[
https://issues.apache.org/jira/browse/CARBONDATA-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17381080#comment-17381080
]
Indhumathi commented on CARBONDATA-4132:
----------------------------------------
Please refer the comment that i have added in CARBONDATA-4239 which can help
you to use MV in better way for your scenario to get storage benefit and
performance
> Numer of records not matching in MVs
> ------------------------------------
>
> Key: CARBONDATA-4132
> URL: https://issues.apache.org/jira/browse/CARBONDATA-4132
> Project: CarbonData
> Issue Type: Improvement
> Components: core
> Affects Versions: 2.0.1
> Environment: Apache carbondata 2.0.1
> Reporter: suyash yadav
> Priority: Major
> Fix For: 2.0.1
>
>
> Hi Team,
> We are working on a POC where we need to insert 300k records/second in a
> table where we have already created Timeeries MVs with Minute,Hour,Day
> granularity.
>
> As per our the Minute based MV should contain 300K records till the insertion
> of next minute data. Also the hour and Day based MVs should contain 300K
> records till the arrival of next hour and next day data respectively.
>
> But The count of records in MV is not coming out as per our expectation.It is
> always more than our expectation.
> But the strange thing is, When we drop the MV and create the MV after
> inserting the data in the table then the count if reocrds comes correct.So it
> is clear there is no problem with MV definition and the data.
>
> Kindly help us in resolving this issue on priority.Please find more details
> below:
> Table definition:
> ===========
> spark.sql("create table Flow_Raw_TS(export_ms bigint,exporter_ip
> string,pkt_seq_num bigint,flow_seq_num int,src_ip string,dst_ip
> string,protocol_id smallint,src_tos smallint,dst_tos smallint,raw_src_tos
> smallint,raw_dst_tos smallint,src_mask smallint,dst_mask smallint,tcp_bits
> int,src_port int,in_if_id bigint,in_if_entity_id bigint,in_if_enabled
> boolean,dst_port int,out_if_id bigint,out_if_entity_id bigint,out_if_enabled
> boolean,direction smallint,in_octets bigint,out_octets bigint,in_packets
> bigint,out_packets bigint,next_hop_ip string,bgp_src_as_num
> bigint,bgp_dst_as_num bigint,bgp_next_hop_ip string,end_ms timestamp,start_ms
> timestamp,app_id string,app_name string,src_ip_group string,dst_ip_group
> string,policy_qos_classification_hierarchy string,policy_qos_queue_id
> bigint,worker_id int,day bigint ) stored as carbondata TBLPROPERTIES
> ('local_dictionary_enable'='false')
> MV definition:
>
> ==============
> +*Minute based*+
> spark.sql("create materialized view Flow_Raw_TS_agg_001_min as select
> timeseries(end_ms,'minute') as
> end_ms,src_ip,dst_ip,app_name,in_if_id,src_tos,src_ip_group,dst_ip_group,protocol_id,bgp_src_as_num,
> bgp_dst_as_num,policy_qos_classification_hierarchy,
> policy_qos_queue_id,sum(in_octets) as octects, sum(in_packets) as packets,
> sum(out_packets) as out_packets, sum(out_octets) as out_octects FROM
> Flow_Raw_TS group by
> timeseries(end_ms,'minute'),src_ip,dst_ip,app_name,in_if_id,src_tos,src_ip_group,
>
> dst_ip_group,protocol_id,bgp_src_as_num,bgp_dst_as_num,policy_qos_classification_hierarchy,
> policy_qos_queue_id").show()
> +*Hour Based*+
> val startTime = System.nanoTime
> spark.sql("create materialized view Flow_Raw_TS_agg_001_hour as select
> timeseries(end_ms,'hour') as end_ms,app_name,sum(in_octets) as octects,
> sum(in_packets) as packets, sum(out_packets) as out_packets, sum(out_octets)
> as out_octects, in_if_id,src_tos,src_ip_group,
> dst_ip_group,protocol_id,src_ip, dst_ip,bgp_src_as_num,
> bgp_dst_as_num,policy_qos_classification_hierarchy, policy_qos_queue_id FROM
> Flow_Raw_TS group by
> timeseries(end_ms,'hour'),in_if_id,app_name,src_tos,src_ip_group,dst_ip_group,protocol_id,src_ip,
> dst_ip,bgp_src_as_num,bgp_dst_as_num,policy_qos_classification_hierarchy,
> policy_qos_queue_id").show()
> val endTime = System.nanoTime
> val elapsedSeconds = (endTime - startTime) / 1e9d
--
This message was sent by Atlassian Jira
(v8.3.4#803005)