[ https://issues.apache.org/jira/browse/CARBONDATA-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17381080#comment-17381080 ]
Indhumathi commented on CARBONDATA-4132: ---------------------------------------- Please refer the comment that i have added in CARBONDATA-4239 which can help you to use MV in better way for your scenario to get storage benefit and performance > Numer of records not matching in MVs > ------------------------------------ > > Key: CARBONDATA-4132 > URL: https://issues.apache.org/jira/browse/CARBONDATA-4132 > Project: CarbonData > Issue Type: Improvement > Components: core > Affects Versions: 2.0.1 > Environment: Apache carbondata 2.0.1 > Reporter: suyash yadav > Priority: Major > Fix For: 2.0.1 > > > Hi Team, > We are working on a POC where we need to insert 300k records/second in a > table where we have already created Timeeries MVs with Minute,Hour,Day > granularity. > > As per our the Minute based MV should contain 300K records till the insertion > of next minute data. Also the hour and Day based MVs should contain 300K > records till the arrival of next hour and next day data respectively. > > But The count of records in MV is not coming out as per our expectation.It is > always more than our expectation. > But the strange thing is, When we drop the MV and create the MV after > inserting the data in the table then the count if reocrds comes correct.So it > is clear there is no problem with MV definition and the data. > > Kindly help us in resolving this issue on priority.Please find more details > below: > Table definition: > =========== > spark.sql("create table Flow_Raw_TS(export_ms bigint,exporter_ip > string,pkt_seq_num bigint,flow_seq_num int,src_ip string,dst_ip > string,protocol_id smallint,src_tos smallint,dst_tos smallint,raw_src_tos > smallint,raw_dst_tos smallint,src_mask smallint,dst_mask smallint,tcp_bits > int,src_port int,in_if_id bigint,in_if_entity_id bigint,in_if_enabled > boolean,dst_port int,out_if_id bigint,out_if_entity_id bigint,out_if_enabled > boolean,direction smallint,in_octets bigint,out_octets bigint,in_packets > bigint,out_packets bigint,next_hop_ip string,bgp_src_as_num > bigint,bgp_dst_as_num bigint,bgp_next_hop_ip string,end_ms timestamp,start_ms > timestamp,app_id string,app_name string,src_ip_group string,dst_ip_group > string,policy_qos_classification_hierarchy string,policy_qos_queue_id > bigint,worker_id int,day bigint ) stored as carbondata TBLPROPERTIES > ('local_dictionary_enable'='false') > MV definition: > > ============== > +*Minute based*+ > spark.sql("create materialized view Flow_Raw_TS_agg_001_min as select > timeseries(end_ms,'minute') as > end_ms,src_ip,dst_ip,app_name,in_if_id,src_tos,src_ip_group,dst_ip_group,protocol_id,bgp_src_as_num, > bgp_dst_as_num,policy_qos_classification_hierarchy, > policy_qos_queue_id,sum(in_octets) as octects, sum(in_packets) as packets, > sum(out_packets) as out_packets, sum(out_octets) as out_octects FROM > Flow_Raw_TS group by > timeseries(end_ms,'minute'),src_ip,dst_ip,app_name,in_if_id,src_tos,src_ip_group, > > dst_ip_group,protocol_id,bgp_src_as_num,bgp_dst_as_num,policy_qos_classification_hierarchy, > policy_qos_queue_id").show() > +*Hour Based*+ > val startTime = System.nanoTime > spark.sql("create materialized view Flow_Raw_TS_agg_001_hour as select > timeseries(end_ms,'hour') as end_ms,app_name,sum(in_octets) as octects, > sum(in_packets) as packets, sum(out_packets) as out_packets, sum(out_octets) > as out_octects, in_if_id,src_tos,src_ip_group, > dst_ip_group,protocol_id,src_ip, dst_ip,bgp_src_as_num, > bgp_dst_as_num,policy_qos_classification_hierarchy, policy_qos_queue_id FROM > Flow_Raw_TS group by > timeseries(end_ms,'hour'),in_if_id,app_name,src_tos,src_ip_group,dst_ip_group,protocol_id,src_ip, > dst_ip,bgp_src_as_num,bgp_dst_as_num,policy_qos_classification_hierarchy, > policy_qos_queue_id").show() > val endTime = System.nanoTime > val elapsedSeconds = (endTime - startTime) / 1e9d -- This message was sent by Atlassian Jira (v8.3.4#803005)