[jira] [Commented] (CARBONDATA-4239) Carbondata 2.1.1 MV : Incremental refresh : Doesnot aggregate data correctly

2021-07-15 Thread Indhumathi (Jira)


[ 
https://issues.apache.org/jira/browse/CARBONDATA-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381175#comment-17381175
 ] 

Indhumathi commented on CARBONDATA-4239:


MV can be used for real-time data loading, even for every 15 mins data, but, 
with more data.

If you use INSERT to add a single row every 5/15 mins, then it will not give 
much benefit.

As i already suggested in previous comments, you can still use MV for your 
scenario, with manual refresh.

 

> Carbondata 2.1.1 MV : Incremental refresh : Doesnot aggregate data correctly 
> -
>
> Key: CARBONDATA-4239
> URL: https://issues.apache.org/jira/browse/CARBONDATA-4239
> Project: CarbonData
>  Issue Type: Bug
>  Components: core, data-load
>Affects Versions: 2.1.1
> Environment: RHEL  spark-2.4.5-bin-hadoop2.7 for carbon 2.1.1 
>Reporter: Sushant Sammanwar
>Priority: Major
>  Labels: Materialistic_Views, materializedviews, refreshnodes
>
> Hi Team ,
> We are doing a POC with Carbondata using MV .
> Our MV doesnot contain AVG function as we wanted to utilize the feature of 
> incremental refresh.
> But with incremetnal refresh , we noticed the MV doesnot aggregate value 
> correctly.
> If a row is inserted , it creates another row in MV instead of adding 
> incremental value .
> As a result no. of rows in MV are almost same as raw table.
> This doesnot happen with full refresh MV. 
> Below is the data in MV with 3 rows :
> scala> carbon.sql("select * from fact_365_1_eutrancell_21_30_minute").show()
> ++---+---+--+-+-++
> |fact_365_1_eutrancell_21_tags_id|fact_365_1_eutrancell_21_metric| ts| 
> sum_value|min_value|max_value|fact_365_1_eutrancell_21_ts2|
> ++---+---+--+-+-++
> | ff6cb0f7-fba0-413...| eUtranCell.HHO.X2...|2020-09-25 
> 06:30:00|5412.68105| 31.345| 4578.112| 2020-09-25 05:30:00|
> | ff6cb0f7-fba0-413...| eUtranCell.HHO.X2...|2020-09-25 05:30:00| 1176.7035| 
> 392.2345| 392.2345| 2020-09-25 05:30:00|
> | ff6cb0f7-fba0-413...| eUtranCell.HHO.X2...|2020-09-25 06:00:00| 58.112| 
> 58.112| 58.112| 2020-09-25 05:30:00|
> ++---+---+--+-+-++
> Below , i am inserting data for 6th hour, and it should add incremental 
> values to 6th hour row of MV. 
> Note the data being inserted ; columns which are part of groupby clause are 
> having same values as existing data.
> scala> carbon.sql("insert into fact_365_1_eutrancell_21 values ('2020-09-25 
> 06:05:00','eUtranCell.HHO.X2.InterFreq.PrepAttOut','ff6cb0f7-fba0-4134-81ee-55e820574627',118.112,'2020-09-25
>  05:30:00')").show()
> 21/06/28 16:01:31 AUDIT audit: \{"time":"June 28, 2021 4:01:31 PM 
> IST","username":"root","opName":"INSERT 
> INTO","opId":"7332282307468267","opStatus":"START"}
> 21/06/28 16:01:32 WARN CarbonOutputIteratorWrapper: try to poll a row batch 
> one more time.
> 21/06/28 16:01:32 WARN CarbonOutputIteratorWrapper: try to poll a row batch 
> one more time.
> 21/06/28 16:01:32 WARN CarbonOutputIteratorWrapper: try to poll a row batch 
> one more time.
> 21/06/28 16:01:33 AUDIT audit: \{"time":"June 28, 2021 4:01:33 PM 
> IST","username":"root","opName":"INSERT 
> INTO","opId":"7332284066443156","opStatus":"START"}
> [Stage 40:=>(199 + 1) / 
> 200]21/06/28 16:01:44 WARN CarbonOutputIteratorWrapper: try to poll a row 
> batch one more time.
> 21/06/28 16:01:44 WARN CarbonOutputIteratorWrapper: try to poll a row batch 
> one more time.
> 21/06/28 16:01:44 WARN CarbonOutputIteratorWrapper: try to poll a row batch 
> one more time.
> 21/06/28 16:01:44 AUDIT audit: \{"time":"June 28, 2021 4:01:44 PM 
> IST","username":"root","opName":"INSERT 
> INTO","opId":"7332284066443156","opStatus":"SUCCESS","opTime":"11343 
> ms","table":"default.fact_365_1_eutrancell_21_30_minute","extraInfo":{}}
> 21/06/28 16:01:44 AUDIT audit: \{"time":"June 28, 2021 4:01:44 PM 
> IST","username":"root","opName":"INSERT 
> INTO","opId":"7332282307468267","opStatus":"SUCCESS","opTime":"13137 
> ms","table":"default.fact_365_1_eutrancell_21","extraInfo":{}}
> +--+
> |Segment ID|
> +--+
> | 8|
> +--+
> Below we can see it has added another row of 2020-09-25 06:00:00 .
> Note: All values of columns which are part of groupby caluse have same value.
> This means there should have been single row for 2020-09-25 06:00:00 .
> scala> carbon.sql("select * from 
> 

[jira] [Commented] (CARBONDATA-4239) Carbondata 2.1.1 MV : Incremental refresh : Doesnot aggregate data correctly

2021-07-15 Thread Sushant Sammanwar (Jira)


[ 
https://issues.apache.org/jira/browse/CARBONDATA-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381160#comment-17381160
 ] 

Sushant Sammanwar commented on CARBONDATA-4239:
---

Thanks [~indhumuthumurugesh] [~Indhumathi27] for your response.

Does this mean MV should NOT be used for real-time (continuous , incremental ) 
data loading ? It should be used only in bulk data load ( for eg, load data for 
30 mins or 1 hr instead of every 5 or 15 mind )?
Only then it will benefit storage and query time .
Is my understanding correct ?

> Carbondata 2.1.1 MV : Incremental refresh : Doesnot aggregate data correctly 
> -
>
> Key: CARBONDATA-4239
> URL: https://issues.apache.org/jira/browse/CARBONDATA-4239
> Project: CarbonData
>  Issue Type: Bug
>  Components: core, data-load
>Affects Versions: 2.1.1
> Environment: RHEL  spark-2.4.5-bin-hadoop2.7 for carbon 2.1.1 
>Reporter: Sushant Sammanwar
>Priority: Major
>  Labels: Materialistic_Views, materializedviews, refreshnodes
>
> Hi Team ,
> We are doing a POC with Carbondata using MV .
> Our MV doesnot contain AVG function as we wanted to utilize the feature of 
> incremental refresh.
> But with incremetnal refresh , we noticed the MV doesnot aggregate value 
> correctly.
> If a row is inserted , it creates another row in MV instead of adding 
> incremental value .
> As a result no. of rows in MV are almost same as raw table.
> This doesnot happen with full refresh MV. 
> Below is the data in MV with 3 rows :
> scala> carbon.sql("select * from fact_365_1_eutrancell_21_30_minute").show()
> ++---+---+--+-+-++
> |fact_365_1_eutrancell_21_tags_id|fact_365_1_eutrancell_21_metric| ts| 
> sum_value|min_value|max_value|fact_365_1_eutrancell_21_ts2|
> ++---+---+--+-+-++
> | ff6cb0f7-fba0-413...| eUtranCell.HHO.X2...|2020-09-25 
> 06:30:00|5412.68105| 31.345| 4578.112| 2020-09-25 05:30:00|
> | ff6cb0f7-fba0-413...| eUtranCell.HHO.X2...|2020-09-25 05:30:00| 1176.7035| 
> 392.2345| 392.2345| 2020-09-25 05:30:00|
> | ff6cb0f7-fba0-413...| eUtranCell.HHO.X2...|2020-09-25 06:00:00| 58.112| 
> 58.112| 58.112| 2020-09-25 05:30:00|
> ++---+---+--+-+-++
> Below , i am inserting data for 6th hour, and it should add incremental 
> values to 6th hour row of MV. 
> Note the data being inserted ; columns which are part of groupby clause are 
> having same values as existing data.
> scala> carbon.sql("insert into fact_365_1_eutrancell_21 values ('2020-09-25 
> 06:05:00','eUtranCell.HHO.X2.InterFreq.PrepAttOut','ff6cb0f7-fba0-4134-81ee-55e820574627',118.112,'2020-09-25
>  05:30:00')").show()
> 21/06/28 16:01:31 AUDIT audit: \{"time":"June 28, 2021 4:01:31 PM 
> IST","username":"root","opName":"INSERT 
> INTO","opId":"7332282307468267","opStatus":"START"}
> 21/06/28 16:01:32 WARN CarbonOutputIteratorWrapper: try to poll a row batch 
> one more time.
> 21/06/28 16:01:32 WARN CarbonOutputIteratorWrapper: try to poll a row batch 
> one more time.
> 21/06/28 16:01:32 WARN CarbonOutputIteratorWrapper: try to poll a row batch 
> one more time.
> 21/06/28 16:01:33 AUDIT audit: \{"time":"June 28, 2021 4:01:33 PM 
> IST","username":"root","opName":"INSERT 
> INTO","opId":"7332284066443156","opStatus":"START"}
> [Stage 40:=>(199 + 1) / 
> 200]21/06/28 16:01:44 WARN CarbonOutputIteratorWrapper: try to poll a row 
> batch one more time.
> 21/06/28 16:01:44 WARN CarbonOutputIteratorWrapper: try to poll a row batch 
> one more time.
> 21/06/28 16:01:44 WARN CarbonOutputIteratorWrapper: try to poll a row batch 
> one more time.
> 21/06/28 16:01:44 AUDIT audit: \{"time":"June 28, 2021 4:01:44 PM 
> IST","username":"root","opName":"INSERT 
> INTO","opId":"7332284066443156","opStatus":"SUCCESS","opTime":"11343 
> ms","table":"default.fact_365_1_eutrancell_21_30_minute","extraInfo":{}}
> 21/06/28 16:01:44 AUDIT audit: \{"time":"June 28, 2021 4:01:44 PM 
> IST","username":"root","opName":"INSERT 
> INTO","opId":"7332282307468267","opStatus":"SUCCESS","opTime":"13137 
> ms","table":"default.fact_365_1_eutrancell_21","extraInfo":{}}
> +--+
> |Segment ID|
> +--+
> | 8|
> +--+
> Below we can see it has added another row of 2020-09-25 06:00:00 .
> Note: All values of columns which are part of groupby caluse have same value.
> This means there should have been single row for 

[jira] [Commented] (CARBONDATA-4132) Numer of records not matching in MVs

2021-07-15 Thread Indhumathi (Jira)


[ 
https://issues.apache.org/jira/browse/CARBONDATA-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381080#comment-17381080
 ] 

Indhumathi commented on CARBONDATA-4132:


Please refer the comment that i have added in CARBONDATA-4239 which can help 
you to use MV in better way for your scenario to get storage benefit and 
performance

> Numer of records not matching in MVs
> 
>
> Key: CARBONDATA-4132
> URL: https://issues.apache.org/jira/browse/CARBONDATA-4132
> Project: CarbonData
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 2.0.1
> Environment: Apache carbondata 2.0.1
>Reporter: suyash yadav
>Priority: Major
> Fix For: 2.0.1
>
>
> Hi Team, 
> We are working on a POC where we need to insert 300k records/second in a 
> table where we have already created Timeeries MVs with Minute,Hour,Day 
> granularity.
>  
> As per our the Minute based MV should contain 300K records till the insertion 
> of next minute data. Also the hour and Day based MVs should contain 300K 
> records till the arrival of next hour and next day data respectively.
>  
> But The count of records in MV is not coming out as per our expectation.It is 
> always more than our expectation.
> But the strange thing is, When we drop the MV and create the MV after 
> inserting the data in the table then the count if reocrds comes correct.So it 
> is clear there is no problem with MV definition and the data.
>  
> Kindly help us in resolving this issue on priority.Please find more details 
> below:
> Table definition:
> ===
> spark.sql("create table Flow_Raw_TS(export_ms bigint,exporter_ip 
> string,pkt_seq_num bigint,flow_seq_num int,src_ip string,dst_ip 
> string,protocol_id smallint,src_tos smallint,dst_tos smallint,raw_src_tos 
> smallint,raw_dst_tos smallint,src_mask smallint,dst_mask smallint,tcp_bits 
> int,src_port int,in_if_id bigint,in_if_entity_id bigint,in_if_enabled 
> boolean,dst_port int,out_if_id bigint,out_if_entity_id bigint,out_if_enabled 
> boolean,direction smallint,in_octets bigint,out_octets bigint,in_packets 
> bigint,out_packets bigint,next_hop_ip string,bgp_src_as_num 
> bigint,bgp_dst_as_num bigint,bgp_next_hop_ip string,end_ms timestamp,start_ms 
> timestamp,app_id string,app_name string,src_ip_group string,dst_ip_group 
> string,policy_qos_classification_hierarchy string,policy_qos_queue_id 
> bigint,worker_id int,day bigint ) stored as carbondata TBLPROPERTIES 
> ('local_dictionary_enable'='false')
> MV definition:
>  
> ==
> +*Minute based*+
> spark.sql("create materialized view Flow_Raw_TS_agg_001_min as select 
> timeseries(end_ms,'minute') as 
> end_ms,src_ip,dst_ip,app_name,in_if_id,src_tos,src_ip_group,dst_ip_group,protocol_id,bgp_src_as_num,
>  bgp_dst_as_num,policy_qos_classification_hierarchy, 
> policy_qos_queue_id,sum(in_octets) as octects, sum(in_packets) as packets, 
> sum(out_packets) as out_packets, sum(out_octets) as out_octects FROM 
> Flow_Raw_TS group by 
> timeseries(end_ms,'minute'),src_ip,dst_ip,app_name,in_if_id,src_tos,src_ip_group,
>  
> dst_ip_group,protocol_id,bgp_src_as_num,bgp_dst_as_num,policy_qos_classification_hierarchy,
>  policy_qos_queue_id").show()
> +*Hour Based*+
> val startTime = System.nanoTime
> spark.sql("create materialized view Flow_Raw_TS_agg_001_hour as select 
> timeseries(end_ms,'hour') as end_ms,app_name,sum(in_octets) as octects, 
> sum(in_packets) as packets, sum(out_packets) as out_packets, sum(out_octets) 
> as out_octects, in_if_id,src_tos,src_ip_group, 
> dst_ip_group,protocol_id,src_ip, dst_ip,bgp_src_as_num, 
> bgp_dst_as_num,policy_qos_classification_hierarchy, policy_qos_queue_id FROM 
> Flow_Raw_TS group by 
> timeseries(end_ms,'hour'),in_if_id,app_name,src_tos,src_ip_group,dst_ip_group,protocol_id,src_ip,
>  dst_ip,bgp_src_as_num,bgp_dst_as_num,policy_qos_classification_hierarchy, 
> policy_qos_queue_id").show()
> val endTime = System.nanoTime
> val elapsedSeconds = (endTime - startTime) / 1e9d



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CARBONDATA-4239) Carbondata 2.1.1 MV : Incremental refresh : Doesnot aggregate data correctly

2021-07-15 Thread Indhumathi (Jira)


[ 
https://issues.apache.org/jira/browse/CARBONDATA-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381075#comment-17381075
 ] 

Indhumathi commented on CARBONDATA-4239:


 

For loading data(LOAD using csv, .txt, where each load has more data), 
Incremental loading will save time, and benefit load performance. If your case 
is INSERT scenario, then the mv table with automatic refresh(which is enabled 
by default), will not benefit in terms of both storage and performance.

For your scenario, i suggest you to use MV with Manual Refresh. You can Refresh 
mv, at some interval (say, at each hour, which will load 4 segments of main 
table to single segment of MV), which will benefit both storage cost and mv 
performance also.

To create MV with manual Refresh, use

create materialized view mv_name with deferred refresh as SELECT(..)

(or)

create materialized view mv_name properties('refresh_trigger_mode'='on_manual') 
as SELECT(...)

 

Refer 
https://github.com/apache/carbondata/blob/master/docs/mv-guide.md#loading-data

> Carbondata 2.1.1 MV : Incremental refresh : Doesnot aggregate data correctly 
> -
>
> Key: CARBONDATA-4239
> URL: https://issues.apache.org/jira/browse/CARBONDATA-4239
> Project: CarbonData
>  Issue Type: Bug
>  Components: core, data-load
>Affects Versions: 2.1.1
> Environment: RHEL  spark-2.4.5-bin-hadoop2.7 for carbon 2.1.1 
>Reporter: Sushant Sammanwar
>Priority: Major
>  Labels: Materialistic_Views, materializedviews, refreshnodes
>
> Hi Team ,
> We are doing a POC with Carbondata using MV .
> Our MV doesnot contain AVG function as we wanted to utilize the feature of 
> incremental refresh.
> But with incremetnal refresh , we noticed the MV doesnot aggregate value 
> correctly.
> If a row is inserted , it creates another row in MV instead of adding 
> incremental value .
> As a result no. of rows in MV are almost same as raw table.
> This doesnot happen with full refresh MV. 
> Below is the data in MV with 3 rows :
> scala> carbon.sql("select * from fact_365_1_eutrancell_21_30_minute").show()
> ++---+---+--+-+-++
> |fact_365_1_eutrancell_21_tags_id|fact_365_1_eutrancell_21_metric| ts| 
> sum_value|min_value|max_value|fact_365_1_eutrancell_21_ts2|
> ++---+---+--+-+-++
> | ff6cb0f7-fba0-413...| eUtranCell.HHO.X2...|2020-09-25 
> 06:30:00|5412.68105| 31.345| 4578.112| 2020-09-25 05:30:00|
> | ff6cb0f7-fba0-413...| eUtranCell.HHO.X2...|2020-09-25 05:30:00| 1176.7035| 
> 392.2345| 392.2345| 2020-09-25 05:30:00|
> | ff6cb0f7-fba0-413...| eUtranCell.HHO.X2...|2020-09-25 06:00:00| 58.112| 
> 58.112| 58.112| 2020-09-25 05:30:00|
> ++---+---+--+-+-++
> Below , i am inserting data for 6th hour, and it should add incremental 
> values to 6th hour row of MV. 
> Note the data being inserted ; columns which are part of groupby clause are 
> having same values as existing data.
> scala> carbon.sql("insert into fact_365_1_eutrancell_21 values ('2020-09-25 
> 06:05:00','eUtranCell.HHO.X2.InterFreq.PrepAttOut','ff6cb0f7-fba0-4134-81ee-55e820574627',118.112,'2020-09-25
>  05:30:00')").show()
> 21/06/28 16:01:31 AUDIT audit: \{"time":"June 28, 2021 4:01:31 PM 
> IST","username":"root","opName":"INSERT 
> INTO","opId":"7332282307468267","opStatus":"START"}
> 21/06/28 16:01:32 WARN CarbonOutputIteratorWrapper: try to poll a row batch 
> one more time.
> 21/06/28 16:01:32 WARN CarbonOutputIteratorWrapper: try to poll a row batch 
> one more time.
> 21/06/28 16:01:32 WARN CarbonOutputIteratorWrapper: try to poll a row batch 
> one more time.
> 21/06/28 16:01:33 AUDIT audit: \{"time":"June 28, 2021 4:01:33 PM 
> IST","username":"root","opName":"INSERT 
> INTO","opId":"7332284066443156","opStatus":"START"}
> [Stage 40:=>(199 + 1) / 
> 200]21/06/28 16:01:44 WARN CarbonOutputIteratorWrapper: try to poll a row 
> batch one more time.
> 21/06/28 16:01:44 WARN CarbonOutputIteratorWrapper: try to poll a row batch 
> one more time.
> 21/06/28 16:01:44 WARN CarbonOutputIteratorWrapper: try to poll a row batch 
> one more time.
> 21/06/28 16:01:44 AUDIT audit: \{"time":"June 28, 2021 4:01:44 PM 
> IST","username":"root","opName":"INSERT 
> INTO","opId":"7332284066443156","opStatus":"SUCCESS","opTime":"11343 
> ms","table":"default.fact_365_1_eutrancell_21_30_minute","extraInfo":{}}
> 21/06/28 16:01:44