[jira] [Commented] (CARBONDATA-4177) performence issue with Query

Nihal kumar ojha (Jira) Tue, 03 Aug 2021 01:43:06 -0700


    [ 
https://issues.apache.org/jira/browse/CARBONDATA-4177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17392172#comment-17392172
 ]


Nihal kumar ojha commented on CARBONDATA-4177:
----------------------------------------------

Hi,

    Currently, carbondata doesn't support limit push down for either main table 
or MV table. It is only supported in the case of secondary index(SI). Because 
of this when we try to select rows with limit 10 in that case also carbon first 
fill vector of size 4096 rows and then sends it to spark. After that spark will 
apply the limit and give the result. So even for 10 rows we fetch 4096 rows and 
that is taking time. May be in future we can support limit pushdown for main 
table and MV and we will fetch only 10 rows at place of 4096 rows after that 
your query can get advantage.

> performence issue with Query
> ----------------------------
>
>                 Key: CARBONDATA-4177
>                 URL: https://issues.apache.org/jira/browse/CARBONDATA-4177
>             Project: CarbonData
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 2.0.1
>            Reporter: suyash yadav
>            Priority: Major
>             Fix For: 2.0.1
>
>
> Hi Team,Hi Team,
> We are working on a POC using carbondata 2.0.1 and have come across 
> parformance issue.Below are the details:
> 1.Table creation query:
> ======================
> spark.sql("create table Flow_TS_1day_stats_16042021(start_time 
> timestamp,end_time timestamp,source_ip_address string,destintion_ip_address 
> string,appname string,protocol_name string,source_tos smallint,in_interface 
> smallint,out_interface smallint,src_as bigint,dst_as bigint,source_mask 
> smallint,destination_mask smallint, dst_tos smallint,input_pkt 
> bigint,input_byt bigint,output_pkt bigint,output_byt bigint,source_port 
> int,destination_port int) stored as carbondata TBLPROPERTIES 
> ('local_dictionary_enable'='false')").show()
> TWO MVs are there on this table, Below are the queries for those MVs
> :===========================================================
> 1. Network MV
> ================
> spark.sql("create materialized view 
> Network_Level_Agg_10min_MV_with_ip_15042021_again as select 
> timeseries(end_time,'ten_minute') as end_time,source_ip_address, 
> destintion_ip_address,appname,protocol_name,source_port,destination_port,source_tos,src_as,dst_as,sum(input_pkt)
>  as input_pkt,sum(input_byt) as input_byt,sum(output_pkt) as 
> output_pkt,sum(output_byt) as output_byt from 
> Flow_TS_1day_stats_15042021_again group by 
> timeseries(end_time,'ten_minute'),source_ip_address,destintion_ip_address, 
> appname,protocol_name,source_port,destination_port,source_tos,src_as,dst_as 
> order by input_pkt,input_byt,output_pkt,output_byt desc").show(false)
> 2. Interfae MV:
> ==================Interface :==============
> spark.sql("create materialized view Interface_Level_Agg_10min_MV_16042021 as 
> select timeseries(end_time,'ten_minute') as end_time, 
> source_ip_address,destintion_ip_address,appname,protocol_name,source_port,destination_port,source_tos,src_as,dst_as,in_interface,out_interface,sum(input_pkt)
>  as input_pkt,sum(input_byt) as input_byt,sum(output_pkt) as 
> output_pkt,sum(output_byt) as output_byt from Flow_TS_1day_stats_16042021 
> group by timeseries(end_time,'ten_minute'), 
> source_ip_address,destintion_ip_address,appname,protocol_name,source_port,destination_port,source_tos,src_as,dst_as,in_interface,out_interface
>  order by input_pkt,input_byt,output_pkt,output_byt desc").show(false)
> +*We are firing below query for fethcing data which is taking almost 10 
> seconds:*+
> *Select appname,input_byt from Flow_TS_1day_stats_16042021 where end_time >= 
> '2021-03-02 00:00:00' and end_time < '2021-03-03 00:00:00' group by 
> appname,input_byt order by input_byt desc LIMIT 10*
>  
> The above query is only fetching 10 records but it is taking almost 10 
> seconds to complete.
> Could you please review above schemas and help us to understand how can we 
> get some improvement in the qury execution time. We are expectingt he 
> response should be in subseconds.
> Table Name : RAW Table (1 Day - 300K/Sec)#Records : 25920000000
> RegardsSuyash Yadav                          



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (CARBONDATA-4177) performence issue with Query

Reply via email to