[ 
https://issues.apache.org/jira/browse/KYLIN-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15405298#comment-15405298
 ] 

hongbin ma commented on KYLIN-1936:
-----------------------------------

Kylin is basically limiting the footprint on storage visit, the size of each 
row is inversely proportional to the number of rows to be read. The cuboid row 
size grows when there's a distinct count measure, so you're observing  
threshold being 49121, this is normal.

The unnormal part is kylins behavior when there's a limit clause, especially 
for cases like Tianshuo's case, where query being:

select FCRASHTIME,count(1) from UXIP.EDL_FDT_OUC_UPLOAD_FILES group by 
FCRASH_ANALYSIS_ID,FCRASHTIME limit N

The query does not have any filters, so we should be able to read the first N 
rows from cuboid (FCRASH_ANALYSIS_ID,FCRASHTIME) and return the result to 
users. Yang Li tried to fix the issue in 
https://issues.apache.org/jira/browse/KYLIN-1787, however the approach was 
still a little bit too conservative to me. The patch in KYLIN-1787 would not 
enable the storage read limit as long as the cube has a partition time column 
(and meanwhile the query is not grouping by the partition time column), because 
we'll need to further aggregate rows from different segments. This is why 1.5.3 
does not behave as Tianshuo expect.

However there's still room for improvement even if further aggregation is 
required across multiple segments. For tianshuo's case, we can ask for N cuboid 
row from each segment, and merge them at query server side. Since the cuboid 
rows are respectively sorted in each segment, it is guaranteed that the result 
is correct

However it's a different story if the query contains filters, like in 
Tiansheng's case. Filter on dimensions may prevent limit clause put down, 
especially when the dimension is not the first dimension in row key. Below is 
Tiansheng's case:

Error while executing SQL "select 
"DATE",ADGROUPID,CAMPAIGNID,COMPANYID,APPID,SUM(IMPS) as imps,SUM(CLKS) as 
clks,SUM(CONS) as cons, (SUM(IMP_CLOSINGPRICE)+SUM(CLK_CLOSINGPRICE)) as 
cost,COUNT(DISTINCT CLK_DEVICEID) as clk_uv from EXT_MID_EVENT_JOIN where 
COMPANYID='296' and "DATE">='2016-01-01' and "DATE"<'2016-01-05' group by 
"DATE",ADGROUPID,CAMPAIGNID,COMPANYID,APPID order by imps desc limit 10 offset 
0": Scan row count exceeded threshold: 49121, please add filter condition to 
narrow down backend scan range, like where clause.

> Improve enable limit logic (exactAggregation is too strict)
> -----------------------------------------------------------
>
>                 Key: KYLIN-1936
>                 URL: https://issues.apache.org/jira/browse/KYLIN-1936
>             Project: Kylin
>          Issue Type: Improvement
>            Reporter: hongbin ma
>            Assignee: hongbin ma
>
> from zhaotians...@meizu.com:
> recently I got the following error while execute query on a cube which is not 
> that big( about 400mb, 20milion record)
> ==================
> Error while executing SQL "select FCRASHTIME,count(1) from 
> UXIP.EDL_FDT_OUC_UPLOAD_FILES group by FCRASH_ANALYSIS_ID,FCRASHTIME limit 
> 1": Scan row count exceeded threshold: 10000000, please add filter condition 
> to narrow down backend scan range, like where clause.
> I guess what  it scan were the intermediate result, but It doesn't any order 
> by,also the result count is limit to just 1.so it could scan to find any 
> record with those two dimension and wala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to