Re: Re: Scan row count exceeded threshold

赵天烁 Sun, 07 Aug 2016 20:31:01 -0700

already sovled by change those two date dims with UHC into yyMMdd formated 
integer ,trim  HHmmss, which dramatically reduce the cardinality,then put 
mandatory filter dim at the begining of row key.thx


________________________________
赵天烁
Kevin Zhao
[email protected]<mailto:[email protected]>

珠海市魅族科技有限公司
MEIZU Technology Co., Ltd.
广东省珠海市科技创新海岸魅族科技楼
MEIZU Tech Bldg., Technology & Innovation Coast
Zhuhai, 519085, Guangdong, China
meizu.com

From: ShaoFeng Shi<mailto:[email protected]>
Date: 2016-08-08 11:23
To: user<mailto:[email protected]>
Subject: Re: Re: Scan row count exceeded threshold
"two of then cardinality are bigger than 100milion"

==> You must do optimization for your cube when having UHC dimensions; 
otherwise its expansion rate may be very high.

2016-08-06 12:20 GMT+08:00 张天生 
<[email protected]<mailto:[email protected]>>:
I also need to adjust threshold to bigger value. We set JVM -Xmx to 32G, but it 
was not occupied enough.

赵天烁 <[email protected]<mailto:[email protected]>>于2016年8月4日周四 
上午11:54写道：
I dig into the code, found that this threshold is determine at 
CubeStorageQuery--setThreshold method, there already has a prop call 
kylin.query.mem.budget,which define the memory size of current jvm 
obtain,default to 3G then divide this by a estimate size of a single row.the 
result of that is the threshold.so I think by change kylin.query.mem.budget to 
a higher val will do.

btw my cube is growing like a hell.....the biggest one is almost 1.7TB(only 
keep on month of these data),source record is 200 milion level,9 dim,two of 
then cardinality are bigger than 100milion,already have performance issue,maybe 
you kylin expert could provide some optimize sugg? waiting......

________________________________
赵天烁
Kevin Zhao
[email protected]<mailto:[email protected]>

珠海市魅族科技有限公司
MEIZU Technology Co., Ltd.
广东省珠海市科技创新海岸魅族科技楼
MEIZU Tech Bldg., Technology & Innovation Coast
Zhuhai, 519085, Guangdong, China
meizu.com<http://meizu.com>

From: hongbin ma<mailto:[email protected]>
Date: 2016-08-03 12:50
To: user.kylin<mailto:[email protected]>
Subject: Re: Re: Scan row count exceeded threshold
Hi Tiansheng and Tianshuo

Kylin is basically limiting the footprint on storage visit, the size of each 
row is inversely proportional to the number of rows to be read. The cuboid row 
size grows when there's a distinct count measure, so you're observing  
threshold being 49121, this is normal.

The unnormal part is kylins behavior when there's a limit clause, especially 
for cases like Tianshuo's case, where query being:

select FCRASHTIME,count(1) from UXIP.EDL_FDT_OUC_UPLOAD_FILES group by 
FCRASH_ANALYSIS_ID,FCRASHTIME limit N

The query does not have any filters, so we should be able to read the first N 
rows from cuboid (FCRASH_ANALYSIS_ID,FCRASHTIME) and return the result to 
users. Yang Li tried to fix the issue in 
https://issues.apache.org/jira/browse/KYLIN-1787, however the approach was 
still a little bit too conservative to me. The patch in KYLIN-1787 would not 
enable the storage read limit as long as the cube has a partition time column 
(and meanwhile the query is not grouping by the partition time column), because 
we'll need to further aggregate rows from different segments. This is why 1.5.3 
does not behave as Tianshuo expect.

However there's still room for improvement even if further aggregation is 
required across multiple segments. For tianshuo's case, we can ask for N cuboid 
row from each segment, and merge them at query server side. Since the cuboid 
rows are respectively sorted in each segment, it is guaranteed that the result 
is correct

However it's a different story if the query contains filters, like in 
Tiansheng's case. Filter on dimensions may prevent limit clause put down, 
especially when the dimension is not the first dimension in row key. Below is 
Tiansheng's case:

Error while executing SQL "select 
"DATE",ADGROUPID,CAMPAIGNID,COMPANYID,APPID,SUM(IMPS) as imps,SUM(CLKS) as 
clks,SUM(CONS) as cons, (SUM(IMP_CLOSINGPRICE)+SUM(CLK_CLOSINGPRICE)) as 
cost,COUNT(DISTINCT CLK_DEVICEID) as clk_uv from EXT_MID_EVENT_JOIN where 
COMPANYID='296' and "DATE">='2016-01-01' and "DATE"<'2016-01-05' group by 
"DATE",ADGROUPID,CAMPAIGNID,COMPANYID,APPID order by imps desc limit 10 offset 
0": Scan row count exceeded threshold: 49121, please add filter condition to 
narrow down backend scan range, like where clause.

I have opened a JIRA to fix this

On Wed, Aug 3, 2016 at 10:25 AM, 张天生 
<[email protected]<mailto:[email protected]>> wrote:
Hi ShaoFeng:

My sql is "select "DATE",ADGROUPID,CAMPAIGNID,COMPANYID,APPID,SUM(IMPS) as 
imps,SUM(CLKS) as clks,SUM(CONS) as cons, 
(SUM(IMP_CLOSINGPRICE)+SUM(CLK_CLOSINGPRICE)) as cost,COUNT(DISTINCT 
CLK_DEVICEID) as clk_uv from EXT_MID_EVENT_JOIN where COMPANYID='296' and 
"DATE">='2016-01-01' and "DATE"<'2016-01-05' group by 
"DATE",ADGROUPID,CAMPAIGNID,COMPANYID,APPID order by imps desc limit 10 offset 
0". I analyzed the critical reason is "COUNT(DISTINCT CLK_DEVICEID) as clk_uv", 
when i delete this measure, it queried fine even for a long time span query.

赵天烁 <[email protected]<mailto:[email protected]>>于2016年8月3日周三 
上午10:16写道：
ok,I'll try to make it configurable in kylin.properties

________________________________
赵天烁
Kevin Zhao
[email protected]<mailto:[email protected]>

珠海市魅族科技有限公司
MEIZU Technology Co., Ltd.
广东省珠海市科技创新海岸魅族科技楼
MEIZU Tech Bldg., Technology & Innovation Coast
Zhuhai, 519085, Guangdong, China
meizu.com<http://meizu.com>

From: ShaoFeng Shi<mailto:[email protected]>
Date: 2016-08-02 22:37
To: user<mailto:[email protected]>
Subject: Re: Re: Scan row count exceeded threshold
This is a new change introduced in 1.5.3, JIRA is: 
https://issues.apache.org/jira/browse/KYLIN-1787

In a short, it tries to avoid scanning too many rows (1 million) in one query. 
But the threshold wasn't externalized as a parameter. @Tianshuo, would you mind 
to contribute a patch?

Besides, the query has "limit 1" but Kylin didn't smartly aware that.

@Tian sheng, what's the SQL query looks like in your case?

2016-08-02 21:14 GMT+08:00 张天生 
<[email protected]<mailto:[email protected]>>:
I meet the same question. In 1.5.2.1 version it worked fine, but switched 1.5.3 
it ofter report this error.

赵天烁 <[email protected]<mailto:[email protected]>>于2016年8月2日周二 
上午11:57写道：
yes, it is 1.5.3,after limit the date range to a week,it still cost about 7 
second to response, is there any way to get the query execute detail just like 
the diagnosis log?

________________________________
赵天烁
Kevin Zhao
[email protected]<mailto:[email protected]>

珠海市魅族科技有限公司
MEIZU Technology Co., Ltd.
广东省珠海市科技创新海岸魅族科技楼
MEIZU Tech Bldg., Technology & Innovation Coast
Zhuhai, 519085, Guangdong, China
meizu.com<http://meizu.com>

From: ShaoFeng Shi<mailto:[email protected]>
Date: 2016-08-01 20:20
To: user<mailto:[email protected]>
Subject: Re: Scan row count exceeded threshold
are you on 1.5.3? If not, suggest to upgrade and then retry it. Besides, you 
can add some "where" contition to reduce the records to scan, as the error 
message mentioned.

2016-08-01 15:37 GMT+08:00 赵天烁 
<[email protected]<mailto:[email protected]>>:
recently I got the following error while execute query on a cube which is not 
that big( about 400mb, 20milion record)
==================
Error while executing SQL "select FCRASHTIME,count(1) from 
UXIP.EDL_FDT_OUC_UPLOAD_FILES group by FCRASH_ANALYSIS_ID,FCRASHTIME limit 1": 
Scan row count exceeded threshold: 10000000, please add filter condition to 
narrow down backend scan range, like where clause.

I guess what  it scan were the intermediate result, but It doesn't any order 
by,also the result count is limit to just 1.so it could scan to find any record 
with those two dimension and wala.
waiting kylin expert to give me some detail on that.
btw how to configure that threshold, any properties in kylin.properties?
________________________________
赵天烁
Kevin Zhao
[email protected]<mailto:[email protected]>

珠海市魅族科技有限公司
MEIZU Technology Co., Ltd.
广东省珠海市科技创新海岸魅族科技楼
MEIZU Tech Bldg., Technology & Innovation Coast
Zhuhai, 519085, Guangdong, China
meizu.com<http://meizu.com>



--
Best regards,

Shaofeng Shi




--
Best regards,

Shaofeng Shi




--
Regards,

Bin Mahone | 马洪宾



--
Best regards,

Shaofeng Shi

Re: Re: Scan row count exceeded threshold

Reply via email to