Re: Re: Scan row count exceeded threshold

ShaoFeng Shi Sun, 07 Aug 2016 20:22:58 -0700

"two of then cardinality are bigger than 100milion"

==> You must do optimization for your cube when having UHC dimensions;
otherwise its expansion rate may be very high.


2016-08-06 12:20 GMT+08:00 张天生 <[email protected]>:

> I also need to adjust threshold to bigger value. We set JVM -Xmx to 32G,
> but it was not occupied enough.
>
> 赵天烁 <[email protected]>于2016年8月4日周四 上午11:54写道：
>
>> I dig into the code, found that this threshold is determine at
>> CubeStorageQuery--setThreshold method, there already has a prop call
>> kylin.query.mem.budget,which define the memory size of current jvm
>> obtain,default to 3G then divide this by a estimate size of a single
>> row.the result of that is the threshold.so I think by change 
>> kylin.query.mem.budget
>> to a higher val will do.
>>
>> btw my cube is growing like a hell.....the biggest one is almost
>> 1.7TB(only keep on month of these data),source record is 200 milion level,9
>> dim,two of then cardinality are bigger than 100milion,already have
>> performance issue,maybe you kylin expert could provide some optimize sugg?
>> waiting......
>>
>> ------------------------------
>>
>> 赵天烁
>>
>> Kevin Zhao
>>
>> *[email protected] <[email protected]>*
>>
>>
>>
>> 珠海市魅族科技有限公司
>>
>> MEIZU Technology Co., Ltd.
>>
>> 广东省珠海市科技创新海岸魅族科技楼
>>
>> MEIZU Tech Bldg., Technology & Innovation Coast
>>
>> Zhuhai, 519085, Guangdong, China
>>
>> meizu.com
>>
>>
>> *From:* hongbin ma <[email protected]>
>> *Date:* 2016-08-03 12:50
>> *To:* user.kylin <[email protected]>
>>
>> *Subject:* Re: Re: Scan row count exceeded threshold
>>
>> Hi Tiansheng and Tianshuo
>>
>> Kylin is basically limiting the footprint on storage visit, the size of
>> each row is inversely proportional to the number of rows to be read. The
>> cuboid row size grows when there's a distinct count measure, so you're
>> observing  threshold being 49121, this is normal.
>>
>> The unnormal part is kylins behavior when there's a limit clause,
>> especially for cases like Tianshuo's case, where query being:
>>
>> select FCRASHTIME,count(1) from UXIP.EDL_FDT_OUC_UPLOAD_FILES group by
>> FCRASH_ANALYSIS_ID,FCRASHTIME limit N
>>
>> The query does not have any filters, so we should be able to read the
>> first N rows from cuboid (FCRASH_ANALYSIS_ID,FCRASHTIME) and return the
>> result to users. Yang Li tried to fix the issue in
>> https://issues.apache.org/jira/browse/KYLIN-1787, however the approach
>> was still a little bit too conservative to me. The patch in KYLIN-1787
>> would not enable the storage read limit as long as the cube has a partition
>> time column (and meanwhile the query is not grouping by the partition time
>> column), because we'll need to further aggregate rows from different
>> segments. This is why 1.5.3 does not behave as Tianshuo expect.
>>
>> However there's still room for improvement even if further aggregation is
>> required across multiple segments. For tianshuo's case, we can ask for N
>> cuboid row from each segment, and merge them at query server side. Since
>> the cuboid rows are respectively sorted in each segment, it
>> is guaranteed that the result is correct
>>
>> However it's a different story if the query contains filters, like in
>> Tiansheng's case. Filter on dimensions may prevent limit clause put down,
>> especially when the dimension is not the first dimension in row key. Below
>> is Tiansheng's case:
>>
>> Error while executing SQL "select 
>> "DATE",ADGROUPID,CAMPAIGNID,COMPANYID,APPID,SUM(IMPS)
>> as imps,SUM(CLKS) as clks,SUM(CONS) as cons, 
>> (SUM(IMP_CLOSINGPRICE)+SUM(CLK_CLOSINGPRICE))
>> as cost,COUNT(DISTINCT CLK_DEVICEID) as clk_uv from EXT_MID_EVENT_JOIN
>> where COMPANYID='296' and "DATE">='2016-01-01' and "DATE"<'2016-01-05'
>> group by "DATE",ADGROUPID,CAMPAIGNID,COMPANYID,APPID order by imps desc
>> limit 10 offset 0": Scan row count exceeded threshold: 49121, please add
>> filter condition to narrow down backend scan range, like where clause.
>>
>> I have opened a JIRA to fix this
>>
>> On Wed, Aug 3, 2016 at 10:25 AM, 张天生 <[email protected]> wrote:
>>
>>> Hi ShaoFeng:
>>>
>>> My sql is "select "DATE",ADGROUPID,CAMPAIGNID,COMPANYID,APPID,SUM(IMPS)
>>> as imps,SUM(CLKS) as clks,SUM(CONS) as cons, 
>>> (SUM(IMP_CLOSINGPRICE)+SUM(CLK_CLOSINGPRICE))
>>> as cost,COUNT(DISTINCT CLK_DEVICEID) as clk_uv from EXT_MID_EVENT_JOIN
>>> where COMPANYID='296' and "DATE">='2016-01-01' and "DATE"<'2016-01-05'
>>> group by "DATE",ADGROUPID,CAMPAIGNID,COMPANYID,APPID order by imps desc
>>> limit 10 offset 0". I analyzed the critical reason is "COUNT(DISTINCT
>>> CLK_DEVICEID) as clk_uv", when i delete this measure, it queried fine even
>>> for a long time span query.
>>>
>>> 赵天烁 <[email protected]>于2016年8月3日周三 上午10:16写道：
>>>
>>>> ok,I'll try to make it configurable in kylin.properties
>>>>
>>>> ------------------------------
>>>>
>>>> 赵天烁
>>>>
>>>> Kevin Zhao
>>>>
>>>> *[email protected] <[email protected]>*
>>>>
>>>>
>>>>
>>>> 珠海市魅族科技有限公司
>>>>
>>>> MEIZU Technology Co., Ltd.
>>>>
>>>> 广东省珠海市科技创新海岸魅族科技楼
>>>>
>>>> MEIZU Tech Bldg., Technology & Innovation Coast
>>>>
>>>> Zhuhai, 519085, Guangdong, China
>>>>
>>>> meizu.com
>>>>
>>>>
>>>> *From:* ShaoFeng Shi <[email protected]>
>>>> *Date:* 2016-08-02 22:37
>>>> *To:* user <[email protected]>
>>>> *Subject:* Re: Re: Scan row count exceeded threshold
>>>>
>>>> This is a new change introduced in 1.5.3, JIRA is:
>>>> https://issues.apache.org/jira/browse/KYLIN-1787
>>>>
>>>> In a short, it tries to avoid scanning too many rows (1 million) in one
>>>> query. But the threshold wasn't externalized as a parameter. @Tianshuo,
>>>> would you mind to contribute a patch?
>>>>
>>>> Besides, the query has "limit 1" but Kylin didn't smartly aware that.
>>>>
>>>> @Tian sheng, what's the SQL query looks like in your case?
>>>>
>>>> 2016-08-02 21:14 GMT+08:00 张天生 <[email protected]>:
>>>>
>>>>> I meet the same question. In 1.5.2.1 version it worked fine, but
>>>>> switched 1.5.3 it ofter report this error.
>>>>>
>>>>> 赵天烁 <[email protected]>于2016年8月2日周二 上午11:57写道：
>>>>>
>>>>>> yes, it is 1.5.3,after limit the date range to a week,it still cost
>>>>>> about 7 second to response, is there any way to get the query execute
>>>>>> detail just like the diagnosis log?
>>>>>>
>>>>>> ------------------------------
>>>>>>
>>>>>> 赵天烁
>>>>>>
>>>>>> Kevin Zhao
>>>>>>
>>>>>> *[email protected] <[email protected]>*
>>>>>>
>>>>>>
>>>>>>
>>>>>> 珠海市魅族科技有限公司
>>>>>>
>>>>>> MEIZU Technology Co., Ltd.
>>>>>>
>>>>>> 广东省珠海市科技创新海岸魅族科技楼
>>>>>>
>>>>>> MEIZU Tech Bldg., Technology & Innovation Coast
>>>>>>
>>>>>> Zhuhai, 519085, Guangdong, China
>>>>>>
>>>>>> meizu.com
>>>>>>
>>>>>>
>>>>>> *From:* ShaoFeng Shi <[email protected]>
>>>>>> *Date:* 2016-08-01 20:20
>>>>>> *To:* user <[email protected]>
>>>>>> *Subject:* Re: Scan row count exceeded threshold
>>>>>>
>>>>>> are you on 1.5.3? If not, suggest to upgrade and then retry it.
>>>>>> Besides, you can add some "where" contition to reduce the records to 
>>>>>> scan,
>>>>>> as the error message mentioned.
>>>>>>
>>>>>> 2016-08-01 15:37 GMT+08:00 赵天烁 <[email protected]>:
>>>>>>
>>>>>>> recently I got the following error while execute query on a cube
>>>>>>> which is not that big( about 400mb, 20milion record)
>>>>>>> ==================
>>>>>>> Error while executing SQL "select FCRASHTIME,count(1) from
>>>>>>> UXIP.EDL_FDT_OUC_UPLOAD_FILES group by FCRASH_ANALYSIS_ID,FCRASHTIME 
>>>>>>> limit
>>>>>>> 1": Scan row count exceeded threshold: 10000000, please add filter
>>>>>>> condition to narrow down backend scan range, like where clause.
>>>>>>>
>>>>>>> I guess what  it scan were the intermediate result, but It doesn't
>>>>>>> any order by,also the result count is limit to just 1.so it could
>>>>>>> scan to find any record with those two dimension and wala.
>>>>>>> waiting kylin expert to give me some detail on that.
>>>>>>> btw how to configure that threshold, any properties in
>>>>>>> kylin.properties?
>>>>>>> ------------------------------
>>>>>>>
>>>>>>> 赵天烁
>>>>>>>
>>>>>>> Kevin Zhao
>>>>>>>
>>>>>>> *[email protected] <[email protected]>*
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 珠海市魅族科技有限公司
>>>>>>>
>>>>>>> MEIZU Technology Co., Ltd.
>>>>>>>
>>>>>>> 广东省珠海市科技创新海岸魅族科技楼
>>>>>>>
>>>>>>> MEIZU Tech Bldg., Technology & Innovation Coast
>>>>>>>
>>>>>>> Zhuhai, 519085, Guangdong, China
>>>>>>>
>>>>>>> meizu.com
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best regards,
>>>>>>
>>>>>> Shaofeng Shi
>>>>>>
>>>>>>
>>>>
>>>>
>>>> --
>>>> Best regards,
>>>>
>>>> Shaofeng Shi
>>>>
>>>>
>>
>>
>> --
>> Regards,
>>
>> *Bin Mahone | 马洪宾*
>>
>>


-- 
Best regards,

Shaofeng Shi

Re: Re: Scan row count exceeded threshold

Reply via email to