Cool, you get it :-) 2016-08-08 11:30 GMT+08:00 赵天烁 <[email protected]>:
> already sovled by change those two date dims with UHC into yyMMdd formated > integer ,trim HHmmss, which dramatically reduce the cardinality,then put > mandatory filter dim at the begining of row key.thx > > ------------------------------ > > 赵天烁 > > Kevin Zhao > > *[email protected] <[email protected]>* > > > > 珠海市魅族科技有限公司 > > MEIZU Technology Co., Ltd. > > 广东省珠海市科技创新海岸魅族科技楼 > > MEIZU Tech Bldg., Technology & Innovation Coast > > Zhuhai, 519085, Guangdong, China > > meizu.com > > > *From:* ShaoFeng Shi <[email protected]> > *Date:* 2016-08-08 11:23 > *To:* user <[email protected]> > *Subject:* Re: Re: Scan row count exceeded threshold > "two of then cardinality are bigger than 100milion" > > ==> You must do optimization for your cube when having UHC dimensions; > otherwise its expansion rate may be very high. > > 2016-08-06 12:20 GMT+08:00 张天生 <[email protected]>: > >> I also need to adjust threshold to bigger value. We set JVM -Xmx to 32G, >> but it was not occupied enough. >> >> 赵天烁 <[email protected]>于2016年8月4日周四 上午11:54写道: >> >>> I dig into the code, found that this threshold is determine at >>> CubeStorageQuery--setThreshold method, there already has a prop call >>> kylin.query.mem.budget,which define the memory size of current jvm >>> obtain,default to 3G then divide this by a estimate size of a single >>> row.the result of that is the threshold.so I think by change >>> kylin.query.mem.budget >>> to a higher val will do. >>> >>> btw my cube is growing like a hell.....the biggest one is almost >>> 1.7TB(only keep on month of these data),source record is 200 milion level,9 >>> dim,two of then cardinality are bigger than 100milion,already have >>> performance issue,maybe you kylin expert could provide some optimize sugg? >>> waiting...... >>> >>> ------------------------------ >>> >>> 赵天烁 >>> >>> Kevin Zhao >>> >>> *[email protected] <[email protected]>* >>> >>> >>> >>> 珠海市魅族科技有限公司 >>> >>> MEIZU Technology Co., Ltd. >>> >>> 广东省珠海市科技创新海岸魅族科技楼 >>> >>> MEIZU Tech Bldg., Technology & Innovation Coast >>> >>> Zhuhai, 519085, Guangdong, China >>> >>> meizu.com >>> >>> >>> *From:* hongbin ma <[email protected]> >>> *Date:* 2016-08-03 12:50 >>> *To:* user.kylin <[email protected]> >>> >>> *Subject:* Re: Re: Scan row count exceeded threshold >>> >>> Hi Tiansheng and Tianshuo >>> >>> Kylin is basically limiting the footprint on storage visit, the size of >>> each row is inversely proportional to the number of rows to be read. The >>> cuboid row size grows when there's a distinct count measure, so you're >>> observing threshold being 49121, this is normal. >>> >>> The unnormal part is kylins behavior when there's a limit clause, >>> especially for cases like Tianshuo's case, where query being: >>> >>> select FCRASHTIME,count(1) from UXIP.EDL_FDT_OUC_UPLOAD_FILES group by >>> FCRASH_ANALYSIS_ID,FCRASHTIME limit N >>> >>> The query does not have any filters, so we should be able to read the >>> first N rows from cuboid (FCRASH_ANALYSIS_ID,FCRASHTIME) and return the >>> result to users. Yang Li tried to fix the issue in >>> https://issues.apache.org/jira/browse/KYLIN-1787, however the approach >>> was still a little bit too conservative to me. The patch in KYLIN-1787 >>> would not enable the storage read limit as long as the cube has a partition >>> time column (and meanwhile the query is not grouping by the partition time >>> column), because we'll need to further aggregate rows from different >>> segments. This is why 1.5.3 does not behave as Tianshuo expect. >>> >>> However there's still room for improvement even if further aggregation >>> is required across multiple segments. For tianshuo's case, we can ask for N >>> cuboid row from each segment, and merge them at query server side. Since >>> the cuboid rows are respectively sorted in each segment, it >>> is guaranteed that the result is correct >>> >>> However it's a different story if the query contains filters, like in >>> Tiansheng's case. Filter on dimensions may prevent limit clause put down, >>> especially when the dimension is not the first dimension in row key. Below >>> is Tiansheng's case: >>> >>> Error while executing SQL "select >>> "DATE",ADGROUPID,CAMPAIGNID,COMPANYID,APPID,SUM(IMPS) >>> as imps,SUM(CLKS) as clks,SUM(CONS) as cons, >>> (SUM(IMP_CLOSINGPRICE)+SUM(CLK_CLOSINGPRICE)) >>> as cost,COUNT(DISTINCT CLK_DEVICEID) as clk_uv from EXT_MID_EVENT_JOIN >>> where COMPANYID='296' and "DATE">='2016-01-01' and "DATE"<'2016-01-05' >>> group by "DATE",ADGROUPID,CAMPAIGNID,COMPANYID,APPID order by imps desc >>> limit 10 offset 0": Scan row count exceeded threshold: 49121, please add >>> filter condition to narrow down backend scan range, like where clause. >>> >>> I have opened a JIRA to fix this >>> >>> On Wed, Aug 3, 2016 at 10:25 AM, 张天生 <[email protected]> wrote: >>> >>>> Hi ShaoFeng: >>>> >>>> My sql is "select "DATE",ADGROUPID,CAMPAIGNID,COMPANYID,APPID,SUM(IMPS) >>>> as imps,SUM(CLKS) as clks,SUM(CONS) as cons, >>>> (SUM(IMP_CLOSINGPRICE)+SUM(CLK_CLOSINGPRICE)) >>>> as cost,COUNT(DISTINCT CLK_DEVICEID) as clk_uv from EXT_MID_EVENT_JOIN >>>> where COMPANYID='296' and "DATE">='2016-01-01' and "DATE"<'2016-01-05' >>>> group by "DATE",ADGROUPID,CAMPAIGNID,COMPANYID,APPID order by imps >>>> desc limit 10 offset 0". I analyzed the critical reason is "COUNT(DISTINCT >>>> CLK_DEVICEID) as clk_uv", when i delete this measure, it queried fine even >>>> for a long time span query. >>>> >>>> 赵天烁 <[email protected]>于2016年8月3日周三 上午10:16写道: >>>> >>>>> ok,I'll try to make it configurable in kylin.properties >>>>> >>>>> ------------------------------ >>>>> >>>>> 赵天烁 >>>>> >>>>> Kevin Zhao >>>>> >>>>> *[email protected] <[email protected]>* >>>>> >>>>> >>>>> >>>>> 珠海市魅族科技有限公司 >>>>> >>>>> MEIZU Technology Co., Ltd. >>>>> >>>>> 广东省珠海市科技创新海岸魅族科技楼 >>>>> >>>>> MEIZU Tech Bldg., Technology & Innovation Coast >>>>> >>>>> Zhuhai, 519085, Guangdong, China >>>>> >>>>> meizu.com >>>>> >>>>> >>>>> *From:* ShaoFeng Shi <[email protected]> >>>>> *Date:* 2016-08-02 22:37 >>>>> *To:* user <[email protected]> >>>>> *Subject:* Re: Re: Scan row count exceeded threshold >>>>> >>>>> This is a new change introduced in 1.5.3, JIRA is: >>>>> https://issues.apache.org/jira/browse/KYLIN-1787 >>>>> >>>>> In a short, it tries to avoid scanning too many rows (1 million) in >>>>> one query. But the threshold wasn't externalized as a parameter. >>>>> @Tianshuo, >>>>> would you mind to contribute a patch? >>>>> >>>>> Besides, the query has "limit 1" but Kylin didn't smartly aware that. >>>>> >>>>> @Tian sheng, what's the SQL query looks like in your case? >>>>> >>>>> 2016-08-02 21:14 GMT+08:00 张天生 <[email protected]>: >>>>> >>>>>> I meet the same question. In 1.5.2.1 version it worked fine, but >>>>>> switched 1.5.3 it ofter report this error. >>>>>> >>>>>> 赵天烁 <[email protected]>于2016年8月2日周二 上午11:57写道: >>>>>> >>>>>>> yes, it is 1.5.3,after limit the date range to a week,it still cost >>>>>>> about 7 second to response, is there any way to get the query execute >>>>>>> detail just like the diagnosis log? >>>>>>> >>>>>>> ------------------------------ >>>>>>> >>>>>>> 赵天烁 >>>>>>> >>>>>>> Kevin Zhao >>>>>>> >>>>>>> *[email protected] <[email protected]>* >>>>>>> >>>>>>> >>>>>>> >>>>>>> 珠海市魅族科技有限公司 >>>>>>> >>>>>>> MEIZU Technology Co., Ltd. >>>>>>> >>>>>>> 广东省珠海市科技创新海岸魅族科技楼 >>>>>>> >>>>>>> MEIZU Tech Bldg., Technology & Innovation Coast >>>>>>> >>>>>>> Zhuhai, 519085, Guangdong, China >>>>>>> >>>>>>> meizu.com >>>>>>> >>>>>>> >>>>>>> *From:* ShaoFeng Shi <[email protected]> >>>>>>> *Date:* 2016-08-01 20:20 >>>>>>> *To:* user <[email protected]> >>>>>>> *Subject:* Re: Scan row count exceeded threshold >>>>>>> >>>>>>> are you on 1.5.3? If not, suggest to upgrade and then retry it. >>>>>>> Besides, you can add some "where" contition to reduce the records to >>>>>>> scan, >>>>>>> as the error message mentioned. >>>>>>> >>>>>>> 2016-08-01 15:37 GMT+08:00 赵天烁 <[email protected]>: >>>>>>> >>>>>>>> recently I got the following error while execute query on a cube >>>>>>>> which is not that big( about 400mb, 20milion record) >>>>>>>> ================== >>>>>>>> Error while executing SQL "select FCRASHTIME,count(1) from >>>>>>>> UXIP.EDL_FDT_OUC_UPLOAD_FILES group by FCRASH_ANALYSIS_ID,FCRASHTIME >>>>>>>> limit >>>>>>>> 1": Scan row count exceeded threshold: 10000000, please add filter >>>>>>>> condition to narrow down backend scan range, like where clause. >>>>>>>> >>>>>>>> I guess what it scan were the intermediate result, but It doesn't >>>>>>>> any order by,also the result count is limit to just 1.so it could >>>>>>>> scan to find any record with those two dimension and wala. >>>>>>>> waiting kylin expert to give me some detail on that. >>>>>>>> btw how to configure that threshold, any properties in >>>>>>>> kylin.properties? >>>>>>>> ------------------------------ >>>>>>>> >>>>>>>> 赵天烁 >>>>>>>> >>>>>>>> Kevin Zhao >>>>>>>> >>>>>>>> *[email protected] <[email protected]>* >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> 珠海市魅族科技有限公司 >>>>>>>> >>>>>>>> MEIZU Technology Co., Ltd. >>>>>>>> >>>>>>>> 广东省珠海市科技创新海岸魅族科技楼 >>>>>>>> >>>>>>>> MEIZU Tech Bldg., Technology & Innovation Coast >>>>>>>> >>>>>>>> Zhuhai, 519085, Guangdong, China >>>>>>>> >>>>>>>> meizu.com >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Best regards, >>>>>>> >>>>>>> Shaofeng Shi >>>>>>> >>>>>>> >>>>> >>>>> >>>>> -- >>>>> Best regards, >>>>> >>>>> Shaofeng Shi >>>>> >>>>> >>> >>> >>> -- >>> Regards, >>> >>> *Bin Mahone | 马洪宾* >>> >>> > > > -- > Best regards, > > Shaofeng Shi > > -- Best regards, Shaofeng Shi
