Sorry, The pictures can't arrive you OK See this: http://www.slideshare.net/HBaseCon/apache-kylins-performance-boost-from-apache-hbase#9
2016-11-01 7:52 GMT+01:00 张磊 <121762...@qq.com>: > You say Kylin is "smart" when compose Hbase Row Key, there is something i > can not see. Could you send again How Hbase Row Key compose? > > > > > ------------------ 原始邮件 ------------------ > 发件人: "a.ramonportoles";<a.ramonporto...@gmail.com>; > 发送时间: 2016年10月28日(星期五) 晚上6:51 > 收件人: "dev"<dev@kylin.apache.org>; > > 主题: Re: Exceed scan threshold at 10000001 > > > > Q1: If i query select count(1) from table group by letter,number limit > 2,it should scan the first two rows(letter number Agg groups)? > > > A1: Kylin build Hbase Key with Dimensions: > > > Kylin is "smart" when compose Hbase Row Key: > Is not the same Group by / filter by Dim1 that Dim3 :) > > Dim1: Range scan--> you read that you need --> fast > > Dim3: full scan --> you read more rows that you need --> slow > > > how to solve it? (I think:) you can build several cubes / uses different > aggregation groups on same project > > > > > > Q2: when i query select count(1) from table group by letter limit 2,it > should scan the two rows(letter Agg group) > > > A2: Yes,, if you define count(1) as measure and letter as Dim, you will > have a pre-calculated results > > > > Also: check the cardinaliy of your data, Isn't normal: > > limit 10000 --> scan 1000 rows > > limit 10001 ---> scan millions of rows > > If this is true your data isn't balanced, I don't know any solution for > this > > > Alb > > > > > > > > > > > > > > > > 2016-10-28 10:01 GMT+02:00 张磊 <121762...@qq.com>: > kylin do put pre-calculate results in hbase, if cude desc is below: > Dimensions:letter number > Measures:count > in hbase the result is > count letter number > 1 A 1 > 1 A 2 > 1 B 1 > 1 B 2 > 1 B 3 > 1 B 4 > count letter > 2 A > 4 B > If i query select count(1) from table group by letter,number limit 2,it > should scan the first two rows(letter number Agg groups)? > when i query select count(1) from table group by letter limit 2,it should > scan the two rows(letter Agg group) > Do i say right? > > > ------------------ 原始邮件 ------------------ > 发件人: "a.ramonportoles";<a.ramonporto...@gmail.com>; > 发送时间: 2016年10月28日(星期五) 下午3:43 > 收件人: "dev"<dev@kylin.apache.org>; > > 主题: Re: Exceed scan threshold at 10000001 > > > > hummmm > but are you using "group by LO_CUSTKEY,LO_PARTKEY" > > And limit apply to final result no to scan rows > > Example: > table with two columns Letter / Number > A:1 > A:2 > B:1 > B:2 > B:3 > B:4 > > select count (1), Letter from TB group by Letter limit 1 > Result: 2:A > Scans 2 rows > > select count (1), Letter from TB group by Letter limit 2 > Result: 2:A > 4:B > Scans 2 +4 rows > > > Alb > > > > 2016-10-28 8:33 GMT+02:00 张磊 <121762...@qq.com>: > > > Query1:select count(1),sum(LO_REVENUE) from lineorder group by > > LO_CUSTKEY,LO_PARTKEY > > LIMIT 10000 > > > > > > I find it scan 10000 rows from HBase > > > > > > Query2: select count(1),sum(LO_REVENUE) from lineorder group by > > LO_CUSTKEY,LO_PARTKEY > > LIMIT 10001 > > > > > > I find it scan 10000001 rows from Hbase > > > > > > I do not know why? Should not scan 10001 row? > > > > > > The two query i scan the same HTable KYLIN_78ROC49NQY > > Kylin log:Endpoint RPC returned from HTable KYLIN_78ROC49NQY > > > > > > > > > > ------------------ 原始邮件 ------------------ > > 发件人: "ShaoFeng Shi";<shaofeng...@apache.org>; > > 发送时间: 2016年10月28日(星期五) 中午11:20 > > 收件人: "dev"<dev@kylin.apache.org>; > > > > 主题: Re: Exceed scan threshold at 10000001 > > > > > > > > Alberto, thanks for your explaination, you got the points and is > already an > > Kylin expert I believe. > > > > In order to protect HBase and Kylin from crashing by bad queries (which > > scan too many rows), Kylin add this mechnisam to interrupt when reach > some > > threshold. Usually in an OLAP scenario, the result wouldn't be too > large. > > This is also a reminder for user to rethink the design; If you really > want > > to get the threshold be enlarged, you can allocate more memory to Kylin > and > > set "kylin.query.mem.budget" to bigger value. > > > > 2016-10-27 18:39 GMT+08:00 Alberto Ramón <a.ramonporto...@gmail.com>: > > > > > NOTE: I'm not a expert on Kylin ;) > > > > > > Where is mandatory? No > > > Where is recommended? yes > > > Where bypass the threshold? No, I think this limit is hardcoded ¿? > > > > > > The real question must be: why this limit exists ?: (opinion) > > > - The target of Kylin is Real / Near RT, limit rows --> limit response > > time > > > - If Your are using JDBC, this is not a good option by performance > > > - Protect the HBase Coprocesor > > > - Perhaps you need a new Dim, to precalculate This Aggregate or > filter by > > > this new Dim > > > > > > For Extra-Large queries, you can also check: > > > -kylin.query.mem.budget= 3GB > > > -hbase.server.scanner.max.result.size = 100MB (limit from HBase, > you > > can > > > disable with -1) > > > > > > Good Luck, Alb > > > > > > 2016-10-27 11:56 GMT+02:00 张磊 <121762...@qq.com>: > > > > > > > Do you mean when i query, i should add where clause, > > > > but in some case, the number of records > threshold, how can i do? > > > > For example, order by all groups, the number of the all groups > > > > > threshold > > > > > > > > > > > > > > > > > > > > ------------------ 原始邮件 ------------------ > > > > 发件人: "Alberto Ramón";<a.ramonporto...@gmail.com>; > > > > 发送时间: 2016年10月27日(星期四) 下午5:47 > > > > 收件人: "dev"<dev@kylin.apache.org>; > > > > > > > > 主题: Re: Exceed scan threshold at 10000001 > > > > > > > > > > > > > > > > ERROR: Scan row count exceeded threshold > > > > > > > > MailList > > > > <http://mail-archives.apache.org/mod_mbox/kylin-user/ > > > > 201608.mbox/%3CCALjEW7M_YYi7Xs55OqPdxS6pzNvD0% > > 2BamN2AX3hetnF0%3D9uFnow% > > > > 40mail.gmail.com%3E> > > > > Kilin > > > > 1787 <https://issues.apache.org/jira/browse/KYLIN-1787>v1.5.3 > > > > > > > > *Scan row count exceeded threshold: 1000000, please add filter > > condition > > > to > > > > narrow down backend scan range, like where clause* > > > > > > > > > > > > BR, Alb > > > > > > > > 2016-10-27 11:40 GMT+02:00 张磊 <121762...@qq.com>: > > > > > > > > > Hi > > > > > > > > > > > > > > > When i query a sql, I do not know why should scan hbase? How can i > > do? > > > > > Thanks! > > > > > > > > > > > > > > > Table: lineorder 12,000,000 row records > > > > > Dimensions: LO_CUSTKEY,LO_PARTKEY > > > > > Measures: count(1), sum(LO_REVENUE) > > > > > > > > > > > > > > > Query SQL: select count(1),sum(LO_REVENUE) from lineorder group by > > > > > LO_CUSTKEY,LO_PARTKEY order by LO_CUSTKEY,LO_PARTKEY limit 50 > > > > > > > > > > > > > > > I build a cude with two Dimensions and two Measures(count and > sum), > > the > > > > > size of the Htable is 98 MB, when i execute a query in insight, it > > > shows > > > > > Error in coprocessor; and i check the hbase log, i find blow > messages > > > > > > > > > > > > > > > 2016-10-27 02:06:13,470 INFO [B.defaultRpcServer.handler=4, > > > > queue=1,port=16020] > > > > > gridtable.GTScanRequest: pre aggregation is not beneficial, skip > it > > > > > 2016-10-27 02:06:13,470 INFO [B.defaultRpcServer.handler=4, > > > > queue=1,port=16020] > > > > > endpoint.CubeVisitService: Scanned 1 rows from HBase. > > > > > > > > > > > > > > > 2016-10-27 02:24:20,884 INFO [B.defaultRpcServer.handler=6, > > > > queue=0,port=16020] > > > > > endpoint.CubeVisitService: Scanned 9999001 rows from HBase. > > > > > 2016-10-27 02:24:20,889 INFO [B.defaultRpcServer.handler=6, > > > > queue=0,port=16020] > > > > > endpoint.CubeVisitService: The cube visit did not finish normally > > > because > > > > > scan num exceeds threshold > > > > > org.apache.kylin.gridtable.GTScanExceedThresholdException: Exceed > > scan > > > > > threshold at 10000001 > > > > > at org.apache.kylin.storage.hbase.cube.v2.coprocessor. > > > > > endpoint.CubeVisitService$2.hasNext(CubeVisitService.java:267) > > > > > at org.apache.kylin.storage.hbase.cube.v2. > > > > HBaseReadonlyStore$1$1. > > > > > hasNext(HBaseReadonlyStore.java:111) > > > > > at org.apache.kylin.storage.hbase.cube.v2.coprocessor. > > > > > endpoint.CubeVisitService.visitCube(CubeVisitService.java:299) > > > > > at org.apache.kylin.storage.hbase.cube.v2.coprocessor. > > > > > endpoint.generated.CubeVisitProtos$CubeVisitService.callMethod( > > > > > CubeVisitProtos.java:3952) > > > > > at org.apache.hadoop.hbase.regionserver.HRegion. > > > > > execService(HRegion.java:7815) > > > > > at org.apache.hadoop.hbase.regionserver.RSRpcServices. > > > > > execServiceOnRegion(RSRpcServices.java:1986) > > > > > at org.apache.hadoop.hbase.regionserver.RSRpcServices. > > > > > execService(RSRpcServices.java:1968) > > > > > at org.apache.hadoop.hbase.protobuf.generated. > > > > > ClientProtos$ClientService$2.callBlockingMethod( > > > ClientProtos.java:33652) > > > > > at org.apache.hadoop.hbase.ipc. > > RpcServer.call(RpcServer.java: > > > > 2178) > > > > > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner. > > > > java:112) > > > > > at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop( > > > > > RpcExecutor.java:133) > > > > > at org.apache.hadoop.hbase.ipc. > > RpcExecutor$1.run(RpcExecutor. > > > > > java:108) > > > > > at java.lang.Thread.run(Thread.java:745) > > > > > > > > > > > > > > > -- > > Best regards, > > > > Shaofeng Shi 史少锋 > > >