Re: Exceed scan threshold at 10000001

Alberto Ramón Tue, 01 Nov 2016 03:45:07 -0700

Sorry, The pictures can't arrive you OK

See this:
http://www.slideshare.net/HBaseCon/apache-kylins-performance-boost-from-apache-hbase#9


2016-11-01 7:52 GMT+01:00 张磊 <[email protected]>:

> You say Kylin is "smart" when compose Hbase Row Key, there is something i
> can not see. Could you send again How Hbase Row Key compose?
>
>
>
>
> ------------------ 原始邮件 ------------------
> 发件人: "a.ramonportoles";<[email protected]>;
> 发送时间: 2016年10月28日(星期五) 晚上6:51
> 收件人: "dev"<[email protected]>;
>
> 主题: Re: Exceed scan threshold at 10000001
>
>
>
> Q1: If i query select count(1) from table group by letter,number limit
> 2,it  should scan the first two rows(letter number Agg groups)?
>
>
> A1: Kylin build Hbase Key with Dimensions:
>
>
> Kylin is "smart" when compose Hbase Row Key:
> Is not the same Group by / filter by Dim1 that Dim3   :)
>
> Dim1: Range scan--> you read that you need --> fast
>
> Dim3: full scan --> you read more rows that you need --> slow
>
>
> how to solve it?  (I think:) you can build several cubes / uses different
> aggregation groups  on same project
>
>
>
>
>
> Q2: when i query select count(1) from table group by letter limit 2,it
> should scan the two rows(letter Agg group)
>
>
> A2: Yes,, if you define count(1) as measure and letter as Dim, you will
> have a pre-calculated results
>
>
>
> Also: check the cardinaliy of your data, Isn't normal:
>
> limit 10000  --> scan 1000 rows
>
> limit 10001  ---> scan millions of rows
>
> If this is true your data isn't balanced, I don't know any solution for
> this
>
>
> Alb
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> 2016-10-28 10:01 GMT+02:00 张磊 <[email protected]>:
> kylin do put pre-calculate results in hbase, if cude desc is below:
>  Dimensions:letter number
>  Measures:count
>  in hbase the result is
>  count letter number
>  1         A        1
>  1         A        2
>  1         B        1
>  1         B        2
>  1         B        3
>  1         B        4
>  count  letter
>  2          A
>  4          B
>  If i query select count(1) from table group by letter,number limit 2,it
> should scan the first two rows(letter number Agg groups)?
>  when i query select count(1) from table group by letter limit 2,it should
> scan the two rows(letter Agg group)
>  Do i say right?
>
>
>  ------------------ 原始邮件 ------------------
>  发件人: "a.ramonportoles";<[email protected]>;
>  发送时间: 2016年10月28日(星期五) 下午3:43
>  收件人: "dev"<[email protected]>;
>
>  主题: Re: Exceed scan threshold at 10000001
>
>
>
>  hummmm
>  but are you using "group by LO_CUSTKEY,LO_PARTKEY"
>
>  And limit apply to final result no to scan rows
>
>  Example:
>  table with two columns Letter / Number
>  A:1
>  A:2
>  B:1
>  B:2
>  B:3
>  B:4
>
>  select count (1), Letter from TB group by Letter limit 1
>     Result: 2:A
>     Scans 2 rows
>
>  select count (1), Letter from TB group by Letter limit 2
>     Result: 2:A
>                 4:B
>     Scans 2 +4 rows
>
>
>  Alb
>
>
>
>  2016-10-28 8:33 GMT+02:00 张磊 <[email protected]>:
>
>  > Query1:select count(1),sum(LO_REVENUE) from lineorder group by
>  > LO_CUSTKEY,LO_PARTKEY
>  > LIMIT 10000
>  >
>  >
>  > I find it scan 10000 rows from HBase
>  >
>  >
>  > Query2: select count(1),sum(LO_REVENUE) from lineorder group by
>  > LO_CUSTKEY,LO_PARTKEY
>  > LIMIT 10001
>  >
>  >
>  > I find it scan 10000001 rows from Hbase
>  >
>  >
>  > I do not know why?  Should not  scan 10001 row?
>  >
>  >
>  > The two query i scan the same HTable KYLIN_78ROC49NQY
>  > Kylin log:Endpoint RPC returned from HTable KYLIN_78ROC49NQY
>  >
>  >
>  >
>  >
>  > ------------------ 原始邮件 ------------------
>  > 发件人: "ShaoFeng Shi";<[email protected]>;
>  > 发送时间: 2016年10月28日(星期五) 中午11:20
>  > 收件人: "dev"<[email protected]>;
>  >
>  > 主题: Re: Exceed scan threshold at 10000001
>  >
>  >
>  >
>  > Alberto, thanks for your explaination, you got the points and is
> already an
>  > Kylin expert I believe.
>  >
>  > In order to protect HBase and Kylin from crashing by bad queries (which
>  > scan too many rows), Kylin add this mechnisam to interrupt when reach
> some
>  > threshold. Usually in an OLAP scenario, the result wouldn't be too
> large.
>  > This is also a reminder for user to rethink the design; If you really
> want
>  > to get the threshold be enlarged, you can allocate more memory to Kylin
> and
>  > set "kylin.query.mem.budget" to bigger value.
>  >
>  > 2016-10-27 18:39 GMT+08:00 Alberto Ramón <[email protected]>:
>  >
>  > > NOTE: I'm not a expert on Kylin  ;)
>  > >
>  > > Where is mandatory? No
>  > > Where is recommended? yes
>  > > Where bypass the threshold? No, I think this limit is hardcoded ¿?
>  > >
>  > > The real question must be: why this limit exists ?: (opinion)
>  > > - The target of Kylin is Real / Near RT, limit rows --> limit response
>  > time
>  > > - If Your are using JDBC, this is not a good option by performance
>  > > - Protect the HBase Coprocesor
>  > > - Perhaps you need a new Dim, to precalculate This Aggregate or
> filter by
>  > > this new Dim
>  > >
>  > > For Extra-Large queries, you can also check:
>  > >  -kylin.query.mem.budget= 3GB
>  > >  -hbase.server.scanner.max.result.size = 100MB  (limit from HBase,
> you
>  > can
>  > > disable with -1)
>  > >
>  > > Good Luck, Alb
>  > >
>  > > 2016-10-27 11:56 GMT+02:00 张磊 <[email protected]>:
>  > >
>  > > > Do you mean when i query, i should add where clause,
>  > > > but in some case, the number of records > threshold, how can i do?
>  > > > For example, order by all groups, the number of the  all groups >
>  > > > threshold
>  > > >
>  > > >
>  > > >
>  > > >
>  > > > ------------------ 原始邮件 ------------------
>  > > > 发件人: "Alberto Ramón";<[email protected]>;
>  > > > 发送时间: 2016年10月27日(星期四) 下午5:47
>  > > > 收件人: "dev"<[email protected]>;
>  > > >
>  > > > 主题: Re: Exceed scan threshold at 10000001
>  > > >
>  > > >
>  > > >
>  > > >  ERROR: Scan row count exceeded threshold
>  > > >
>  > > > MailList
>  > > > <http://mail-archives.apache.org/mod_mbox/kylin-user/
>  > > > 201608.mbox/%3CCALjEW7M_YYi7Xs55OqPdxS6pzNvD0%
>  > 2BamN2AX3hetnF0%3D9uFnow%
>  > > > 40mail.gmail.com%3E>
>  > > > Kilin
>  > > > 1787 <https://issues.apache.org/jira/browse/KYLIN-1787>v1.5.3
>  > > >
>  > > > *Scan row count exceeded threshold: 1000000, please add filter
>  > condition
>  > > to
>  > > > narrow down backend scan range, like where clause*
>  > > >
>  > > >
>  > > > BR, Alb
>  > > >
>  > > > 2016-10-27 11:40 GMT+02:00 张磊 <[email protected]>:
>  > > >
>  > > > > Hi
>  > > > >
>  > > > >
>  > > > > When i query a sql, I do not know why should scan hbase? How can i
>  > do?
>  > > > > Thanks!
>  > > > >
>  > > > >
>  > > > > Table: lineorder  12,000,000 row records
>  > > > > Dimensions: LO_CUSTKEY,LO_PARTKEY
>  > > > > Measures: count(1), sum(LO_REVENUE)
>  > > > >
>  > > > >
>  > > > > Query SQL: select count(1),sum(LO_REVENUE) from lineorder group by
>  > > > > LO_CUSTKEY,LO_PARTKEY order by LO_CUSTKEY,LO_PARTKEY limit 50
>  > > > >
>  > > > >
>  > > > > I build a cude with two Dimensions and two Measures(count and
> sum),
>  > the
>  > > > > size of the Htable is 98 MB, when i execute a query in insight, it
>  > > shows
>  > > > > Error in coprocessor; and i check the hbase log, i find blow
> messages
>  > > > >
>  > > > >
>  > > > > 2016-10-27 02:06:13,470 INFO  [B.defaultRpcServer.handler=4,
>  > > > queue=1,port=16020]
>  > > > > gridtable.GTScanRequest: pre aggregation is not beneficial, skip
> it
>  > > > > 2016-10-27 02:06:13,470 INFO  [B.defaultRpcServer.handler=4,
>  > > > queue=1,port=16020]
>  > > > > endpoint.CubeVisitService: Scanned 1 rows from HBase.
>  > > > >
>  > > > >
>  > > > > 2016-10-27 02:24:20,884 INFO  [B.defaultRpcServer.handler=6,
>  > > > queue=0,port=16020]
>  > > > > endpoint.CubeVisitService: Scanned 9999001 rows from HBase.
>  > > > > 2016-10-27 02:24:20,889 INFO  [B.defaultRpcServer.handler=6,
>  > > > queue=0,port=16020]
>  > > > > endpoint.CubeVisitService: The cube visit did not finish normally
>  > > because
>  > > > > scan num exceeds threshold
>  > > > > org.apache.kylin.gridtable.GTScanExceedThresholdException: Exceed
>  > scan
>  > > > > threshold at 10000001
>  > > > >         at org.apache.kylin.storage.hbase.cube.v2.coprocessor.
>  > > > > endpoint.CubeVisitService$2.hasNext(CubeVisitService.java:267)
>  > > > >         at org.apache.kylin.storage.hbase.cube.v2.
>  > > > HBaseReadonlyStore$1$1.
>  > > > > hasNext(HBaseReadonlyStore.java:111)
>  > > > >         at org.apache.kylin.storage.hbase.cube.v2.coprocessor.
>  > > > > endpoint.CubeVisitService.visitCube(CubeVisitService.java:299)
>  > > > >         at org.apache.kylin.storage.hbase.cube.v2.coprocessor.
>  > > > > endpoint.generated.CubeVisitProtos$CubeVisitService.callMethod(
>  > > > > CubeVisitProtos.java:3952)
>  > > > >         at org.apache.hadoop.hbase.regionserver.HRegion.
>  > > > > execService(HRegion.java:7815)
>  > > > >         at org.apache.hadoop.hbase.regionserver.RSRpcServices.
>  > > > > execServiceOnRegion(RSRpcServices.java:1986)
>  > > > >         at org.apache.hadoop.hbase.regionserver.RSRpcServices.
>  > > > > execService(RSRpcServices.java:1968)
>  > > > >         at org.apache.hadoop.hbase.protobuf.generated.
>  > > > > ClientProtos$ClientService$2.callBlockingMethod(
>  > > ClientProtos.java:33652)
>  > > > >         at org.apache.hadoop.hbase.ipc.
>  > RpcServer.call(RpcServer.java:
>  > > > 2178)
>  > > > >         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.
>  > > > java:112)
>  > > > >         at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(
>  > > > > RpcExecutor.java:133)
>  > > > >         at org.apache.hadoop.hbase.ipc.
>  > RpcExecutor$1.run(RpcExecutor.
>  > > > > java:108)
>  > > > >         at java.lang.Thread.run(Thread.java:745)
>  > > >
>  > >
>  >
>  >
>  >
>  > --
>  > Best regards,
>  >
>  > Shaofeng Shi 史少锋
>  >
>

Re: Exceed scan threshold at 10000001

Reply via email to