Greetings,
We are computing a cube with a bunch of TopN metrics on a moderate-size data 
set that has prominent data skew on two dimensions.
TopN metrics are defined using default settings topn(500,4).
When count of records processed by CuboidReducer.doReduce() reaches some 
thresholds, time to compute aggregate metrics seems to grow exponentially.
Here are some excerpts from cube build MR job logs with some additional logging 
added to see the count of values processed.
Note that 8K input records are processed just fine, but 65K input records 
resulted in a major jump in processing time.
…
2019-07-19 12:04:19,478 INFO [main] org.apache.kylin.engine.mr.KylinReducer: Do 
setup, available memory: 7700m
2019-07-19 12:04:19,479 INFO [main] org.apache.kylin.engine.mr.KylinReducer: 
Accepting Reducer Key with ordinal: 1
2019-07-19 12:04:19,479 INFO [main] org.apache.kylin.engine.mr.KylinReducer: Do 
reduce, available memory: 7700m
2019-07-19 12:04:19,479 INFO [main] 
org.apache.kylin.engine.mr.steps.CuboidReducer: Handling value with ordinal 
(This is not KV number!): 1
2019-07-19 12:04:24,457 INFO [main] 
org.apache.kylin.engine.mr.steps.CuboidReducer: Total number of values 
processed: 2377
2019-07-19 12:04:24,511 INFO [main] 
org.apache.kylin.engine.mr.steps.CuboidReducer: Total number of values 
processed: 2425
2019-07-19 12:04:24,511 INFO [main] 
org.apache.kylin.engine.mr.steps.CuboidReducer: Total number of values 
processed: 2432
2019-07-19 12:04:24,512 INFO [main] 
org.apache.kylin.engine.mr.steps.CuboidReducer: Total number of values 
processed: 2462
2019-07-19 12:04:24,536 INFO [main] 
org.apache.kylin.engine.mr.steps.CuboidReducer: Total number of values 
processed: 2485
2019-07-19 12:04:24,537 INFO [main] 
org.apache.kylin.engine.mr.steps.CuboidReducer: Total number of values 
processed: 2486
2019-07-19 12:04:24,537 INFO [main] 
org.apache.kylin.engine.mr.steps.CuboidReducer: Total number of values 
processed: 2491
2019-07-19 12:04:24,537 INFO [main] 
org.apache.kylin.engine.mr.steps.CuboidReducer: Total number of values 
processed: 2518
2019-07-19 12:04:24,537 INFO [main] 
org.apache.kylin.engine.mr.steps.CuboidReducer: Total number of values 
processed: 2520
2019-07-19 12:04:59,430 INFO [main] 
org.apache.kylin.engine.mr.steps.CuboidReducer: Total number of values 
processed: 8287
2019-07-19 12:04:59,468 INFO [main] org.apache.kylin.engine.mr.KylinReducer: Do 
cleanup, available memory: 7382m
2019-07-19 12:04:59,468 INFO [main] org.apache.kylin.engine.mr.KylinReducer: 
Total rows: 10
…
2019-07-19 12:05:01,639 INFO [main] org.apache.kylin.engine.mr.KylinReducer: Do 
setup, available memory: 7254m
2019-07-19 12:05:01,639 INFO [main] org.apache.kylin.engine.mr.KylinReducer: 
Accepting Reducer Key with ordinal: 1
2019-07-19 12:05:01,640 INFO [main] org.apache.kylin.engine.mr.KylinReducer: Do 
reduce, available memory: 7254m
2019-07-19 12:05:01,640 INFO [main] 
org.apache.kylin.engine.mr.steps.CuboidReducer: Handling value with ordinal 
(This is not KV number!): 1
2019-07-19 12:05:01,640 INFO [main] 
org.apache.kylin.engine.mr.steps.CuboidReducer: Total number of values 
processed: 1
2019-07-19 12:05:01,682 INFO [main] 
org.apache.kylin.engine.mr.steps.CuboidReducer: Total number of values 
processed: 385
2019-07-19 12:05:01,684 INFO [main] 
org.apache.kylin.engine.mr.steps.CuboidReducer: Total number of values 
processed: 386
2019-07-19 12:05:01,684 INFO [main] 
org.apache.kylin.engine.mr.steps.CuboidReducer: Total number of values 
processed: 391
2019-07-19 12:05:01,684 INFO [main] 
org.apache.kylin.engine.mr.steps.CuboidReducer: Total number of values 
processed: 426
2019-07-19 12:05:01,685 INFO [main] 
org.apache.kylin.engine.mr.steps.CuboidReducer: Total number of values 
processed: 427
2019-07-19 12:05:01,685 INFO [main] 
org.apache.kylin.engine.mr.steps.CuboidReducer: Total number of values 
processed: 429
2019-07-19 12:33:14,119 INFO [main] 
org.apache.kylin.engine.mr.steps.CuboidReducer: Total number of values 
processed: 65997
2019-07-19 12:33:14,348 INFO [main] org.apache.kylin.engine.mr.KylinReducer: Do 
cleanup, available memory: 7184m
2019-07-19 12:33:14,348 INFO [main] org.apache.kylin.engine.mr.KylinReducer: 
Total rows: 8
…
There is a breaking point, after which cuboid aggregations would never complete 
and time out even after waiting for 3+ hours.
Adding more memory to the mapper (via mapreduce.map.memory.mb/ 
mapreduce.map.java.opts in kylin_job_conf.xml) doesn’t seem to make much 
difference.

There doesn’t seem to the any documented way to combat data skew during the 
cuboid build step.

Any suggestions on how to deal with the above mentioned issue would be greatly 
appreciated.

Thanks,
Seva.


-----------------------------------------------------------------------------------------------------------------------
Notice: This e-mail together with any attachments may contain information of 
Ribbon Communications Inc. that
is confidential and/or proprietary for the sole use of the intended recipient.  
Any review, disclosure, reliance or
distribution by others or forwarding without express permission is strictly 
prohibited.  If you are not the intended
recipient, please notify the sender immediately and then delete all copies, 
including any attachments.
-----------------------------------------------------------------------------------------------------------------------

Reply via email to