Hi Shrikant,

How much memory are you allocating to Reducer? Please consider to allocate
more mem to reducer, as Kylin builds the dictionary in the reducers.

You can also disable this, then Kylin will build dict in its own JVM. This
may cause your Kylin process OOM if there is an ultra high cardinality
(UHC) column.

kylin.engine.mr.build-dict-in-reducer=false


Do you know how high the cardinality of that dimension? For UHC which
cardinality > 3 millions, we don't recommend to use dictionary as the
encoding. You may need to use "fixed_length" or "integer"(if it is in
type of integer).


2018-08-16 16:50 GMT+08:00 Ashish Singhi <ashishsin...@apache.org>:

> Hi Shrikant,
>
> Refer http://kylin.apache.org/blog/2015/08/13/kylin-dictionary/
> You might find it useful.
>
> Regards,
> Ashish
>
> On Thu, Aug 16, 2018 at 10:33 AM, Shrikant Bang <b.shrikan...@gmail.com>
> wrote:
>
>> Thank you, ShaoFeng & Billy for responses.
>>
>> I could able to set hierarchies in dimension.
>>
>> While building cube, step "fact distinct column" job is failing in a
>> reducer with Out Of Memory exception.
>>
>> java.lang.OutOfMemoryError: Java heap space
>> at java.util.IdentityHashMap.resize(IdentityHashMap.java:471)
>> at java.util.IdentityHashMap.put(IdentityHashMap.java:440)
>> at org.apache.kylin.dict.TrieDictionaryBuilder.buildTrieBytes(T
>> rieDictionaryBuilder.java:476)
>> at org.apache.kylin.dict.TrieDictionaryBuilder.build(TrieDictio
>> naryBuilder.java:418)
>> at org.apache.kylin.dict.TrieDictionaryForestBuilder.build(Trie
>> DictionaryForestBuilder.java:109)
>> at org.apache.kylin.dict.DictionaryGenerator$StringTrieDictFore
>> stBuilder.build(DictionaryGenerator.java:220)
>> at org.apache.kylin.engine.mr.steps.FactDistinctColumnsReducer.
>> doCleanup(FactDistinctColumnsReducer.java:216)
>> at org.apache.kylin.engine.mr.KylinReducer.cleanup(KylinReducer.java:103)
>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:179)
>> at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:422)
>> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGro
>> upInformation.java:1657)
>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
>>
>>
>> I tried debugging and understood that dictionary is getting built in
>> reducer's clean up method.
>>
>> I am curious to learn internals. Can you please help me in below :
>>
>>   1.  Any pointer/reference/JIRA for understanding how TRIE (dictionary)
>> of dimension's value getting used in next steps?
>>
>>   2.  Any best practice/references in tuning "fact distinct column" job
>> for those reducer which have high cardinality. I am trying with increasing
>> memory as of now as partitioning and number of reducers are depends on
>> cuboids number.
>>
>>
>> P.S. I am using v2.4 of Kylin with HBase 1.x
>>
>> Thank You,
>> Shrikant Bang
>>
>> On Tue, Aug 14, 2018 at 8:33 PM ShaoFeng Shi <shaofeng...@apache.org>
>> wrote:
>>
>>> For question 1), in Cube's "advanced setting" step, you can specify the
>>> cuboid whitelist to build.
>>>
>>> 2018-08-13 22:26 GMT+08:00 Billy Liu <billy...@apache.org>:
>>>
>>>> Hello Shrikant,
>>>>
>>>> For 1, seems the 4 dimensions are hierarchy structure. You could
>>>> define them as hierarchy dimensions in Cube, and leave A as mandatory
>>>> dimension.
>>>>
>>>> For 2, select 'user_activity' as partition column in model design.
>>>> There are a few built-in formats, most date types are supported.
>>>>
>>>> With Warm regards
>>>>
>>>> Billy Liu
>>>> Shrikant Bang <b.shrikan...@gmail.com> 于2018年8月13日周一 下午5:39写道:
>>>> >
>>>> > Hi Team,
>>>> >
>>>> >      We are doing a PoC on building OLAP cubes. Could you please help
>>>> me to get answer of below queries?
>>>> >
>>>> > Selective Cuboids:
>>>> > We need to have selective cuboids as part of OLAP cubes.
>>>> > Let say if we have 4 dimensions : A, B, C, D then we need just
>>>> (A,B,C,D) , (A,B,C), (A,B) and (A)
>>>> >
>>>> > Refresh Settings:
>>>> > How to specify partition column and format while building cube for
>>>> fact table.
>>>> > e.g. user_activity is partitioned by date 'yyyy-MM-dd' and cube
>>>> should be refreshed everyday with previous day's computation.
>>>> >
>>>> >
>>>> > Thank You,
>>>> > Shrikant Bang
>>>> >
>>>>
>>>
>>>
>>>
>>> --
>>> Best regards,
>>>
>>> Shaofeng Shi 史少锋
>>>
>>>
>


-- 
Best regards,

Shaofeng Shi 史少锋

Reply via email to