Re: Precisely Count Distinct on 100 million string values column

Yerui Sun Thu, 25 Aug 2016 20:41:13 -0700

Hmm?? You??re right, hybrid cube couldn??t resolve your problem. 

It??s really a challenge to count distinct on such huge dataset. 
A possible solution is expand dict id from int to bigint, and made 
RoaringBitmap support bigint. However, it need quite changing on current code.


> ?? 2016??8??25????16:40??lxw <[email protected]> ??????
> 
> 1. Yes, USER_ID have duplicated values between segments, 100 million is new, 
> maby another 150 million is old per segment.
> 2. I think "Hybrid Model" also has problem about my scene, just like "default 
> dictionay cross segments", this is "global dictionary cross cubes", am I 
> right?
> 
> 
> 
> 
> 
> ------------------ ???????? ------------------
> ??????: "Yerui Sun";<[email protected]>;
> ????????: 2016??8??25??(??????) ????4:22
> ??????: "dev"<[email protected]>; 
> 
> ????: Re: Precisely Count Distinct on 100 million string values column
> 
> 
> 
> That depends on your USER_ID carnality. I think your USER_ID should have 
> duplicated values between segments, that??s why you use count **distinct**. 
> If the USER_ID always different and show up only once, just count should be 
> fine, no need to count **distinct**.
> 
> If the USER_ID carnality indeed over 2 billion, maybe you need create one 
> cube every 21 days, and combine them into one hybrid cube? I??m not sure 
> whether it worked, you can check 
> http://kylin.apache.org/blog/2015/09/25/hybrid-model/  and have a try. 
> 
>> ?? 2016??8??25????12:56??lxw <[email protected]> ??????
>> 
>> Thanks, I got it.
>> 
>> We have 100 million new USER_IDs per day (segment), that means after 21 
>> days, the building task will be failed?
>> And we can't use "Precisely Count Distinct" in out scene?
>> 
>> 
>> 
>> 
>> 
>> ------------------ ???????? ------------------
>> ??????: "Yerui Sun";<[email protected]>;
>> ????????: 2016??8??25??(??????) ????11:55
>> ??????: "dev"<[email protected]>; 
>> 
>> ????: Re: Precisely Count Distinct on 100 million string values column
>> 
>> 
>> 
>> lxw,
>> If the values exceed Integer.MAX_VALUE, exception will be threw when 
>> dictionary building.
>> 
>> You can firstly disable cube and then edit the json on web ui. The action 
>> button is in the ??Admins?? of cube list table.
>> 
>> BTW, the 255 limitation could be removed in theory, however, that made the 
>> logic more complicated. You can have a try and contribute the patch if 
>> you??re interested.
>> 
>> Yiming,
>> I will post a patch for more clearly exception message and some minor 
>> improve of GlobalDictionary. 
>> But maybe later, it??s quite a busy week... 
>> 
>>> ?? 2016??8??25????10:05??lxw <[email protected]> ??????
>>> 
>>> Sorry, 
>>> 
>>> About question 1, 
>>> I means if count distinct values of column data cross all segments exceed 
>>> Integer.MAX_VALUE, what will be happened?
>>> 
>>> 
>>> 
>>> ------------------ ???????? ------------------
>>> ??????: "lxw";<[email protected]>;
>>> ????????: 2016??8??25??(??????) ????10:01
>>> ??????: "dev"<[email protected]>; 
>>> 
>>> ????: ?????? Precisely Count Distinct on 100 million string values column
>>> 
>>> 
>>> 
>>> I have 2 more questions:
>>> 
>>> 1. The capacity of the global dictionary is Integer.MAX_VALUE? If count 
>>> distinct values of column data cross all segments, what will be happened? 
>>> duplication or error ?
>>> 
>>> 2. Where I can manually edit a cube desc json? Now I use JAVA API to create 
>>> or update cube.
>>> 
>>> Thanks!
>>> 
>>> 
>>> 
>>> ------------------ ???????? ------------------
>>> ??????: "Yiming Liu";<[email protected]>;
>>> ????????: 2016??8??25??(??????) ????9:41
>>> ??????: "dev"<[email protected]>; "sunyerui"<[email protected]>; 
>>> 
>>> ????: Re: Precisely Count Distinct on 100 million string values column
>>> 
>>> 
>>> 
>>> Good found.
>>> 
>>> The code AppendTrieDictionary line 604:
>>> 
>>> // nValueBytes
>>> if (n.part.length > 255)
>>>  throw new RuntimeException();
>>> 
>>> Hi Yerui,
>>> 
>>> Could you add more comments for the 255 limit, with more meaningful 
>>> exception?
>>> 
>>> 
>>> 2016-08-24 20:44 GMT+08:00 lxw <[email protected]>:
>>> 
>>>> It caused by length(USER_ID) > 255.
>>>> After exclude these dirty data, it works .
>>>> 
>>>> 
>>>> Total 150 million records, execute this query:
>>>> 
>>>> select city_code,
>>>> sum(bid_request) as bid_request,
>>>> count(distinct user_id) as uv
>>>> from liuxiaowen.TEST_T_PBS_UV_FACT
>>>> group by city_code
>>>> order by uv desc limit 100
>>>> 
>>>> Kylin cost  7 seconds, and Hive cost 180 seconds, the result is same.
>>>> 
>>>> 
>>>> 
>>>> ------------------ Original ------------------
>>>> From:  "lxw";<[email protected]>;
>>>> Date:  Wed, Aug 24, 2016 05:27 PM
>>>> To:  "dev"<[email protected]>;
>>>> 
>>>> Subject:  Precisely Count Distinct on 100 million string values column
>>>> 
>>>> 
>>>> 
>>>> Hi,
>>>> 
>>>>  I am trying to use "Precisely Count Distinct" on 100 million string
>>>> values column "USER_ID", I updated the cube json :
>>>> "dictionaries": [     {       "column": "USER_ID",       "builder":
>>>> "org.apache.kylin.dict.GlobalDictionaryBuilder"     }   ],
>>>> 
>>>> "override_kylin_properties": {     
>>>> "kylin.job.mr.config.override.mapred.map.child.java.opts":
>>>> "-Xmx7g",     "kylin.job.mr.config.override.mapreduce.map.memory.mb":
>>>> "7168"   }  when I build the cube, an error occurred on "#4 Step Name:
>>>> Build Dimension Dictionary",
>>>> the error log in "kylin.log" :
>>>> 
>>>> 2016-08-24 17:27:53,282 ERROR [pool-7-thread-10] dict.CachedTreeMap:239 :
>>>> write value into /kylin_test1/kylin_metadata_test1/resources/GlobalDict/
>>>> dict/LIUXIAOWEN.TEST_T_PBS_UV_FACT/USER_ID.tmp/cached_
>>>> AQEByQXVzFd8r0YviP4x84YqUv-NcRiuCI2d exception: java.lang.RuntimeException
>>>> java.lang.RuntimeException
>>>>      at org.apache.kylin.dict.AppendTrieDictionary$DictNode.
>>>> build_writeNode(AppendTrieDictionary.java:605)
>>>>      at org.apache.kylin.dict.AppendTrieDictionary$DictNode.
>>>> buildTrieBytes(AppendTrieDictionary.java:576)
>>>>      at org.apache.kylin.dict.AppendTrieDictionary$DictNode.
>>>> write(AppendTrieDictionary.java:523)
>>>>      at org.apache.kylin.dict.CachedTreeMap.writeValue(
>>>> CachedTreeMap.java:234)
>>>>      at org.apache.kylin.dict.CachedTreeMap.write(
>>>> CachedTreeMap.java:374)
>>>>      at org.apache.kylin.dict.AppendTrieDictionary.flushIndex(
>>>> AppendTrieDictionary.java:1043)
>>>>      at org.apache.kylin.dict.AppendTrieDictionary$Builder.
>>>> build(AppendTrieDictionary.java:954)
>>>>      at org.apache.kylin.dict.GlobalDictionaryBuilder.build(
>>>> GlobalDictionaryBuilder.java:82)
>>>>      at org.apache.kylin.dict.DictionaryGenerator.buildDictionary(
>>>> DictionaryGenerator.java:81)
>>>>      at org.apache.kylin.dict.DictionaryManager.buildDictionary(
>>>> DictionaryManager.java:323)
>>>>      at org.apache.kylin.cube.CubeManager.buildDictionary(
>>>> CubeManager.java:185)
>>>>      at org.apache.kylin.cube.cli.DictionaryGeneratorCLI.
>>>> processSegment(DictionaryGeneratorCLI.java:51)
>>>>      at org.apache.kylin.cube.cli.DictionaryGeneratorCLI.
>>>> processSegment(DictionaryGeneratorCLI.java:42)
>>>>      at org.apache.kylin.engine.mr.steps.CreateDictionaryJob.run(
>>>> CreateDictionaryJob.java:56)
>>>>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>>>>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
>>>>      at org.apache.kylin.engine.mr.common.HadoopShellExecutable.
>>>> doWork(HadoopShellExecutable.java:63)
>>>>      at org.apache.kylin.job.execution.AbstractExecutable.
>>>> execute(AbstractExecutable.java:112)
>>>>      at org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(
>>>> DefaultChainedExecutable.java:57)
>>>>      at org.apache.kylin.job.execution.AbstractExecutable.
>>>> execute(AbstractExecutable.java:112)
>>>>      at org.apache.kylin.job.impl.threadpool.DefaultScheduler$
>>>> JobRunner.run(DefaultScheduler.java:127)
>>>>      at java.util.concurrent.ThreadPoolExecutor.runWorker(
>>>> ThreadPoolExecutor.java:1145)
>>>>      at java.util.concurrent.ThreadPoolExecutor$Worker.run(
>>>> ThreadPoolExecutor.java:615)
>>>>      at java.lang.Thread.run(Thread.java:744)
>>>> 2016-08-24 17:27:53,340 ERROR [pool-7-thread-10]
>>>> common.HadoopShellExecutable:65 : error execute HadoopShellExecutable{id=
>>>> 3a0f2751-dd2a-4a3b-a27a-58bfc0edbbfd-03, name=Build Dimension Dictionary,
>>>> state=RUNNING}
>>>> java.lang.RuntimeException
>>>>      at org.apache.kylin.dict.CachedTreeMap.writeValue(
>>>> CachedTreeMap.java:240)
>>>>      at org.apache.kylin.dict.CachedTreeMap.write(
>>>> CachedTreeMap.java:374)
>>>>      at org.apache.kylin.dict.AppendTrieDictionary.flushIndex(
>>>> AppendTrieDictionary.java:1043)
>>>>      at org.apache.kylin.dict.AppendTrieDictionary$Builder.
>>>> build(AppendTrieDictionary.java:954)
>>>>      at org.apache.kylin.dict.GlobalDictionaryBuilder.build(
>>>> GlobalDictionaryBuilder.java:82)
>>>>      at org.apache.kylin.dict.DictionaryGenerator.buildDictionary(
>>>> DictionaryGenerator.java:81)
>>>>      at org.apache.kylin.dict.DictionaryManager.buildDictionary(
>>>> DictionaryManager.java:323)
>>>>      at org.apache.kylin.cube.CubeManager.buildDictionary(
>>>> CubeManager.java:185)
>>>>      at org.apache.kylin.cube.cli.DictionaryGeneratorCLI.
>>>> processSegment(DictionaryGeneratorCLI.java:51)
>>>>      at org.apache.kylin.cube.cli.DictionaryGeneratorCLI.
>>>> processSegment(DictionaryGeneratorCLI.java:42)
>>>>      at org.apache.kylin.engine.mr.steps.CreateDictionaryJob.run(
>>>> CreateDictionaryJob.java:56)
>>>>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>>>>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
>>>>      at org.apache.kylin.engine.mr.common.HadoopShellExecutable.
>>>> doWork(HadoopShellExecutable.java:63)
>>>>      at org.apache.kylin.job.execution.AbstractExecutable.
>>>> execute(AbstractExecutable.java:112)
>>>>      at org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(
>>>> DefaultChainedExecutable.java:57)
>>>>      at org.apache.kylin.job.execution.AbstractExecutable.
>>>> execute(AbstractExecutable.java:112)
>>>>      at org.apache.kylin.job.impl.threadpool.DefaultScheduler$
>>>> JobRunner.run(DefaultScheduler.java:127)
>>>>      at java.util.concurrent.ThreadPoolExecutor.runWorker(
>>>> ThreadPoolExecutor.java:1145)
>>>>      at java.util.concurrent.ThreadPoolExecutor$Worker.run(
>>>> ThreadPoolExecutor.java:615)
>>>>      at java.lang.Thread.run(Thread.java:744)
>>>> 
>>>>  and the error log in "kylin.out" :
>>>> 
>>>> Aug 24, 2016 5:25:32 PM com.google.common.cache.LocalCache
>>>> processPendingNotifications
>>>> WARNING: Exception thrown by removal listener
>>>> java.lang.RuntimeException
>>>>      at org.apache.kylin.dict.CachedTreeMap.writeValue(
>>>> CachedTreeMap.java:240)
>>>>      at org.apache.kylin.dict.CachedTreeMap.access$300(
>>>> CachedTreeMap.java:52)
>>>>      at org.apache.kylin.dict.CachedTreeMap$1.onRemoval(
>>>> CachedTreeMap.java:149)
>>>>      at com.google.common.cache.LocalCache.processPendingNotifications(
>>>> LocalCache.java:2011)
>>>>      at com.google.common.cache.LocalCache$Segment.
>>>> runUnlockedCleanup(LocalCache.java:3501)
>>>>      at com.google.common.cache.LocalCache$Segment.
>>>> postWriteCleanup(LocalCache.java:3477)
>>>>      at com.google.common.cache.LocalCache$Segment.put(
>>>> LocalCache.java:2940)
>>>>      at com.google.common.cache.LocalCache.put(LocalCache.java:4202)
>>>>      at com.google.common.cache.LocalCache$LocalManualCache.
>>>> put(LocalCache.java:4798)
>>>>      at org.apache.kylin.dict.CachedTreeMap.put(CachedTreeMap.java:284)
>>>>      at org.apache.kylin.dict.CachedTreeMap.put(CachedTreeMap.java:52)
>>>>      at org.apache.kylin.dict.AppendTrieDictionary$Builder.
>>>> addValue(AppendTrieDictionary.java:829)
>>>>      at org.apache.kylin.dict.AppendTrieDictionary$Builder.
>>>> addValue(AppendTrieDictionary.java:804)
>>>>      at org.apache.kylin.dict.GlobalDictionaryBuilder.build(
>>>> GlobalDictionaryBuilder.java:78)
>>>>      at org.apache.kylin.dict.DictionaryGenerator.buildDictionary(
>>>> DictionaryGenerator.java:81)
>>>>      at org.apache.kylin.dict.DictionaryManager.buildDictionary(
>>>> DictionaryManager.java:323)
>>>>      at org.apache.kylin.cube.CubeManager.buildDictionary(
>>>> CubeManager.java:185)
>>>>      at org.apache.kylin.cube.cli.DictionaryGeneratorCLI.
>>>> processSegment(DictionaryGeneratorCLI.java:51)
>>>>      at org.apache.kylin.cube.cli.DictionaryGeneratorCLI.
>>>> processSegment(DictionaryGeneratorCLI.java:42)
>>>>      at org.apache.kylin.engine.mr.steps.CreateDictionaryJob.run(
>>>> CreateDictionaryJob.java:56)
>>>>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>>>>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
>>>>      at org.apache.kylin.engine.mr.common.HadoopShellExecutable.
>>>> doWork(HadoopShellExecutable.java:63)
>>>>      at org.apache.kylin.job.execution.AbstractExecutable.
>>>> execute(AbstractExecutable.java:112)
>>>>      at org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(
>>>> DefaultChainedExecutable.java:57)
>>>>      at org.apache.kylin.job.execution.AbstractExecutable.
>>>> execute(AbstractExecutable.java:112)
>>>>      at org.apache.kylin.job.impl.threadpool.DefaultScheduler$
>>>> JobRunner.run(DefaultScheduler.java:127)
>>>>      at java.util.concurrent.ThreadPoolExecutor.runWorker(
>>>> ThreadPoolExecutor.java:1145)
>>>>      at java.util.concurrent.ThreadPoolExecutor$Worker.run(
>>>> ThreadPoolExecutor.java:615)
>>>>      at java.lang.Thread.run(Thread.java:744)
>>>> 
>>>> usage: CreateDictionaryJob
>>>> -cubename <cubename>         Cube name. For exmaple, flat_item_cube
>>>> -input <input>               Input path
>>>> -segmentname <segmentname>   Cube segment name
>>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> With Warm regards
>>> 
>>> Yiming Liu (??????)

Re: Precisely Count Distinct on 100 million string values column

Reply via email to