Hmm?? You??re right, hybrid cube couldn??t resolve your problem. It??s really a challenge to count distinct on such huge dataset. A possible solution is expand dict id from int to bigint, and made RoaringBitmap support bigint. However, it need quite changing on current code.
> ?? 2016??8??25????16:40??lxw <[email protected]> ?????? > > 1. Yes, USER_ID have duplicated values between segments, 100 million is new, > maby another 150 million is old per segment. > 2. I think "Hybrid Model" also has problem about my scene, just like "default > dictionay cross segments", this is "global dictionary cross cubes", am I > right? > > > > > > ------------------ ???????? ------------------ > ??????: "Yerui Sun";<[email protected]>; > ????????: 2016??8??25??(??????) ????4:22 > ??????: "dev"<[email protected]>; > > ????: Re: Precisely Count Distinct on 100 million string values column > > > > That depends on your USER_ID carnality. I think your USER_ID should have > duplicated values between segments, that??s why you use count **distinct**. > If the USER_ID always different and show up only once, just count should be > fine, no need to count **distinct**. > > If the USER_ID carnality indeed over 2 billion, maybe you need create one > cube every 21 days, and combine them into one hybrid cube? I??m not sure > whether it worked, you can check > http://kylin.apache.org/blog/2015/09/25/hybrid-model/ and have a try. > >> ?? 2016??8??25????12:56??lxw <[email protected]> ?????? >> >> Thanks, I got it. >> >> We have 100 million new USER_IDs per day (segment), that means after 21 >> days, the building task will be failed? >> And we can't use "Precisely Count Distinct" in out scene? >> >> >> >> >> >> ------------------ ???????? ------------------ >> ??????: "Yerui Sun";<[email protected]>; >> ????????: 2016??8??25??(??????) ????11:55 >> ??????: "dev"<[email protected]>; >> >> ????: Re: Precisely Count Distinct on 100 million string values column >> >> >> >> lxw, >> If the values exceed Integer.MAX_VALUE, exception will be threw when >> dictionary building. >> >> You can firstly disable cube and then edit the json on web ui. The action >> button is in the ??Admins?? of cube list table. >> >> BTW, the 255 limitation could be removed in theory, however, that made the >> logic more complicated. You can have a try and contribute the patch if >> you??re interested. >> >> Yiming, >> I will post a patch for more clearly exception message and some minor >> improve of GlobalDictionary. >> But maybe later, it??s quite a busy week... >> >>> ?? 2016??8??25????10:05??lxw <[email protected]> ?????? >>> >>> Sorry, >>> >>> About question 1, >>> I means if count distinct values of column data cross all segments exceed >>> Integer.MAX_VALUE, what will be happened? >>> >>> >>> >>> ------------------ ???????? ------------------ >>> ??????: "lxw";<[email protected]>; >>> ????????: 2016??8??25??(??????) ????10:01 >>> ??????: "dev"<[email protected]>; >>> >>> ????: ?????? Precisely Count Distinct on 100 million string values column >>> >>> >>> >>> I have 2 more questions: >>> >>> 1. The capacity of the global dictionary is Integer.MAX_VALUE? If count >>> distinct values of column data cross all segments, what will be happened? >>> duplication or error ? >>> >>> 2. Where I can manually edit a cube desc json? Now I use JAVA API to create >>> or update cube. >>> >>> Thanks! >>> >>> >>> >>> ------------------ ???????? ------------------ >>> ??????: "Yiming Liu";<[email protected]>; >>> ????????: 2016??8??25??(??????) ????9:41 >>> ??????: "dev"<[email protected]>; "sunyerui"<[email protected]>; >>> >>> ????: Re: Precisely Count Distinct on 100 million string values column >>> >>> >>> >>> Good found. >>> >>> The code AppendTrieDictionary line 604: >>> >>> // nValueBytes >>> if (n.part.length > 255) >>> throw new RuntimeException(); >>> >>> Hi Yerui, >>> >>> Could you add more comments for the 255 limit, with more meaningful >>> exception? >>> >>> >>> 2016-08-24 20:44 GMT+08:00 lxw <[email protected]>: >>> >>>> It caused by length(USER_ID) > 255. >>>> After exclude these dirty data, it works . >>>> >>>> >>>> Total 150 million records, execute this query: >>>> >>>> select city_code, >>>> sum(bid_request) as bid_request, >>>> count(distinct user_id) as uv >>>> from liuxiaowen.TEST_T_PBS_UV_FACT >>>> group by city_code >>>> order by uv desc limit 100 >>>> >>>> Kylin cost 7 seconds, and Hive cost 180 seconds, the result is same. >>>> >>>> >>>> >>>> ------------------ Original ------------------ >>>> From: "lxw";<[email protected]>; >>>> Date: Wed, Aug 24, 2016 05:27 PM >>>> To: "dev"<[email protected]>; >>>> >>>> Subject: Precisely Count Distinct on 100 million string values column >>>> >>>> >>>> >>>> Hi, >>>> >>>> I am trying to use "Precisely Count Distinct" on 100 million string >>>> values column "USER_ID", I updated the cube json : >>>> "dictionaries": [ { "column": "USER_ID", "builder": >>>> "org.apache.kylin.dict.GlobalDictionaryBuilder" } ], >>>> >>>> "override_kylin_properties": { >>>> "kylin.job.mr.config.override.mapred.map.child.java.opts": >>>> "-Xmx7g", "kylin.job.mr.config.override.mapreduce.map.memory.mb": >>>> "7168" } when I build the cube, an error occurred on "#4 Step Name: >>>> Build Dimension Dictionary", >>>> the error log in "kylin.log" : >>>> >>>> 2016-08-24 17:27:53,282 ERROR [pool-7-thread-10] dict.CachedTreeMap:239 : >>>> write value into /kylin_test1/kylin_metadata_test1/resources/GlobalDict/ >>>> dict/LIUXIAOWEN.TEST_T_PBS_UV_FACT/USER_ID.tmp/cached_ >>>> AQEByQXVzFd8r0YviP4x84YqUv-NcRiuCI2d exception: java.lang.RuntimeException >>>> java.lang.RuntimeException >>>> at org.apache.kylin.dict.AppendTrieDictionary$DictNode. >>>> build_writeNode(AppendTrieDictionary.java:605) >>>> at org.apache.kylin.dict.AppendTrieDictionary$DictNode. >>>> buildTrieBytes(AppendTrieDictionary.java:576) >>>> at org.apache.kylin.dict.AppendTrieDictionary$DictNode. >>>> write(AppendTrieDictionary.java:523) >>>> at org.apache.kylin.dict.CachedTreeMap.writeValue( >>>> CachedTreeMap.java:234) >>>> at org.apache.kylin.dict.CachedTreeMap.write( >>>> CachedTreeMap.java:374) >>>> at org.apache.kylin.dict.AppendTrieDictionary.flushIndex( >>>> AppendTrieDictionary.java:1043) >>>> at org.apache.kylin.dict.AppendTrieDictionary$Builder. >>>> build(AppendTrieDictionary.java:954) >>>> at org.apache.kylin.dict.GlobalDictionaryBuilder.build( >>>> GlobalDictionaryBuilder.java:82) >>>> at org.apache.kylin.dict.DictionaryGenerator.buildDictionary( >>>> DictionaryGenerator.java:81) >>>> at org.apache.kylin.dict.DictionaryManager.buildDictionary( >>>> DictionaryManager.java:323) >>>> at org.apache.kylin.cube.CubeManager.buildDictionary( >>>> CubeManager.java:185) >>>> at org.apache.kylin.cube.cli.DictionaryGeneratorCLI. >>>> processSegment(DictionaryGeneratorCLI.java:51) >>>> at org.apache.kylin.cube.cli.DictionaryGeneratorCLI. >>>> processSegment(DictionaryGeneratorCLI.java:42) >>>> at org.apache.kylin.engine.mr.steps.CreateDictionaryJob.run( >>>> CreateDictionaryJob.java:56) >>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) >>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) >>>> at org.apache.kylin.engine.mr.common.HadoopShellExecutable. >>>> doWork(HadoopShellExecutable.java:63) >>>> at org.apache.kylin.job.execution.AbstractExecutable. >>>> execute(AbstractExecutable.java:112) >>>> at org.apache.kylin.job.execution.DefaultChainedExecutable.doWork( >>>> DefaultChainedExecutable.java:57) >>>> at org.apache.kylin.job.execution.AbstractExecutable. >>>> execute(AbstractExecutable.java:112) >>>> at org.apache.kylin.job.impl.threadpool.DefaultScheduler$ >>>> JobRunner.run(DefaultScheduler.java:127) >>>> at java.util.concurrent.ThreadPoolExecutor.runWorker( >>>> ThreadPoolExecutor.java:1145) >>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run( >>>> ThreadPoolExecutor.java:615) >>>> at java.lang.Thread.run(Thread.java:744) >>>> 2016-08-24 17:27:53,340 ERROR [pool-7-thread-10] >>>> common.HadoopShellExecutable:65 : error execute HadoopShellExecutable{id= >>>> 3a0f2751-dd2a-4a3b-a27a-58bfc0edbbfd-03, name=Build Dimension Dictionary, >>>> state=RUNNING} >>>> java.lang.RuntimeException >>>> at org.apache.kylin.dict.CachedTreeMap.writeValue( >>>> CachedTreeMap.java:240) >>>> at org.apache.kylin.dict.CachedTreeMap.write( >>>> CachedTreeMap.java:374) >>>> at org.apache.kylin.dict.AppendTrieDictionary.flushIndex( >>>> AppendTrieDictionary.java:1043) >>>> at org.apache.kylin.dict.AppendTrieDictionary$Builder. >>>> build(AppendTrieDictionary.java:954) >>>> at org.apache.kylin.dict.GlobalDictionaryBuilder.build( >>>> GlobalDictionaryBuilder.java:82) >>>> at org.apache.kylin.dict.DictionaryGenerator.buildDictionary( >>>> DictionaryGenerator.java:81) >>>> at org.apache.kylin.dict.DictionaryManager.buildDictionary( >>>> DictionaryManager.java:323) >>>> at org.apache.kylin.cube.CubeManager.buildDictionary( >>>> CubeManager.java:185) >>>> at org.apache.kylin.cube.cli.DictionaryGeneratorCLI. >>>> processSegment(DictionaryGeneratorCLI.java:51) >>>> at org.apache.kylin.cube.cli.DictionaryGeneratorCLI. >>>> processSegment(DictionaryGeneratorCLI.java:42) >>>> at org.apache.kylin.engine.mr.steps.CreateDictionaryJob.run( >>>> CreateDictionaryJob.java:56) >>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) >>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) >>>> at org.apache.kylin.engine.mr.common.HadoopShellExecutable. >>>> doWork(HadoopShellExecutable.java:63) >>>> at org.apache.kylin.job.execution.AbstractExecutable. >>>> execute(AbstractExecutable.java:112) >>>> at org.apache.kylin.job.execution.DefaultChainedExecutable.doWork( >>>> DefaultChainedExecutable.java:57) >>>> at org.apache.kylin.job.execution.AbstractExecutable. >>>> execute(AbstractExecutable.java:112) >>>> at org.apache.kylin.job.impl.threadpool.DefaultScheduler$ >>>> JobRunner.run(DefaultScheduler.java:127) >>>> at java.util.concurrent.ThreadPoolExecutor.runWorker( >>>> ThreadPoolExecutor.java:1145) >>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run( >>>> ThreadPoolExecutor.java:615) >>>> at java.lang.Thread.run(Thread.java:744) >>>> >>>> and the error log in "kylin.out" : >>>> >>>> Aug 24, 2016 5:25:32 PM com.google.common.cache.LocalCache >>>> processPendingNotifications >>>> WARNING: Exception thrown by removal listener >>>> java.lang.RuntimeException >>>> at org.apache.kylin.dict.CachedTreeMap.writeValue( >>>> CachedTreeMap.java:240) >>>> at org.apache.kylin.dict.CachedTreeMap.access$300( >>>> CachedTreeMap.java:52) >>>> at org.apache.kylin.dict.CachedTreeMap$1.onRemoval( >>>> CachedTreeMap.java:149) >>>> at com.google.common.cache.LocalCache.processPendingNotifications( >>>> LocalCache.java:2011) >>>> at com.google.common.cache.LocalCache$Segment. >>>> runUnlockedCleanup(LocalCache.java:3501) >>>> at com.google.common.cache.LocalCache$Segment. >>>> postWriteCleanup(LocalCache.java:3477) >>>> at com.google.common.cache.LocalCache$Segment.put( >>>> LocalCache.java:2940) >>>> at com.google.common.cache.LocalCache.put(LocalCache.java:4202) >>>> at com.google.common.cache.LocalCache$LocalManualCache. >>>> put(LocalCache.java:4798) >>>> at org.apache.kylin.dict.CachedTreeMap.put(CachedTreeMap.java:284) >>>> at org.apache.kylin.dict.CachedTreeMap.put(CachedTreeMap.java:52) >>>> at org.apache.kylin.dict.AppendTrieDictionary$Builder. >>>> addValue(AppendTrieDictionary.java:829) >>>> at org.apache.kylin.dict.AppendTrieDictionary$Builder. >>>> addValue(AppendTrieDictionary.java:804) >>>> at org.apache.kylin.dict.GlobalDictionaryBuilder.build( >>>> GlobalDictionaryBuilder.java:78) >>>> at org.apache.kylin.dict.DictionaryGenerator.buildDictionary( >>>> DictionaryGenerator.java:81) >>>> at org.apache.kylin.dict.DictionaryManager.buildDictionary( >>>> DictionaryManager.java:323) >>>> at org.apache.kylin.cube.CubeManager.buildDictionary( >>>> CubeManager.java:185) >>>> at org.apache.kylin.cube.cli.DictionaryGeneratorCLI. >>>> processSegment(DictionaryGeneratorCLI.java:51) >>>> at org.apache.kylin.cube.cli.DictionaryGeneratorCLI. >>>> processSegment(DictionaryGeneratorCLI.java:42) >>>> at org.apache.kylin.engine.mr.steps.CreateDictionaryJob.run( >>>> CreateDictionaryJob.java:56) >>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) >>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) >>>> at org.apache.kylin.engine.mr.common.HadoopShellExecutable. >>>> doWork(HadoopShellExecutable.java:63) >>>> at org.apache.kylin.job.execution.AbstractExecutable. >>>> execute(AbstractExecutable.java:112) >>>> at org.apache.kylin.job.execution.DefaultChainedExecutable.doWork( >>>> DefaultChainedExecutable.java:57) >>>> at org.apache.kylin.job.execution.AbstractExecutable. >>>> execute(AbstractExecutable.java:112) >>>> at org.apache.kylin.job.impl.threadpool.DefaultScheduler$ >>>> JobRunner.run(DefaultScheduler.java:127) >>>> at java.util.concurrent.ThreadPoolExecutor.runWorker( >>>> ThreadPoolExecutor.java:1145) >>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run( >>>> ThreadPoolExecutor.java:615) >>>> at java.lang.Thread.run(Thread.java:744) >>>> >>>> usage: CreateDictionaryJob >>>> -cubename <cubename> Cube name. For exmaple, flat_item_cube >>>> -input <input> Input path >>>> -segmentname <segmentname> Cube segment name >>>> >>> >>> >>> >>> -- >>> With Warm regards >>> >>> Yiming Liu (??????)
