Re: Precisely Count Distinct on 100 million string values column

Yiming Liu Wed, 24 Aug 2016 20:59:44 -0700

Thanks Yerui for the response.

2016-08-25 11:55 GMT+08:00 Yerui Sun <[email protected]>:


> lxw,
> If the values exceed Integer.MAX_VALUE, exception will be threw when
> dictionary building.
>
> You can firstly disable cube and then edit the json on web ui. The action
> button is in the ‘Admins’ of cube list table.
>
> BTW, the 255 limitation could be removed in theory, however, that made the
> logic more complicated. You can have a try and contribute the patch if
> you’re interested.
>
> Yiming,
> I will post a patch for more clearly exception message and some minor
> improve of GlobalDictionary.
> But maybe later, it’s quite a busy week...
>
> > 在 2016年8月25日，10:05，lxw <[email protected]> 写道：
> >
> > Sorry,
> >
> > About question 1,
> > I means if count distinct values of column data cross all segments
> exceed Integer.MAX_VALUE, what will be happened?
> >
> >
> >
> > ------------------ 原始邮件 ------------------
> > 发件人: "lxw";<[email protected]>;
> > 发送时间: 2016年8月25日(星期四) 上午10:01
> > 收件人: "dev"<[email protected]>;
> >
> > 主题: 回复： Precisely Count Distinct on 100 million string values column
> >
> >
> >
> > I have 2 more questions:
> >
> > 1. The capacity of the global dictionary is Integer.MAX_VALUE? If count
> distinct values of column data cross all segments, what will be happened?
> duplication or error ?
> >
> > 2. Where I can manually edit a cube desc json? Now I use JAVA API to
> create or update cube.
> >
> > Thanks!
> >
> >
> >
> > ------------------ 原始邮件 ------------------
> > 发件人: "Yiming Liu";<[email protected]>;
> > 发送时间: 2016年8月25日(星期四) 上午9:41
> > 收件人: "dev"<[email protected]>; "sunyerui"<[email protected]>;
> >
> > 主题: Re: Precisely Count Distinct on 100 million string values column
> >
> >
> >
> > Good found.
> >
> > The code AppendTrieDictionary line 604:
> >
> > // nValueBytes
> > if (n.part.length > 255)
> >    throw new RuntimeException();
> >
> > Hi Yerui,
> >
> > Could you add more comments for the 255 limit, with more meaningful
> exception?
> >
> >
> > 2016-08-24 20:44 GMT+08:00 lxw <[email protected]>:
> >
> >> It caused by length(USER_ID) > 255.
> >> After exclude these dirty data, it works .
> >>
> >>
> >> Total 150 million records, execute this query:
> >>
> >> select city_code,
> >> sum(bid_request) as bid_request,
> >> count(distinct user_id) as uv
> >> from liuxiaowen.TEST_T_PBS_UV_FACT
> >> group by city_code
> >> order by uv desc limit 100
> >>
> >> Kylin cost  7 seconds, and Hive cost 180 seconds, the result is same.
> >>
> >>
> >>
> >> ------------------ Original ------------------
> >> From:  "lxw";<[email protected]>;
> >> Date:  Wed, Aug 24, 2016 05:27 PM
> >> To:  "dev"<[email protected]>;
> >>
> >> Subject:  Precisely Count Distinct on 100 million string values column
> >>
> >>
> >>
> >> Hi,
> >>
> >>    I am trying to use "Precisely Count Distinct" on 100 million string
> >> values column "USER_ID", I updated the cube json :
> >> "dictionaries": [     {       "column": "USER_ID",       "builder":
> >> "org.apache.kylin.dict.GlobalDictionaryBuilder"     }   ],
> >>
> >> "override_kylin_properties": {     "kylin.job.mr.config.override.
> mapred.map.child.java.opts":
> >> "-Xmx7g",     "kylin.job.mr.config.override.mapreduce.map.memory.mb":
> >> "7168"   }  when I build the cube, an error occurred on "#4 Step Name:
> >> Build Dimension Dictionary",
> >>  the error log in "kylin.log" :
> >>
> >> 2016-08-24 17:27:53,282 ERROR [pool-7-thread-10] dict.CachedTreeMap:239
> :
> >> write value into /kylin_test1/kylin_metadata_
> test1/resources/GlobalDict/
> >> dict/LIUXIAOWEN.TEST_T_PBS_UV_FACT/USER_ID.tmp/cached_
> >> AQEByQXVzFd8r0YviP4x84YqUv-NcRiuCI2d exception:
> java.lang.RuntimeException
> >> java.lang.RuntimeException
> >>        at org.apache.kylin.dict.AppendTrieDictionary$DictNode.
> >> build_writeNode(AppendTrieDictionary.java:605)
> >>        at org.apache.kylin.dict.AppendTrieDictionary$DictNode.
> >> buildTrieBytes(AppendTrieDictionary.java:576)
> >>        at org.apache.kylin.dict.AppendTrieDictionary$DictNode.
> >> write(AppendTrieDictionary.java:523)
> >>        at org.apache.kylin.dict.CachedTreeMap.writeValue(
> >> CachedTreeMap.java:234)
> >>        at org.apache.kylin.dict.CachedTreeMap.write(
> >> CachedTreeMap.java:374)
> >>        at org.apache.kylin.dict.AppendTrieDictionary.flushIndex(
> >> AppendTrieDictionary.java:1043)
> >>        at org.apache.kylin.dict.AppendTrieDictionary$Builder.
> >> build(AppendTrieDictionary.java:954)
> >>        at org.apache.kylin.dict.GlobalDictionaryBuilder.build(
> >> GlobalDictionaryBuilder.java:82)
> >>        at org.apache.kylin.dict.DictionaryGenerator.buildDictionary(
> >> DictionaryGenerator.java:81)
> >>        at org.apache.kylin.dict.DictionaryManager.buildDictionary(
> >> DictionaryManager.java:323)
> >>        at org.apache.kylin.cube.CubeManager.buildDictionary(
> >> CubeManager.java:185)
> >>        at org.apache.kylin.cube.cli.DictionaryGeneratorCLI.
> >> processSegment(DictionaryGeneratorCLI.java:51)
> >>        at org.apache.kylin.cube.cli.DictionaryGeneratorCLI.
> >> processSegment(DictionaryGeneratorCLI.java:42)
> >>        at org.apache.kylin.engine.mr.steps.CreateDictionaryJob.run(
> >> CreateDictionaryJob.java:56)
> >>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> >>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
> >>        at org.apache.kylin.engine.mr.common.HadoopShellExecutable.
> >> doWork(HadoopShellExecutable.java:63)
> >>        at org.apache.kylin.job.execution.AbstractExecutable.
> >> execute(AbstractExecutable.java:112)
> >>        at org.apache.kylin.job.execution.DefaultChainedExecutable.
> doWork(
> >> DefaultChainedExecutable.java:57)
> >>        at org.apache.kylin.job.execution.AbstractExecutable.
> >> execute(AbstractExecutable.java:112)
> >>        at org.apache.kylin.job.impl.threadpool.DefaultScheduler$
> >> JobRunner.run(DefaultScheduler.java:127)
> >>        at java.util.concurrent.ThreadPoolExecutor.runWorker(
> >> ThreadPoolExecutor.java:1145)
> >>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> >> ThreadPoolExecutor.java:615)
> >>        at java.lang.Thread.run(Thread.java:744)
> >> 2016-08-24 17:27:53,340 ERROR [pool-7-thread-10]
> >> common.HadoopShellExecutable:65 : error execute
> HadoopShellExecutable{id=
> >> 3a0f2751-dd2a-4a3b-a27a-58bfc0edbbfd-03, name=Build Dimension
> Dictionary,
> >> state=RUNNING}
> >> java.lang.RuntimeException
> >>        at org.apache.kylin.dict.CachedTreeMap.writeValue(
> >> CachedTreeMap.java:240)
> >>        at org.apache.kylin.dict.CachedTreeMap.write(
> >> CachedTreeMap.java:374)
> >>        at org.apache.kylin.dict.AppendTrieDictionary.flushIndex(
> >> AppendTrieDictionary.java:1043)
> >>        at org.apache.kylin.dict.AppendTrieDictionary$Builder.
> >> build(AppendTrieDictionary.java:954)
> >>        at org.apache.kylin.dict.GlobalDictionaryBuilder.build(
> >> GlobalDictionaryBuilder.java:82)
> >>        at org.apache.kylin.dict.DictionaryGenerator.buildDictionary(
> >> DictionaryGenerator.java:81)
> >>        at org.apache.kylin.dict.DictionaryManager.buildDictionary(
> >> DictionaryManager.java:323)
> >>        at org.apache.kylin.cube.CubeManager.buildDictionary(
> >> CubeManager.java:185)
> >>        at org.apache.kylin.cube.cli.DictionaryGeneratorCLI.
> >> processSegment(DictionaryGeneratorCLI.java:51)
> >>        at org.apache.kylin.cube.cli.DictionaryGeneratorCLI.
> >> processSegment(DictionaryGeneratorCLI.java:42)
> >>        at org.apache.kylin.engine.mr.steps.CreateDictionaryJob.run(
> >> CreateDictionaryJob.java:56)
> >>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> >>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
> >>        at org.apache.kylin.engine.mr.common.HadoopShellExecutable.
> >> doWork(HadoopShellExecutable.java:63)
> >>        at org.apache.kylin.job.execution.AbstractExecutable.
> >> execute(AbstractExecutable.java:112)
> >>        at org.apache.kylin.job.execution.DefaultChainedExecutable.
> doWork(
> >> DefaultChainedExecutable.java:57)
> >>        at org.apache.kylin.job.execution.AbstractExecutable.
> >> execute(AbstractExecutable.java:112)
> >>        at org.apache.kylin.job.impl.threadpool.DefaultScheduler$
> >> JobRunner.run(DefaultScheduler.java:127)
> >>        at java.util.concurrent.ThreadPoolExecutor.runWorker(
> >> ThreadPoolExecutor.java:1145)
> >>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> >> ThreadPoolExecutor.java:615)
> >>        at java.lang.Thread.run(Thread.java:744)
> >>
> >>    and the error log in "kylin.out" :
> >>
> >> Aug 24, 2016 5:25:32 PM com.google.common.cache.LocalCache
> >> processPendingNotifications
> >> WARNING: Exception thrown by removal listener
> >> java.lang.RuntimeException
> >>        at org.apache.kylin.dict.CachedTreeMap.writeValue(
> >> CachedTreeMap.java:240)
> >>        at org.apache.kylin.dict.CachedTreeMap.access$300(
> >> CachedTreeMap.java:52)
> >>        at org.apache.kylin.dict.CachedTreeMap$1.onRemoval(
> >> CachedTreeMap.java:149)
> >>        at com.google.common.cache.LocalCache.
> processPendingNotifications(
> >> LocalCache.java:2011)
> >>        at com.google.common.cache.LocalCache$Segment.
> >> runUnlockedCleanup(LocalCache.java:3501)
> >>        at com.google.common.cache.LocalCache$Segment.
> >> postWriteCleanup(LocalCache.java:3477)
> >>        at com.google.common.cache.LocalCache$Segment.put(
> >> LocalCache.java:2940)
> >>        at com.google.common.cache.LocalCache.put(LocalCache.java:4202)
> >>        at com.google.common.cache.LocalCache$LocalManualCache.
> >> put(LocalCache.java:4798)
> >>        at org.apache.kylin.dict.CachedTreeMap.put(
> CachedTreeMap.java:284)
> >>        at org.apache.kylin.dict.CachedTreeMap.put(
> CachedTreeMap.java:52)
> >>        at org.apache.kylin.dict.AppendTrieDictionary$Builder.
> >> addValue(AppendTrieDictionary.java:829)
> >>        at org.apache.kylin.dict.AppendTrieDictionary$Builder.
> >> addValue(AppendTrieDictionary.java:804)
> >>        at org.apache.kylin.dict.GlobalDictionaryBuilder.build(
> >> GlobalDictionaryBuilder.java:78)
> >>        at org.apache.kylin.dict.DictionaryGenerator.buildDictionary(
> >> DictionaryGenerator.java:81)
> >>        at org.apache.kylin.dict.DictionaryManager.buildDictionary(
> >> DictionaryManager.java:323)
> >>        at org.apache.kylin.cube.CubeManager.buildDictionary(
> >> CubeManager.java:185)
> >>        at org.apache.kylin.cube.cli.DictionaryGeneratorCLI.
> >> processSegment(DictionaryGeneratorCLI.java:51)
> >>        at org.apache.kylin.cube.cli.DictionaryGeneratorCLI.
> >> processSegment(DictionaryGeneratorCLI.java:42)
> >>        at org.apache.kylin.engine.mr.steps.CreateDictionaryJob.run(
> >> CreateDictionaryJob.java:56)
> >>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> >>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
> >>        at org.apache.kylin.engine.mr.common.HadoopShellExecutable.
> >> doWork(HadoopShellExecutable.java:63)
> >>        at org.apache.kylin.job.execution.AbstractExecutable.
> >> execute(AbstractExecutable.java:112)
> >>        at org.apache.kylin.job.execution.DefaultChainedExecutable.
> doWork(
> >> DefaultChainedExecutable.java:57)
> >>        at org.apache.kylin.job.execution.AbstractExecutable.
> >> execute(AbstractExecutable.java:112)
> >>        at org.apache.kylin.job.impl.threadpool.DefaultScheduler$
> >> JobRunner.run(DefaultScheduler.java:127)
> >>        at java.util.concurrent.ThreadPoolExecutor.runWorker(
> >> ThreadPoolExecutor.java:1145)
> >>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> >> ThreadPoolExecutor.java:615)
> >>        at java.lang.Thread.run(Thread.java:744)
> >>
> >> usage: CreateDictionaryJob
> >> -cubename <cubename>         Cube name. For exmaple, flat_item_cube
> >> -input <input>               Input path
> >> -segmentname <segmentname>   Cube segment name
> >>
> >
> >
> >
> > --
> > With Warm regards
> >
> > Yiming Liu (刘一鸣)
>
>


-- 
With Warm regards

Yiming Liu (刘一鸣)

Re: Precisely Count Distinct on 100 million string values column

Reply via email to