Hi Zhixin, Thanks for the feedback.
We don't use "distribute by rand()" anymore because it may lose data. So we use distribute by multiple columns now. In the current version of Kylin, its behavior is: 1. If there is a "shard by" column be specified, will use it to distribute; 2. Otherwise, use the top 3 columns in the rowkey to distribute. Usually, we put the columns into the rowkey by its cardinality: high cardinality ahead of low cardinality. So we assume the top 3 columns can make the distribution even. But, there might be other cases as you mentioned: the top 3 columns are all low cardinality or they are related, which may cause data skew. In this case, maybe distribute by random or by all column is better. Please feel free to report a JIRA on this, with your observations. Thank you! Best regards, Shaofeng Shi 史少锋 Apache Kylin PMC Work email: [email protected] Kyligence Inc: https://kyligence.io/ Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html Join Kylin user mail group: [email protected] Join Kylin dev mail group: [email protected] liuzhixin <[email protected]> 于2018年11月2日周五 下午8:47写道: > HI ShaoFeng Shi: > > 数据表中高基数维度(例如request_id或者timestamp)会带来维度膨胀,引起了OOM; > 而其他的一些偏低的高基数维度本身数据分布就不均衡,导致数据也分布不均衡; > # > 数据本身有很多分布就不均衡,没有了rand(),Kylin该如何处理? > > Best Wishes > > > 在 2018年11月2日,下午1:42,ShaoFeng Shi <[email protected]> 写道: > > > > Please move the high cardinality dimensions to the leading position of > > rowkey, that will make the data distribution more even; > > > > Chao Long <[email protected]> 于2018年11月2日周五 下午1:38写道: > > > >> Hi zhixin, > >> Data may become not correct if use "distribute by rand()". > >> https://issues.apache.org/jira/browse/KYLIN-3388 > >> > >> > >> > >> > >> ------------------ 原始邮件 ------------------ > >> 发件人: "liuzhixin"<[email protected]>; > >> 发送时间: 2018年11月2日(星期五) 中午12:53 > >> 收件人: "dev"<[email protected]>; > >> 抄送: "ShaoFeng Shi"<[email protected]>; > >> 主题: Re: Redistribute intermediate table default not by rand() > >> > >> > >> > >> Hi kylin team: > >> > >> Step: Redistribute intermediate table > >> # > >> 默认选择了维度的前三个字段作为DISTRIBUTE BY的依据,没有采用DISTRIBUTE BY RAND() > >> 如果没有合适的维度字段,这样的默认策略将会导致数据更加的数据不均衡。 > >> > >> Best Regards! > >> > >>> 在 2018年11月2日,下午12:03,liuzhixin <[email protected]> 写道: > >>> > >>> Hi kylin team: > >>> > >>> Version: Kylin2.5-hadoop3.1 for hdp3.0 > >>> # > >>> Step: Redistribute intermediate table > >>> # > >>> DISTRIBUTE BY is that: > >>> INSERT OVERWRITE TABLE table_intermediate SELECT * FROM > >> table_intermediate DISTRIBUTE BY Field1, Field2, Field3; > >>> # > >>> Not DISTRIBUTE BY RAND() > >>> # > >>> Is this default DISTRIBUTE BY Field1, Field2, Field3? how to DISTRIBUTE > >> BY RAND()? > >>> > >>> Best wishes. > >>> > > > > > > > > -- > > Best regards, > > > > Shaofeng Shi 史少锋 > > >
