[ https://issues.apache.org/jira/browse/KYLIN-4083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
PENG Zhengshuai updated KYLIN-4083: ----------------------------------- Description: In the Fact Distinct Column Step, kylin uses MR to de-dup the values of columns. If the column is UHC (ultra high cardinality) column and the value of the property *kylin.engine.mr.uhc-reducer-count* has been set greater than *1*, the Mapper task will write the output of UHC column values to different reducers by *FactDistinctColumnPartitioner* according to the reducer id The reducer id will be calculated by hash, the implementation in *FactDistinctColumnsReducerMapping#getReducerIdForCol()*, in this method, *the reducer id = reducerBeginIndex + Math.abs(value.hashCode()) % uhcReducerCount* When the value.hashCode() is Integer.MIN_VALUE, the Math.abs(value.hashCode()) return also Integer.MIN_VALUE. Thus the reducer id may return a negative value. This may cause the FactDistinctColumn step failed, or the UHC column value may be redirected to another reducer which not belongs to UHC column was: In the Fact Distinct Column Step, kylin uses MR to de-dup the values of columns. If the column is UHC (ultra high cardinality) column and the value of the property *kylin.engine.mr.uhc-reducer-count* has been set greater than *1*, the Mapper task will write the output of UHC column values to different reducers by *FactDistinctColumnPartitioner* according to the reducer id The reducer id will be calculated by hash, the implementation in *FactDistinctColumnsReducerMapping#getReducerIdForCol*, in this method, *the reducer id = reducerBeginIndex + Math.abs(value.hashCode()) % uhcReducerCount* When the value.hashCode() is Integer.MIN_VALUE, the Math.abs(value.hashCode()) return also Integer.MIN_VALUE. Thus the reducer id may return a negative value. This may cause the FactDistinctColumn step failed, or the UHC column value may be redirected to another reducer which not belongs to UHC column > Fact Distinct Column Step maybe failed or value lost when hashcode of the UHC > column value is Integer.MIN_VALUE > --------------------------------------------------------------------------------------------------------------- > > Key: KYLIN-4083 > URL: https://issues.apache.org/jira/browse/KYLIN-4083 > Project: Kylin > Issue Type: Bug > Reporter: PENG Zhengshuai > Assignee: PENG Zhengshuai > Priority: Major > > In the Fact Distinct Column Step, kylin uses MR to de-dup the values of > columns. > If the column is UHC (ultra high cardinality) column and the value of the > property *kylin.engine.mr.uhc-reducer-count* has been set greater than *1*, > the Mapper task will write the output of UHC column values to different > reducers by *FactDistinctColumnPartitioner* according to the reducer id > The reducer id will be calculated by hash, the implementation in > *FactDistinctColumnsReducerMapping#getReducerIdForCol()*, in this method, > *the reducer id = reducerBeginIndex + Math.abs(value.hashCode()) % > uhcReducerCount* > When the value.hashCode() is Integer.MIN_VALUE, the > Math.abs(value.hashCode()) return also Integer.MIN_VALUE. Thus the reducer id > may return a negative value. This may cause the FactDistinctColumn step > failed, or the UHC column value may be redirected to another reducer which > not belongs to UHC column -- This message was sent by Atlassian JIRA (v7.6.14#76016)