[ 
https://issues.apache.org/jira/browse/KYLIN-4083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PENG Zhengshuai updated KYLIN-4083:
-----------------------------------
    Description: 
In the Fact Distinct Column Step, kylin uses MR to de-dup the values of columns.
If the column is UHC (ultra high cardinality) column and the value of the 
property *kylin.engine.mr.uhc-reducer-count* has been set greater than *1*, the 
Mapper task will write the output of UHC column values to different reducers by 
*FactDistinctColumnPartitioner* according to the reducer id 

The reducer id will be calculated by hash, the implementation in 
*FactDistinctColumnsReducerMapping#getReducerIdForCol()*,  in this method, *the 
reducer id = reducerBeginIndex + Math.abs(value.hashCode()) % uhcReducerCount*

When the value.hashCode() is Integer.MIN_VALUE, the Math.abs(value.hashCode()) 
return also Integer.MIN_VALUE. Thus the reducer id may return a negative value. 
This may cause the FactDistinctColumn step failed, or the UHC column value may 
be redirected to another reducer which not belongs to UHC column

  was:
In the Fact Distinct Column Step, kylin uses MR to de-dup the values of columns.
If the column is UHC (ultra high cardinality) column and the value of the 
property *kylin.engine.mr.uhc-reducer-count* has been set greater than *1*, the 
Mapper task will write the output of UHC column values to different reducers by 
*FactDistinctColumnPartitioner* according to the reducer id 

The reducer id will be calculated by hash, the implementation in 
*FactDistinctColumnsReducerMapping#getReducerIdForCol*,  in this method, *the 
reducer id = reducerBeginIndex + Math.abs(value.hashCode()) % uhcReducerCount*

When the value.hashCode() is Integer.MIN_VALUE, the Math.abs(value.hashCode()) 
return also Integer.MIN_VALUE. Thus the reducer id may return a negative value. 
This may cause the FactDistinctColumn step failed, or the UHC column value may 
be redirected to another reducer which not belongs to UHC column


> Fact Distinct Column Step maybe failed or value lost when hashcode of the UHC 
> column value is Integer.MIN_VALUE
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: KYLIN-4083
>                 URL: https://issues.apache.org/jira/browse/KYLIN-4083
>             Project: Kylin
>          Issue Type: Bug
>            Reporter: PENG Zhengshuai
>            Assignee: PENG Zhengshuai
>            Priority: Major
>
> In the Fact Distinct Column Step, kylin uses MR to de-dup the values of 
> columns.
> If the column is UHC (ultra high cardinality) column and the value of the 
> property *kylin.engine.mr.uhc-reducer-count* has been set greater than *1*, 
> the Mapper task will write the output of UHC column values to different 
> reducers by *FactDistinctColumnPartitioner* according to the reducer id 
> The reducer id will be calculated by hash, the implementation in 
> *FactDistinctColumnsReducerMapping#getReducerIdForCol()*,  in this method, 
> *the reducer id = reducerBeginIndex + Math.abs(value.hashCode()) % 
> uhcReducerCount*
> When the value.hashCode() is Integer.MIN_VALUE, the 
> Math.abs(value.hashCode()) return also Integer.MIN_VALUE. Thus the reducer id 
> may return a negative value. This may cause the FactDistinctColumn step 
> failed, or the UHC column value may be redirected to another reducer which 
> not belongs to UHC column



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to