[ https://issues.apache.org/jira/browse/KYLIN-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15624044#comment-15624044 ]
Shaofeng SHI commented on KYLIN-2135: ------------------------------------- I have couple of comments for the patch: 1. this function will benefit the performnce when there is UHC column, regardless of using Global Dictionary or normal Dictionary; If it can be made as generic, that will be great (like adding some method like isUHC(), and the column that using Global Dict is one kind of UHC, later we can add more possibilities there) 2. For Global Dictionariy each time only need the new values to build into the dict ? if so, is it possible to exclude the existing values in mapper side so to reduce the IO? 3. The check for "if (reducerIndex > 255) " can be moved before submitting to Hadoop (now it is inner the mapper) 4. The new DFSFileTableReader.java, the close() method should catch exception and continue the loo; and the next() method seems incorrect as each time it will read from the first element in "readerList" (as I couldn't apply this patch so this might be wrong, please double check) > Enlarge FactDistinctColumns reducer number > ------------------------------------------ > > Key: KYLIN-2135 > URL: https://issues.apache.org/jira/browse/KYLIN-2135 > Project: Kylin > Issue Type: Improvement > Components: Job Engine > Affects Versions: v1.5.4.1 > Reporter: kangkaisen > Assignee: kangkaisen > Attachments: KYLIN-2135.patch, new.png, old.png > > > When the hive table has billions of rows and use global dictionary for > precise count distinct measures, the {{Extract Fact Table Distinct Columns}} > job will run o long time. > So we could use more reducer to deal with the one column. -- This message was sent by Atlassian JIRA (v6.3.4#6332)