[ 
https://issues.apache.org/jira/browse/KYLIN-3925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16808587#comment-16808587
 ] 

ASF GitHub Bot commented on KYLIN-3925:
---------------------------------------

kyotoYaho commented on pull request #580: KYLIN-3925 Add reduce step for 
FilterRecommendCuboidDataJob & UpdateO…
URL: https://github.com/apache/kylin/pull/580
 
 
   …ldCuboidShardJob to avoid generating small hdfs files
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add reduce step for FilterRecommendCuboidDataJob & UpdateOldCuboidShardJob to 
> avoid generating small hdfs files
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: KYLIN-3925
>                 URL: https://issues.apache.org/jira/browse/KYLIN-3925
>             Project: Kylin
>          Issue Type: Improvement
>            Reporter: Zhong Yanghong
>            Assignee: Zhong Yanghong
>            Priority: Major
>
> Previously when doing cube optimization, there're two map only MR jobs: 
> *FilterRecommendCuboidDataJob* & *UpdateOldCuboidShardJob*. The benefit of 
> map only job is to avoid shuffling. However, this benefit will bring a more 
> severe issue, too many small hdfs files.
> Suppose there're 10 hdfs files for current cuboids data and each with 500M. 
> If the block size is 100M, there'll be 10*(500/100) mappers for the map only 
> job *FilterRecommendCuboidDataJob*. Each mapper will generate a hdfs file. 
> Finally there'll be 50 hdfs files. Since the job 
> *FilterRecommendCuboidDataJob* will filter out the cuboid data used for 
> future, the data size of each file will be less than 100M. In some cases, it 
> will be even less than 50M.
> To avoid this kind of small hdfs file issue, it's better to add a reduce step 
> to control the final output hdfs file number.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to