[ https://issues.apache.org/jira/browse/HIVE-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zheng Shao updated HIVE-1118: ----------------------------- Summary: Add hive.merge.size.per.task to HiveConf (was: Hive merge map files should have different bytes/mapper setting) > Add hive.merge.size.per.task to HiveConf > ---------------------------------------- > > Key: HIVE-1118 > URL: https://issues.apache.org/jira/browse/HIVE-1118 > Project: Hadoop Hive > Issue Type: Improvement > Reporter: Zheng Shao > Assignee: Zheng Shao > Attachments: HIVE-1118.1.patch, HIVE-1118.2.patch, HIVE-1118.3.patch > > > Currently, by default, we get one reducer for each 1GB of input data. > It's also true for the conditional merge job that will run if the average > file size is smaller than a threshold. > This actually makes those job very slow, because each reducer needs to > consume 1GB of data. > Alternatively, we can just use that threshold to determine the number of > reducers per job (or introduce a new parameter). > Let's say the threshold is 1MB, then we only start the the merge job if the > average file size is less than 1MB, and the eventual result file size will be > around 1MB (or another small number). > This will remove the extreme cases where we have thousands of empty files, > but still make normal jobs fast enough. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.