[ 
https://issues.apache.org/jira/browse/HIVE-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated HIVE-1118:
-----------------------------

    Summary: Add hive.merge.size.per.task to HiveConf  (was: Hive merge map 
files should have different bytes/mapper setting)

> Add hive.merge.size.per.task to HiveConf
> ----------------------------------------
>
>                 Key: HIVE-1118
>                 URL: https://issues.apache.org/jira/browse/HIVE-1118
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>         Attachments: HIVE-1118.1.patch, HIVE-1118.2.patch, HIVE-1118.3.patch
>
>
> Currently, by default, we get one reducer for each 1GB of input data.
> It's also true for the conditional merge job that will run if the average 
> file size is smaller than a threshold.
> This actually makes those job very slow, because each reducer needs to 
> consume 1GB of data.
> Alternatively, we can just use that threshold to determine the number of 
> reducers per job (or introduce a new parameter).
> Let's say the threshold is 1MB, then we only start the the merge job if the 
> average file size is less than 1MB, and the eventual result file size will be 
> around 1MB (or another small number).
> This will remove the extreme cases where we have thousands of empty files, 
> but still make normal jobs fast enough.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to