[ 
https://issues.apache.org/jira/browse/HIVE-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated HIVE-1118:
-----------------------------

    Attachment: HIVE-1118.1.patch

Actually the option is already there. I just modified the default to be: 
condition = 16MB, merged file size = 32MB.
I think this setting is a good default.

I also added the missed conf variable to hive-default.xml.


> Hive merge map files should have different bytes/mapper setting
> ---------------------------------------------------------------
>
>                 Key: HIVE-1118
>                 URL: https://issues.apache.org/jira/browse/HIVE-1118
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Zheng Shao
>         Attachments: HIVE-1118.1.patch
>
>
> Currently, by default, we get one reducer for each 1GB of input data.
> It's also true for the conditional merge job that will run if the average 
> file size is smaller than a threshold.
> This actually makes those job very slow, because each reducer needs to 
> consume 1GB of data.
> Alternatively, we can just use that threshold to determine the number of 
> reducers per job (or introduce a new parameter).
> Let's say the threshold is 1MB, then we only start the the merge job if the 
> average file size is less than 1MB, and the eventual result file size will be 
> around 1MB (or another small number).
> This will remove the extreme cases where we have thousands of empty files, 
> but still make normal jobs fast enough.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to