[
https://issues.apache.org/jira/browse/HIVE-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13861818#comment-13861818
]
Eric Chu commented on HIVE-6134:
--------------------------------
[~brocknoland] and [~xuefuz]: I was talking to Yin Huai about this issue and he
suggested I pinged you on this, especially on how it affects HUE UX as
mentioned above.
> Merging small files based on file size only works for CTAS queries
> ------------------------------------------------------------------
>
> Key: HIVE-6134
> URL: https://issues.apache.org/jira/browse/HIVE-6134
> Project: Hive
> Issue Type: Bug
> Affects Versions: 0.8.0, 0.10.0, 0.11.0, 0.12.0
> Reporter: Eric Chu
>
> According to the documentation, if we set hive.merge.mapfiles to true, Hive
> will launch an additional MR job to merge the small output files at the end
> of a map-only job when the average output file size is smaller than
> hive.merge.smallfiles.avgsize. Similarly, by setting hive.merge.mapredfiles
> to true, Hive will merge the output files of a map-reduce job.
> My expectation is that this is true for all MR queries. However, my
> observation is that this is only true for CTAS queries. In
> GenMRFileSink1.java, HIVEMERGEMAPFILES and HIVEMERGEMAPREDFILES are only used
> if ((ctx.getMvTask() != null) && (!ctx.getMvTask().isEmpty())). So, for a
> regular SELECT query that doesn't have move tasks, these properties are not
> used.
> Is my understanding correct and if so, what's the reasoning behind the logic
> of not supporting this for regular SELECT queries? It seems to me that this
> should be supported for regular SELECT queries as well. One scenario where
> this hits us hard is when users try to download the result in HUE, and HUE
> times out b/c there are thousands of output files. The workaround is to
> re-run the query as CTAS, but it's a significant time sink.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)