[jira] [Commented] (HIVE-6134) Merging small files based on file size only works for CTAS queries

Eric Chu (JIRA) Mon, 06 Jan 2014 11:25:20 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13863292#comment-13863292
 ]


Eric Chu commented on HIVE-6134:
--------------------------------

Thanks [~ashutoshc] for pointing out the concatenate command. However, I think 
the ability to merge files for a table partition is orthogonal to supporting 
hive.merge.mapfiles, hive.merge.mapredfiles, and hive.merge.smallfiles.avgsize 
for "regular" queries (i.e., that don't result in a new table). Even if we have 
the optimal number of files at input for each partition, users querying over a 
large number of partitions with just SELECT FROM WHERE clauses will result in a 
large number of small output files, and there will be negative sides effects 
such as Hue timeout, the next job will have a large number of mappers, etc.

Can someone explain why the properties are supported only for queries with move 
tasks? Was it just a matter of scoping, or is there some reason that makes this 
inappropriate for queries without a move task? We are considering adding this 
support on our own and would like to get some insights on the original design 
considerations. Thanks!



> Merging small files based on file size only works for CTAS queries
> ------------------------------------------------------------------
>
>                 Key: HIVE-6134
>                 URL: https://issues.apache.org/jira/browse/HIVE-6134
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 0.8.0, 0.10.0, 0.11.0, 0.12.0
>            Reporter: Eric Chu
>
> According to the documentation, if we set hive.merge.mapfiles to true, Hive 
> will launch an additional MR job to merge the small output files at the end 
> of a map-only job when the average output file size is smaller than 
> hive.merge.smallfiles.avgsize. Similarly, by setting hive.merge.mapredfiles 
> to true, Hive will merge the output files of a map-reduce job. 
> My expectation is that this is true for all MR queries. However, my 
> observation is that this is only true for CTAS queries. In 
> GenMRFileSink1.java, HIVEMERGEMAPFILES and HIVEMERGEMAPREDFILES are only used 
> if ((ctx.getMvTask() != null) && (!ctx.getMvTask().isEmpty())). So, for a 
> regular SELECT query that doesn't have move tasks, these properties are not 
> used.
> Is my understanding correct and if so, what's the reasoning behind the logic 
> of not supporting this for regular SELECT queries? It seems to me that this 
> should be supported for regular SELECT queries as well. One scenario where 
> this hits us hard is when users try to download the result in HUE, and HUE 
> times out b/c there are thousands of output files. The workaround is to 
> re-run the query as CTAS, but it's a significant time sink.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (HIVE-6134) Merging small files based on file size only works for CTAS queries

Reply via email to