[ https://issues.apache.org/jira/browse/HIVE-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13861846#comment-13861846 ]
Xuefu Zhang commented on HIVE-6134: ----------------------------------- It seems reasonable to me that these flags kicks in only for CTAS, or other queries that resulting a new table. In other words, the functionality of merging small files for a table should be applied to table (upon request) rather than coming in effect for any query that touches the table. I think what is missing is a new command/query something like "MERGE FILES FOR TABLE table_name". This might be further automated in a scheduled fashion in HiveServer2. Of course, the scope is much larger. > Merging small files based on file size only works for CTAS queries > ------------------------------------------------------------------ > > Key: HIVE-6134 > URL: https://issues.apache.org/jira/browse/HIVE-6134 > Project: Hive > Issue Type: Bug > Affects Versions: 0.8.0, 0.10.0, 0.11.0, 0.12.0 > Reporter: Eric Chu > > According to the documentation, if we set hive.merge.mapfiles to true, Hive > will launch an additional MR job to merge the small output files at the end > of a map-only job when the average output file size is smaller than > hive.merge.smallfiles.avgsize. Similarly, by setting hive.merge.mapredfiles > to true, Hive will merge the output files of a map-reduce job. > My expectation is that this is true for all MR queries. However, my > observation is that this is only true for CTAS queries. In > GenMRFileSink1.java, HIVEMERGEMAPFILES and HIVEMERGEMAPREDFILES are only used > if ((ctx.getMvTask() != null) && (!ctx.getMvTask().isEmpty())). So, for a > regular SELECT query that doesn't have move tasks, these properties are not > used. > Is my understanding correct and if so, what's the reasoning behind the logic > of not supporting this for regular SELECT queries? It seems to me that this > should be supported for regular SELECT queries as well. One scenario where > this hits us hard is when users try to download the result in HUE, and HUE > times out b/c there are thousands of output files. The workaround is to > re-run the query as CTAS, but it's a significant time sink. -- This message was sent by Atlassian JIRA (v6.1.5#6160)