[ 
https://issues.apache.org/jira/browse/HIVE-21100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641987#comment-17641987
 ] 

Arnaud Linz edited comment on HIVE-21100 at 12/1/22 5:54 PM:
-------------------------------------------------------------

The workaround does not always work as sometimes the merge step is skipped, 
despite having set hive.merge.tezfiles=true; (the files must be smaller than 
{{hive.merge.size.per.task}} /   {{hive.merge.smallfiles.avgsize)}}

So to be sure we need to add a "hand made" HDFS move after each query with 
unions to keep the flat directory structure that is necessary for many tools 
(like Dataiku).

Knowing that this post processing is done outside an Hive Lock with direct Hdfs 
access makes it a fragile step... And a very cumbersome one.

This case is not minor to us, it was discovered during a Hive2 
(MR/Spark)->Hive3 (Tez) migration, and has led to numerous production issues.


was (Author: arnaudl):
The workaround does not always work as sometimes the merge step is skipped, 
despite having set hive.merge.tezfiles=true; (the files must be smaller than 
{{hive.merge.size.per.task}} /   {{hive.merge.smallfiles.avgsize)}}

So to be sure we need to add a "hand made" HDFS move after each query with 
unions to keep the flat directory structure that is necessary for many tools 
(like Dataiku). 

Knowing that this post processing is done outside an Hive Lock with direct Hdfs 
access makes it a fragile step... And a very cumbersome one.

This case is not minor to us, it was discovered during a Hive2 
(MR/Spark)->Hive3 (Tez) migration, and has lead to numerous production issues.

> Allow flattening of table subdirectories resulted when using TEZ engine and 
> UNION clause
> ----------------------------------------------------------------------------------------
>
>                 Key: HIVE-21100
>                 URL: https://issues.apache.org/jira/browse/HIVE-21100
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: George Pachitariu
>            Assignee: George Pachitariu
>            Priority: Minor
>              Labels: pull-request-available
>         Attachments: HIVE-21100.1.patch, HIVE-21100.2.patch, 
> HIVE-21100.3.patch, HIVE-21100.patch
>
>          Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Right now, when writing data into a table with Tez engine and the clause 
> UNION ALL is the last step of the query, Hive on Tez will create a 
> subdirectory for each branch of the UNION ALL.
> With this patch the subdirectories are removed, and the files are renamed and 
> moved to the parent directory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to