[ 
https://issues.apache.org/jira/browse/HIVE-13321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15205023#comment-15205023
 ] 

Sanjay Radia commented on HIVE-13321:
-------------------------------------

bq. A common pattern in MapReduce and Hive is to write all output into a 
temporary folder and then rename this temporary folder to match the final 
output location. When using some of the newer FileSystems with Hive, the 
performance can be improved by directly writing output and avoiding the 
temporary folder write & rename.
Note: the temp folder was necessary to deal with failures and also with 
multiple attempts.  Rename in traditional fs's are very low cost and involve 
not copy of data unless across volumes. In case of MapReduce the tmp folder is 
a subdir in the output folder so that the rename is not across volumes. In the 
cloud's object stores (like S3) the rename require a data copy (hence 
HADOOP-9565's proposal to add a server-side copy - but that is still an extra 
copy that you are trying to avoid in this Jira.)
Optimization for cloud storage makes a lot of sense, but one has to deal with 
the failure case and multiple attempts/speculative execution;  the output 
directory cannot be left in a mess. Could you please elaborate on how you plan 
to deal with failures.

> Add support for different output strategies
> -------------------------------------------
>
>                 Key: HIVE-13321
>                 URL: https://issues.apache.org/jira/browse/HIVE-13321
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Rob Leidle
>
> The Hadoop ecosystem has expanded to support a wider variety of data-stores 
> and filesystems than simply HDFS. These FileSystems have different write 
> atomicity and read consistency guarantees.  There are enhancements we can 
> make to Hive to ensure Hive works even better with a wider variety of 
> FileSystems in the Hadoop ecosystem. We can see work going on in the Hadoop 
> project to robustly support these FileSystems. One such example is 
> HADOOP-9565 where the behavior of MapReduce output is enhanced to do what is 
> optimal for different FileSystems.
>  
> A common pattern in MapReduce and Hive is to write all output into a 
> temporary folder and then rename this temporary folder to match the final 
> output location. When using some of the newer FileSystems with Hive, the 
> performance can be improved by directly writing output and avoiding the 
> temporary folder write & rename.
>  
> The proposal is to enhance Hive to support different strategies for file 
> output. One such strategy would be a concept named “DirectWrite”. DirectWrite 
> will be optionally enabled, likely on a per-FileSystem basis. When 
> DirectWrite is enabled, all Hive job output will be written directly to the 
> output location.
>  
> This is an umbrella JIRA for all the tasks related to this functionality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to