[ 
https://issues.apache.org/jira/browse/HIVE-14271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15649356#comment-15649356
 ] 

Sahil Takiar commented on HIVE-14271:
-------------------------------------

We might want to consider re-opening this ticket, but changing the original 
approach. To clarify, right now the FileSinkOperator (FSOP) will always write 
all its data to a scratch directory. The FSOP first writes to a {{outPaths}} 
and then renames the data to {{finalPaths}}, but all the data is still under 
the scratch directory. No data is exposed to users or future ETL jobs yet.

There are two different ways to modify this to improve performance on S3:

1: FSOP implements the "direct output committer" strategy (similar to 
HIVE-1620) and all data is written directly to the final table location, no 
data is written to a staging file or in the scratch directory. Hive's MoveTask 
(which runs in HiveServer2) does nothing.

2: FSOP writes data to a scratch directory, but it doesn't write to 
{{outPaths}} it writes to {{finalPaths}} instead (remember both of these 
directories are still under the scratch directory). Hive's MoveTask (which runs 
inside HiveServer2) copies the data from the scratch directory to the final 
table location. The FSOP writes directly to the final location in the scratch 
directory, no writing to a temp file is done. This improves performance since 
it avoids copying data from {{outPaths}} to {{finalPaths}}.

For reasons stated in earlier comments, there are a number of issues with 
approach 1. Implementing approach 2 should be better, and should improve 
performance significantly.

> FileSinkOperator should not rename files to final paths when S3 is the 
> default destination
> ------------------------------------------------------------------------------------------
>
>                 Key: HIVE-14271
>                 URL: https://issues.apache.org/jira/browse/HIVE-14271
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Sergio Peña
>            Assignee: Sergio Peña
>
> FileSinkOperator does a rename of {{outPaths -> finalPaths}} when it finished 
> writing all rows to a temporary path. The problem is that S3 does not support 
> renaming.
> Two options can be considered:
> a. Use a copy operation instead. After FileSinkOperator writes all rows to 
> outPaths, then the commit method will do a copy() call instead of move().
> b. Write row by row directly to the S3 path (see HIVE-1620). This may add 
> better performance calls, but we should take care of the cleanup part in case 
> of writing errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to