[ 
https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893788#action_12893788
 ] 

Ning Zhang commented on HIVE-1492:
----------------------------------

@Edward, this is a heuristics that should be generally true. The good news is 
that we are not aware of any exceptions that violate the rule (assuming 
multiple attempts of the same task give deterministic results). 

The reason that we are relying on heuristics here is that the old Hadoop API 
doesn't not support exception handling outside Mapper's map() function. The bug 
presents if an exception was thrown by Hadoop's RecordReader layer and it does 
not pass the message to the Mapper. When the mapper.close() is called there is 
not way the mapper know whether there is an exception happened in the Hadoop 
code path. A better way to handle this is to use the new Hadoop API that gives 
more control to the application layer. This heuristics is a workaround based on 
the old Hadoop API. 


> FileSinkOperator should remove duplicated files from the same task based on 
> file sizes
> --------------------------------------------------------------------------------------
>
>                 Key: HIVE-1492
>                 URL: https://issues.apache.org/jira/browse/HIVE-1492
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>             Fix For: 0.7.0
>
>         Attachments: HIVE-1492.patch, HIVE-1492_branch-0.6.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to 
> retain only one file for each task. A task could produce multiple files due 
> to failed attempts or speculative runs. The largest file should be retained 
> rather than the first file for each task. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to