Larger files are not guaranteed to be the right ones. (For example, there could 
be user defined transform scripts that can freely access external resources and 
generate anything which we don't have control.) But larger files, rather than 
the first one, are much more likely to be the correct one. Before we use the 
new MapReduce API to fix the issue of generating wrong results in MapReduce, 
this patch will help us fix the problem in most scenarios.

-----Original Message-----
From: He Yongqiang (JIRA) [mailto:j...@apache.org] 
Sent: Thursday, July 29, 2010 12:12 PM
To: hive-dev@hadoop.apache.org
Subject: [jira] Commented: (HIVE-1492) FileSinkOperator should remove 
duplicated files from the same task based on file sizes


    [ 
https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893782#action_12893782
 ] 

He Yongqiang commented on HIVE-1492:
------------------------------------

The assumption of Map-reduce is 
if we give same input and same m/r function, the output should be always the 
same.

Otherwise the map-reduce fault tolerance mechanism is wrong.

> FileSinkOperator should remove duplicated files from the same task based on 
> file sizes
> --------------------------------------------------------------------------------------
>
>                 Key: HIVE-1492
>                 URL: https://issues.apache.org/jira/browse/HIVE-1492
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>             Fix For: 0.7.0
>
>         Attachments: HIVE-1492.patch, HIVE-1492_branch-0.6.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to 
> retain only one file for each task. A task could produce multiple files due 
> to failed attempts or speculative runs. The largest file should be retained 
> rather than the first file for each task. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to