[ https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893788#action_12893788 ]
Ning Zhang commented on HIVE-1492: ---------------------------------- @Edward, this is a heuristics that should be generally true. The good news is that we are not aware of any exceptions that violate the rule (assuming multiple attempts of the same task give deterministic results). The reason that we are relying on heuristics here is that the old Hadoop API doesn't not support exception handling outside Mapper's map() function. The bug presents if an exception was thrown by Hadoop's RecordReader layer and it does not pass the message to the Mapper. When the mapper.close() is called there is not way the mapper know whether there is an exception happened in the Hadoop code path. A better way to handle this is to use the new Hadoop API that gives more control to the application layer. This heuristics is a workaround based on the old Hadoop API. > FileSinkOperator should remove duplicated files from the same task based on > file sizes > -------------------------------------------------------------------------------------- > > Key: HIVE-1492 > URL: https://issues.apache.org/jira/browse/HIVE-1492 > Project: Hadoop Hive > Issue Type: Bug > Affects Versions: 0.7.0 > Reporter: Ning Zhang > Assignee: Ning Zhang > Fix For: 0.7.0 > > Attachments: HIVE-1492.patch, HIVE-1492_branch-0.6.patch > > > FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to > retain only one file for each task. A task could produce multiple files due > to failed attempts or speculative runs. The largest file should be retained > rather than the first file for each task. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.