[ 
https://issues.apache.org/jira/browse/HIVE-21214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16761153#comment-16761153
 ] 

Jason Dere commented on HIVE-21214:
-----------------------------------

I'm not totally sure about the decision to change duplicate filename resolution 
from file size to task attempt number. If you just fixed the file size logic to 
take directories into account this would allow the existing logic to work in 
the directory case. With task attempts we might have to worry about if this 
breaks any existing cases. If we are convinced that we just need to worry about 
Tez execution then I guess this could work, but this does not work on M/R with 
speculative execution.

In terms of code comments, might be better with RB, but I'll add comments here:
 * For the comments at the top of compareTempOrDuplicateFiles(), add a comment 
this this breaks speculative execution.
 * getDirSize() may not be the best name - this is really getting the file 
size, and doing so recursively in the case that the file turns out to be a 
directory. So maybe getFileSizeRecursivey() or something.
 * Log at debug level in getDirSize()

I still need to make sense of the parsing changes

> MoveTask : Use attemptId instead of file size for deduplication of files 
> compareTempOrDuplicateFiles()
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-21214
>                 URL: https://issues.apache.org/jira/browse/HIVE-21214
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Deepak Jaiswal
>            Assignee: Deepak Jaiswal
>            Priority: Major
>         Attachments: HIVE-21214.1.patch
>
>
> For a given task, if there is more than one attempt then deduplication logic 
> kicks in.
> {noformat}
> Utilities.compareTempOrDuplicateFiles(){noformat}
> The logic uses file size and picks the one with largest size. This logic is 
> very fragile.
> ideally, it should pick the successful attempt's file.
> However, a simpler solution is to pick the newest attempt and also checking 
> the file size for the newest attempt is the largest.
> If not, throw an exception.
>  
> cc [~gopalv] [~thejas] [~jdere] [~ekoifman]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to