[ 
https://issues.apache.org/jira/browse/PIG-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16161375#comment-16161375
 ] 

Erik Krogen commented on PIG-5290:
----------------------------------

Fantastic, thank you Rohini!

> User Cache upload contention can cause job failures
> ---------------------------------------------------
>
>                 Key: PIG-5290
>                 URL: https://issues.apache.org/jira/browse/PIG-5290
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.13.0
>            Reporter: Erik Krogen
>            Assignee: Erik Krogen
>             Fix For: 0.18.0
>
>         Attachments: PIG-5290-1.patch, PIG-5290.patch
>
>
> We recently enabled the User Cache (PIG-2672) feature and found that 
> occasionally jobs would fail because of contention when uploading JARs into 
> the cache. Although the cache is designed to be fail-safe, i.e. to fall back 
> to normal behavior if anything goes wrong by catching all {{IOException}}, 
> the portion of code which closes the output stream _is not_ wrapped within a 
> {{try}} statement and thus an exception during the closing of that stream 
> causes the entire job to fail. If multiple jobs are attempting to upload the 
> same JAR failure simultaneously, the contention can cause this close 
> statement to fail.
> The current strategy also has two other flaws. First, consider the scenario 
> where job A begins uploading jar X. Job B also needs jar X, sees that the 
> file exists, and launches its tasks. Yet, job A has not yet finished 
> uploading jar X (perhaps it is large). So, the tasks are localizing a 
> half-completed version of jar X. Second, the original design allowed for the 
> same JAR (identical contents) to be shared between jobs even if a different 
> name was used. In PIG-3815, however, this ability was removed, and now JARs 
> are only shared if they have the same name.
> I propose we solve all of these issues simultaneously by returning to the 
> listStatus based behavior (used prior to PIG-3815), but filter out entries 
> ending in {{.tmp}}. When uploading, upload to {{randomNumber.tmp}}, then once 
> the file is completed, do a rename to the original name of the JAR file. This 
> ensures that incomplete files are never in a location that would be accessed 
> by other jobs, and the only write operation accessing a shared path is a 
> single rename operation.
> An alternative design is to use a single canonicalized name for all JAR files 
> (they will still be unique since they are inside of directories based on 
> their SHA1). Upload to a tmp file as previously described, then rename to the 
> canonical name. This removes the need to do a listStatus call; however it 
> will result in classpaths that are human unreadable since the name of the JAR 
> file has been lost. I think it's worth it from a debugging standpoint to go 
> with the first design.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to