[ https://issues.apache.org/jira/browse/PIG-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16161375#comment-16161375 ]
Erik Krogen commented on PIG-5290: ---------------------------------- Fantastic, thank you Rohini! > User Cache upload contention can cause job failures > --------------------------------------------------- > > Key: PIG-5290 > URL: https://issues.apache.org/jira/browse/PIG-5290 > Project: Pig > Issue Type: Bug > Affects Versions: 0.13.0 > Reporter: Erik Krogen > Assignee: Erik Krogen > Fix For: 0.18.0 > > Attachments: PIG-5290-1.patch, PIG-5290.patch > > > We recently enabled the User Cache (PIG-2672) feature and found that > occasionally jobs would fail because of contention when uploading JARs into > the cache. Although the cache is designed to be fail-safe, i.e. to fall back > to normal behavior if anything goes wrong by catching all {{IOException}}, > the portion of code which closes the output stream _is not_ wrapped within a > {{try}} statement and thus an exception during the closing of that stream > causes the entire job to fail. If multiple jobs are attempting to upload the > same JAR failure simultaneously, the contention can cause this close > statement to fail. > The current strategy also has two other flaws. First, consider the scenario > where job A begins uploading jar X. Job B also needs jar X, sees that the > file exists, and launches its tasks. Yet, job A has not yet finished > uploading jar X (perhaps it is large). So, the tasks are localizing a > half-completed version of jar X. Second, the original design allowed for the > same JAR (identical contents) to be shared between jobs even if a different > name was used. In PIG-3815, however, this ability was removed, and now JARs > are only shared if they have the same name. > I propose we solve all of these issues simultaneously by returning to the > listStatus based behavior (used prior to PIG-3815), but filter out entries > ending in {{.tmp}}. When uploading, upload to {{randomNumber.tmp}}, then once > the file is completed, do a rename to the original name of the JAR file. This > ensures that incomplete files are never in a location that would be accessed > by other jobs, and the only write operation accessing a shared path is a > single rename operation. > An alternative design is to use a single canonicalized name for all JAR files > (they will still be unique since they are inside of directories based on > their SHA1). Upload to a tmp file as previously described, then rename to the > canonical name. This removes the need to do a listStatus call; however it > will result in classpaths that are human unreadable since the name of the JAR > file has been lost. I think it's worth it from a debugging standpoint to go > with the first design. -- This message was sent by Atlassian JIRA (v6.4.14#64029)