[
https://issues.apache.org/jira/browse/PIG-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16125911#comment-16125911
]
Rohini Palaniswamy commented on PIG-5290:
-----------------------------------------
[~xkrogen],
Using randomNumber.tmp is simple and effective. +1 on the idea. Have added
you to the contributors list. You can now assign the jira to yourself if you
are going to work on the patch.
> User Cache upload contention can cause job failures
> ---------------------------------------------------
>
> Key: PIG-5290
> URL: https://issues.apache.org/jira/browse/PIG-5290
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.13.0
> Reporter: Erik Krogen
>
> We recently enabled the User Cache (PIG-2672) feature and found that
> occasionally jobs would fail because of contention when uploading JARs into
> the cache. Although the cache is designed to be fail-safe, i.e. to fall back
> to normal behavior if anything goes wrong by catching all {{IOException}},
> the portion of code which closes the output stream _is not_ wrapped within a
> {{try}} statement and thus an exception during the closing of that stream
> causes the entire job to fail. If multiple jobs are attempting to upload the
> same JAR failure simultaneously, the contention can cause this close
> statement to fail.
> The current strategy also has two other flaws. First, consider the scenario
> where job A begins uploading jar X. Job B also needs jar X, sees that the
> file exists, and launches its tasks. Yet, job A has not yet finished
> uploading jar X (perhaps it is large). So, the tasks are localizing a
> half-completed version of jar X. Second, the original design allowed for the
> same JAR (identical contents) to be shared between jobs even if a different
> name was used. In PIG-3815, however, this ability was removed, and now JARs
> are only shared if they have the same name.
> I propose we solve all of these issues simultaneously by returning to the
> listStatus based behavior (used prior to PIG-3815), but filter out entries
> ending in {{.tmp}}. When uploading, upload to {{randomNumber.tmp}}, then once
> the file is completed, do a rename to the original name of the JAR file. This
> ensures that incomplete files are never in a location that would be accessed
> by other jobs, and the only write operation accessing a shared path is a
> single rename operation.
> An alternative design is to use a single canonicalized name for all JAR files
> (they will still be unique since they are inside of directories based on
> their SHA1). Upload to a tmp file as previously described, then rename to the
> canonical name. This removes the need to do a listStatus call; however it
> will result in classpaths that are human unreadable since the name of the JAR
> file has been lost. I think it's worth it from a debugging standpoint to go
> with the first design.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)