Erik Krogen created PIG-5290:
--------------------------------
Summary: User Cache upload contention can cause job failures
Key: PIG-5290
URL: https://issues.apache.org/jira/browse/PIG-5290
Project: Pig
Issue Type: Bug
Affects Versions: 0.13.0
Reporter: Erik Krogen
We recently enabled the User Cache (PIG-2672) feature and found that
occasionally jobs would fail because of contention when uploading JARs into the
cache. Although the cache is designed to be fail-safe, i.e. to fall back to
normal behavior if anything goes wrong by catching all {{IOException}}s, the
portion of code which closes the output stream _is not_ wrapped within a
{{try}} statement and thus an exception during the closing of that stream
causes the entire job to fail. If multiple jobs are attempting to upload the
same JAR failure simultaneously, the contention can cause this close statement
to fail.
The current strategy also has two other flaws. First, consider the scenario
where job A begins uploading jar X. Job B also needs jar X, sees that the file
exists, and launches its tasks. Yet, job A has not yet finished uploading jar X
(perhaps it is large). So, the tasks are localizing a half-completed version of
jar X. Second, the original design allowed for the same JAR (identical
contents) to be shared between jobs even if a different name was used. In
PIG-3815, however, this ability was removed, and now JARs are only shared if
they have the same name.
I propose we solve both of these issues simultaneously by returning to the
listStatus based behavior (used prior to PIG-3815), but filter out entries
ending in {{.tmp}}. When uploading, upload to {{randomNumber.tmp}}, then once
the file is completed, do a rename to the original name of the JAR file.
An alternative design is to use a single canonicalized name for all JAR files
(they will still be unique since they are inside of directories based on their
SHA1). Upload to a tmp file as previously described, then rename to the
canonical name. This removes the need to do a listStatus call; however it will
result in classpaths that are human unreadable since the name of the JAR file
has been lost. I think it's worth it from a debugging standpoint to go with the
first design.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)