[
https://issues.apache.org/jira/browse/SAMZA-2804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jon Bringhurst updated SAMZA-2804:
----------------------------------
Description:
Several possible issues were identified in run-class.sh, including:
h2. Race condition in pathing jar manifest creation
A race condition exists when setting up the classpath during container launch.
During container launch using samza-yarn, run-class.sh creates a pathing jar
file (which holds the classpath for the container launch). However, during the
creation of this pathing jar, temporary files, as well as the pathing jar
itself is not placed in a location unique to the container. This results in
multiple containers writing to the same pathing jar location and temporary file
location, which results in a race condition.
This race condition may show up in several ways, such as when Yarn removes jars
from a finished container (other containers will point to a classpath which no
longer exists) or when multiple run-class.sh scripts attempt to write the
manifest.txt or pathing jar at the same time.
Note that host affinity being enabled will make this problem worse. The
pathing.jar is written to the usercache, so when the container which created
the pathing.jar is finished and removed, any new container which launches on
that host will point to jar files which do not exist anymore. When host
affinity is enabled, it will not move to a new host and just keep failing.
Typical errors for this include sporadic exceptions (typically it works on
retry) like the following:
{noformat}
java.io.IOException: line too long (line 511)
at java.base/java.util.jar.Attributes.read(Attributes.java:380)
at java.base/java.util.jar.Manifest.read(Manifest.java:290)
at java.base/java.util.jar.Manifest.<init>(Manifest.java:100)
at java.base/java.util.jar.Manifest.<init>(Manifest.java:76)
at jdk.jartool/sun.tools.jar.Main.run(Main.java:277)
at jdk.jartool/sun.tools.jar.Main.main(Main.java:1683)
{noformat}
and
{noformat}
Error: Could not find or load main class <snip job task process class name>
Caused by: java.lang.ClassNotFoundException: <snip job task process class name>
{noformat}
h2. Container logging directory fallback is not unique for each container
The fallback log directory is the same among all containers running on the same
host. It should be unique per-container.
h2. Container tmp dir is not unique per-container
The JAVA_TMP_DIR directory is the same for all containers. We should make sure
that it's safe to use the same directory for all containers.
was:
Several possible issues were identified in run-class.sh, including:
h2. Race condition in pathing jar manifest creation
A race condition exists when setting up the classpath during container launch.
During container launch using samza-yarn, run-class.sh creates a pathing jar
file (which holds the classpath for the container launch). However, during the
creation of this pathing jar, temporary files, as well as the pathing jar
itself is not placed in a location unique to the container. This results in
multiple containers writing to the same pathing jar location and temporary file
location, which results in a race condition.
This race condition may show up in several ways, such as when Yarn removes jars
from a finished container (other containers will point to a classpath which no
longer exists) or when multiple run-class.sh scripts attempt to write the
manifest.txt or pathing jar at the same time.
Note that host affinity being enabled will make this problem worse. The
pathing.jar is written to the usercache, so when the container which created
the pathing.jar is finished and removed, any new container which launches on
that host will point to jar files which do not exist anymore. When host
affinity is enabled, it will not move to a new host and just keep failing.
h2. Container logging directory fallback is not unique for each container
The fallback log directory is the same among all containers running on the same
host. It should be unique per-container.
h2. Container tmp dir is not unique per-container
The JAVA_TMP_DIR directory is the same for all containers. We should make sure
that it's safe to use the same directory for all containers.
> run-class.sh concurrency issues when on samza-yarn
> --------------------------------------------------
>
> Key: SAMZA-2804
> URL: https://issues.apache.org/jira/browse/SAMZA-2804
> Project: Samza
> Issue Type: Bug
> Reporter: Jon Bringhurst
> Assignee: Jon Bringhurst
> Priority: Major
> Time Spent: 1h
> Remaining Estimate: 0h
>
> Several possible issues were identified in run-class.sh, including:
> h2. Race condition in pathing jar manifest creation
> A race condition exists when setting up the classpath during container launch.
> During container launch using samza-yarn, run-class.sh creates a pathing jar
> file (which holds the classpath for the container launch). However, during
> the creation of this pathing jar, temporary files, as well as the pathing jar
> itself is not placed in a location unique to the container. This results in
> multiple containers writing to the same pathing jar location and temporary
> file location, which results in a race condition.
> This race condition may show up in several ways, such as when Yarn removes
> jars from a finished container (other containers will point to a classpath
> which no longer exists) or when multiple run-class.sh scripts attempt to
> write the manifest.txt or pathing jar at the same time.
> Note that host affinity being enabled will make this problem worse. The
> pathing.jar is written to the usercache, so when the container which created
> the pathing.jar is finished and removed, any new container which launches on
> that host will point to jar files which do not exist anymore. When host
> affinity is enabled, it will not move to a new host and just keep failing.
> Typical errors for this include sporadic exceptions (typically it works on
> retry) like the following:
> {noformat}
> java.io.IOException: line too long (line 511)
> at java.base/java.util.jar.Attributes.read(Attributes.java:380)
> at java.base/java.util.jar.Manifest.read(Manifest.java:290)
> at java.base/java.util.jar.Manifest.<init>(Manifest.java:100)
> at java.base/java.util.jar.Manifest.<init>(Manifest.java:76)
> at jdk.jartool/sun.tools.jar.Main.run(Main.java:277)
> at jdk.jartool/sun.tools.jar.Main.main(Main.java:1683)
> {noformat}
> and
> {noformat}
> Error: Could not find or load main class <snip job task process class name>
> Caused by: java.lang.ClassNotFoundException: <snip job task process class
> name>
> {noformat}
> h2. Container logging directory fallback is not unique for each container
> The fallback log directory is the same among all containers running on the
> same host. It should be unique per-container.
> h2. Container tmp dir is not unique per-container
> The JAVA_TMP_DIR directory is the same for all containers. We should make
> sure that it's safe to use the same directory for all containers.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)