[ 
https://issues.apache.org/jira/browse/SPARK-10878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15993888#comment-15993888
 ] 

Josh Rosen commented on SPARK-10878:
------------------------------------

[~jeeyoungk], my understanding is that there are two possible sources of races 
here:

1. Racing on the module descriptor / POM that we write out. Your suggestion of 
using a dummy string would address that.
2. Racing on writes to other files in the Ivy resolution cache. This is more 
likely to occur when the cache is initially clean and would manifest itself as 
corruption in other files.

Given this, I think the most robust solution is to use separate resolution 
caches since that would solve both problems in one fell swoop. If that's going 
to be hard then the partial fix of randomizing the module descriptor version 
(say, to use a timestamp) isn't a bad idea because it's pretty cheap to 
implement and probably covers the majority of real-world races here.

> Race condition when resolving Maven coordinates via Ivy
> -------------------------------------------------------
>
>                 Key: SPARK-10878
>                 URL: https://issues.apache.org/jira/browse/SPARK-10878
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.5.0
>            Reporter: Ryan Williams
>            Priority: Minor
>
> I've recently been shell-scripting the creation of many concurrent 
> Spark-on-YARN apps and observing a fraction of them to fail with what I'm 
> guessing is a race condition in their Maven-coordinate resolution.
> For example, I might spawn an app for each path in file {{paths}} with the 
> following shell script:
> {code}
> cat paths | parallel "$SPARK_HOME/bin/spark-submit foo.jar {}"
> {code}
> When doing this, I observe some fraction of the spawned jobs to fail with 
> errors like:
> {code}
> :: retrieving :: org.apache.spark#spark-submit-parent
>         confs: [default]
> Exception in thread "main" java.lang.RuntimeException: problem during 
> retrieve of org.apache.spark#spark-submit-parent: java.text.ParseException: 
> failed to parse report: 
> /hpc/users/willir31/.ivy2/cache/org.apache.spark-spark-submit-parent-default.xml:
>  Premature end of file.
>         at 
> org.apache.ivy.core.retrieve.RetrieveEngine.retrieve(RetrieveEngine.java:249)
>         at 
> org.apache.ivy.core.retrieve.RetrieveEngine.retrieve(RetrieveEngine.java:83)
>         at org.apache.ivy.Ivy.retrieve(Ivy.java:551)
>         at 
> org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1006)
>         at 
> org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:286)
>         at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
>         at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>         at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.text.ParseException: failed to parse report: 
> /hpc/users/willir31/.ivy2/cache/org.apache.spark-spark-submit-parent-default.xml:
>  Premature end of file.
>         at 
> org.apache.ivy.plugins.report.XmlReportParser.parse(XmlReportParser.java:293)
>         at 
> org.apache.ivy.core.retrieve.RetrieveEngine.determineArtifactsToCopy(RetrieveEngine.java:329)
>         at 
> org.apache.ivy.core.retrieve.RetrieveEngine.retrieve(RetrieveEngine.java:118)
>         ... 7 more
> Caused by: org.xml.sax.SAXParseException; Premature end of file.
>         at 
> org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown 
> Source)
>         at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown 
> Source)
>         at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
> {code}
> The more apps I try to launch simultaneously, the greater fraction of them 
> seem to fail with this or similar errors; a batch of ~10 will usually work 
> fine, a batch of 15 will see a few failures, and a batch of ~60 will have 
> dozens of failures.
> [This gist shows 11 recent failures I 
> observed|https://gist.github.com/ryan-williams/648bff70e518de0c7c84].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to