Restarting a failed Spark streaming job running on top of a yarn cluster

jcgarciam Wed, 03 Oct 2018 05:21:46 -0700

Hi Folks,

We have few spark job streaming jobs running on a yarn cluster, and from
time to time a job need to be restarted (it was killed due to external
reason or others).


Once we submit the new job we are face with the following exception:
 ERROR spark.SparkContext: Failed to add
/mnt/data1/yarn/nm/usercache/spark/appcache/*application_1537885048149_15382*/container_e82_1537885048149_15382_01_000001/__app__.jar
to Spark environment
java.io.FileNotFoundException: Jar
/mnt/data1/yarn/nm/usercache/spark/appcache/application_1537885048149_15382/container_e82_1537885048149_15382_01_000001/__app__.jar
not found
        at org.apache.spark.SparkContext.addJarFile$1(SparkContext.scala:1807)
        at org.apache.spark.SparkContext.addJar(SparkContext.scala:1835)
        at 
org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:457)

Of course we know that *application_1537885048149_15382* correspond to the
previous job that was killed, and that our Yarn is cleaning up the usercache
directory very often to avoid choking the filesystem with so many unused
file.

However what can you guys recommend for long running jobs that have to be
restarted but the previous context is not available due to the cleanup?


Hope is clear what i meant, if you need more information just ask.

Thanks

JC




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Restarting a failed Spark streaming job running on top of a yarn cluster

Reply via email to