Hi Hussam, Have you looked at the stdout and stderr files from the worker process? You can find them in the “work” directory under SPARK_HOME on the slave node. They might have some information about why it crashed. Otherwise, I’d recommend profiling the workers with tools like jmap or jstack to see what objects take up memory. Commonly the problem may be having too low a level of parallelism set.
Matei On Nov 12, 2013, at 8:53 AM, [email protected] wrote: > Hi, > > Using spark 0.8 and hadoop 1.2.1 with cluster of 2 node each have 16 CPU and > allocated 8G of RAM > > I am running into a use case that if I try to save a very large > JavaRDD<String> that was created using paralleize from Java List<String> my > job workers are failing as follows > > 13/11/11 19:23:48 INFO Worker: Executor app-20131111191414-0001/2 finished > with state FAILED message Command exited with code 1 exitStatus 1 > > Looks like the spark driver trying 5 times to execute the then decide to > kill the process > > Any help on how to get more info on the reason of failure or what code 1 > existStatus 1 would means here? > > Any setting or configuration that I can use in spark that would dump more > info on error? > > Here's my logs > > 13/11/11 19:14:50 INFO Worker: Asked to launch executor > app-20131111190659-0000/0 for OMDBQueryService > 13/11/11 19:14:50 INFO ExecutorRunner: Launch command: "java" "-cp" > ":/opt/spark-0.8.0/conf:/opt/spark-0.8.0/assembly/target/scala-2.9.3/spark-assembly_2.9.3-0.8.0-incubating-hadoop1.0.4.jar" > "-Dspark.executor.memory=8g" "-Dspark.local.dir=/tmp/spark" > "-XX:+UseParallelGC" "-XX:+UseParallelOldGC" "-XX:+DisableExplicitGC" > "-XX:MaxPermSize=1024m" "-Dspark.executor.memory=8g" > "-Dspark.local.dir=/tmp/spark" "-XX:+UseParallelGC" "-XX:+UseParallelOldGC" > "-XX:+DisableExplicitGC" "-XX:MaxPermSize=1024m" "-Dspark.executor.memory=8g" > "-Dspark.local.dir=/tmp/spark" "-XX:+UseParallelGC" "-XX:+UseParallelOldGC" > "-XX:+DisableExplicitGC" "-XX:MaxPermSize=1024m" "-Dspark.executor.memory=8g" > "-Dspark.local.dir=/tmp/spark" "-XX:+UseParallelGC" "-XX:+UseParallelOldGC" > "-XX:+DisableExplicitGC" "-XX:MaxPermSize=1024m" "-Xms512M" "-Xmx512M" > "org.apache.spark.executor.StandaloneExecutorBackend" > "akka://spark@poc1:54482/user/StandaloneScheduler" "0" "poc3" "16" > 13/11/11 19:16:47 INFO Worker: Executor app-20131111190659-0000/0 finished > with state FAILED message Command exited with code 1 exitStatus 1 > 13/11/11 19:16:47 INFO Worker: Asked to launch executor > app-20131111190659-0000/2 for OMDBQueryService > 13/11/11 19:16:47 INFO ExecutorRunner: Launch command: "java" "-cp" > ":/opt/spark-0.8.0/conf:/opt/spark-0.8.0/assembly/target/scala-2.9.3/spark-assembly_2.9.3-0.8.0-incubating-hadoop1.0.4.jar" > "-Dspark.executor.memory=8g" "-Dspark.local.dir=/tmp/spark" > "-XX:+UseParallelGC" "-XX:+UseParallelOldGC" "-XX:+DisableExplicitGC" > "-XX:MaxPermSize=1024m" "-Dspark.executor.memory=8g" > "-Dspark.local.dir=/tmp/spark" "-XX:+UseParallelGC" "-XX:+UseParallelOldGC" > "-XX:+DisableExplicitGC" "-XX:MaxPermSize=1024m" "-Dspark.executor.memory=8g" > "-Dspark.local.dir=/tmp/spark" "-XX:+UseParallelGC" "-XX:+UseParallelOldGC" > "-XX:+DisableExplicitGC" "-XX:MaxPermSize=1024m" "-Dspark.executor.memory=8g" > "-Dspark.local.dir=/tmp/spark" "-XX:+UseParallelGC" "-XX:+UseParallelOldGC" > "-XX:+DisableExplicitGC" "-XX:MaxPermSize=1024m" "-Xms512M" "-Xmx512M" > "org.apache.spark.executor.StandaloneExecutorBackend" > "akka://spark@poc1:54482/user/StandaloneScheduler" "2" "poc3" "16" > 13/11/11 19:16:53 INFO Worker: Executor app-20131111190659-0000/2 finished > with state FAILED message Command exited with code 1 exitStatus 1 > 13/11/11 19:16:53 INFO Worker: Asked to launch executor > app-20131111190659-0000/4 for OMDBQueryService > 13/11/11 19:16:53 INFO ExecutorRunner: Launch command: "java" "-cp" > ":/opt/spark-0.8.0/conf:/opt/spark-0.8.0/assembly/target/scala-2.9.3/spark-assembly_2.9.3-0.8.0-incubating-hadoop1.0.4.jar" > "-Dspark.executor.memory=8g" "-Dspark.local.dir=/tmp/spark" > "-XX:+UseParallelGC" "-XX:+UseParallelOldGC" "-XX:+DisableExplicitGC" > "-XX:MaxPermSize=1024m" "-Dspark.executor.memory=8g" > "-Dspark.local.dir=/tmp/spark" "-XX:+UseParallelGC" "-XX:+UseParallelOldGC" > "-XX:+DisableExplicitGC" "-XX:MaxPermSize=1024m" "-Dspark.executor.memory=8g" > "-Dspark.local.dir=/tmp/spark" "-XX:+UseParallelGC" "-XX:+UseParallelOldGC" > "-XX:+DisableExplicitGC" "-XX:MaxPermSize=1024m" "-Dspark.executor.memory=8g" > "-Dspark.local.dir=/tmp/spark" "-XX:+UseParallelGC" "-XX:+UseParallelOldGC" > "-XX:+DisableExplicitGC" "-XX:MaxPermSize=1024m" "-Xms512M" "-Xmx512M" > "org.apache.spark.executor.StandaloneExecutorBackend" > "akka://spark@poc1:54482/user/StandaloneScheduler" "4" "poc3" "16" > 13/11/11 19:17:02 INFO Worker: Executor app-20131111190659-0000/4 finished > with state FAILED message Command exited with code 1 exitStatus 1 > 13/11/11 19:17:02 INFO Worker: Asked to launch executor > app-20131111190659-0000/6 for OMDBQueryService > 13/11/11 19:17:02 INFO ExecutorRunner: Launch command: "java" "-cp" > ":/opt/spark-0.8.0/conf:/opt/spark-0.8.0/assembly/target/scala-2.9.3/spark-assembly_2.9.3-0.8.0-incubating-hadoop1.0.4.jar" > "-Dspark.executor.memory=8g" "-Dspark.local.dir=/tmp/spark" > "-XX:+UseParallelGC" "-XX:+UseParallelOldGC" "-XX:+DisableExplicitGC" > "-XX:MaxPermSize=1024m" "-Dspark.executor.memory=8g" > "-Dspark.local.dir=/tmp/spark" "-XX:+UseParallelGC" "-XX:+UseParallelOldGC" > "-XX:+DisableExplicitGC" "-XX:MaxPermSize=1024m" "-Dspark.executor.memory=8g" > "-Dspark.local.dir=/tmp/spark" "-XX:+UseParallelGC" "-XX:+UseParallelOldGC" > "-XX:+DisableExplicitGC" "-XX:MaxPermSize=1024m" "-Dspark.executor.memory=8g" > "-Dspark.local.dir=/tmp/spark" "-XX:+UseParallelGC" "-XX:+UseParallelOldGC" > "-XX:+DisableExplicitGC" "-XX:MaxPermSize=1024m" "-Xms512M" "-Xmx512M" > "org.apache.spark.executor.StandaloneExecutorBackend" > "akka://spark@poc1:54482/user/StandaloneScheduler" "6" "poc3" "16" > 13/11/11 19:17:09 INFO Worker: Executor app-20131111190659-0000/6 finished > with state FAILED message Command exited with code 1 exitStatus 1 > 13/11/11 19:17:09 INFO Worker: Asked to launch executor > app-20131111190659-0000/8 for OMDBQueryService > 13/11/11 19:17:09 INFO ExecutorRunner: Launch command: "java" "-cp" > ":/opt/spark-0.8.0/conf:/opt/spark-0.8.0/assembly/target/scala-2.9.3/spark-assembly_2.9.3-0.8.0-incubating-hadoop1.0.4.jar" > "-Dspark.executor.memory=8g" "-Dspark.local.dir=/tmp/spark" > "-XX:+UseParallelGC" "-XX:+UseParallelOldGC" "-XX:+DisableExplicitGC" > "-XX:MaxPermSize=1024m" "-Dspark.executor.memory=8g" > "-Dspark.local.dir=/tmp/spark" "-XX:+UseParallelGC" "-XX:+UseParallelOldGC" > "-XX:+DisableExplicitGC" "-XX:MaxPermSize=1024m" "-Dspark.executor.memory=8g" > "-Dspark.local.dir=/tmp/spark" "-XX:+UseParallelGC" "-XX:+UseParallelOldGC" > "-XX:+DisableExplicitGC" "-XX:MaxPermSize=1024m" "-Dspark.executor.memory=8g" > "-Dspark.local.dir=/tmp/spark" "-XX:+UseParallelGC" "-XX:+UseParallelOldGC" > "-XX:+DisableExplicitGC" "-XX:MaxPermSize=1024m" "-Xms512M" "-Xmx512M" > "org.apache.spark.executor.StandaloneExecutorBackend" > "akka://spark@poc1:54482/user/StandaloneScheduler" "8" "poc3" "16" > 13/11/11 19:17:17 INFO Worker: Executor app-20131111190659-0000/8 finished > with state FAILED message Command exited with code 1 exitStatus 1 > 13/11/11 19:17:17 INFO Worker: Asked to launch executor > app-20131111190659-0000/10 for OMDBQueryService > 13/11/11 19:17:17 INFO ExecutorRunner: Launch command: "java" "-cp" > ":/opt/spark-0.8.0/conf:/opt/spark-0.8.0/assembly/target/scala-2.9.3/spark-assembly_2.9.3-0.8.0-incubating-hadoop1.0.4.jar" > "-Dspark.executor.memory=8g" "-Dspark.local.dir=/tmp/spark" > "-XX:+UseParallelGC" "-XX:+UseParallelOldGC" "-XX:+DisableExplicitGC" > "-XX:MaxPermSize=1024m" "-Dspark.executor.memory=8g" > "-Dspark.local.dir=/tmp/spark" "-XX:+UseParallelGC" "-XX:+UseParallelOldGC" > "-XX:+DisableExplicitGC" "-XX:MaxPermSize=1024m" "-Dspark.executor.memory=8g" > "-Dspark.local.dir=/tmp/spark" "-XX:+UseParallelGC" "-XX:+UseParallelOldGC" > "-XX:+DisableExplicitGC" "-XX:MaxPermSize=1024m" "-Dspark.executor.memory=8g" > "-Dspark.local.dir=/tmp/spark" "-XX:+UseParallelGC" "-XX:+UseParallelOldGC" > "-XX:+DisableExplicitGC" "-XX:MaxPermSize=1024m" "-Xms512M" "-Xmx512M" > "org.apache.spark.executor.StandaloneExecutorBackend" > "akka://spark@poc1:54482/user/StandaloneScheduler" "10" "poc3" "16" > 13/11/11 19:17:20 INFO Worker: Asked to kill executor > app-20131111190659-0000/10 > 13/11/11 19:17:20 INFO ExecutorRunner: Killing process! > 13/11/11 19:17:20 INFO ExecutorRunner: Runner thread for executor > app-20131111190659-0000/10 interrupted > 13/11/11 19:17:21 INFO Worker: Executor app-20131111190659-0000/10 finished > with state KILLED > > Thanks, > Hussam
