[
https://issues.apache.org/jira/browse/SPARK-13514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208244#comment-15208244
]
Sean Owen commented on SPARK-13514:
-----------------------------------
All of this looks like a knock-on error. It looks like the YarnShuffleService
failed to start, so there's no blockHandler, which ultimately causes the NPE.
Do you see an error log starting with "Failed to initialize external shuffle
service"? that would indicate why it didn't start.
> Spark Shuffle Service 1.6.0 issue in Yarn
> ------------------------------------------
>
> Key: SPARK-13514
> URL: https://issues.apache.org/jira/browse/SPARK-13514
> Project: Spark
> Issue Type: Bug
> Reporter: Satish Kolli
>
> Spark shuffle service 1.6.0 in Yarn fails with an unknown exception. When I
> replace the spark shuffle jar with version 1.5.2 jar file, the following
> succeeds with out any issues.
> Hadoop Version: 2.5.1 (Kerberos Enabled)
> Spark Version: 1.6.0
> Java Version: 1.7.0_79
> {code}
> $SPARK_HOME/bin/spark-shell \
> --master yarn \
> --deploy-mode client \
> --conf spark.dynamicAllocation.enabled=true \
> --conf spark.dynamicAllocation.minExecutors=5 \
> --conf spark.yarn.executor.memoryOverhead=2048 \
> --conf spark.shuffle.service.enabled=true \
> --conf spark.scheduler.mode=FAIR \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
> --executor-memory 6G \
> --driver-memory 8G
> {code}
> {code}
> scala> val df = sc.parallelize(1 to 50).toDF
> df: org.apache.spark.sql.DataFrame = [_1: int]
> scala> df.show(50)
> {code}
> {code}
> 16/02/26 08:20:53 INFO spark.SparkContext: Starting job: show at <console>:30
> 16/02/26 08:20:53 INFO scheduler.DAGScheduler: Got job 0 (show at
> <console>:30) with 1 output partitions
> 16/02/26 08:20:53 INFO scheduler.DAGScheduler: Final stage: ResultStage 0
> (show at <console>:30)
> 16/02/26 08:20:53 INFO scheduler.DAGScheduler: Parents of final stage: List()
> 16/02/26 08:20:53 INFO scheduler.DAGScheduler: Missing parents: List()
> 16/02/26 08:20:53 INFO scheduler.DAGScheduler: Submitting ResultStage 0
> (MapPartitionsRDD[2] at show at <console>:30), which has no missing parents
> 16/02/26 08:20:53 INFO storage.MemoryStore: Block broadcast_0 stored as
> values in memory (estimated size 2.2 KB, free 2.2 KB)
> 16/02/26 08:20:53 INFO storage.MemoryStore: Block broadcast_0_piece0 stored
> as bytes in memory (estimated size 1411.0 B, free 3.6 KB)
> 16/02/26 08:20:53 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in
> memory on 10.5.76.106:46683 (size: 1411.0 B, free: 5.5 GB)
> 16/02/26 08:20:53 INFO spark.SparkContext: Created broadcast 0 from broadcast
> at DAGScheduler.scala:1006
> 16/02/26 08:20:53 INFO scheduler.DAGScheduler: Submitting 1 missing tasks
> from ResultStage 0 (MapPartitionsRDD[2] at show at <console>:30)
> 16/02/26 08:20:53 INFO cluster.YarnScheduler: Adding task set 0.0 with 1 tasks
> 16/02/26 08:20:53 INFO scheduler.FairSchedulableBuilder: Added task set
> TaskSet_0 tasks to pool default
> 16/02/26 08:20:53 INFO scheduler.TaskSetManager: Starting task 0.0 in stage
> 0.0 (TID 0, XXXXXXXXXXXXXXXXXXXXXXXX, partition 0,PROCESS_LOCAL, 2031 bytes)
> 16/02/26 08:20:53 INFO cluster.YarnClientSchedulerBackend: Disabling executor
> 2.
> 16/02/26 08:20:54 INFO scheduler.DAGScheduler: Executor lost: 2 (epoch 0)
> 16/02/26 08:20:54 INFO storage.BlockManagerMasterEndpoint: Trying to remove
> executor 2 from BlockManagerMaster.
> 16/02/26 08:20:54 INFO storage.BlockManagerMasterEndpoint: Removing block
> manager BlockManagerId(2, XXXXXXXXXXXXXXXXXXXXXXXX, 48113)
> 16/02/26 08:20:54 INFO storage.BlockManagerMaster: Removed 2 successfully in
> removeExecutor
> 16/02/26 08:20:54 ERROR cluster.YarnScheduler: Lost executor 2 on
> XXXXXXXXXXXXXXXXXXXXXXXX: Container marked as failed:
> container_1456492687549_0001_01_000003 on host: XXXXXXXXXXXXXXXXXXXXXXXX.
> Exit status: 1. Diagnostics: Exception from container-launch:
> ExitCodeException exitCode=1:
> ExitCodeException exitCode=1:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
> at org.apache.hadoop.util.Shell.run(Shell.java:455)
> at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
> at
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Container exited with a non-zero exit code 1
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]