We've seen this issue as well in production. We also aren't sure what
causes it, but have just recently shaded some of the Spark code in
TaskSchedulerImpl that we use to effectively bubble up an exception from
Spark instead of zombie in this situation. If you are interested I can go
into more detail about that. Otherwise I'm also keen to find out more on
how this might be happening.

On Fri, Jun 26, 2015 at 8:28 AM, Sjoerd Mulder <sjoerdmul...@gmail.com>
wrote:

> Hi,
>
> I have a really annoying issue that i cannot replicate consistently, still
> it happens every +- 100 submissions. (it's a job that's running every 3
> minutes).
> Already reported an issue for this:
> https://issues.apache.org/jira/browse/SPARK-8592
>
> Here are the Thread dump of the Driver and the Executor:
> https://docs.google.com/document/d/1x7ZwUzlvRqeJQ12FoGhpLV1zqDAmVsaF2HYhzkPNBKQ
>
> Any direction is should look into?
>
> Spark 1.4.0
> Java 1.8.0_45 (Oracle Corporation)
> Scala 2.11.6
>
> I already tried to resolve the NPE by not logging the ActorRef. This makes
> the NPE go away :)
>
> But  the root cause lies deeper I expect, since then the driver then still
> hangs with the "*WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient resources*" messages. But there are enough resources
> available in the cluster, it has plenty of CPU and Memory left.
>
> Logs from Driver:
>
> 15/06/26 11:58:19 INFO Remoting: Starting remoting
> 15/06/26 11:58:19 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://sparkDriver@172.17.0.123:51415]
> 15/06/26 11:58:19 INFO Utils: Successfully started service 'sparkDriver'
> on port 51415.
> 15/06/26 11:58:20 INFO SparkEnv: Registering MapOutputTracker
> 15/06/26 11:58:20 INFO SparkEnv: Registering BlockManagerMaster
> 15/06/26 11:58:20 INFO DiskBlockManager: Created local directory at
> /tmp/spark-ff1f5a88-4e1d-4fe0-9c54-890e4174f02a/blockmgr-92b1e974-53bb-45a3-b918-916759e14630
> 15/06/26 11:58:20 INFO MemoryStore: MemoryStore started with capacity
> 265.1 MB
> 15/06/26 11:58:20 INFO HttpFileServer: HTTP File server directory is
> /tmp/spark-ff1f5a88-4e1d-4fe0-9c54-890e4174f02a/httpd-f5894293-33aa-4eaa-9740-4a36c054b6c8
> 15/06/26 11:58:20 INFO HttpServer: Starting HTTP Server
> 15/06/26 11:58:20 INFO Utils: Successfully started service 'HTTP file
> server' on port 33176.
> 15/06/26 11:58:20 INFO SparkEnv: Registering OutputCommitCoordinator
> 15/06/26 11:58:20 INFO Utils: Successfully started service 'SparkUI' on
> port 4040.
> 15/06/26 11:58:20 INFO SparkUI: Started SparkUI at
> http://172.17.0.123:4040
> 15/06/26 11:58:20 INFO SparkContext: Added JAR
> file:/opt/jar/spark/spark-job-1.0-SNAPSHOT.jar at
> http://172.17.0.123:33176/jars/spark-job-1.0-SNAPSHOT.jar with timestamp
> 1435319900257
> 15/06/26 11:58:20 INFO AppClient$ClientActor: Connecting to master
> akka.tcp://sparkMaster@172.17.42.1:7077/user/Master...
> 15/06/26 11:58:20 INFO SparkDeploySchedulerBackend: Connected to Spark
> cluster with app ID app-20150626115820-0917
> 15/06/26 11:58:20 INFO AppClient$ClientActor: Executor added:
> app-20150626115820-0917/0 on worker-20150625133752-10.0.7.171-47050 (
> 10.0.7.171:47050) with 1 cores
> 15/06/26 11:58:20 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20150626115820-0917/0 on hostPort 10.0.7.171:47050 with 1 cores, 2.0
> GB RAM
> 15/06/26 11:58:20 INFO TaskSchedulerImpl: Starting speculative execution
> thread
> 15/06/26 11:58:20 INFO AppClient$ClientActor: Executor updated:
> app-20150626115820-0917/0 is now LOADING
> 15/06/26 11:58:20 INFO AppClient$ClientActor: Executor updated:
> app-20150626115820-0917/0 is now RUNNING
> 15/06/26 11:58:20 INFO Utils: Successfully started service
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 52000.
> 15/06/26 11:58:20 INFO NettyBlockTransferService: Server created on 52000
> 15/06/26 11:58:20 INFO BlockManagerMaster: Trying to register BlockManager
> 15/06/26 11:58:20 INFO BlockManagerMasterEndpoint: Registering block
> manager 172.17.0.123:52000 with 265.1 MB RAM, BlockManagerId(driver,
> 172.17.0.123, 52000)
> 15/06/26 11:58:20 INFO BlockManagerMaster: Registered BlockManager
> 15/06/26 11:58:20 INFO SparkDeploySchedulerBackend: SchedulerBackend is
> ready for scheduling beginning after reached minRegisteredResourcesRatio:
> 0.0
> 15/06/26 11:58:24 INFO Exchange: Using SparkSqlSerializer2.
> 15/06/26 11:58:24 INFO Exchange: Using SparkSqlSerializer2.
> 15/06/26 11:58:24 INFO SparkContext: Starting job: map at
> SparkProductEventAggregator.scala:144
> 15/06/26 11:58:24 INFO Version: Elasticsearch Hadoop v2.1.0.rc1
> [5cc3f53084]
> 15/06/26 11:58:24 INFO ScalaEsRowRDD: Reading from
> [675d42c8-9823-4d3c-8e86-5aa611d38770/events]
> 15/06/26 11:58:24 INFO ScalaEsRowRDD: Discovered mapping
> {675d42c8-9823-4d3c-8e86-5aa611d38770=[REMOVED]} for
> [675d42c8-9823-4d3c-8e86-5aa611d38770/events]
> 15/06/26 11:58:24 INFO DAGScheduler: Registering RDD 5 (map at
> SparkProductEventAggregator.scala:144)
> 15/06/26 11:58:24 INFO DAGScheduler: Got job 0 (map at
> SparkProductEventAggregator.scala:144) with 200 output partitions
> (allowLocal=false)
> 15/06/26 11:58:24 INFO DAGScheduler: Final stage: ResultStage 1(map at
> SparkProductEventAggregator.scala:144)
> 15/06/26 11:58:24 INFO DAGScheduler: Parents of final stage:
> List(ShuffleMapStage 0)
> 15/06/26 11:58:24 INFO DAGScheduler: Missing parents: List(ShuffleMapStage
> 0)
> 15/06/26 11:58:24 INFO DAGScheduler: Submitting ShuffleMapStage 0
> (MapPartitionsRDD[5] at map at SparkProductEventAggregator.scala:144),
> which has no missing parents
> 15/06/26 11:58:24 INFO MemoryStore: ensureFreeSpace(12384) called with
> curMem=0, maxMem=278019440
> 15/06/26 11:58:24 INFO MemoryStore: Block broadcast_0 stored as values in
> memory (estimated size 12.1 KB, free 265.1 MB)
> 15/06/26 11:58:24 INFO MemoryStore: ensureFreeSpace(5542) called with
> curMem=12384, maxMem=278019440
> 15/06/26 11:58:24 INFO MemoryStore: Block broadcast_0_piece0 stored as
> bytes in memory (estimated size 5.4 KB, free 265.1 MB)
> 15/06/26 11:58:24 INFO BlockManagerInfo: Added broadcast_0_piece0 in
> memory on 172.17.0.123:52000 (size: 5.4 KB, free: 265.1 MB)
> 15/06/26 11:58:24 INFO SparkContext: Created broadcast 0 from broadcast at
> DAGScheduler.scala:874
> 15/06/26 11:58:24 INFO DAGScheduler: Submitting 5 missing tasks from
> ShuffleMapStage 0 (MapPartitionsRDD[5] at map at
> SparkProductEventAggregator.scala:144)
> 15/06/26 11:58:24 INFO TaskSchedulerImpl: Adding task set 0.0 with 5 tasks
> 15/06/26 11:58:39 WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient resources
> 15/06/26 11:58:54 WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient resources
> 15/06/26 11:59:09 WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient resources
>
>
> Sjoerd
>
>
>
>

Reply via email to