Hi,

We observed strange behabiour of Spark 0.9.0 when using sc.stop().

We have a bunch of applications that perform some jobs and then issue
sc.stop() at the end of main. Most of the time, everything works as
desired, but sometimes the applications get marked as "FAILED" by the
master and all remote workers get killed:

Executor Summary
> ExecutorID Worker Cores Memory State Logs
> 17 worker-20140520224948-10.240.75.212-39131 2 4096 KILLED stdout stderr
> 11 worker-20140520224947-10.240.121.104-40995 2 4096 KILLED stdout stderr
> 14 worker-20140520224948-10.240.10.39-57360 2 4096 KILLED stdout stderr
> 13 worker-20140520224855-10.240.124.170-41538 2 4096 KILLED stdout stderr
> 16 worker-20140520224802-10.240.110.72-51637 2 4096 KILLED stdout stderr
> 10 worker-20140520224948-10.240.146.198-53600 2 4096 KILLED stdout stderr
> 18 worker-20140520224948-10.240.109.20-49695 2 4096 KILLED stdout stderr
> 12 worker-20140520224950-10.240.238.138-50737 2 4096 KILLED stdout stderr
> 15 worker-20140520224947-10.240.255.168-57993 2 4096 KILLED stdout stderr


There are no errors in logs nor stdout / stderr, except the message from
the master:

INFO [Thread-31] 2014-05-21 15:41:35,814 ProcessUtil.java (line 36)
> SparkMaster: 14/05/21 15:41:35 ERROR DseSparkMaster: Application
> TestRunner: count with ID app-20140521141832-0006 failed 10 times, removing
> it


Tail of logs from the application:

14/05/21 17:52:56.318 INFO SparkContext: Job finished: count at
> SchedulerThroughputTest.scala:32, took 6.429144266 s
> results: 18.316,8.017,7.032,6.836,6.882,6.416,6.413,6.592,6.299,6.435
> 14/05/21 17:52:59.543 INFO SparkDeploySchedulerBackend: Shutting down all
> executors
> 14/05/21 17:52:59.544 INFO SparkDeploySchedulerBackend: Asking each
> executor to shut down
> 14/05/21 17:53:00.607 INFO MapOutputTrackerMasterActor:
> MapOutputTrackerActor stopped!
> 14/05/21 17:53:00.661 INFO ConnectionManager: Selector thread was
> interrupted!
> 14/05/21 17:53:00.663 INFO ConnectionManager: ConnectionManager stopped
> 14/05/21 17:53:00.664 INFO MemoryStore: MemoryStore cleared
> 14/05/21 17:53:00.664 INFO BlockManager: BlockManager stopped
> 14/05/21 17:53:00.665 INFO BlockManagerMasterActor: Stopping
> BlockManagerMaster
> 14/05/21 17:53:00.666 INFO BlockManagerMaster: BlockManagerMaster stopped
> 14/05/21 17:53:00.669 INFO RemoteActorRefProvider$RemotingTerminator:
> Shutting down remote daemon.
> 14/05/21 17:53:00.670 INFO SparkContext: Successfully stopped SparkContext
> 14/05/21 17:53:00.672 INFO RemoteActorRefProvider$RemotingTerminator:
> Remote daemon shut down; proceeding with flushing remote transports.


Now if we do not call the sc.stop() and the end of the application,
everything works fine, and spark reports FINISHED every single time.

So should we call sc.stop() and the observed behaviour is a Spark bug, or
is it our bug and we shouldn't ever call sc.stop() at the end of main?

Thanks,
Piotr

-- 
Piotr Kolaczkowski, Lead Software Engineer
pkola...@datastax.com

http://www.datastax.com/
777 Mariners Island Blvd., Suite 510
San Mateo, CA 94404

Reply via email to