[ https://issues.apache.org/jira/browse/SPARK-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048808#comment-14048808 ]
nigel commented on SPARK-1764: ------------------------------ I have seen this issue in a cut of the main branch take on friday 27th June (last commit https://github.com/srowen/spark/commit/18f29b96c7e0948f5f504e522e5aa8a8d1ab163e ) It does run successfully for a time (~800 requests) but then fails. This has happened on two different clusters, mesos 0.18.0 and 0.18.2, different sites. One interesting side effect is as this is running, the number of file handles in use on the slaves goes up to many thousands. This could be the resource that is running out. One of the slaves (different machines) always dies and the job is eventually aborted. This happens with a real app that is actually doing stuff with 18GB RDDs, but also with the simple, but brutal while true command below. A similar loop in scala seems to work fine, indefinitely. I would be interested to hear if anyone can run this command for a significant amount of time because that would point to our environment in some way. Using Python version 2.7.6 (default, Jan 17 2014 10:13:17) SparkContext available as sc. In [1]: while True: sc.parallelize(range(100)).map(lambda n: n * 2).collect() ...: it ran for a while then... 14/06/27 13:59:08 INFO TaskSetManager: Finished TID 778 in 63 ms on guavcpt-ch2-a28p.sys.comcast.net (progress: 1/8) 14/06/27 13:59:08 INFO DAGScheduler: Completed ResultTask(97, 2) 14/06/27 13:59:08 INFO TaskSetManager: Finished TID 777 in 70 ms on guavcpt-ch2-a32p.sys.comcast.net (progress: 2/8) 14/06/27 13:59:08 INFO DAGScheduler: Completed ResultTask(97, 1) 14/06/27 13:59:08 INFO TaskSetManager: Finished TID 782 in 104 ms on guavcpt-ch2-a28p.sys.comcast.net (progress: 3/8) 14/06/27 13:59:08 INFO DAGScheduler: Completed ResultTask(97, 6) 14/06/27 13:59:08 INFO TaskSetManager: Finished TID 781 in 105 ms on guavcpt-ch2-a32p.sys.comcast.net (progress: 4/8) 14/06/27 13:59:08 INFO DAGScheduler: Completed ResultTask(97, 5) 14/06/27 13:59:08 ERROR DAGSchedulerActorSupervisor: eventProcesserActor failed due to the error EOF reached before Python server acknowledged; shutting down SparkContext 14/06/27 13:59:08 INFO TaskSchedulerImpl: Cancelling stage 97 14/06/27 13:59:08 INFO DAGScheduler: Could not cancel tasks for stage 97 java.lang.UnsupportedOperationException at org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32) at org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:185) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:183) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:176) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.TaskSchedulerImpl.cancelTasks(TaskSchedulerImpl.scala:176) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply$mcVI$sp(DAGScheduler.scala:1066) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1052) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1052) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1052) at org.apache.spark.scheduler.DAGScheduler.handleJobCancellation(DAGScheduler.scala:1005) at org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAllJobs$1.apply$mcVI$sp(DAGScheduler.scala:502) at org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAllJobs$1.apply(DAGScheduler.scala:502) at org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAllJobs$1.apply(DAGScheduler.scala:502) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.DAGScheduler.doCancelAllJobs(DAGScheduler.scala:502) at org.apache.spark.scheduler.DAGSchedulerActorSupervisor$$anonfun$2.applyOrElse(DAGScheduler.scala:1167) at org.apache.spark.scheduler.DAGSchedulerActorSupervisor$$anonfun$2.applyOrElse(DAGScheduler.scala:1162) at akka.actor.SupervisorStrategy.handleFailure(FaultHandling.scala:295) at akka.actor.dungeon.FaultHandling$class.handleFailure(FaultHandling.scala:253) at akka.actor.ActorCell.handleFailure(ActorCell.scala:338) at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:423) at akka.actor.ActorCell.systemInvoke(ActorCell.scala:447) at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:262) at akka.dispatch.Mailbox.run(Mailbox.scala:218) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 14/06/27 13:59:08 INFO TaskSetManager: Finished TID 776 in 151 ms on guavcpt-ch2-a31p.sys.comcast.net (progress: 5/8) 14/06/27 13:59:08 INFO TaskSetManager: Finished TID 780 in 147 ms on guavcpt-ch2-a31p.sys.comcast.net (progress: 6/8) 14/06/27 13:59:08 INFO TaskSetManager: Finished TID 779 in 188 ms on guavcpt-ch2-a27p.sys.comcast.net (progress: 7/8) 14/06/27 13:59:08 INFO TaskSetManager: Finished TID 783 in 193 ms on guavcpt-ch2-a27p.sys.comcast.net (progress: 8/8) 14/06/27 13:59:08 INFO TaskSchedulerImpl: Removed TaskSet 97.0, whose tasks have all completed, from pool 14/06/27 13:59:09 WARN QueuedThreadPool: 5 threads could not be stopped 14/06/27 13:59:09 INFO SparkUI: Stopped Spark web UI at http://guavcpt-ch2-a26p.sys.comcast.net:4041 14/06/27 13:59:09 INFO DAGScheduler: Stopping DAGScheduler I0627 13:59:09.129179 14267 sched.cpp:730] Stopping framework '20140605-200137-607132588-5050-19211-0073' 14/06/27 13:59:09 INFO MesosSchedulerBackend: driver.run() returned with code DRIVER_STOPPED 14/06/27 13:59:10 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor stopped! 14/06/27 13:59:10 INFO ConnectionManager: Selector thread was interrupted! 14/06/27 13:59:10 INFO ConnectionManager: ConnectionManager stopped 14/06/27 13:59:10 INFO MemoryStore: MemoryStore cleared 14/06/27 13:59:10 INFO BlockManager: BlockManager stopped 14/06/27 13:59:10 INFO BlockManagerMasterActor: Stopping BlockManagerMaster 14/06/27 13:59:10 INFO BlockManagerMaster: BlockManagerMaster stopped 14/06/27 13:59:10 INFO SparkContext: Successfully stopped SparkContext 14/06/27 13:59:10 ERROR OneForOneStrategy: EOF reached before Python server acknowledged org.apache.spark.SparkException: EOF reached before Python server acknowledged at org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:613) at org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:584) at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:72) at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:280) at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:278) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.Accumulators$.add(Accumulators.scala:278) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:825) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1223) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 14/06/27 13:59:10 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 14/06/27 13:59:10 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 14/06/27 13:59:10 INFO Remoting: Remoting shut down > EOF reached before Python server acknowledged > --------------------------------------------- > > Key: SPARK-1764 > URL: https://issues.apache.org/jira/browse/SPARK-1764 > Project: Spark > Issue Type: Bug > Components: Mesos, PySpark > Affects Versions: 1.0.0 > Reporter: Bouke van der Bijl > Priority: Blocker > Labels: mesos, pyspark > > I'm getting "EOF reached before Python server acknowledged" while using > PySpark on Mesos. The error manifests itself in multiple ways. One is: > 14/05/08 18:10:40 ERROR DAGSchedulerActorSupervisor: eventProcesserActor > failed due to the error EOF reached before Python server acknowledged; > shutting down SparkContext > And the other has a full stacktrace: > 14/05/08 18:03:06 ERROR OneForOneStrategy: EOF reached before Python server > acknowledged > org.apache.spark.SparkException: EOF reached before Python server acknowledged > at > org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:416) > at > org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:387) > at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:71) > at > org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:279) > at > org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:277) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) > at > scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) > at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > at org.apache.spark.Accumulators$.add(Accumulators.scala:277) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:818) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1204) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) > at akka.actor.ActorCell.invoke(ActorCell.scala:456) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > This error causes the SparkContext to shutdown. I have not been able to > reliably reproduce this bug, it seems to happen randomly, but if you run > enough tasks on a SparkContext it'll hapen eventually -- This message was sent by Atlassian JIRA (v6.2#6252)