[jira] [Commented] (SPARK-1764) EOF reached before Python server acknowledged

nigel (JIRA) Tue, 01 Jul 2014 05:28:36 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048808#comment-14048808
 ]


nigel commented on SPARK-1764:
------------------------------

I have seen this issue in a cut of the main branch take on friday 27th June 
(last commit 
https://github.com/srowen/spark/commit/18f29b96c7e0948f5f504e522e5aa8a8d1ab163e 
)

It does run successfully for a time (~800 requests) but then fails. This has 
happened on two different clusters, mesos 0.18.0 and 0.18.2, different sites.

One interesting side effect is as this is running, the number of file handles 
in use on the slaves goes up to many thousands. This could be the resource that 
is running out. One of the slaves (different machines) always dies and the job 
is eventually aborted.

This happens with a real app that is actually doing stuff with 18GB RDDs, but 
also with the simple, but brutal while true command below. A similar loop in 
scala seems to work fine, indefinitely.

I would be interested to hear if anyone can run this command for a significant 
amount of time because that would point to our environment in some way.

Using Python version 2.7.6 (default, Jan 17 2014 10:13:17)
SparkContext available as sc.

In [1]: while True:                       
    sc.parallelize(range(100)).map(lambda n: n * 2).collect()
   ...:     

it ran for a while then...

14/06/27 13:59:08 INFO TaskSetManager: Finished TID 778 in 63 ms on 
guavcpt-ch2-a28p.sys.comcast.net (progress: 1/8)
14/06/27 13:59:08 INFO DAGScheduler: Completed ResultTask(97, 2)
14/06/27 13:59:08 INFO TaskSetManager: Finished TID 777 in 70 ms on 
guavcpt-ch2-a32p.sys.comcast.net (progress: 2/8)
14/06/27 13:59:08 INFO DAGScheduler: Completed ResultTask(97, 1)
14/06/27 13:59:08 INFO TaskSetManager: Finished TID 782 in 104 ms on 
guavcpt-ch2-a28p.sys.comcast.net (progress: 3/8)
14/06/27 13:59:08 INFO DAGScheduler: Completed ResultTask(97, 6)
14/06/27 13:59:08 INFO TaskSetManager: Finished TID 781 in 105 ms on 
guavcpt-ch2-a32p.sys.comcast.net (progress: 4/8)
14/06/27 13:59:08 INFO DAGScheduler: Completed ResultTask(97, 5)
14/06/27 13:59:08 ERROR DAGSchedulerActorSupervisor: eventProcesserActor failed 
due to the error EOF reached before Python server acknowledged; shutting down 
SparkContext
14/06/27 13:59:08 INFO TaskSchedulerImpl: Cancelling stage 97
14/06/27 13:59:08 INFO DAGScheduler: Could not cancel tasks for stage 97
java.lang.UnsupportedOperationException
at 
org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32)
at 
org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:185)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:183)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:176)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.cancelTasks(TaskSchedulerImpl.scala:176)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply$mcVI$sp(DAGScheduler.scala:1066)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1052)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1052)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1052)
at 
org.apache.spark.scheduler.DAGScheduler.handleJobCancellation(DAGScheduler.scala:1005)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAllJobs$1.apply$mcVI$sp(DAGScheduler.scala:502)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAllJobs$1.apply(DAGScheduler.scala:502)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAllJobs$1.apply(DAGScheduler.scala:502)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
at 
org.apache.spark.scheduler.DAGScheduler.doCancelAllJobs(DAGScheduler.scala:502)
at 
org.apache.spark.scheduler.DAGSchedulerActorSupervisor$$anonfun$2.applyOrElse(DAGScheduler.scala:1167)
at 
org.apache.spark.scheduler.DAGSchedulerActorSupervisor$$anonfun$2.applyOrElse(DAGScheduler.scala:1162)
at akka.actor.SupervisorStrategy.handleFailure(FaultHandling.scala:295)
at akka.actor.dungeon.FaultHandling$class.handleFailure(FaultHandling.scala:253)
at akka.actor.ActorCell.handleFailure(ActorCell.scala:338)
at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:423)
at akka.actor.ActorCell.systemInvoke(ActorCell.scala:447)
at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:262)
at akka.dispatch.Mailbox.run(Mailbox.scala:218)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
14/06/27 13:59:08 INFO TaskSetManager: Finished TID 776 in 151 ms on 
guavcpt-ch2-a31p.sys.comcast.net (progress: 5/8)
14/06/27 13:59:08 INFO TaskSetManager: Finished TID 780 in 147 ms on 
guavcpt-ch2-a31p.sys.comcast.net (progress: 6/8)
14/06/27 13:59:08 INFO TaskSetManager: Finished TID 779 in 188 ms on 
guavcpt-ch2-a27p.sys.comcast.net (progress: 7/8)
14/06/27 13:59:08 INFO TaskSetManager: Finished TID 783 in 193 ms on 
guavcpt-ch2-a27p.sys.comcast.net (progress: 8/8)
14/06/27 13:59:08 INFO TaskSchedulerImpl: Removed TaskSet 97.0, whose tasks 
have all completed, from pool 
14/06/27 13:59:09 WARN QueuedThreadPool: 5 threads could not be stopped
14/06/27 13:59:09 INFO SparkUI: Stopped Spark web UI at 
http://guavcpt-ch2-a26p.sys.comcast.net:4041
14/06/27 13:59:09 INFO DAGScheduler: Stopping DAGScheduler
I0627 13:59:09.129179 14267 sched.cpp:730] Stopping framework 
'20140605-200137-607132588-5050-19211-0073'
14/06/27 13:59:09 INFO MesosSchedulerBackend: driver.run() returned with code 
DRIVER_STOPPED
14/06/27 13:59:10 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor 
stopped!
14/06/27 13:59:10 INFO ConnectionManager: Selector thread was interrupted!
14/06/27 13:59:10 INFO ConnectionManager: ConnectionManager stopped
14/06/27 13:59:10 INFO MemoryStore: MemoryStore cleared
14/06/27 13:59:10 INFO BlockManager: BlockManager stopped
14/06/27 13:59:10 INFO BlockManagerMasterActor: Stopping BlockManagerMaster
14/06/27 13:59:10 INFO BlockManagerMaster: BlockManagerMaster stopped
14/06/27 13:59:10 INFO SparkContext: Successfully stopped SparkContext
14/06/27 13:59:10 ERROR OneForOneStrategy: EOF reached before Python server 
acknowledged
org.apache.spark.SparkException: EOF reached before Python server acknowledged
at 
org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:613)
at 
org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:584)
at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:72)
at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:280)
at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:278)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.Accumulators$.add(Accumulators.scala:278)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:825)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1223)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
14/06/27 13:59:10 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down 
remote daemon.
14/06/27 13:59:10 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon 
shut down; proceeding with flushing remote transports.
14/06/27 13:59:10 INFO Remoting: Remoting shut down



> EOF reached before Python server acknowledged
> ---------------------------------------------
>
>                 Key: SPARK-1764
>                 URL: https://issues.apache.org/jira/browse/SPARK-1764
>             Project: Spark
>          Issue Type: Bug
>          Components: Mesos, PySpark
>    Affects Versions: 1.0.0
>            Reporter: Bouke van der Bijl
>            Priority: Blocker
>              Labels: mesos, pyspark
>
> I'm getting "EOF reached before Python server acknowledged" while using 
> PySpark on Mesos. The error manifests itself in multiple ways. One is:
> 14/05/08 18:10:40 ERROR DAGSchedulerActorSupervisor: eventProcesserActor 
> failed due to the error EOF reached before Python server acknowledged; 
> shutting down SparkContext
> And the other has a full stacktrace:
> 14/05/08 18:03:06 ERROR OneForOneStrategy: EOF reached before Python server 
> acknowledged
> org.apache.spark.SparkException: EOF reached before Python server acknowledged
>       at 
> org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:416)
>       at 
> org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:387)
>       at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:71)
>       at 
> org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:279)
>       at 
> org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:277)
>       at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>       at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>       at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>       at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
>       at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
>       at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
>       at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>       at org.apache.spark.Accumulators$.add(Accumulators.scala:277)
>       at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:818)
>       at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1204)
>       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>       at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>       at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>       at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>       at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>       at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> This error causes the SparkContext to shutdown. I have not been able to 
> reliably reproduce this bug, it seems to happen randomly, but if you run 
> enough tasks on a SparkContext it'll hapen eventually



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1764) EOF reached before Python server acknowledged

Reply via email to