Github user ivanwick commented on the pull request: https://github.com/apache/spark/pull/311#issuecomment-39755988 This patch fixes a bug with PySpark shell running on Mesos. Without the spark.executor.uri property, PySpark reports lost tasks because the slave is looking for the spark-executor in the wrong path and can never start it. It logs several "Lost TID" and "Executor lost", while the scheduler re-queues the lost tasks. They again fail for the same reason, finally ending with: ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/opt/spark/spark-0.9.0-incubating-bin-cdh4/python/pyspark/rdd.py", line 539, in sum return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add) File "/opt/spark/spark-0.9.0-incubating-bin-cdh4/python/pyspark/rdd.py", line 505, in reduce vals = self.mapPartitions(func).collect() File "/opt/spark/spark-0.9.0-incubating-bin-cdh4/python/pyspark/rdd.py", line 469, in collect bytesInJava = self._jrdd.collect().iterator() File "/opt/spark/spark-0.9.0-incubating-bin-cdh4/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py", line 537, in __call__ File "/opt/spark/spark-0.9.0-incubating-bin-cdh4/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError14/04/05 14:10:48 INFO TaskSetManager: Re-queueing tasks for 201404020012-1174907072-5050-22936-8 from TaskSet 0.0 14/04/05 14:10:48 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool : An error occurred while calling o21.collect. : org.apache.spark.SparkException: Job aborted: Task 0.0:1 failed 4 times (most recent failure: unknown) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) ``` The stderr of each slave in the Mesos framework reports: ``` sh: 1: /opt/spark/spark-0.9.0-incubating-bin-cdh4/sbin/spark-executor: not found ``` because this path doesn't exist on the slave nodes (this happens to be the path where it's installed on the head node). When spark.executor.uri is set, as it is with the Scala repl, Mesos is able to download the Spark dist package and run it from the framework temp directory on the slave.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---