I'm in a situation where I have two compute nodes in Amazon EC2 and a third node that is used to just execute queries. The third node is not part of the cluster. It's also configured slightly differently. That is, the third node runs Ubuntu 14.04 while the two cluster nodes run CentOS.
I launch pyspark on the third node using the following command: ./bin/pyspark --master spark://<ip of cluster master node>:7077 I'm attempting to execute the following python snippet: data = sc.textFile('s3n://<keyhere>@test/data/y=2014/m=06/d=05/h=02/') data.count() However, I get the following exception: Py4JJavaError: An error occurred while calling o130.collectPartitions. : org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/root/spark/python/pyspark/worker.py", line 77, in main serializer.dump_stream(func(split_index, iterator), outfile) File "/root/spark/python/pyspark/serializers.py", line 191, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/root/spark/python/pyspark/serializers.py", line 123, in dump_stream for obj in iterator: File "/root/spark/python/pyspark/serializers.py", line 180, in _batched for item in iterator: File "/root/spark/python/pyspark/rdd.py", line 856, in takeUpToNum yield next(iterator) File "<ipython-input-14-cf25bfc806aa>", line 1, in <lambda> IndexError: list index out of range at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:115) at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:145) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:78) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:574) at org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:559) I have been able to narrow the problem down to something related to py4j because I can run the analogous code in the scala shell. To make matters more confusing, if I ssh into the master node (CentOS), then execute pyspark, the exact same code snippet works fine. Additionally, if I execute data.collect() instead of data.count() on the third node, then all of the data is written to the console as expected. If anyone has some ideas on how to continue troubleshooting this problem, the help would be appreciated. Thanks, Kris -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Cannot-use-pyspark-to-aggregate-on-remote-EC2-cluster-tp6953.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.