Cannot use pyspark to aggregate on remote EC2 cluster

kriskalish Thu, 05 Jun 2014 15:13:30 -0700

I'm in a situation where I have two compute nodes in Amazon EC2 and a third
node that is used to just execute queries. The third node is not part of the
cluster. It's also configured slightly differently. That is, the third node
runs Ubuntu 14.04 while the two cluster nodes run CentOS.


I launch pyspark on the third node using the following command:
./bin/pyspark --master spark://<ip of cluster master node>:7077

I'm attempting to execute the following python snippet:

data = sc.textFile('s3n://<keyhere>@test/data/y=2014/m=06/d=05/h=02/')
data.count()

However, I get the following exception:

Py4JJavaError: An error occurred while calling o130.collectPartitions.
: org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
  File "/root/spark/python/pyspark/worker.py", line 77, in main
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/root/spark/python/pyspark/serializers.py", line 191, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/root/spark/python/pyspark/serializers.py", line 123, in dump_stream
    for obj in iterator:
  File "/root/spark/python/pyspark/serializers.py", line 180, in _batched
    for item in iterator:
  File "/root/spark/python/pyspark/rdd.py", line 856, in takeUpToNum
    yield next(iterator)
  File "<ipython-input-14-cf25bfc806aa>", line 1, in <lambda>
IndexError: list index out of range

        at 
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:115)
        at
org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:145)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:78)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
        at
org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:574)
        at
org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:559)


I have been able to narrow the problem down to something related to py4j
because I can run the analogous code in the scala shell. To make matters
more confusing, if I ssh into the master node (CentOS), then execute
pyspark, the exact same code snippet works fine.

Additionally, if I execute data.collect() instead of data.count() on the
third node, then all of the data is written to the console as expected.


If anyone has some ideas on how to continue troubleshooting this problem,
the help would be appreciated. 

Thanks,
Kris





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Cannot-use-pyspark-to-aggregate-on-remote-EC2-cluster-tp6953.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Cannot use pyspark to aggregate on remote EC2 cluster

Reply via email to