[jira] [Created] (SPARK-7682) Size of distributed grids still limited by cPickle

Toby Potter (JIRA) Sat, 16 May 2015 01:42:20 -0700

Toby Potter created SPARK-7682:
----------------------------------

             Summary: Size of distributed grids still limited by cPickle 
                 Key: SPARK-7682
                 URL: https://issues.apache.org/jira/browse/SPARK-7682
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 1.3.1
         Environment: Redhat Enterprise Linux 6.5, Spark 1.3.1 standalone in 
cluster mode, 2 nodes with 64 GB spark slaves, Python 2.7.6
            Reporter: Toby Potter
            Priority: Minor



I'm trying to explore the possibilities of writing a fault-tolerant distributed 
computing engine for multidimensional arrays. I'm finding that the Python 
cPickle serializer is limiting the size of Numpy arrays that I can distribute 
over the cluster.

My example code is below

#!/usr/bin/env python

#Python app to use spark
from pyspark import SparkContext, SparkConf
import numpy

appName="Spark Test App"

# Create a spark context
conf = SparkConf().setAppName(appName)

# Set memory
conf = SparkConf().set("spark.executor.memory", "32g")
sc = SparkContext(conf=conf)

# Make array
grid=numpy.zeros((1024,1024,1024))

# Now parallelise and persist the data
rdd = sc.parallelize([("srcw", grid)])

# Make the data persist in memory
rdd_rdd.persist()

When I run the code I get the following error

Traceback (most recent call last):
  File "test_app.py", line 20, in <module>
    rdd = sc.parallelize([("srcw", grid)])
  File "/spark/1.3.1/python/pyspark/context.py", line 341, in parallelize
    serializer.dump_stream(c, tempFile)
  File "/spark/1.3.1/python/pyspark/serializers.py", line 208, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/spark/1.3.1/python/pyspark/serializers.py", line 127, in dump_stream
    self._write_with_length(obj, stream)
  File "/spark/1.3.1/python/pyspark/serializers.py", line 137, in 
_write_with_length
    serialized = self.dumps(obj)
  File "/spark/1.3.1/python/pyspark/serializers.py", line 403, in dumps
    return cPickle.dumps(obj, 2)
SystemError: error return without exception set






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7682) Size of distributed grids still limited by cPickle

Reply via email to