Toby Potter created SPARK-7682: ---------------------------------- Summary: Size of distributed grids still limited by cPickle Key: SPARK-7682 URL: https://issues.apache.org/jira/browse/SPARK-7682 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.3.1 Environment: Redhat Enterprise Linux 6.5, Spark 1.3.1 standalone in cluster mode, 2 nodes with 64 GB spark slaves, Python 2.7.6 Reporter: Toby Potter Priority: Minor
I'm trying to explore the possibilities of writing a fault-tolerant distributed computing engine for multidimensional arrays. I'm finding that the Python cPickle serializer is limiting the size of Numpy arrays that I can distribute over the cluster. My example code is below #!/usr/bin/env python #Python app to use spark from pyspark import SparkContext, SparkConf import numpy appName="Spark Test App" # Create a spark context conf = SparkConf().setAppName(appName) # Set memory conf = SparkConf().set("spark.executor.memory", "32g") sc = SparkContext(conf=conf) # Make array grid=numpy.zeros((1024,1024,1024)) # Now parallelise and persist the data rdd = sc.parallelize([("srcw", grid)]) # Make the data persist in memory rdd_rdd.persist() When I run the code I get the following error Traceback (most recent call last): File "test_app.py", line 20, in <module> rdd = sc.parallelize([("srcw", grid)]) File "/spark/1.3.1/python/pyspark/context.py", line 341, in parallelize serializer.dump_stream(c, tempFile) File "/spark/1.3.1/python/pyspark/serializers.py", line 208, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/spark/1.3.1/python/pyspark/serializers.py", line 127, in dump_stream self._write_with_length(obj, stream) File "/spark/1.3.1/python/pyspark/serializers.py", line 137, in _write_with_length serialized = self.dumps(obj) File "/spark/1.3.1/python/pyspark/serializers.py", line 403, in dumps return cPickle.dumps(obj, 2) SystemError: error return without exception set -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org