Pyspark Error when broadcast numpy array

bliuab Tue, 11 Nov 2014 19:47:57 -0800

In spark-1.0.2, I have come across an error when I try to broadcast a quite
large numpy array(with 35M dimension). The error information except the
java.lang.NegativeArraySizeException error and details is listed below.
Moreover, when broadcast a relatively smaller numpy array(30M dimension),
everything works fine. And 30M dimension numpy array takes 230M memory
which, in my opinion, not very large.
As far as I have surveyed, it seems related with py4j. However, I have no
idea how to fix  this. I would be appreciated if I can get some hint.
------------
py4j.protocol.Py4JError: An error occurred while calling o23.broadcast.
Trace:
java.lang.NegativeArraySizeException
        at py4j.Base64.decode(Base64.java:292)
        at py4j.Protocol.getBytes(Protocol.java:167)
        at py4j.Protocol.getObject(Protocol.java:276)
        at
py4j.commands.AbstractCommand.getArguments(AbstractCommand.java:81)
        at py4j.commands.CallCommand.execute(CallCommand.java:77)
        at py4j.GatewayConnection.run(GatewayConnection.java:207)
-------------
And the test code is a follows:
conf =
SparkConf().setAppName('brodyliu_LR').setMaster('spark://10.231.131.87:5051')   
      
conf.set('spark.executor.memory', '4000m')                                  
conf.set('spark.akka.timeout', '100000')
conf.set('spark.ui.port','8081')
conf.set('spark.cores.max','150')                                           
#conf.set('spark.rdd.compress', 'True')                                     
conf.set('spark.default.parallelism', '300')                                
#configure the spark environment                                            
sc = SparkContext(conf=conf, batchSize=1)


vec = np.random.rand(35000000)                                              
a = sc.broadcast(vec)






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Pyspark Error when broadcast numpy array

Reply via email to