Re: Pyspark Error when broadcast numpy array

Davies Liu Tue, 11 Nov 2014 22:31:19 -0800

Yes, your broadcast should be about 300M, much smaller than 2G, I
didn't read your post carefully.


The broadcast in Python had been improved much since 1.1, I think it
will work in 1.1 or upcoming 1.2 release, could you upgrade to 1.1?

Davies

On Tue, Nov 11, 2014 at 8:37 PM, bliuab <bli...@cse.ust.hk> wrote:
> Dear Liu:
>
> Thank you very much for your help. I will update that patch. By the way, as
> I have succeed to broadcast an array of size(30M) the log said that such
> array takes around 230MB memory. As a result, I think the numpy array that
> leads to error is much smaller than 2G.
>
> On Wed, Nov 12, 2014 at 12:29 PM, Davies Liu-2 [via Apache Spark User List]
> <[hidden email]> wrote:
>>
>> This PR fix the problem: https://github.com/apache/spark/pull/2659
>>
>> cc @josh
>>
>> Davies
>>
>> On Tue, Nov 11, 2014 at 7:47 PM, bliuab <[hidden email]> wrote:
>>
>> > In spark-1.0.2, I have come across an error when I try to broadcast a
>> > quite
>> > large numpy array(with 35M dimension). The error information except the
>> > java.lang.NegativeArraySizeException error and details is listed below.
>> > Moreover, when broadcast a relatively smaller numpy array(30M
>> > dimension),
>> > everything works fine. And 30M dimension numpy array takes 230M memory
>> > which, in my opinion, not very large.
>> > As far as I have surveyed, it seems related with py4j. However, I have
>> > no
>> > idea how to fix  this. I would be appreciated if I can get some hint.
>> > ------------
>> > py4j.protocol.Py4JError: An error occurred while calling o23.broadcast.
>> > Trace:
>> > java.lang.NegativeArraySizeException
>> >         at py4j.Base64.decode(Base64.java:292)
>> >         at py4j.Protocol.getBytes(Protocol.java:167)
>> >         at py4j.Protocol.getObject(Protocol.java:276)
>> >         at
>> > py4j.commands.AbstractCommand.getArguments(AbstractCommand.java:81)
>> >         at py4j.commands.CallCommand.execute(CallCommand.java:77)
>> >         at py4j.GatewayConnection.run(GatewayConnection.java:207)
>> > -------------
>> > And the test code is a follows:
>> > conf =
>> >
>> > SparkConf().setAppName('brodyliu_LR').setMaster('spark://10.231.131.87:5051')
>> > conf.set('spark.executor.memory', '4000m')
>> > conf.set('spark.akka.timeout', '100000')
>> > conf.set('spark.ui.port','8081')
>> > conf.set('spark.cores.max','150')
>> > #conf.set('spark.rdd.compress', 'True')
>> > conf.set('spark.default.parallelism', '300')
>> > #configure the spark environment
>> > sc = SparkContext(conf=conf, batchSize=1)
>> >
>> > vec = np.random.rand(35000000)
>> > a = sc.broadcast(vec)
>> >
>> >
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> > http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [hidden email]
>> > For additional commands, e-mail: [hidden email]
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>>
>> ________________________________
>> If you reply to this email, your message will be added to the discussion
>> below:
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662p18673.html
>> To unsubscribe from Pyspark Error when broadcast numpy array, click here.
>> NAML
>
>
>
>
> --
> My Homepage: www.cse.ust.hk/~bliuab
> MPhil student in Hong Kong University of Science and Technology.
> Clear Water Bay, Kowloon, Hong Kong.
> Profile at LinkedIn.
>
> ________________________________
> View this message in context: Re: Pyspark Error when broadcast numpy array
>
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Pyspark Error when broadcast numpy array

Reply via email to