Re: Pyspark Error when broadcast numpy array

2014-11-12 Thread bliuab
Dear Liu:

I have tested this issue under Spark-1.1.0. The problem is solved under
this newer version.


On Wed, Nov 12, 2014 at 3:18 PM, Bo Liu bli...@cse.ust.hk wrote:

 Dear Liu:

 Thank you for your replay. I will set up an experimental environment for
 spark-1.1 and test it.

 On Wed, Nov 12, 2014 at 2:30 PM, Davies Liu-2 [via Apache Spark User List]
 ml-node+s1001560n1868...@n3.nabble.com wrote:

 Yes, your broadcast should be about 300M, much smaller than 2G, I
 didn't read your post carefully.

 The broadcast in Python had been improved much since 1.1, I think it
 will work in 1.1 or upcoming 1.2 release, could you upgrade to 1.1?

 Davies

 On Tue, Nov 11, 2014 at 8:37 PM, bliuab [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18684i=0 wrote:

  Dear Liu:
 
  Thank you very much for your help. I will update that patch. By the
 way, as
  I have succeed to broadcast an array of size(30M) the log said that
 such
  array takes around 230MB memory. As a result, I think the numpy array
 that
  leads to error is much smaller than 2G.
 
  On Wed, Nov 12, 2014 at 12:29 PM, Davies Liu-2 [via Apache Spark User
 List]
  [hidden email] wrote:
 
  This PR fix the problem: https://github.com/apache/spark/pull/2659
 
  cc @josh
 
  Davies
 
  On Tue, Nov 11, 2014 at 7:47 PM, bliuab [hidden email] wrote:
 
   In spark-1.0.2, I have come across an error when I try to broadcast
 a
   quite
   large numpy array(with 35M dimension). The error information except
 the
   java.lang.NegativeArraySizeException error and details is listed
 below.
   Moreover, when broadcast a relatively smaller numpy array(30M
   dimension),
   everything works fine. And 30M dimension numpy array takes 230M
 memory
   which, in my opinion, not very large.
   As far as I have surveyed, it seems related with py4j. However, I
 have
   no
   idea how to fix  this. I would be appreciated if I can get some
 hint.
   
   py4j.protocol.Py4JError: An error occurred while calling
 o23.broadcast.
   Trace:
   java.lang.NegativeArraySizeException
   at py4j.Base64.decode(Base64.java:292)
   at py4j.Protocol.getBytes(Protocol.java:167)
   at py4j.Protocol.getObject(Protocol.java:276)
   at
   py4j.commands.AbstractCommand.getArguments(AbstractCommand.java:81)
   at py4j.commands.CallCommand.execute(CallCommand.java:77)
   at py4j.GatewayConnection.run(GatewayConnection.java:207)
   -
   And the test code is a follows:
   conf =
  
   SparkConf().setAppName('brodyliu_LR').setMaster('spark://
 10.231.131.87:5051')
   conf.set('spark.executor.memory', '4000m')
   conf.set('spark.akka.timeout', '10')
   conf.set('spark.ui.port','8081')
   conf.set('spark.cores.max','150')
   #conf.set('spark.rdd.compress', 'True')
   conf.set('spark.default.parallelism', '300')
   #configure the spark environment
   sc = SparkContext(conf=conf, batchSize=1)
  
   vec = np.random.rand(3500)
   a = sc.broadcast(vec)
  
  
  
  
  
  
   --
   View this message in context:
  
 http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662.html
   Sent from the Apache Spark User List mailing list archive at
 Nabble.com.
  
  
 -
   To unsubscribe, e-mail: [hidden email]
   For additional commands, e-mail: [hidden email]
  
 
  -
  To unsubscribe, e-mail: [hidden email]
  For additional commands, e-mail: [hidden email]
 
 
 
  
  If you reply to this email, your message will be added to the
 discussion
  below:
 
 
 http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662p18673.html
  To unsubscribe from Pyspark Error when broadcast numpy array, click
 here.
  NAML
 
 
 
 
  --
  My Homepage: www.cse.ust.hk/~bliuab
  MPhil student in Hong Kong University of Science and Technology.
  Clear Water Bay, Kowloon, Hong Kong.
  Profile at LinkedIn.
 
  
  View this message in context: Re: Pyspark Error when broadcast numpy
 array
 
  Sent from the Apache Spark User List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18684i=1
 For additional commands, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18684i=2



 --
  If you reply to this email, your message will be added to the
 discussion below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662p18684.html
  To unsubscribe from Pyspark Error when broadcast numpy array, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=18662code

Pyspark Error when broadcast numpy array

2014-11-11 Thread bliuab
In spark-1.0.2, I have come across an error when I try to broadcast a quite
large numpy array(with 35M dimension). The error information except the
java.lang.NegativeArraySizeException error and details is listed below.
Moreover, when broadcast a relatively smaller numpy array(30M dimension),
everything works fine. And 30M dimension numpy array takes 230M memory
which, in my opinion, not very large.
As far as I have surveyed, it seems related with py4j. However, I have no
idea how to fix  this. I would be appreciated if I can get some hint.

py4j.protocol.Py4JError: An error occurred while calling o23.broadcast.
Trace:
java.lang.NegativeArraySizeException
at py4j.Base64.decode(Base64.java:292)
at py4j.Protocol.getBytes(Protocol.java:167)
at py4j.Protocol.getObject(Protocol.java:276)
at
py4j.commands.AbstractCommand.getArguments(AbstractCommand.java:81)
at py4j.commands.CallCommand.execute(CallCommand.java:77)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
-
And the test code is a follows:
conf =
SparkConf().setAppName('brodyliu_LR').setMaster('spark://10.231.131.87:5051')   
  
conf.set('spark.executor.memory', '4000m')  
conf.set('spark.akka.timeout', '10')
conf.set('spark.ui.port','8081')
conf.set('spark.cores.max','150')   
#conf.set('spark.rdd.compress', 'True') 
conf.set('spark.default.parallelism', '300')
#configure the spark environment
sc = SparkContext(conf=conf, batchSize=1)   

vec = np.random.rand(3500)  
a = sc.broadcast(vec)






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Pyspark Error when broadcast numpy array

2014-11-11 Thread Davies Liu
This PR fix the problem: https://github.com/apache/spark/pull/2659

cc @josh

Davies

On Tue, Nov 11, 2014 at 7:47 PM, bliuab bli...@cse.ust.hk wrote:
 In spark-1.0.2, I have come across an error when I try to broadcast a quite
 large numpy array(with 35M dimension). The error information except the
 java.lang.NegativeArraySizeException error and details is listed below.
 Moreover, when broadcast a relatively smaller numpy array(30M dimension),
 everything works fine. And 30M dimension numpy array takes 230M memory
 which, in my opinion, not very large.
 As far as I have surveyed, it seems related with py4j. However, I have no
 idea how to fix  this. I would be appreciated if I can get some hint.
 
 py4j.protocol.Py4JError: An error occurred while calling o23.broadcast.
 Trace:
 java.lang.NegativeArraySizeException
 at py4j.Base64.decode(Base64.java:292)
 at py4j.Protocol.getBytes(Protocol.java:167)
 at py4j.Protocol.getObject(Protocol.java:276)
 at
 py4j.commands.AbstractCommand.getArguments(AbstractCommand.java:81)
 at py4j.commands.CallCommand.execute(CallCommand.java:77)
 at py4j.GatewayConnection.run(GatewayConnection.java:207)
 -
 And the test code is a follows:
 conf =
 SparkConf().setAppName('brodyliu_LR').setMaster('spark://10.231.131.87:5051')
 conf.set('spark.executor.memory', '4000m')
 conf.set('spark.akka.timeout', '10')
 conf.set('spark.ui.port','8081')
 conf.set('spark.cores.max','150')
 #conf.set('spark.rdd.compress', 'True')
 conf.set('spark.default.parallelism', '300')
 #configure the spark environment
 sc = SparkContext(conf=conf, batchSize=1)

 vec = np.random.rand(3500)
 a = sc.broadcast(vec)






 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Pyspark Error when broadcast numpy array

2014-11-11 Thread bliuab
Dear Liu:

Thank you very much for your help. I will update that patch. By the way, as
I have succeed to broadcast an array of size(30M) the log said that such
array takes around 230MB memory. As a result, I think the numpy array that
leads to error is much smaller than 2G.

On Wed, Nov 12, 2014 at 12:29 PM, Davies Liu-2 [via Apache Spark User List]
ml-node+s1001560n18673...@n3.nabble.com wrote:

 This PR fix the problem: https://github.com/apache/spark/pull/2659

 cc @josh

 Davies

 On Tue, Nov 11, 2014 at 7:47 PM, bliuab [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18673i=0 wrote:

  In spark-1.0.2, I have come across an error when I try to broadcast a
 quite
  large numpy array(with 35M dimension). The error information except the
  java.lang.NegativeArraySizeException error and details is listed below.
  Moreover, when broadcast a relatively smaller numpy array(30M
 dimension),
  everything works fine. And 30M dimension numpy array takes 230M memory
  which, in my opinion, not very large.
  As far as I have surveyed, it seems related with py4j. However, I have
 no
  idea how to fix  this. I would be appreciated if I can get some hint.
  
  py4j.protocol.Py4JError: An error occurred while calling o23.broadcast.
  Trace:
  java.lang.NegativeArraySizeException
  at py4j.Base64.decode(Base64.java:292)
  at py4j.Protocol.getBytes(Protocol.java:167)
  at py4j.Protocol.getObject(Protocol.java:276)
  at
  py4j.commands.AbstractCommand.getArguments(AbstractCommand.java:81)
  at py4j.commands.CallCommand.execute(CallCommand.java:77)
  at py4j.GatewayConnection.run(GatewayConnection.java:207)
  -
  And the test code is a follows:
  conf =
  SparkConf().setAppName('brodyliu_LR').setMaster('spark://
 10.231.131.87:5051')
  conf.set('spark.executor.memory', '4000m')
  conf.set('spark.akka.timeout', '10')
  conf.set('spark.ui.port','8081')
  conf.set('spark.cores.max','150')
  #conf.set('spark.rdd.compress', 'True')
  conf.set('spark.default.parallelism', '300')
  #configure the spark environment
  sc = SparkContext(conf=conf, batchSize=1)
 
  vec = np.random.rand(3500)
  a = sc.broadcast(vec)
 
 
 
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18673i=1
  For additional commands, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18673i=2
 

 -
 To unsubscribe, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18673i=3
 For additional commands, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18673i=4



 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662p18673.html
  To unsubscribe from Pyspark Error when broadcast numpy array, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=18662code=YmxpdWFiQGNzZS51c3QuaGt8MTg2NjJ8NTUwMDMxMjYz
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml




-- 
My Homepage: www.cse.ust.hk/~bliuab
MPhil student in Hong Kong University of Science and Technology.
Clear Water Bay, Kowloon, Hong Kong.
Profile at LinkedIn http://www.linkedin.com/pub/liu-bo/55/52b/10b.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662p18674.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Pyspark Error when broadcast numpy array

2014-11-11 Thread Davies Liu
Yes, your broadcast should be about 300M, much smaller than 2G, I
didn't read your post carefully.

The broadcast in Python had been improved much since 1.1, I think it
will work in 1.1 or upcoming 1.2 release, could you upgrade to 1.1?

Davies

On Tue, Nov 11, 2014 at 8:37 PM, bliuab bli...@cse.ust.hk wrote:
 Dear Liu:

 Thank you very much for your help. I will update that patch. By the way, as
 I have succeed to broadcast an array of size(30M) the log said that such
 array takes around 230MB memory. As a result, I think the numpy array that
 leads to error is much smaller than 2G.

 On Wed, Nov 12, 2014 at 12:29 PM, Davies Liu-2 [via Apache Spark User List]
 [hidden email] wrote:

 This PR fix the problem: https://github.com/apache/spark/pull/2659

 cc @josh

 Davies

 On Tue, Nov 11, 2014 at 7:47 PM, bliuab [hidden email] wrote:

  In spark-1.0.2, I have come across an error when I try to broadcast a
  quite
  large numpy array(with 35M dimension). The error information except the
  java.lang.NegativeArraySizeException error and details is listed below.
  Moreover, when broadcast a relatively smaller numpy array(30M
  dimension),
  everything works fine. And 30M dimension numpy array takes 230M memory
  which, in my opinion, not very large.
  As far as I have surveyed, it seems related with py4j. However, I have
  no
  idea how to fix  this. I would be appreciated if I can get some hint.
  
  py4j.protocol.Py4JError: An error occurred while calling o23.broadcast.
  Trace:
  java.lang.NegativeArraySizeException
  at py4j.Base64.decode(Base64.java:292)
  at py4j.Protocol.getBytes(Protocol.java:167)
  at py4j.Protocol.getObject(Protocol.java:276)
  at
  py4j.commands.AbstractCommand.getArguments(AbstractCommand.java:81)
  at py4j.commands.CallCommand.execute(CallCommand.java:77)
  at py4j.GatewayConnection.run(GatewayConnection.java:207)
  -
  And the test code is a follows:
  conf =
 
  SparkConf().setAppName('brodyliu_LR').setMaster('spark://10.231.131.87:5051')
  conf.set('spark.executor.memory', '4000m')
  conf.set('spark.akka.timeout', '10')
  conf.set('spark.ui.port','8081')
  conf.set('spark.cores.max','150')
  #conf.set('spark.rdd.compress', 'True')
  conf.set('spark.default.parallelism', '300')
  #configure the spark environment
  sc = SparkContext(conf=conf, batchSize=1)
 
  vec = np.random.rand(3500)
  a = sc.broadcast(vec)
 
 
 
 
 
 
  --
  View this message in context:
  http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: [hidden email]
  For additional commands, e-mail: [hidden email]
 

 -
 To unsubscribe, e-mail: [hidden email]
 For additional commands, e-mail: [hidden email]



 
 If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662p18673.html
 To unsubscribe from Pyspark Error when broadcast numpy array, click here.
 NAML




 --
 My Homepage: www.cse.ust.hk/~bliuab
 MPhil student in Hong Kong University of Science and Technology.
 Clear Water Bay, Kowloon, Hong Kong.
 Profile at LinkedIn.

 
 View this message in context: Re: Pyspark Error when broadcast numpy array

 Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Pyspark Error when broadcast numpy array

2014-11-11 Thread bliuab
Dear Liu:

Thank you for your replay. I will set up an experimental environment for
spark-1.1 and test it.

On Wed, Nov 12, 2014 at 2:30 PM, Davies Liu-2 [via Apache Spark User List] 
ml-node+s1001560n1868...@n3.nabble.com wrote:

 Yes, your broadcast should be about 300M, much smaller than 2G, I
 didn't read your post carefully.

 The broadcast in Python had been improved much since 1.1, I think it
 will work in 1.1 or upcoming 1.2 release, could you upgrade to 1.1?

 Davies

 On Tue, Nov 11, 2014 at 8:37 PM, bliuab [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18684i=0 wrote:

  Dear Liu:
 
  Thank you very much for your help. I will update that patch. By the way,
 as
  I have succeed to broadcast an array of size(30M) the log said that such
  array takes around 230MB memory. As a result, I think the numpy array
 that
  leads to error is much smaller than 2G.
 
  On Wed, Nov 12, 2014 at 12:29 PM, Davies Liu-2 [via Apache Spark User
 List]
  [hidden email] wrote:
 
  This PR fix the problem: https://github.com/apache/spark/pull/2659
 
  cc @josh
 
  Davies
 
  On Tue, Nov 11, 2014 at 7:47 PM, bliuab [hidden email] wrote:
 
   In spark-1.0.2, I have come across an error when I try to broadcast a
   quite
   large numpy array(with 35M dimension). The error information except
 the
   java.lang.NegativeArraySizeException error and details is listed
 below.
   Moreover, when broadcast a relatively smaller numpy array(30M
   dimension),
   everything works fine. And 30M dimension numpy array takes 230M
 memory
   which, in my opinion, not very large.
   As far as I have surveyed, it seems related with py4j. However, I
 have
   no
   idea how to fix  this. I would be appreciated if I can get some hint.
   
   py4j.protocol.Py4JError: An error occurred while calling
 o23.broadcast.
   Trace:
   java.lang.NegativeArraySizeException
   at py4j.Base64.decode(Base64.java:292)
   at py4j.Protocol.getBytes(Protocol.java:167)
   at py4j.Protocol.getObject(Protocol.java:276)
   at
   py4j.commands.AbstractCommand.getArguments(AbstractCommand.java:81)
   at py4j.commands.CallCommand.execute(CallCommand.java:77)
   at py4j.GatewayConnection.run(GatewayConnection.java:207)
   -
   And the test code is a follows:
   conf =
  
   SparkConf().setAppName('brodyliu_LR').setMaster('spark://
 10.231.131.87:5051')
   conf.set('spark.executor.memory', '4000m')
   conf.set('spark.akka.timeout', '10')
   conf.set('spark.ui.port','8081')
   conf.set('spark.cores.max','150')
   #conf.set('spark.rdd.compress', 'True')
   conf.set('spark.default.parallelism', '300')
   #configure the spark environment
   sc = SparkContext(conf=conf, batchSize=1)
  
   vec = np.random.rand(3500)
   a = sc.broadcast(vec)
  
  
  
  
  
  
   --
   View this message in context:
  
 http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662.html
   Sent from the Apache Spark User List mailing list archive at
 Nabble.com.
  
   -
   To unsubscribe, e-mail: [hidden email]
   For additional commands, e-mail: [hidden email]
  
 
  -
  To unsubscribe, e-mail: [hidden email]
  For additional commands, e-mail: [hidden email]
 
 
 
  
  If you reply to this email, your message will be added to the
 discussion
  below:
 
 
 http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662p18673.html
  To unsubscribe from Pyspark Error when broadcast numpy array, click
 here.
  NAML
 
 
 
 
  --
  My Homepage: www.cse.ust.hk/~bliuab
  MPhil student in Hong Kong University of Science and Technology.
  Clear Water Bay, Kowloon, Hong Kong.
  Profile at LinkedIn.
 
  
  View this message in context: Re: Pyspark Error when broadcast numpy
 array
 
  Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18684i=1
 For additional commands, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18684i=2



 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662p18684.html
  To unsubscribe from Pyspark Error when broadcast numpy array, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=18662code=YmxpdWFiQGNzZS51c3QuaGt8MTg2NjJ8NTUwMDMxMjYz
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase