Re: Pyspark Error when broadcast numpy array
Dear Liu: I have tested this issue under Spark-1.1.0. The problem is solved under this newer version. On Wed, Nov 12, 2014 at 3:18 PM, Bo Liu bli...@cse.ust.hk wrote: Dear Liu: Thank you for your replay. I will set up an experimental environment for spark-1.1 and test it. On Wed, Nov 12, 2014 at 2:30 PM, Davies Liu-2 [via Apache Spark User List] ml-node+s1001560n1868...@n3.nabble.com wrote: Yes, your broadcast should be about 300M, much smaller than 2G, I didn't read your post carefully. The broadcast in Python had been improved much since 1.1, I think it will work in 1.1 or upcoming 1.2 release, could you upgrade to 1.1? Davies On Tue, Nov 11, 2014 at 8:37 PM, bliuab [hidden email] http://user/SendEmail.jtp?type=nodenode=18684i=0 wrote: Dear Liu: Thank you very much for your help. I will update that patch. By the way, as I have succeed to broadcast an array of size(30M) the log said that such array takes around 230MB memory. As a result, I think the numpy array that leads to error is much smaller than 2G. On Wed, Nov 12, 2014 at 12:29 PM, Davies Liu-2 [via Apache Spark User List] [hidden email] wrote: This PR fix the problem: https://github.com/apache/spark/pull/2659 cc @josh Davies On Tue, Nov 11, 2014 at 7:47 PM, bliuab [hidden email] wrote: In spark-1.0.2, I have come across an error when I try to broadcast a quite large numpy array(with 35M dimension). The error information except the java.lang.NegativeArraySizeException error and details is listed below. Moreover, when broadcast a relatively smaller numpy array(30M dimension), everything works fine. And 30M dimension numpy array takes 230M memory which, in my opinion, not very large. As far as I have surveyed, it seems related with py4j. However, I have no idea how to fix this. I would be appreciated if I can get some hint. py4j.protocol.Py4JError: An error occurred while calling o23.broadcast. Trace: java.lang.NegativeArraySizeException at py4j.Base64.decode(Base64.java:292) at py4j.Protocol.getBytes(Protocol.java:167) at py4j.Protocol.getObject(Protocol.java:276) at py4j.commands.AbstractCommand.getArguments(AbstractCommand.java:81) at py4j.commands.CallCommand.execute(CallCommand.java:77) at py4j.GatewayConnection.run(GatewayConnection.java:207) - And the test code is a follows: conf = SparkConf().setAppName('brodyliu_LR').setMaster('spark:// 10.231.131.87:5051') conf.set('spark.executor.memory', '4000m') conf.set('spark.akka.timeout', '10') conf.set('spark.ui.port','8081') conf.set('spark.cores.max','150') #conf.set('spark.rdd.compress', 'True') conf.set('spark.default.parallelism', '300') #configure the spark environment sc = SparkContext(conf=conf, batchSize=1) vec = np.random.rand(3500) a = sc.broadcast(vec) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] - To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662p18673.html To unsubscribe from Pyspark Error when broadcast numpy array, click here. NAML -- My Homepage: www.cse.ust.hk/~bliuab MPhil student in Hong Kong University of Science and Technology. Clear Water Bay, Kowloon, Hong Kong. Profile at LinkedIn. View this message in context: Re: Pyspark Error when broadcast numpy array Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=18684i=1 For additional commands, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=18684i=2 -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662p18684.html To unsubscribe from Pyspark Error when broadcast numpy array, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=18662code
Re: Pyspark Error when broadcast numpy array
This PR fix the problem: https://github.com/apache/spark/pull/2659 cc @josh Davies On Tue, Nov 11, 2014 at 7:47 PM, bliuab bli...@cse.ust.hk wrote: In spark-1.0.2, I have come across an error when I try to broadcast a quite large numpy array(with 35M dimension). The error information except the java.lang.NegativeArraySizeException error and details is listed below. Moreover, when broadcast a relatively smaller numpy array(30M dimension), everything works fine. And 30M dimension numpy array takes 230M memory which, in my opinion, not very large. As far as I have surveyed, it seems related with py4j. However, I have no idea how to fix this. I would be appreciated if I can get some hint. py4j.protocol.Py4JError: An error occurred while calling o23.broadcast. Trace: java.lang.NegativeArraySizeException at py4j.Base64.decode(Base64.java:292) at py4j.Protocol.getBytes(Protocol.java:167) at py4j.Protocol.getObject(Protocol.java:276) at py4j.commands.AbstractCommand.getArguments(AbstractCommand.java:81) at py4j.commands.CallCommand.execute(CallCommand.java:77) at py4j.GatewayConnection.run(GatewayConnection.java:207) - And the test code is a follows: conf = SparkConf().setAppName('brodyliu_LR').setMaster('spark://10.231.131.87:5051') conf.set('spark.executor.memory', '4000m') conf.set('spark.akka.timeout', '10') conf.set('spark.ui.port','8081') conf.set('spark.cores.max','150') #conf.set('spark.rdd.compress', 'True') conf.set('spark.default.parallelism', '300') #configure the spark environment sc = SparkContext(conf=conf, batchSize=1) vec = np.random.rand(3500) a = sc.broadcast(vec) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Pyspark Error when broadcast numpy array
Dear Liu: Thank you very much for your help. I will update that patch. By the way, as I have succeed to broadcast an array of size(30M) the log said that such array takes around 230MB memory. As a result, I think the numpy array that leads to error is much smaller than 2G. On Wed, Nov 12, 2014 at 12:29 PM, Davies Liu-2 [via Apache Spark User List] ml-node+s1001560n18673...@n3.nabble.com wrote: This PR fix the problem: https://github.com/apache/spark/pull/2659 cc @josh Davies On Tue, Nov 11, 2014 at 7:47 PM, bliuab [hidden email] http://user/SendEmail.jtp?type=nodenode=18673i=0 wrote: In spark-1.0.2, I have come across an error when I try to broadcast a quite large numpy array(with 35M dimension). The error information except the java.lang.NegativeArraySizeException error and details is listed below. Moreover, when broadcast a relatively smaller numpy array(30M dimension), everything works fine. And 30M dimension numpy array takes 230M memory which, in my opinion, not very large. As far as I have surveyed, it seems related with py4j. However, I have no idea how to fix this. I would be appreciated if I can get some hint. py4j.protocol.Py4JError: An error occurred while calling o23.broadcast. Trace: java.lang.NegativeArraySizeException at py4j.Base64.decode(Base64.java:292) at py4j.Protocol.getBytes(Protocol.java:167) at py4j.Protocol.getObject(Protocol.java:276) at py4j.commands.AbstractCommand.getArguments(AbstractCommand.java:81) at py4j.commands.CallCommand.execute(CallCommand.java:77) at py4j.GatewayConnection.run(GatewayConnection.java:207) - And the test code is a follows: conf = SparkConf().setAppName('brodyliu_LR').setMaster('spark:// 10.231.131.87:5051') conf.set('spark.executor.memory', '4000m') conf.set('spark.akka.timeout', '10') conf.set('spark.ui.port','8081') conf.set('spark.cores.max','150') #conf.set('spark.rdd.compress', 'True') conf.set('spark.default.parallelism', '300') #configure the spark environment sc = SparkContext(conf=conf, batchSize=1) vec = np.random.rand(3500) a = sc.broadcast(vec) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=18673i=1 For additional commands, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=18673i=2 - To unsubscribe, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=18673i=3 For additional commands, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=18673i=4 -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662p18673.html To unsubscribe from Pyspark Error when broadcast numpy array, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=18662code=YmxpdWFiQGNzZS51c3QuaGt8MTg2NjJ8NTUwMDMxMjYz . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- My Homepage: www.cse.ust.hk/~bliuab MPhil student in Hong Kong University of Science and Technology. Clear Water Bay, Kowloon, Hong Kong. Profile at LinkedIn http://www.linkedin.com/pub/liu-bo/55/52b/10b. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662p18674.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Pyspark Error when broadcast numpy array
Yes, your broadcast should be about 300M, much smaller than 2G, I didn't read your post carefully. The broadcast in Python had been improved much since 1.1, I think it will work in 1.1 or upcoming 1.2 release, could you upgrade to 1.1? Davies On Tue, Nov 11, 2014 at 8:37 PM, bliuab bli...@cse.ust.hk wrote: Dear Liu: Thank you very much for your help. I will update that patch. By the way, as I have succeed to broadcast an array of size(30M) the log said that such array takes around 230MB memory. As a result, I think the numpy array that leads to error is much smaller than 2G. On Wed, Nov 12, 2014 at 12:29 PM, Davies Liu-2 [via Apache Spark User List] [hidden email] wrote: This PR fix the problem: https://github.com/apache/spark/pull/2659 cc @josh Davies On Tue, Nov 11, 2014 at 7:47 PM, bliuab [hidden email] wrote: In spark-1.0.2, I have come across an error when I try to broadcast a quite large numpy array(with 35M dimension). The error information except the java.lang.NegativeArraySizeException error and details is listed below. Moreover, when broadcast a relatively smaller numpy array(30M dimension), everything works fine. And 30M dimension numpy array takes 230M memory which, in my opinion, not very large. As far as I have surveyed, it seems related with py4j. However, I have no idea how to fix this. I would be appreciated if I can get some hint. py4j.protocol.Py4JError: An error occurred while calling o23.broadcast. Trace: java.lang.NegativeArraySizeException at py4j.Base64.decode(Base64.java:292) at py4j.Protocol.getBytes(Protocol.java:167) at py4j.Protocol.getObject(Protocol.java:276) at py4j.commands.AbstractCommand.getArguments(AbstractCommand.java:81) at py4j.commands.CallCommand.execute(CallCommand.java:77) at py4j.GatewayConnection.run(GatewayConnection.java:207) - And the test code is a follows: conf = SparkConf().setAppName('brodyliu_LR').setMaster('spark://10.231.131.87:5051') conf.set('spark.executor.memory', '4000m') conf.set('spark.akka.timeout', '10') conf.set('spark.ui.port','8081') conf.set('spark.cores.max','150') #conf.set('spark.rdd.compress', 'True') conf.set('spark.default.parallelism', '300') #configure the spark environment sc = SparkContext(conf=conf, batchSize=1) vec = np.random.rand(3500) a = sc.broadcast(vec) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] - To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662p18673.html To unsubscribe from Pyspark Error when broadcast numpy array, click here. NAML -- My Homepage: www.cse.ust.hk/~bliuab MPhil student in Hong Kong University of Science and Technology. Clear Water Bay, Kowloon, Hong Kong. Profile at LinkedIn. View this message in context: Re: Pyspark Error when broadcast numpy array Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Pyspark Error when broadcast numpy array
Dear Liu: Thank you for your replay. I will set up an experimental environment for spark-1.1 and test it. On Wed, Nov 12, 2014 at 2:30 PM, Davies Liu-2 [via Apache Spark User List] ml-node+s1001560n1868...@n3.nabble.com wrote: Yes, your broadcast should be about 300M, much smaller than 2G, I didn't read your post carefully. The broadcast in Python had been improved much since 1.1, I think it will work in 1.1 or upcoming 1.2 release, could you upgrade to 1.1? Davies On Tue, Nov 11, 2014 at 8:37 PM, bliuab [hidden email] http://user/SendEmail.jtp?type=nodenode=18684i=0 wrote: Dear Liu: Thank you very much for your help. I will update that patch. By the way, as I have succeed to broadcast an array of size(30M) the log said that such array takes around 230MB memory. As a result, I think the numpy array that leads to error is much smaller than 2G. On Wed, Nov 12, 2014 at 12:29 PM, Davies Liu-2 [via Apache Spark User List] [hidden email] wrote: This PR fix the problem: https://github.com/apache/spark/pull/2659 cc @josh Davies On Tue, Nov 11, 2014 at 7:47 PM, bliuab [hidden email] wrote: In spark-1.0.2, I have come across an error when I try to broadcast a quite large numpy array(with 35M dimension). The error information except the java.lang.NegativeArraySizeException error and details is listed below. Moreover, when broadcast a relatively smaller numpy array(30M dimension), everything works fine. And 30M dimension numpy array takes 230M memory which, in my opinion, not very large. As far as I have surveyed, it seems related with py4j. However, I have no idea how to fix this. I would be appreciated if I can get some hint. py4j.protocol.Py4JError: An error occurred while calling o23.broadcast. Trace: java.lang.NegativeArraySizeException at py4j.Base64.decode(Base64.java:292) at py4j.Protocol.getBytes(Protocol.java:167) at py4j.Protocol.getObject(Protocol.java:276) at py4j.commands.AbstractCommand.getArguments(AbstractCommand.java:81) at py4j.commands.CallCommand.execute(CallCommand.java:77) at py4j.GatewayConnection.run(GatewayConnection.java:207) - And the test code is a follows: conf = SparkConf().setAppName('brodyliu_LR').setMaster('spark:// 10.231.131.87:5051') conf.set('spark.executor.memory', '4000m') conf.set('spark.akka.timeout', '10') conf.set('spark.ui.port','8081') conf.set('spark.cores.max','150') #conf.set('spark.rdd.compress', 'True') conf.set('spark.default.parallelism', '300') #configure the spark environment sc = SparkContext(conf=conf, batchSize=1) vec = np.random.rand(3500) a = sc.broadcast(vec) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] - To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662p18673.html To unsubscribe from Pyspark Error when broadcast numpy array, click here. NAML -- My Homepage: www.cse.ust.hk/~bliuab MPhil student in Hong Kong University of Science and Technology. Clear Water Bay, Kowloon, Hong Kong. Profile at LinkedIn. View this message in context: Re: Pyspark Error when broadcast numpy array Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=18684i=1 For additional commands, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=18684i=2 -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Error-when-broadcast-numpy-array-tp18662p18684.html To unsubscribe from Pyspark Error when broadcast numpy array, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=18662code=YmxpdWFiQGNzZS51c3QuaGt8MTg2NjJ8NTUwMDMxMjYz . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase