Re: pyspark sc.parallelize running OOM with smallish data
Continuing to debug with Scala, I tried this on local with enough memory (10g) and it is able to count the dataset. With more memory(for executor and driver) in a cluster it still fails. The data is about 2Gbytes. It is 30k * 4k doubles. On Sat, Jul 12, 2014 at 6:31 PM, Aaron Davidson ilike...@gmail.com wrote: I think this is probably dying on the driver itself, as you are probably materializing the whole dataset inside your python driver. How large is spark_data_array compared to your driver memory? On Fri, Jul 11, 2014 at 7:30 PM, Mohit Jaggi mohitja...@gmail.com wrote: I put the same dataset into scala (using spark-shell) and it acts weird. I cannot do a count on it, the executors seem to hang. The WebUI shows 0/96 in the status bar, shows details about the worker nodes but there is no progress. sc.parallelize does finish (takes too long for the data size) in scala. On Fri, Jul 11, 2014 at 2:00 PM, Mohit Jaggi mohitja...@gmail.com wrote: spark_data_array here has about 35k rows with 4k columns. I have 4 nodes in the cluster and gave 48g to executors. also tried kyro serialization. traceback (most recent call last): File /mohit/./m.py, line 58, in module spark_data = sc.parallelize(spark_data_array) File /mohit/spark/python/pyspark/context.py, line 265, in parallelize jrdd = readRDDFromFile(self._jsc, tempFile.name, numSlices) File /mohit/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 537, in __call__ File /mohit/spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.readRDDFromFile. : java.lang.OutOfMemoryError: Java heap space at org.apache.spark.api.python.PythonRDD$.readRDDFromFile(PythonRDD.scala:279) at org.apache.spark.api.python.PythonRDD.readRDDFromFile(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745)
Re: pyspark sc.parallelize running OOM with smallish data
I think this is probably dying on the driver itself, as you are probably materializing the whole dataset inside your python driver. How large is spark_data_array compared to your driver memory? On Fri, Jul 11, 2014 at 7:30 PM, Mohit Jaggi mohitja...@gmail.com wrote: I put the same dataset into scala (using spark-shell) and it acts weird. I cannot do a count on it, the executors seem to hang. The WebUI shows 0/96 in the status bar, shows details about the worker nodes but there is no progress. sc.parallelize does finish (takes too long for the data size) in scala. On Fri, Jul 11, 2014 at 2:00 PM, Mohit Jaggi mohitja...@gmail.com wrote: spark_data_array here has about 35k rows with 4k columns. I have 4 nodes in the cluster and gave 48g to executors. also tried kyro serialization. traceback (most recent call last): File /mohit/./m.py, line 58, in module spark_data = sc.parallelize(spark_data_array) File /mohit/spark/python/pyspark/context.py, line 265, in parallelize jrdd = readRDDFromFile(self._jsc, tempFile.name, numSlices) File /mohit/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 537, in __call__ File /mohit/spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.readRDDFromFile. : java.lang.OutOfMemoryError: Java heap space at org.apache.spark.api.python.PythonRDD$.readRDDFromFile(PythonRDD.scala:279) at org.apache.spark.api.python.PythonRDD.readRDDFromFile(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745)
Re: pyspark sc.parallelize running OOM with smallish data
I put the same dataset into scala (using spark-shell) and it acts weird. I cannot do a count on it, the executors seem to hang. The WebUI shows 0/96 in the status bar, shows details about the worker nodes but there is no progress. sc.parallelize does finish (takes too long for the data size) in scala. On Fri, Jul 11, 2014 at 2:00 PM, Mohit Jaggi mohitja...@gmail.com wrote: spark_data_array here has about 35k rows with 4k columns. I have 4 nodes in the cluster and gave 48g to executors. also tried kyro serialization. traceback (most recent call last): File /mohit/./m.py, line 58, in module spark_data = sc.parallelize(spark_data_array) File /mohit/spark/python/pyspark/context.py, line 265, in parallelize jrdd = readRDDFromFile(self._jsc, tempFile.name, numSlices) File /mohit/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 537, in __call__ File /mohit/spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.readRDDFromFile. : java.lang.OutOfMemoryError: Java heap space at org.apache.spark.api.python.PythonRDD$.readRDDFromFile(PythonRDD.scala:279) at org.apache.spark.api.python.PythonRDD.readRDDFromFile(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745)