Hi all, I'm wondering if there's any settings I can use to reduce the memory needed by the PythonRDD when computing simple stats. I am getting OutOfMemoryError exceptions while calculating count() on big, but not absurd, records. It seems like PythonRDD is trying to keep too many of these records in memory, when all that is needed is to stream through them and count. Any tips for getting through this workload?
Code: session = sc.textFile('s3://...json.gz') # ~54GB of compressed data # the biggest individual text line is ~3MB parsed = session.map(lambda l: l.split("\t",1)).map(lambda (y,s): (loads(y), loads(s))) parsed.persist(StorageLevel.MEMORY_AND_DISK) parsed.count() # will never finish: executor.Executor: Uncaught exception will FAIL all executors Incidentally the whole app appears to be killed, but this error is not propagated to the shell. Cluster: 15 m2.xlarges (17GB memory, 17GB swap, spark.executor.memory=10GB) Exception: java.lang.OutOfMemoryError: Java heap space at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:132) at org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:120) at org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:113) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at org.apache.spark.api.python.PythonRDD$$anon$1.foreach(PythonRDD.scala:113) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:94) at org.apache.spark.rdd.RDD.iterator(RDD.scala:220) at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:85)