Hi all, sometimes you can see "OutOfMemoryException: Java heap space" of executor in Spark. There many ideas about how to work arounds.
My question is: how does executor execute tasks from the point of view of memory usage and parallelism? Picture in my mind is: Executor is JVM instance. Number of parallel tasks which can be executed in parallel threads inside single executor are contolled by "--executor-cores" param of submit-job in case of YARN. Each executor owns "--executor-memory" memory which is diveded in memory for RDD cache and memory for task execution. I don't consider caching topic now. It is very interesting to me how memory for task execution is used while work of executor. Let's consider an example when you have only "map" operations, no joins / group/ reduce and no caching. sc.textFile('test.txt') \ .map(lambda line: line.split()) \ .map(lambda item: int(item) + 10) \ .saveAsTextFile('out.txt') How the input RDD will be processed in this case? I know RDDs are divided in P partitions by some rules (for example by block size of HDFS). So we will have P partitions, P tasks and 1 stage (Am I right?). Let --executor-cores be 2. In this case executor will process two partitions in parallel. Will it try to load entire partitions in memory? Or will just call map chaines for each element of partitions? What can encourage "OutOfMemoryException: Java heap space" in this case? Large size of partition or large amount of memory to be eated by processing of single element of RDD? Please correct me and advise. Serg. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-executor-encourage-OutOfMemoryException-Java-heap-space-tp22238.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org