Hi all,

sometimes you can see "OutOfMemoryException: Java heap space" of executor in
Spark. There many ideas about how to work arounds.

My question is: how does executor execute tasks from the point of view of
memory usage and parallelism?

Picture in my mind is:
Executor is JVM instance. Number of parallel tasks which can be executed in
parallel threads inside single executor are contolled by "--executor-cores"
param of submit-job in case of YARN. Each executor owns "--executor-memory"
memory which is diveded in memory for RDD cache and memory for task
execution. I don't consider caching topic now.
It is very interesting to me how memory for task execution is used while
work of executor.

Let's consider an example when you have only "map" operations, no joins /
group/ reduce and no caching.

sc.textFile('test.txt') \
.map(lambda line: line.split()) \
.map(lambda item: int(item) + 10) \
.saveAsTextFile('out.txt')

How the input RDD will be processed in this case? I know RDDs are divided in
P partitions by some rules (for example by block size of HDFS).  So we will
have P partitions, P tasks and 1 stage (Am I right?). Let --executor-cores
be 2. In this case executor will process two partitions in parallel. Will it
try to load entire partitions in memory? Or will just call map chaines for
each element of partitions? What can encourage "OutOfMemoryException: Java
heap space" in this case?    Large size of partition or large amount of
memory to be eated by processing of single element of RDD?

Please correct me and advise.

Serg.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-executor-encourage-OutOfMemoryException-Java-heap-space-tp22238.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to