Spark concurrency question

java8964 Sun, 08 Feb 2015 11:19:59 -0800
Hi, I have some questions about how the spark run the job concurrently.
For example, if I setup the Spark on one standalone test box, which has 24 core 
and 64G memory. I setup the Worker memory to 48G, and Executor memory to 4G, 
and using spark-shell to run some jobs. Here is something confusing me:
1) Does the above setting mean that I can have up to 12 Executor running in 
this box at same time?2) Let's assume that I want to do a line count of one 
1280M HDFS file, which has 10 blocks as 128M per block. In this case, when the 
Spark program starts to run, will it kick off one executor using 10 threads to 
read these 10 blocks hdfs data, or 10 executors to read one block each? Or in 
other way? I read the Apache spark document, so I know that this 1280M HDFS 
file will be split as 10 partitions. But how the executor run them, I am not 
clear.3) In my test case, I started one Spark-shell to run a very expensive 
job. I saw in the Spark web UI, there are 8 stages generated, with 200 to 400 
tasks in each stage, and the tasks started to run. At this time, I started 
another spark shell to connect to master, and try to run a small spark program. 
From the spark-shell, it shows my new small program is in a wait status for 
resource. Why? And what kind of resources it is waiting for? If it is waiting 
for memory, does this means that there are 12 concurrent tasks running in the 
first program, took 12 * 4G = 48G memory given to the worker, so no more 
resource available? If so, in this case, then one running task is one 
executor?4) In MapReduce, the count of map and reducer tasks are the resource 
used by the cluster. My understanding is Spark is using multithread, instead of 
individual JVM processor. In this case, is the Executor using 4G heap to 
generate multithreads? My real question is that if each executor corresponding 
to each RDD partition, or executor could span thread for a RDD partition? On 
the other hand, how the worker decides how many executors to be created?
If there is any online document answering the above questions, please let me 
know. I searched in the Apache Spark site, but couldn't find it.
Thanks
Yong
Spark concurrency question

Reply via email to