Executor parameter doesn't work for Spark-shell on EMR Yarn

2015-01-15 Thread Shuai Zheng
Hi All,


I am testing Spark on EMR cluster. Env is a one node cluster r3.8xlarge. Has
32 vCore and 244G memory.


But the command line I use to start up spark-shell, it can't work. For


~/spark/bin/spark-shell --jars
/home/hadoop/vrisc-lib/aws-java-sdk-1.9.14/lib/*.jar --num-executors 6
--executor-memory 10G


Neither num-executors nor memory setup works.


And more interesting, if I use test code:

val lines = sc.parallelize(List(-240990|161327,9051480,0,2,30.48,75,

var count = lines.mapPartitions(dynamoDBBatchWriteFunc).collect.sum


It will start 32 executors (then I assume it try to start all executors for
every vCore).


But if I use some real data to do it (the file size is 200M):

val lines = sc.textFile(s3://.../part-r-0) 

var count = lines.mapPartitions(dynamoDBBatchWriteFunc).collect.sum

It will only start 4 executors, which map to the number of HDFS split (200M
will have 4 splits).


So I have two questions:

1, Why the setup parameter is ignored by Yarn? How can I limit the number of
executors I can run? 

2, Why my much smaller test data set will trigger 32 executors but my real
200M data set will only have 4 executors?


So how should I control the executor setup on the spark-shell? And I print
the sparkConf, it looks like much less than I expect, and I don't see my
pass in parameter show there.


scala sc.getConf.getAll.foreach(println)







(spark.executor.extraJavaOptions,-verbose:gc -XX:+PrintGCDetails
-XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70)

(spark.app.name,Spark shell)









I search the old threads, attached email answer the question about why vCore
setup doesn't work. But I think this is not same issue as me. Otherwise then
default Yarn Spark setup can't do any adjustment? 









If you are using capacity scheduler in yarn: By default yarn capacity
scheduler uses DefaultResourceCalculator. DefaultResourceCalculator
considerĀ¹s only memory while allocating contains.
You can use DominantResourceCalculator, it considers memory and cpu.
In capacity-scheduler.xml set

On 04/11/14 3:03 am, Gen gen.tan...@gmail.com wrote:


Well, I doesn't find original documentation, but according to
the vcores is not for physics cpu core but for virtual cores.
And I used top command to monitor the cpu utilization during the spark
The spark can use all cpu even I leave --executor-cores as default(1).

Hope that it can be a help.

Gen wrote
 Maybe it is a stupid question, but I am running spark on yarn. I request
 the resources by the following command:
 ./spark-submit --master yarn-client --num-executors #number of worker
 --executor-cores #number of cores. ...
 However, after launching the task, I use
 yarn node -status ID
  to monitor the situation of cluster. It shows that the number of Vcores
 used for each container is always 1 no matter what number I pass by
 Any ideas how to solve this problem? Thanks a lot in advance for your

View this message in context:
Sent from the Apache Spark User List mailing list archive at Nabble.com.

To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

---End Message---

RE: Executor parameter doesn't work for Spark-shell on EMR Yarn

2015-01-15 Thread Shuai Zheng
I figure out the second question, because if I don't pass in the num of
partition for the test data, it will by default assume has max executors
(although I don't know what is this default max num).


val lines = sc.parallelize(List(-240990|161327,9051480,0,2,30.48,75,

will only trigger 2 executors.


So I think the default executors num will be decided by the first RDD
operation need to send to executors. This give me a weird way to control the
num of executors (a fake/test code piece run to kick off the executors
first, then run the real behavior - because executor will run the whole
lifecycle of the applications? Although this may not have any real value in
practice J


But I still need help for my first question. 


Thanks a lot.






From: Shuai Zheng [mailto:szheng.c...@gmail.com] 
Sent: Thursday, January 15, 2015 4:03 PM
To: user@spark.apache.org
Subject: RE: Executor parameter doesn't work for Spark-shell on EMR Yarn


Forget to mention, I use EMR AMI 3.3.1, Spark 1.2.0. Yarn 2.4. The spark is
setup by the standard script:



From: Shuai Zheng [mailto:szheng.c...@gmail.com] 
Sent: Thursday, January 15, 2015 3:52 PM
To: user@spark.apache.org
Subject: Executor parameter doesn't work for Spark-shell on EMR Yarn


Hi All,


I am testing Spark on EMR cluster. Env is a one node cluster r3.8xlarge. Has
32 vCore and 244G memory.


But the command line I use to start up spark-shell, it can't work. For


~/spark/bin/spark-shell --jars
/home/hadoop/vrisc-lib/aws-java-sdk-1.9.14/lib/*.jar --num-executors 6
--executor-memory 10G


Neither num-executors nor memory setup works.


And more interesting, if I use test code:

val lines = sc.parallelize(List(-240990|161327,9051480,0,2,30.48,75,

var count = lines.mapPartitions(dynamoDBBatchWriteFunc).collect.sum


It will start 32 executors (then I assume it try to start all executors for
every vCore).


But if I use some real data to do it (the file size is 200M):

val lines = sc.textFile(s3://.../part-r-0) 

var count = lines.mapPartitions(dynamoDBBatchWriteFunc).collect.sum

It will only start 4 executors, which map to the number of HDFS split (200M
will have 4 splits).


So I have two questions:

1, Why the setup parameter is ignored by Yarn? How can I limit the number of
executors I can run? 

2, Why my much smaller test data set will trigger 32 executors but my real
200M data set will only have 4 executors?


So how should I control the executor setup on the spark-shell? And I print
the sparkConf, it looks like much less than I expect, and I don't see my
pass in parameter show there.


scala sc.getConf.getAll.foreach(println)







(spark.executor.extraJavaOptions,-verbose:gc -XX:+PrintGCDetails
-XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70)

(spark.app.name,Spark shell)









I search the old threads, attached email answer the question about why vCore
setup doesn't work. But I think this is not same issue as me. Otherwise then
default Yarn Spark setup can't do any adjustment? 



