Executor parameter doesn't work for Spark-shell on EMR Yarn

2015-01-15 Thread Shuai Zheng
Hi All,

 

I am testing Spark on EMR cluster. Env is a one node cluster r3.8xlarge. Has
32 vCore and 244G memory.

 

But the command line I use to start up spark-shell, it can't work. For
example:

 

~/spark/bin/spark-shell --jars
/home/hadoop/vrisc-lib/aws-java-sdk-1.9.14/lib/*.jar --num-executors 6
--executor-memory 10G

 

Neither num-executors nor memory setup works.

 

And more interesting, if I use test code:

val lines = sc.parallelize(List(-240990|161327,9051480,0,2,30.48,75,
-240990|161324,9051480,0,2,30.48,75))

var count = lines.mapPartitions(dynamoDBBatchWriteFunc).collect.sum

 

It will start 32 executors (then I assume it try to start all executors for
every vCore).

 

But if I use some real data to do it (the file size is 200M):

val lines = sc.textFile(s3://.../part-r-0) 

var count = lines.mapPartitions(dynamoDBBatchWriteFunc).collect.sum

It will only start 4 executors, which map to the number of HDFS split (200M
will have 4 splits).

 

So I have two questions:

1, Why the setup parameter is ignored by Yarn? How can I limit the number of
executors I can run? 

2, Why my much smaller test data set will trigger 32 executors but my real
200M data set will only have 4 executors?

 

So how should I control the executor setup on the spark-shell? And I print
the sparkConf, it looks like much less than I expect, and I don't see my
pass in parameter show there.

 

scala sc.getConf.getAll.foreach(println)

(spark.tachyonStore.folderName,spark-af0c4d42-fe4d-40b0-a3cf-25b6a9e16fa0)

(spark.app.id,local-1421353031552)

(spark.eventLog.enabled,true)

(spark.executor.id,driver)

(spark.repl.class.uri,http://10.181.82.38:58415)

(spark.driver.host,ip-10-181-82-38.ec2.internal)

(spark.executor.extraJavaOptions,-verbose:gc -XX:+PrintGCDetails
-XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70)

(spark.app.name,Spark shell)

(spark.fileserver.uri,http://10.181.82.38:54666)

(spark.jars,file:/home/hadoop/vrisc-lib/aws-java-sdk-1.9.14/lib/aws-java-sdk
-1.9.14.jar)

(spark.eventLog.dir,hdfs:///spark-logs)

(spark.executor.extraClassPath,/home/hadoop/spark/classpath/emr/*:/home/hado
op/spark/classpath/emrfs/*:/home/hadoop/share/hadoop/common/lib/*:/home/hado
op/.versions/2.4.0/share/hadoop/common/lib/hadoop-lzo.jar)

(spark.master,local[*])

(spark.driver.port,54191)

(spark.driver.extraClassPath,/home/hadoop/spark/classpath/emr/*:/home/hadoop
/spark/classpath/emrfs/*:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop
/.versions/2.4.0/share/hadoop/common/lib/hadoop-lzo.jar)

 

I search the old threads, attached email answer the question about why vCore
setup doesn't work. But I think this is not same issue as me. Otherwise then
default Yarn Spark setup can't do any adjustment? 

 

Regards,

 

Shuai

 

 

 

 

---BeginMessage---
If you are using capacity scheduler in yarn: By default yarn capacity
scheduler uses DefaultResourceCalculator. DefaultResourceCalculator
considerĀ¹s only memory while allocating contains.
You can use DominantResourceCalculator, it considers memory and cpu.
In capacity-scheduler.xml set
yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.res
ource.DefaultResourceCalculator


On 04/11/14 3:03 am, Gen gen.tan...@gmail.com wrote:

Hi,

Well, I doesn't find original documentation, but according to
http://qnalist.com/questions/2791828/about-the-cpu-cores-and-cpu-usage
http://qnalist.com/questions/2791828/about-the-cpu-cores-and-cpu-usage
,
the vcores is not for physics cpu core but for virtual cores.
And I used top command to monitor the cpu utilization during the spark
task.
The spark can use all cpu even I leave --executor-cores as default(1).

Hope that it can be a help.
Cheers
Gen


Gen wrote
 Hi,
 
 Maybe it is a stupid question, but I am running spark on yarn. I request
 the resources by the following command:
 {code}
 ./spark-submit --master yarn-client --num-executors #number of worker
 --executor-cores #number of cores. ...
 {code}
 However, after launching the task, I use
/
 yarn node -status ID
/
  to monitor the situation of cluster. It shows that the number of Vcores
 used for each container is always 1 no matter what number I pass by
 --executor-cores.
 Any ideas how to solve this problem? Thanks a lot in advance for your
 help.
 
 Cheers
 Gen





--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/executor-cores-cannot-
change-vcores-in-yarn-tp17883p17992.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


---End Message---


RE: Executor parameter doesn't work for Spark-shell on EMR Yarn

2015-01-15 Thread Shuai Zheng
I figure out the second question, because if I don't pass in the num of
partition for the test data, it will by default assume has max executors
(although I don't know what is this default max num).

 

val lines = sc.parallelize(List(-240990|161327,9051480,0,2,30.48,75,
-240990|161324,9051480,0,2,30.48,75),2)

will only trigger 2 executors.

 

So I think the default executors num will be decided by the first RDD
operation need to send to executors. This give me a weird way to control the
num of executors (a fake/test code piece run to kick off the executors
first, then run the real behavior - because executor will run the whole
lifecycle of the applications? Although this may not have any real value in
practice J

 

But I still need help for my first question. 

 

Thanks a lot.

 

Regards,

 

Shuai

 

From: Shuai Zheng [mailto:szheng.c...@gmail.com] 
Sent: Thursday, January 15, 2015 4:03 PM
To: user@spark.apache.org
Subject: RE: Executor parameter doesn't work for Spark-shell on EMR Yarn

 

Forget to mention, I use EMR AMI 3.3.1, Spark 1.2.0. Yarn 2.4. The spark is
setup by the standard script:
s3://support.elasticmapreduce/spark/install-spark

 

 

From: Shuai Zheng [mailto:szheng.c...@gmail.com] 
Sent: Thursday, January 15, 2015 3:52 PM
To: user@spark.apache.org
Subject: Executor parameter doesn't work for Spark-shell on EMR Yarn

 

Hi All,

 

I am testing Spark on EMR cluster. Env is a one node cluster r3.8xlarge. Has
32 vCore and 244G memory.

 

But the command line I use to start up spark-shell, it can't work. For
example:

 

~/spark/bin/spark-shell --jars
/home/hadoop/vrisc-lib/aws-java-sdk-1.9.14/lib/*.jar --num-executors 6
--executor-memory 10G

 

Neither num-executors nor memory setup works.

 

And more interesting, if I use test code:

val lines = sc.parallelize(List(-240990|161327,9051480,0,2,30.48,75,
-240990|161324,9051480,0,2,30.48,75))

var count = lines.mapPartitions(dynamoDBBatchWriteFunc).collect.sum

 

It will start 32 executors (then I assume it try to start all executors for
every vCore).

 

But if I use some real data to do it (the file size is 200M):

val lines = sc.textFile(s3://.../part-r-0) 

var count = lines.mapPartitions(dynamoDBBatchWriteFunc).collect.sum

It will only start 4 executors, which map to the number of HDFS split (200M
will have 4 splits).

 

So I have two questions:

1, Why the setup parameter is ignored by Yarn? How can I limit the number of
executors I can run? 

2, Why my much smaller test data set will trigger 32 executors but my real
200M data set will only have 4 executors?

 

So how should I control the executor setup on the spark-shell? And I print
the sparkConf, it looks like much less than I expect, and I don't see my
pass in parameter show there.

 

scala sc.getConf.getAll.foreach(println)

(spark.tachyonStore.folderName,spark-af0c4d42-fe4d-40b0-a3cf-25b6a9e16fa0)

(spark.app.id,local-1421353031552)

(spark.eventLog.enabled,true)

(spark.executor.id,driver)

(spark.repl.class.uri,http://10.181.82.38:58415)

(spark.driver.host,ip-10-181-82-38.ec2.internal)

(spark.executor.extraJavaOptions,-verbose:gc -XX:+PrintGCDetails
-XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70)

(spark.app.name,Spark shell)

(spark.fileserver.uri,http://10.181.82.38:54666)

(spark.jars,file:/home/hadoop/vrisc-lib/aws-java-sdk-1.9.14/lib/aws-java-sdk
-1.9.14.jar)

(spark.eventLog.dir,hdfs:///spark-logs)

(spark.executor.extraClassPath,/home/hadoop/spark/classpath/emr/*:/home/hado
op/spark/classpath/emrfs/*:/home/hadoop/share/hadoop/common/lib/*:/home/hado
op/.versions/2.4.0/share/hadoop/common/lib/hadoop-lzo.jar)

(spark.master,local[*])

(spark.driver.port,54191)

(spark.driver.extraClassPath,/home/hadoop/spark/classpath/emr/*:/home/hadoop
/spark/classpath/emrfs/*:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop
/.versions/2.4.0/share/hadoop/common/lib/hadoop-lzo.jar)

 

I search the old threads, attached email answer the question about why vCore
setup doesn't work. But I think this is not same issue as me. Otherwise then
default Yarn Spark setup can't do any adjustment? 

 

Regards,

 

Shuai