Executor parameter doesn't work for Spark-shell on EMR Yarn

Shuai Zheng Thu, 15 Jan 2015 12:54:07 -0800

Hi All,


I am testing Spark on EMR cluster. Env is a one node cluster r3.8xlarge. Has
32 vCore and 244G memory.

 

But the command line I use to start up spark-shell, it can't work. For
example:

 

~/spark/bin/spark-shell --jars
/home/hadoop/vrisc-lib/aws-java-sdk-1.9.14/lib/*.jar --num-executors 6
--executor-memory 10G

 

Neither num-executors nor memory setup works.

 

And more interesting, if I use test code:

val lines = sc.parallelize(List("-240990|161327,9051480,0,2,30.48,75",
"-240990|161324,9051480,0,2,30.48,75"))

var count = lines.mapPartitions(dynamoDBBatchWriteFunc).collect.sum

 

It will start 32 executors (then I assume it try to start all executors for
every vCore).

 

But if I use some real data to do it (the file size is 200M):

val lines = sc.textFile("s3://.../part-r-00000") 

var count = lines.mapPartitions(dynamoDBBatchWriteFunc).collect.sum

It will only start 4 executors, which map to the number of HDFS split (200M
will have 4 splits).

 

So I have two questions:

1, Why the setup parameter is ignored by Yarn? How can I limit the number of
executors I can run? 

2, Why my much smaller test data set will trigger 32 executors but my real
200M data set will only have 4 executors?

 

So how should I control the executor setup on the spark-shell? And I print
the sparkConf, it looks like much less than I expect, and I don't see my
pass in parameter show there.

 

scala> sc.getConf.getAll.foreach(println)

(spark.tachyonStore.folderName,spark-af0c4d42-fe4d-40b0-a3cf-25b6a9e16fa0)

(spark.app.id,local-1421353031552)

(spark.eventLog.enabled,true)

(spark.executor.id,driver)

(spark.repl.class.uri,http://10.181.82.38:58415)

(spark.driver.host,ip-10-181-82-38.ec2.internal)

(spark.executor.extraJavaOptions,-verbose:gc -XX:+PrintGCDetails
-XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70)

(spark.app.name,Spark shell)

(spark.fileserver.uri,http://10.181.82.38:54666)

(spark.jars,file:/home/hadoop/vrisc-lib/aws-java-sdk-1.9.14/lib/aws-java-sdk
-1.9.14.jar)

(spark.eventLog.dir,hdfs:///spark-logs)

(spark.executor.extraClassPath,/home/hadoop/spark/classpath/emr/*:/home/hado
op/spark/classpath/emrfs/*:/home/hadoop/share/hadoop/common/lib/*:/home/hado
op/.versions/2.4.0/share/hadoop/common/lib/hadoop-lzo.jar)

(spark.master,local[*])

(spark.driver.port,54191)

(spark.driver.extraClassPath,/home/hadoop/spark/classpath/emr/*:/home/hadoop
/spark/classpath/emrfs/*:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop
/.versions/2.4.0/share/hadoop/common/lib/hadoop-lzo.jar)

 

I search the old threads, attached email answer the question about why vCore
setup doesn't work. But I think this is not same issue as me. Otherwise then
default Yarn Spark setup can't do any adjustment? 

 

Regards,

 

Shuai

--- Begin Message ---

If you are using capacity scheduler in yarn: By default yarn capacity
scheduler uses DefaultResourceCalculator. DefaultResourceCalculator
consider¹s only memory while allocating contains.
You can use DominantResourceCalculator, it considers memory and cpu.
In capacity-scheduler.xml set
yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.res
ource.DefaultResourceCalculator


On 04/11/14 3:03 am, "Gen" <gen.tan...@gmail.com> wrote:

>Hi,
>
>Well, I doesn't find original documentation, but according to
>http://qnalist.com/questions/2791828/about-the-cpu-cores-and-cpu-usage
><http://qnalist.com/questions/2791828/about-the-cpu-cores-and-cpu-usage>
>,
>the vcores is not for physics cpu core but for "virtual" cores.
>And I used top command to monitor the cpu utilization during the spark
>task.
>The spark can use all cpu even I leave --executor-cores as default(1).
>
>Hope that it can be a help.
>Cheers
>Gen
>
>
>Gen wrote
>> Hi,
>> 
>> Maybe it is a stupid question, but I am running spark on yarn. I request
>> the resources by the following command:
>> {code}
>> ./spark-submit --master yarn-client --num-executors #number of worker
>> --executor-cores #number of cores. ...
>> {code}
>> However, after launching the task, I use
>/
>> yarn node -status ID
>/
>>  to monitor the situation of cluster. It shows that the number of Vcores
>> used for each container is always 1 no matter what number I pass by
>> --executor-cores.
>> Any ideas how to solve this problem? Thanks a lot in advance for your
>> help.
>> 
>> Cheers
>> Gen
>
>
>
>
>
>--
>View this message in context:
>http://apache-spark-user-list.1001560.n3.nabble.com/executor-cores-cannot-
>change-vcores-in-yarn-tp17883p17992.html
>Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>For additional commands, e-mail: user-h...@spark.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

--- End Message ---

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Executor parameter doesn't work for Spark-shell on EMR Yarn

Reply via email to