Executor parameter doesn't work for Spark-shell on EMR Yarn
Hi All, I am testing Spark on EMR cluster. Env is a one node cluster r3.8xlarge. Has 32 vCore and 244G memory. But the command line I use to start up spark-shell, it can't work. For example: ~/spark/bin/spark-shell --jars /home/hadoop/vrisc-lib/aws-java-sdk-1.9.14/lib/*.jar --num-executors 6 --executor-memory 10G Neither num-executors nor memory setup works. And more interesting, if I use test code: val lines = sc.parallelize(List(-240990|161327,9051480,0,2,30.48,75, -240990|161324,9051480,0,2,30.48,75)) var count = lines.mapPartitions(dynamoDBBatchWriteFunc).collect.sum It will start 32 executors (then I assume it try to start all executors for every vCore). But if I use some real data to do it (the file size is 200M): val lines = sc.textFile(s3://.../part-r-0) var count = lines.mapPartitions(dynamoDBBatchWriteFunc).collect.sum It will only start 4 executors, which map to the number of HDFS split (200M will have 4 splits). So I have two questions: 1, Why the setup parameter is ignored by Yarn? How can I limit the number of executors I can run? 2, Why my much smaller test data set will trigger 32 executors but my real 200M data set will only have 4 executors? So how should I control the executor setup on the spark-shell? And I print the sparkConf, it looks like much less than I expect, and I don't see my pass in parameter show there. scala sc.getConf.getAll.foreach(println) (spark.tachyonStore.folderName,spark-af0c4d42-fe4d-40b0-a3cf-25b6a9e16fa0) (spark.app.id,local-1421353031552) (spark.eventLog.enabled,true) (spark.executor.id,driver) (spark.repl.class.uri,http://10.181.82.38:58415) (spark.driver.host,ip-10-181-82-38.ec2.internal) (spark.executor.extraJavaOptions,-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70) (spark.app.name,Spark shell) (spark.fileserver.uri,http://10.181.82.38:54666) (spark.jars,file:/home/hadoop/vrisc-lib/aws-java-sdk-1.9.14/lib/aws-java-sdk -1.9.14.jar) (spark.eventLog.dir,hdfs:///spark-logs) (spark.executor.extraClassPath,/home/hadoop/spark/classpath/emr/*:/home/hado op/spark/classpath/emrfs/*:/home/hadoop/share/hadoop/common/lib/*:/home/hado op/.versions/2.4.0/share/hadoop/common/lib/hadoop-lzo.jar) (spark.master,local[*]) (spark.driver.port,54191) (spark.driver.extraClassPath,/home/hadoop/spark/classpath/emr/*:/home/hadoop /spark/classpath/emrfs/*:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop /.versions/2.4.0/share/hadoop/common/lib/hadoop-lzo.jar) I search the old threads, attached email answer the question about why vCore setup doesn't work. But I think this is not same issue as me. Otherwise then default Yarn Spark setup can't do any adjustment? Regards, Shuai ---BeginMessage--- If you are using capacity scheduler in yarn: By default yarn capacity scheduler uses DefaultResourceCalculator. DefaultResourceCalculator considerĀ¹s only memory while allocating contains. You can use DominantResourceCalculator, it considers memory and cpu. In capacity-scheduler.xml set yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.res ource.DefaultResourceCalculator On 04/11/14 3:03 am, Gen gen.tan...@gmail.com wrote: Hi, Well, I doesn't find original documentation, but according to http://qnalist.com/questions/2791828/about-the-cpu-cores-and-cpu-usage http://qnalist.com/questions/2791828/about-the-cpu-cores-and-cpu-usage , the vcores is not for physics cpu core but for virtual cores. And I used top command to monitor the cpu utilization during the spark task. The spark can use all cpu even I leave --executor-cores as default(1). Hope that it can be a help. Cheers Gen Gen wrote Hi, Maybe it is a stupid question, but I am running spark on yarn. I request the resources by the following command: {code} ./spark-submit --master yarn-client --num-executors #number of worker --executor-cores #number of cores. ... {code} However, after launching the task, I use / yarn node -status ID / to monitor the situation of cluster. It shows that the number of Vcores used for each container is always 1 no matter what number I pass by --executor-cores. Any ideas how to solve this problem? Thanks a lot in advance for your help. Cheers Gen -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/executor-cores-cannot- change-vcores-in-yarn-tp17883p17992.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org ---End Message---
RE: Executor parameter doesn't work for Spark-shell on EMR Yarn
I figure out the second question, because if I don't pass in the num of partition for the test data, it will by default assume has max executors (although I don't know what is this default max num). val lines = sc.parallelize(List(-240990|161327,9051480,0,2,30.48,75, -240990|161324,9051480,0,2,30.48,75),2) will only trigger 2 executors. So I think the default executors num will be decided by the first RDD operation need to send to executors. This give me a weird way to control the num of executors (a fake/test code piece run to kick off the executors first, then run the real behavior - because executor will run the whole lifecycle of the applications? Although this may not have any real value in practice J But I still need help for my first question. Thanks a lot. Regards, Shuai From: Shuai Zheng [mailto:szheng.c...@gmail.com] Sent: Thursday, January 15, 2015 4:03 PM To: user@spark.apache.org Subject: RE: Executor parameter doesn't work for Spark-shell on EMR Yarn Forget to mention, I use EMR AMI 3.3.1, Spark 1.2.0. Yarn 2.4. The spark is setup by the standard script: s3://support.elasticmapreduce/spark/install-spark From: Shuai Zheng [mailto:szheng.c...@gmail.com] Sent: Thursday, January 15, 2015 3:52 PM To: user@spark.apache.org Subject: Executor parameter doesn't work for Spark-shell on EMR Yarn Hi All, I am testing Spark on EMR cluster. Env is a one node cluster r3.8xlarge. Has 32 vCore and 244G memory. But the command line I use to start up spark-shell, it can't work. For example: ~/spark/bin/spark-shell --jars /home/hadoop/vrisc-lib/aws-java-sdk-1.9.14/lib/*.jar --num-executors 6 --executor-memory 10G Neither num-executors nor memory setup works. And more interesting, if I use test code: val lines = sc.parallelize(List(-240990|161327,9051480,0,2,30.48,75, -240990|161324,9051480,0,2,30.48,75)) var count = lines.mapPartitions(dynamoDBBatchWriteFunc).collect.sum It will start 32 executors (then I assume it try to start all executors for every vCore). But if I use some real data to do it (the file size is 200M): val lines = sc.textFile(s3://.../part-r-0) var count = lines.mapPartitions(dynamoDBBatchWriteFunc).collect.sum It will only start 4 executors, which map to the number of HDFS split (200M will have 4 splits). So I have two questions: 1, Why the setup parameter is ignored by Yarn? How can I limit the number of executors I can run? 2, Why my much smaller test data set will trigger 32 executors but my real 200M data set will only have 4 executors? So how should I control the executor setup on the spark-shell? And I print the sparkConf, it looks like much less than I expect, and I don't see my pass in parameter show there. scala sc.getConf.getAll.foreach(println) (spark.tachyonStore.folderName,spark-af0c4d42-fe4d-40b0-a3cf-25b6a9e16fa0) (spark.app.id,local-1421353031552) (spark.eventLog.enabled,true) (spark.executor.id,driver) (spark.repl.class.uri,http://10.181.82.38:58415) (spark.driver.host,ip-10-181-82-38.ec2.internal) (spark.executor.extraJavaOptions,-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70) (spark.app.name,Spark shell) (spark.fileserver.uri,http://10.181.82.38:54666) (spark.jars,file:/home/hadoop/vrisc-lib/aws-java-sdk-1.9.14/lib/aws-java-sdk -1.9.14.jar) (spark.eventLog.dir,hdfs:///spark-logs) (spark.executor.extraClassPath,/home/hadoop/spark/classpath/emr/*:/home/hado op/spark/classpath/emrfs/*:/home/hadoop/share/hadoop/common/lib/*:/home/hado op/.versions/2.4.0/share/hadoop/common/lib/hadoop-lzo.jar) (spark.master,local[*]) (spark.driver.port,54191) (spark.driver.extraClassPath,/home/hadoop/spark/classpath/emr/*:/home/hadoop /spark/classpath/emrfs/*:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop /.versions/2.4.0/share/hadoop/common/lib/hadoop-lzo.jar) I search the old threads, attached email answer the question about why vCore setup doesn't work. But I think this is not same issue as me. Otherwise then default Yarn Spark setup can't do any adjustment? Regards, Shuai