I figure out the second question, because if I don't pass in the num of partition for the test data, it will by default assume has max executors (although I don't know what is this default max num).
val lines = sc.parallelize(List("-240990|161327,9051480,0,2,30.48,75", "-240990|161324,9051480,0,2,30.48,75"),2) will only trigger 2 executors. So I think the default executors num will be decided by the first RDD operation need to send to executors. This give me a weird way to control the num of executors (a fake/test code piece run to kick off the executors first, then run the real behavior - because executor will run the whole lifecycle of the applications? Although this may not have any real value in practice J But I still need help for my first question. Thanks a lot. Regards, Shuai From: Shuai Zheng [mailto:szheng.c...@gmail.com] Sent: Thursday, January 15, 2015 4:03 PM To: user@spark.apache.org Subject: RE: Executor parameter doesn't work for Spark-shell on EMR Yarn Forget to mention, I use EMR AMI 3.3.1, Spark 1.2.0. Yarn 2.4. The spark is setup by the standard script: s3://support.elasticmapreduce/spark/install-spark From: Shuai Zheng [mailto:szheng.c...@gmail.com] Sent: Thursday, January 15, 2015 3:52 PM To: user@spark.apache.org Subject: Executor parameter doesn't work for Spark-shell on EMR Yarn Hi All, I am testing Spark on EMR cluster. Env is a one node cluster r3.8xlarge. Has 32 vCore and 244G memory. But the command line I use to start up spark-shell, it can't work. For example: ~/spark/bin/spark-shell --jars /home/hadoop/vrisc-lib/aws-java-sdk-1.9.14/lib/*.jar --num-executors 6 --executor-memory 10G Neither num-executors nor memory setup works. And more interesting, if I use test code: val lines = sc.parallelize(List("-240990|161327,9051480,0,2,30.48,75", "-240990|161324,9051480,0,2,30.48,75")) var count = lines.mapPartitions(dynamoDBBatchWriteFunc).collect.sum It will start 32 executors (then I assume it try to start all executors for every vCore). But if I use some real data to do it (the file size is 200M): val lines = sc.textFile("s3://.../part-r-00000") var count = lines.mapPartitions(dynamoDBBatchWriteFunc).collect.sum It will only start 4 executors, which map to the number of HDFS split (200M will have 4 splits). So I have two questions: 1, Why the setup parameter is ignored by Yarn? How can I limit the number of executors I can run? 2, Why my much smaller test data set will trigger 32 executors but my real 200M data set will only have 4 executors? So how should I control the executor setup on the spark-shell? And I print the sparkConf, it looks like much less than I expect, and I don't see my pass in parameter show there. scala> sc.getConf.getAll.foreach(println) (spark.tachyonStore.folderName,spark-af0c4d42-fe4d-40b0-a3cf-25b6a9e16fa0) (spark.app.id,local-1421353031552) (spark.eventLog.enabled,true) (spark.executor.id,driver) (spark.repl.class.uri,http://10.181.82.38:58415) (spark.driver.host,ip-10-181-82-38.ec2.internal) (spark.executor.extraJavaOptions,-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70) (spark.app.name,Spark shell) (spark.fileserver.uri,http://10.181.82.38:54666) (spark.jars,file:/home/hadoop/vrisc-lib/aws-java-sdk-1.9.14/lib/aws-java-sdk -1.9.14.jar) (spark.eventLog.dir,hdfs:///spark-logs) (spark.executor.extraClassPath,/home/hadoop/spark/classpath/emr/*:/home/hado op/spark/classpath/emrfs/*:/home/hadoop/share/hadoop/common/lib/*:/home/hado op/.versions/2.4.0/share/hadoop/common/lib/hadoop-lzo.jar) (spark.master,local[*]) (spark.driver.port,54191) (spark.driver.extraClassPath,/home/hadoop/spark/classpath/emr/*:/home/hadoop /spark/classpath/emrfs/*:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop /.versions/2.4.0/share/hadoop/common/lib/hadoop-lzo.jar) I search the old threads, attached email answer the question about why vCore setup doesn't work. But I think this is not same issue as me. Otherwise then default Yarn Spark setup can't do any adjustment? Regards, Shuai