Did you set this configuration "spark.dynamicAllocation.initialExecutors" ?
You can set spark.dynamicAllocation.initialExecutors 50 to take try again. I guess you might be hitting this issue since you're running 1.5.0, https://issues.apache.org/jira/browse/SPARK-9092. But it still cannot explain why 49 executors can be worked. On Tue, Nov 24, 2015 at 4:42 PM, Sabarish Sasidharan < sabarish.sasidha...@manthan.com> wrote: > If yarn has only 50 cores then it can support max 49 executors plus 1 > driver application master. > > Regards > Sab > On 24-Nov-2015 1:58 pm, "谢廷稳" <xieting...@gmail.com> wrote: > >> OK, yarn.scheduler.maximum-allocation-mb is 16384. >> >> I have ran it again, the command to run it is: >> ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master >> yarn-cluster - >> -driver-memory 4g --executor-memory 8g lib/spark-examples*.jar 200 >> >> >> >>> >>> >>> 15/11/24 16:15:56 INFO yarn.ApplicationMaster: Registered signal handlers >>> for [TERM, HUP, INT] >>> >>> 15/11/24 16:15:57 INFO yarn.ApplicationMaster: ApplicationAttemptId: >>> appattempt_1447834709734_0120_000001 >>> >>> 15/11/24 16:15:58 INFO spark.SecurityManager: Changing view acls to: >>> hdfs-test >>> >>> 15/11/24 16:15:58 INFO spark.SecurityManager: Changing modify acls to: >>> hdfs-test >>> >>> 15/11/24 16:15:58 INFO spark.SecurityManager: SecurityManager: >>> authentication disabled; ui acls disabled; users with view permissions: >>> Set(hdfs-test); users with modify permissions: Set(hdfs-test) >>> >>> 15/11/24 16:15:58 INFO yarn.ApplicationMaster: Starting the user >>> application in a separate Thread >>> >>> 15/11/24 16:15:58 INFO yarn.ApplicationMaster: Waiting for spark context >>> initialization >>> >>> 15/11/24 16:15:58 INFO yarn.ApplicationMaster: Waiting for spark context >>> initialization ... >>> 15/11/24 16:15:58 INFO spark.SparkContext: Running Spark version 1.5.0 >>> >>> 15/11/24 16:15:58 INFO spark.SecurityManager: Changing view acls to: >>> hdfs-test >>> >>> 15/11/24 16:15:58 INFO spark.SecurityManager: Changing modify acls to: >>> hdfs-test >>> >>> 15/11/24 16:15:58 INFO spark.SecurityManager: SecurityManager: >>> authentication disabled; ui acls disabled; users with view permissions: >>> Set(hdfs-test); users with modify permissions: Set(hdfs-test) >>> 15/11/24 16:15:58 INFO slf4j.Slf4jLogger: Slf4jLogger started >>> 15/11/24 16:15:59 INFO Remoting: Starting remoting >>> >>> 15/11/24 16:15:59 INFO Remoting: Remoting started; listening on addresses >>> :[akka.tcp://sparkDriver@X.X.X.X >>> ] >>> >>> 15/11/24 16:15:59 INFO util.Utils: Successfully started service >>> 'sparkDriver' on port 61904. >>> 15/11/24 16:15:59 INFO spark.SparkEnv: Registering MapOutputTracker >>> 15/11/24 16:15:59 INFO spark.SparkEnv: Registering BlockManagerMaster >>> >>> 15/11/24 16:15:59 INFO storage.DiskBlockManager: Created local directory at >>> /data1/hadoop/nm-local-dir/usercache/hdfs-test/appcache/application_1447834709734_0120/blockmgr-33fbe6c4-5138-4eff-83b4-fb0c886667b7 >>> >>> 15/11/24 16:15:59 INFO storage.MemoryStore: MemoryStore started with >>> capacity 1966.1 MB >>> >>> 15/11/24 16:15:59 INFO spark.HttpFileServer: HTTP File server directory is >>> /data1/hadoop/nm-local-dir/usercache/hdfs-test/appcache/application_1447834709734_0120/spark-fbbfa2bd-6d30-421e-a634-4546134b3b5f/httpd-e31d7b8e-ca8f-400e-8b4b-d2993fb6f1d1 >>> 15/11/24 16:15:59 INFO spark.HttpServer: Starting HTTP Server >>> 15/11/24 16:15:59 INFO server.Server: jetty-8.y.z-SNAPSHOT >>> 15/11/24 16:15:59 INFO server.AbstractConnector: Started >>> SocketConnector@0.0.0.0:14692 >>> >>> 15/11/24 16:15:59 INFO util.Utils: Successfully started service 'HTTP file >>> server' on port 14692. >>> >>> 15/11/24 16:15:59 INFO spark.SparkEnv: Registering OutputCommitCoordinator >>> >>> 15/11/24 16:15:59 INFO ui.JettyUtils: Adding filter: >>> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter >>> 15/11/24 16:15:59 INFO server.Server: jetty-8.y.z-SNAPSHOT >>> 15/11/24 16:15:59 INFO server.AbstractConnector: Started >>> SelectChannelConnector@0.0.0.0:15948 >>> 15/11/24 16:15:59 INFO util.Utils: Successfully started service 'SparkUI' >>> on port 15948. >>> >>> 15/11/24 16:15:59 INFO ui.SparkUI: Started SparkUI at X.X.X.X >>> >>> 15/11/24 16:15:59 INFO cluster.YarnClusterScheduler: Created >>> YarnClusterScheduler >>> >>> 15/11/24 16:15:59 WARN metrics.MetricsSystem: Using default name >>> DAGScheduler for source because >>> spark.app.id is not set. >>> >>> 15/11/24 16:15:59 INFO util.Utils: Successfully started service >>> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 41830. >>> 15/11/24 16:15:59 INFO netty.NettyBlockTransferService: Server created on >>> 41830 >>> >>> 15/11/24 16:15:59 INFO storage.BlockManagerMaster: Trying to register >>> BlockManager >>> >>> 15/11/24 16:15:59 INFO storage.BlockManagerMasterEndpoint: Registering >>> block manager X.X.X.X:41830 with 1966.1 MB RAM, BlockManagerId(driver, >>> 10.12.30.2, 41830) >>> >>> >>> 15/11/24 16:15:59 INFO storage.BlockManagerMaster: Registered BlockManager >>> 15/11/24 16:16:00 INFO scheduler.EventLoggingListener: Logging events to >>> hdfs:///tmp/latest-spark-events/application_1447834709734_0120_1 >>> >>> 15/11/24 16:16:00 INFO cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: >>> ApplicationMaster registered as >>> AkkaRpcEndpointRef(Actor[akka://sparkDriver/user/YarnAM#293602859]) >>> >>> 15/11/24 16:16:00 INFO client.RMProxy: Connecting to ResourceManager at >>> X.X.X.X >>> >>> >>> 15/11/24 16:16:00 INFO yarn.YarnRMClient: Registering the ApplicationMaster >>> >>> 15/11/24 16:16:00 INFO yarn.ApplicationMaster: Started progress reporter >>> thread with (heartbeat : 3000, initial allocation : 200) intervals >>> >>> 15/11/24 16:16:29 INFO cluster.YarnClusterSchedulerBackend: >>> SchedulerBackend is ready for scheduling beginning after waiting >>> maxRegisteredResourcesWaitingTime: 30000(ms) >>> >>> 15/11/24 16:16:29 INFO cluster.YarnClusterScheduler: >>> YarnClusterScheduler.postStartHook done >>> >>> 15/11/24 16:16:29 INFO spark.SparkContext: Starting job: reduce at >>> SparkPi.scala:36 >>> >>> 15/11/24 16:16:29 INFO scheduler.DAGScheduler: Got job 0 (reduce at >>> SparkPi.scala:36) with 200 output partitions >>> >>> 15/11/24 16:16:29 INFO scheduler.DAGScheduler: Final stage: ResultStage >>> 0(reduce at SparkPi.scala:36) >>> >>> 15/11/24 16:16:29 INFO scheduler.DAGScheduler: Parents of final stage: >>> List() >>> 15/11/24 16:16:29 INFO scheduler.DAGScheduler: Missing parents: List() >>> >>> 15/11/24 16:16:29 INFO scheduler.DAGScheduler: Submitting ResultStage 0 >>> (MapPartitionsRDD[1] at map at SparkPi.scala:32), which has no missing >>> parents >>> >>> 15/11/24 16:16:30 INFO storage.MemoryStore: ensureFreeSpace(1888) called >>> with curMem=0, maxMem=2061647216 >>> >>> 15/11/24 16:16:30 INFO storage.MemoryStore: Block broadcast_0 stored as >>> values in memory (estimated size 1888.0 B, free 1966.1 MB) >>> 15/11/24 16:16:30 INFO storage.MemoryStore: ensureFreeSpace(1202) called >>> with curMem=1888, maxMem=2061647216 >>> >>> 15/11/24 16:16:30 INFO storage.MemoryStore: Block broadcast_0_piece0 stored >>> as bytes in memory (estimated size 1202.0 B, free 1966.1 MB) >>> >>> 15/11/24 16:16:30 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 >>> in memory on X.X.X.X:41830 (size: 1202.0 B, free: 1966.1 MB) >>> >>> >>> 15/11/24 16:16:30 INFO spark.SparkContext: Created broadcast 0 from >>> broadcast at DAGScheduler.scala:861 >>> >>> 15/11/24 16:16:30 INFO scheduler.DAGScheduler: Submitting 200 missing tasks >>> from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:32) >>> >>> 15/11/24 16:16:30 INFO cluster.YarnClusterScheduler: Adding task set 0.0 >>> with 200 tasks >>> >>> 15/11/24 16:16:45 WARN cluster.YarnClusterScheduler: Initial job has not >>> accepted any resources; check your cluster UI to ensure that workers are >>> registered and have sufficient resources >>> >>> 15/11/24 16:17:00 WARN cluster.YarnClusterScheduler: Initial job has not >>> accepted any resources; check your cluster UI to ensure that workers are >>> registered and have sufficient resources >>> >>> 15/11/24 16:17:15 WARN cluster.YarnClusterScheduler: Initial job has not >>> accepted any resources; check your cluster UI to ensure that workers are >>> registered and have sufficient resources >>> >>> 15/11/24 16:17:30 WARN cluster.YarnClusterScheduler: Initial job has not >>> accepted any resources; check your cluster UI to ensure that workers are >>> registered and have sufficient resources >>> >>> 15/11/24 16:17:45 WARN cluster.YarnClusterScheduler: Initial job has not >>> accepted any resources; check your cluster UI to ensure that workers are >>> registered and have sufficient resources >>> >>> 15/11/24 16:18:00 WARN cluster.YarnClusterScheduler: Initial job has not >>> accepted any resources; check your cluster UI to ensure that workers are >>> registered and have sufficient resources >>> >>> >> 2015-11-24 15:14 GMT+08:00 Saisai Shao <sai.sai.s...@gmail.com>: >> >>> What about this configure in Yarn "yarn.scheduler.maximum-allocation-mb" >>> >>> I'm curious why 49 executors can be worked, but 50 failed. Would you >>> provide your application master log, if container request is issued, there >>> will be log like: >>> >>> 15/10/14 17:35:37 INFO yarn.YarnAllocator: Will request 2 executor >>> containers, each with 1 cores and 1408 MB memory including 384 MB overhead >>> 15/10/14 17:35:37 INFO yarn.YarnAllocator: Container request (host: Any, >>> capability: <memory:1408, vCores:1>) >>> 15/10/14 17:35:37 INFO yarn.YarnAllocator: Container request (host: Any, >>> capability: <memory:1408, vCores:1>) >>> >>> >>> >>> On Tue, Nov 24, 2015 at 2:56 PM, 谢廷稳 <xieting...@gmail.com> wrote: >>> >>>> OK, the YARN conf will be list in the following: >>>> >>>> yarn.nodemanager.resource.memory-mb:115200 >>>> yarn.nodemanager.resource.cpu-vcores:50 >>>> >>>> I think the YARN resource is sufficient. In the previous letter I have >>>> said that I think Spark application didn't request resources from YARN. >>>> >>>> Thanks >>>> >>>> 2015-11-24 14:30 GMT+08:00 cherrywayb...@gmail.com < >>>> cherrywayb...@gmail.com>: >>>> >>>>> can you show your parameter values in your env ? >>>>> yarn.nodemanager.resource.cpu-vcores >>>>> yarn.nodemanager.resource.memory-mb >>>>> >>>>> ------------------------------ >>>>> cherrywayb...@gmail.com >>>>> >>>>> >>>>> *From:* 谢廷稳 <xieting...@gmail.com> >>>>> *Date:* 2015-11-24 12:13 >>>>> *To:* Saisai Shao <sai.sai.s...@gmail.com> >>>>> *CC:* spark users <user@spark.apache.org> >>>>> *Subject:* Re: A Problem About Running Spark 1.5 on YARN with Dynamic >>>>> Alloction >>>>> OK,the YARN cluster was used by myself,it have 6 node witch can run >>>>> over 100 executor, and the YARN RM logs showed that the Spark application >>>>> did not requested resource from it. >>>>> >>>>> Is this a bug? Should I create a JIRA for this problem? >>>>> >>>>> 2015-11-24 12:00 GMT+08:00 Saisai Shao <sai.sai.s...@gmail.com>: >>>>> >>>>>> OK, so this looks like your Yarn cluster does not allocate >>>>>> containers which you supposed should be 50. Does the yarn cluster have >>>>>> enough resource after allocating AM container, if not, that is the >>>>>> problem. >>>>>> >>>>>> The problem not lies in dynamic allocation from my guess of your >>>>>> description. I said I'm OK with min and max executors to the same number. >>>>>> >>>>>> On Tue, Nov 24, 2015 at 11:54 AM, 谢廷稳 <xieting...@gmail.com> wrote: >>>>>> >>>>>>> Hi Saisai, >>>>>>> I'm sorry for did not describe it clearly,YARN debug log said I have >>>>>>> 50 executors,but ResourceManager showed that I only have 1 container for >>>>>>> the AppMaster. >>>>>>> >>>>>>> I have checked YARN RM logs,after AppMaster changed state >>>>>>> from ACCEPTED to RUNNING,it did not have log about this job any >>>>>>> more.So,the >>>>>>> problem is I did not have any executor but ExecutorAllocationManager >>>>>>> think >>>>>>> I have.Would you minding having a test in your cluster environment? >>>>>>> Thanks, >>>>>>> Weber >>>>>>> >>>>>>> 2015-11-24 11:00 GMT+08:00 Saisai Shao <sai.sai.s...@gmail.com>: >>>>>>> >>>>>>>> I think this behavior is expected, since you already have 50 >>>>>>>> executors launched, so no need to acquire additional executors. You >>>>>>>> change >>>>>>>> is not solid, it is just hiding the log. >>>>>>>> >>>>>>>> Again I think you should check the logs of Yarn and Spark to see if >>>>>>>> executors are started correctly. Why resource is still not enough >>>>>>>> where you >>>>>>>> already have 50 executors. >>>>>>>> >>>>>>>> On Tue, Nov 24, 2015 at 10:48 AM, 谢廷稳 <xieting...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi SaiSai, >>>>>>>>> I have changed "if (numExecutorsTarget >= maxNumExecutors)" to >>>>>>>>> "if (numExecutorsTarget > maxNumExecutors)" of the first line in the >>>>>>>>> ExecutorAllocationManager#addExecutors() and it rans well. >>>>>>>>> In my opinion,when I was set minExecutors equals maxExecutors,when >>>>>>>>> the first time to add Executors,numExecutorsTarget equals >>>>>>>>> maxNumExecutors >>>>>>>>> and it repeat printe "DEBUG ExecutorAllocationManager: Not adding >>>>>>>>> executors because our current target total is already 50 (limit 50) >>>>>>>>> ". >>>>>>>>> Thanks >>>>>>>>> Weber >>>>>>>>> >>>>>>>>> 2015-11-23 21:00 GMT+08:00 Saisai Shao <sai.sai.s...@gmail.com>: >>>>>>>>> >>>>>>>>>> Hi Tingwen, >>>>>>>>>> >>>>>>>>>> Would you minding sharing your changes in >>>>>>>>>> ExecutorAllocationManager#addExecutors(). >>>>>>>>>> >>>>>>>>>> From my understanding and test, dynamic allocation can be worked >>>>>>>>>> when you set the min to max number of executors to the same number. >>>>>>>>>> >>>>>>>>>> Please check your Spark and Yarn log to make sure the executors >>>>>>>>>> are correctly started, the warning log means currently resource is >>>>>>>>>> not >>>>>>>>>> enough to submit tasks. >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> Saisai >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Nov 23, 2015 at 8:41 PM, 谢廷稳 <xieting...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi all, >>>>>>>>>>> I ran a SparkPi on YARN with Dynamic Allocation enabled and set >>>>>>>>>>> spark.dynamicAllocation.maxExecutors >>>>>>>>>>> equals >>>>>>>>>>> spark.dynamicAllocation.minExecutors,then I submit an >>>>>>>>>>> application using: >>>>>>>>>>> ./bin/spark-submit --class org.apache.spark.examples.SparkPi >>>>>>>>>>> --master yarn-cluster --driver-memory 4g --executor-memory 8g >>>>>>>>>>> lib/spark-examples*.jar 200 >>>>>>>>>>> >>>>>>>>>>> then, this application was submitted successfully, but the >>>>>>>>>>> AppMaster always saying “15/11/23 20:13:08 WARN >>>>>>>>>>> cluster.YarnClusterScheduler: Initial job has not accepted any >>>>>>>>>>> resources; >>>>>>>>>>> check your cluster UI to ensure that workers are registered and have >>>>>>>>>>> sufficient resources” >>>>>>>>>>> and when I open DEBUG,I found “15/11/23 20:24:00 DEBUG >>>>>>>>>>> ExecutorAllocationManager: Not adding executors because our current >>>>>>>>>>> target >>>>>>>>>>> total is already 50 (limit 50)” in the console. >>>>>>>>>>> >>>>>>>>>>> I have fixed it by modifying code in >>>>>>>>>>> ExecutorAllocationManager.addExecutors,Does this a bug or it was >>>>>>>>>>> designed >>>>>>>>>>> that we can’t set maxExecutors equals minExecutors? >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Weber >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >>