Re: Re: A Problem About Running Spark 1.5 on YARN with Dynamic Alloction

Saisai Shao Tue, 24 Nov 2015 00:48:07 -0800

Did you set this configuration "spark.dynamicAllocation.initialExecutors" ?


You can set spark.dynamicAllocation.initialExecutors 50 to take try again.

I guess you might be hitting this issue since you're running 1.5.0,
https://issues.apache.org/jira/browse/SPARK-9092. But it still cannot
explain why 49 executors can be worked.

On Tue, Nov 24, 2015 at 4:42 PM, Sabarish Sasidharan <
sabarish.sasidha...@manthan.com> wrote:

> If yarn has only 50 cores then it can support max 49 executors plus 1
> driver application master.
>
> Regards
> Sab
> On 24-Nov-2015 1:58 pm, "谢廷稳" <xieting...@gmail.com> wrote:
>
>> OK, yarn.scheduler.maximum-allocation-mb is 16384.
>>
>> I have ran it again, the command to run it is:
>> ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
>> yarn-cluster -
>> -driver-memory 4g  --executor-memory 8g lib/spark-examples*.jar 200
>>
>>
>>
>>>
>>>
>>> 15/11/24 16:15:56 INFO yarn.ApplicationMaster: Registered signal handlers 
>>> for [TERM, HUP, INT]
>>>
>>> 15/11/24 16:15:57 INFO yarn.ApplicationMaster: ApplicationAttemptId: 
>>> appattempt_1447834709734_0120_000001
>>>
>>> 15/11/24 16:15:58 INFO spark.SecurityManager: Changing view acls to: 
>>> hdfs-test
>>>
>>> 15/11/24 16:15:58 INFO spark.SecurityManager: Changing modify acls to: 
>>> hdfs-test
>>>
>>> 15/11/24 16:15:58 INFO spark.SecurityManager: SecurityManager: 
>>> authentication disabled; ui acls disabled; users with view permissions: 
>>> Set(hdfs-test); users with modify permissions: Set(hdfs-test)
>>>
>>> 15/11/24 16:15:58 INFO yarn.ApplicationMaster: Starting the user 
>>> application in a separate Thread
>>>
>>> 15/11/24 16:15:58 INFO yarn.ApplicationMaster: Waiting for spark context 
>>> initialization
>>>
>>> 15/11/24 16:15:58 INFO yarn.ApplicationMaster: Waiting for spark context 
>>> initialization ...
>>> 15/11/24 16:15:58 INFO spark.SparkContext: Running Spark version 1.5.0
>>>
>>> 15/11/24 16:15:58 INFO spark.SecurityManager: Changing view acls to: 
>>> hdfs-test
>>>
>>> 15/11/24 16:15:58 INFO spark.SecurityManager: Changing modify acls to: 
>>> hdfs-test
>>>
>>> 15/11/24 16:15:58 INFO spark.SecurityManager: SecurityManager: 
>>> authentication disabled; ui acls disabled; users with view permissions: 
>>> Set(hdfs-test); users with modify permissions: Set(hdfs-test)
>>> 15/11/24 16:15:58 INFO slf4j.Slf4jLogger: Slf4jLogger started
>>> 15/11/24 16:15:59 INFO Remoting: Starting remoting
>>>
>>> 15/11/24 16:15:59 INFO Remoting: Remoting started; listening on addresses 
>>> :[akka.tcp://sparkDriver@X.X.X.X
>>> ]
>>>
>>> 15/11/24 16:15:59 INFO util.Utils: Successfully started service 
>>> 'sparkDriver' on port 61904.
>>> 15/11/24 16:15:59 INFO spark.SparkEnv: Registering MapOutputTracker
>>> 15/11/24 16:15:59 INFO spark.SparkEnv: Registering BlockManagerMaster
>>>
>>> 15/11/24 16:15:59 INFO storage.DiskBlockManager: Created local directory at 
>>> /data1/hadoop/nm-local-dir/usercache/hdfs-test/appcache/application_1447834709734_0120/blockmgr-33fbe6c4-5138-4eff-83b4-fb0c886667b7
>>>
>>> 15/11/24 16:15:59 INFO storage.MemoryStore: MemoryStore started with 
>>> capacity 1966.1 MB
>>>
>>> 15/11/24 16:15:59 INFO spark.HttpFileServer: HTTP File server directory is 
>>> /data1/hadoop/nm-local-dir/usercache/hdfs-test/appcache/application_1447834709734_0120/spark-fbbfa2bd-6d30-421e-a634-4546134b3b5f/httpd-e31d7b8e-ca8f-400e-8b4b-d2993fb6f1d1
>>> 15/11/24 16:15:59 INFO spark.HttpServer: Starting HTTP Server
>>> 15/11/24 16:15:59 INFO server.Server: jetty-8.y.z-SNAPSHOT
>>> 15/11/24 16:15:59 INFO server.AbstractConnector: Started
>>> SocketConnector@0.0.0.0:14692
>>>
>>> 15/11/24 16:15:59 INFO util.Utils: Successfully started service 'HTTP file 
>>> server' on port 14692.
>>>
>>> 15/11/24 16:15:59 INFO spark.SparkEnv: Registering OutputCommitCoordinator
>>>
>>> 15/11/24 16:15:59 INFO ui.JettyUtils: Adding filter: 
>>> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
>>> 15/11/24 16:15:59 INFO server.Server: jetty-8.y.z-SNAPSHOT
>>> 15/11/24 16:15:59 INFO server.AbstractConnector: Started
>>> SelectChannelConnector@0.0.0.0:15948
>>> 15/11/24 16:15:59 INFO util.Utils: Successfully started service 'SparkUI' 
>>> on port 15948.
>>>
>>> 15/11/24 16:15:59 INFO ui.SparkUI: Started SparkUI at X.X.X.X
>>>
>>> 15/11/24 16:15:59 INFO cluster.YarnClusterScheduler: Created 
>>> YarnClusterScheduler
>>>
>>> 15/11/24 16:15:59 WARN metrics.MetricsSystem: Using default name 
>>> DAGScheduler for source because
>>> spark.app.id is not set.
>>>
>>> 15/11/24 16:15:59 INFO util.Utils: Successfully started service 
>>> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 41830.
>>> 15/11/24 16:15:59 INFO netty.NettyBlockTransferService: Server created on 
>>> 41830
>>>
>>> 15/11/24 16:15:59 INFO storage.BlockManagerMaster: Trying to register 
>>> BlockManager
>>>
>>> 15/11/24 16:15:59 INFO storage.BlockManagerMasterEndpoint: Registering 
>>> block manager X.X.X.X:41830 with 1966.1 MB RAM, BlockManagerId(driver, 
>>> 10.12.30.2, 41830)
>>>
>>>
>>> 15/11/24 16:15:59 INFO storage.BlockManagerMaster: Registered BlockManager
>>> 15/11/24 16:16:00 INFO scheduler.EventLoggingListener: Logging events to 
>>> hdfs:///tmp/latest-spark-events/application_1447834709734_0120_1
>>>
>>> 15/11/24 16:16:00 INFO cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: 
>>> ApplicationMaster registered as 
>>> AkkaRpcEndpointRef(Actor[akka://sparkDriver/user/YarnAM#293602859])
>>>
>>> 15/11/24 16:16:00 INFO client.RMProxy: Connecting to ResourceManager at 
>>> X.X.X.X
>>>
>>>
>>> 15/11/24 16:16:00 INFO yarn.YarnRMClient: Registering the ApplicationMaster
>>>
>>> 15/11/24 16:16:00 INFO yarn.ApplicationMaster: Started progress reporter 
>>> thread with (heartbeat : 3000, initial allocation : 200) intervals
>>>
>>> 15/11/24 16:16:29 INFO cluster.YarnClusterSchedulerBackend: 
>>> SchedulerBackend is ready for scheduling beginning after waiting 
>>> maxRegisteredResourcesWaitingTime: 30000(ms)
>>>
>>> 15/11/24 16:16:29 INFO cluster.YarnClusterScheduler: 
>>> YarnClusterScheduler.postStartHook done
>>>
>>> 15/11/24 16:16:29 INFO spark.SparkContext: Starting job: reduce at 
>>> SparkPi.scala:36
>>>
>>> 15/11/24 16:16:29 INFO scheduler.DAGScheduler: Got job 0 (reduce at 
>>> SparkPi.scala:36) with 200 output partitions
>>>
>>> 15/11/24 16:16:29 INFO scheduler.DAGScheduler: Final stage: ResultStage 
>>> 0(reduce at SparkPi.scala:36)
>>>
>>> 15/11/24 16:16:29 INFO scheduler.DAGScheduler: Parents of final stage: 
>>> List()
>>> 15/11/24 16:16:29 INFO scheduler.DAGScheduler: Missing parents: List()
>>>
>>> 15/11/24 16:16:29 INFO scheduler.DAGScheduler: Submitting ResultStage 0 
>>> (MapPartitionsRDD[1] at map at SparkPi.scala:32), which has no missing 
>>> parents
>>>
>>> 15/11/24 16:16:30 INFO storage.MemoryStore: ensureFreeSpace(1888) called 
>>> with curMem=0, maxMem=2061647216
>>>
>>> 15/11/24 16:16:30 INFO storage.MemoryStore: Block broadcast_0 stored as 
>>> values in memory (estimated size 1888.0 B, free 1966.1 MB)
>>> 15/11/24 16:16:30 INFO storage.MemoryStore: ensureFreeSpace(1202) called 
>>> with curMem=1888, maxMem=2061647216
>>>
>>> 15/11/24 16:16:30 INFO storage.MemoryStore: Block broadcast_0_piece0 stored 
>>> as bytes in memory (estimated size 1202.0 B, free 1966.1 MB)
>>>
>>> 15/11/24 16:16:30 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 
>>> in memory on X.X.X.X:41830 (size: 1202.0 B, free: 1966.1 MB)
>>>
>>>
>>> 15/11/24 16:16:30 INFO spark.SparkContext: Created broadcast 0 from 
>>> broadcast at DAGScheduler.scala:861
>>>
>>> 15/11/24 16:16:30 INFO scheduler.DAGScheduler: Submitting 200 missing tasks 
>>> from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:32)
>>>
>>> 15/11/24 16:16:30 INFO cluster.YarnClusterScheduler: Adding task set 0.0 
>>> with 200 tasks
>>>
>>> 15/11/24 16:16:45 WARN cluster.YarnClusterScheduler: Initial job has not 
>>> accepted any resources; check your cluster UI to ensure that workers are 
>>> registered and have sufficient resources
>>>
>>> 15/11/24 16:17:00 WARN cluster.YarnClusterScheduler: Initial job has not 
>>> accepted any resources; check your cluster UI to ensure that workers are 
>>> registered and have sufficient resources
>>>
>>> 15/11/24 16:17:15 WARN cluster.YarnClusterScheduler: Initial job has not 
>>> accepted any resources; check your cluster UI to ensure that workers are 
>>> registered and have sufficient resources
>>>
>>> 15/11/24 16:17:30 WARN cluster.YarnClusterScheduler: Initial job has not 
>>> accepted any resources; check your cluster UI to ensure that workers are 
>>> registered and have sufficient resources
>>>
>>> 15/11/24 16:17:45 WARN cluster.YarnClusterScheduler: Initial job has not 
>>> accepted any resources; check your cluster UI to ensure that workers are 
>>> registered and have sufficient resources
>>>
>>> 15/11/24 16:18:00 WARN cluster.YarnClusterScheduler: Initial job has not 
>>> accepted any resources; check your cluster UI to ensure that workers are 
>>> registered and have sufficient resources
>>>
>>>
>> 2015-11-24 15:14 GMT+08:00 Saisai Shao <sai.sai.s...@gmail.com>:
>>
>>> What about this configure in Yarn "yarn.scheduler.maximum-allocation-mb"
>>>
>>> I'm curious why 49 executors can be worked, but 50 failed. Would you
>>> provide your application master log, if container request is issued, there
>>> will be log like:
>>>
>>> 15/10/14 17:35:37 INFO yarn.YarnAllocator: Will request 2 executor
>>> containers, each with 1 cores and 1408 MB memory including 384 MB overhead
>>> 15/10/14 17:35:37 INFO yarn.YarnAllocator: Container request (host: Any,
>>> capability: <memory:1408, vCores:1>)
>>> 15/10/14 17:35:37 INFO yarn.YarnAllocator: Container request (host: Any,
>>> capability: <memory:1408, vCores:1>)
>>>
>>>
>>>
>>> On Tue, Nov 24, 2015 at 2:56 PM, 谢廷稳 <xieting...@gmail.com> wrote:
>>>
>>>> OK,  the YARN conf will be list in the following:
>>>>
>>>> yarn.nodemanager.resource.memory-mb:115200
>>>> yarn.nodemanager.resource.cpu-vcores:50
>>>>
>>>> I think the YARN resource is sufficient. In the previous letter I have
>>>> said that I think Spark application didn't request resources from YARN.
>>>>
>>>> Thanks
>>>>
>>>> 2015-11-24 14:30 GMT+08:00 cherrywayb...@gmail.com <
>>>> cherrywayb...@gmail.com>:
>>>>
>>>>> can you show your parameter values in your env ?
>>>>>     yarn.nodemanager.resource.cpu-vcores
>>>>>     yarn.nodemanager.resource.memory-mb
>>>>>
>>>>> ------------------------------
>>>>> cherrywayb...@gmail.com
>>>>>
>>>>>
>>>>> *From:* 谢廷稳 <xieting...@gmail.com>
>>>>> *Date:* 2015-11-24 12:13
>>>>> *To:* Saisai Shao <sai.sai.s...@gmail.com>
>>>>> *CC:* spark users <user@spark.apache.org>
>>>>> *Subject:* Re: A Problem About Running Spark 1.5 on YARN with Dynamic
>>>>> Alloction
>>>>> OK,the YARN cluster was used by myself,it have 6 node witch can run
>>>>> over 100 executor, and the YARN RM logs showed that the Spark application
>>>>> did not requested resource from it.
>>>>>
>>>>> Is this a bug? Should I create a JIRA for this problem?
>>>>>
>>>>> 2015-11-24 12:00 GMT+08:00 Saisai Shao <sai.sai.s...@gmail.com>:
>>>>>
>>>>>> OK, so this looks like your Yarn cluster  does not allocate
>>>>>> containers which you supposed should be 50. Does the yarn cluster have
>>>>>> enough resource after allocating AM container, if not, that is the 
>>>>>> problem.
>>>>>>
>>>>>> The problem not lies in dynamic allocation from my guess of your
>>>>>> description. I said I'm OK with min and max executors to the same number.
>>>>>>
>>>>>> On Tue, Nov 24, 2015 at 11:54 AM, 谢廷稳 <xieting...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Saisai,
>>>>>>> I'm sorry for did not describe it clearly,YARN debug log said I have
>>>>>>> 50 executors,but ResourceManager showed that I only have 1 container for
>>>>>>> the AppMaster.
>>>>>>>
>>>>>>> I have checked YARN RM logs,after AppMaster changed state
>>>>>>> from ACCEPTED to RUNNING,it did not have log about this job any 
>>>>>>> more.So,the
>>>>>>> problem is I did not have any executor but ExecutorAllocationManager 
>>>>>>> think
>>>>>>> I have.Would you minding having a test in your cluster environment?
>>>>>>> Thanks,
>>>>>>> Weber
>>>>>>>
>>>>>>> 2015-11-24 11:00 GMT+08:00 Saisai Shao <sai.sai.s...@gmail.com>:
>>>>>>>
>>>>>>>> I think this behavior is expected, since you already have 50
>>>>>>>> executors launched, so no need to acquire additional executors. You 
>>>>>>>> change
>>>>>>>> is not solid, it is just hiding the log.
>>>>>>>>
>>>>>>>> Again I think you should check the logs of Yarn and Spark to see if
>>>>>>>> executors are started correctly. Why resource is still not enough 
>>>>>>>> where you
>>>>>>>> already have 50 executors.
>>>>>>>>
>>>>>>>> On Tue, Nov 24, 2015 at 10:48 AM, 谢廷稳 <xieting...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi SaiSai,
>>>>>>>>> I have changed  "if (numExecutorsTarget >= maxNumExecutors)"  to
>>>>>>>>> "if (numExecutorsTarget > maxNumExecutors)" of the first line in the
>>>>>>>>> ExecutorAllocationManager#addExecutors() and it rans well.
>>>>>>>>> In my opinion,when I was set minExecutors equals maxExecutors,when
>>>>>>>>> the first time to add Executors,numExecutorsTarget equals 
>>>>>>>>> maxNumExecutors
>>>>>>>>> and it repeat printe "DEBUG ExecutorAllocationManager: Not adding
>>>>>>>>> executors because our current target total is already 50 (limit 50)
>>>>>>>>> ".
>>>>>>>>> Thanks
>>>>>>>>> Weber
>>>>>>>>>
>>>>>>>>> 2015-11-23 21:00 GMT+08:00 Saisai Shao <sai.sai.s...@gmail.com>:
>>>>>>>>>
>>>>>>>>>> Hi Tingwen,
>>>>>>>>>>
>>>>>>>>>> Would you minding sharing your changes in
>>>>>>>>>> ExecutorAllocationManager#addExecutors().
>>>>>>>>>>
>>>>>>>>>> From my understanding and test, dynamic allocation can be worked
>>>>>>>>>> when you set the min to max number of executors to the same number.
>>>>>>>>>>
>>>>>>>>>> Please check your Spark and Yarn log to make sure the executors
>>>>>>>>>> are correctly started, the warning log means currently resource is 
>>>>>>>>>> not
>>>>>>>>>> enough to submit tasks.
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Saisai
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Nov 23, 2015 at 8:41 PM, 谢廷稳 <xieting...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi all,
>>>>>>>>>>> I ran a SparkPi on YARN with Dynamic Allocation enabled and set 
>>>>>>>>>>> spark.dynamicAllocation.maxExecutors
>>>>>>>>>>> equals
>>>>>>>>>>> spark.dynamicAllocation.minExecutors,then I submit an
>>>>>>>>>>> application using:
>>>>>>>>>>> ./bin/spark-submit --class org.apache.spark.examples.SparkPi
>>>>>>>>>>> --master yarn-cluster --driver-memory 4g --executor-memory 8g
>>>>>>>>>>> lib/spark-examples*.jar 200
>>>>>>>>>>>
>>>>>>>>>>> then, this application was submitted successfully, but the
>>>>>>>>>>> AppMaster always saying “15/11/23 20:13:08 WARN
>>>>>>>>>>> cluster.YarnClusterScheduler: Initial job has not accepted any 
>>>>>>>>>>> resources;
>>>>>>>>>>> check your cluster UI to ensure that workers are registered and have
>>>>>>>>>>> sufficient resources”
>>>>>>>>>>> and when I open DEBUG,I found “15/11/23 20:24:00 DEBUG
>>>>>>>>>>> ExecutorAllocationManager: Not adding executors because our current 
>>>>>>>>>>> target
>>>>>>>>>>> total is already 50 (limit 50)” in the console.
>>>>>>>>>>>
>>>>>>>>>>> I have fixed it by modifying code in
>>>>>>>>>>> ExecutorAllocationManager.addExecutors,Does this a bug or it was 
>>>>>>>>>>> designed
>>>>>>>>>>> that we can’t set maxExecutors equals minExecutors?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Weber
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>

Re: Re: A Problem About Running Spark 1.5 on YARN with Dynamic Alloction

Reply via email to