Re: Re: A Problem About Running Spark 1.5 on YARN with Dynamic Alloction

Saisai Shao Tue, 24 Nov 2015 01:16:34 -0800

The document is right. Because of a bug introduce in
https://issues.apache.org/jira/browse/SPARK-9092 which makes this
configuration fail to work.


It is fixed in https://issues.apache.org/jira/browse/SPARK-10790, you could
change to newer version of Spark.

On Tue, Nov 24, 2015 at 5:12 PM, 谢廷稳 <xieting...@gmail.com> wrote:

> @Sab Thank you for your reply, but the cluster has 6 nodes which contain
> 300 cores and Spark application did not request resource from YARN.
>
> @SaiSai I have ran it successful with "
> spark.dynamicAllocation.initialExecutors"  equals 50, but in
> http://spark.apache.org/docs/latest/configuration.html#dynamic-allocation
> it says that
>
> "spark.dynamicAllocation.initialExecutors" equals "
> spark.dynamicAllocation.minExecutors". So, I think something was wrong,
> did it?
>
> Thanks.
>
>
>
> 2015-11-24 16:47 GMT+08:00 Saisai Shao <sai.sai.s...@gmail.com>:
>
>> Did you set this configuration "spark.dynamicAllocation.initialExecutors"
>> ?
>>
>> You can set spark.dynamicAllocation.initialExecutors 50 to take try
>> again.
>>
>> I guess you might be hitting this issue since you're running 1.5.0,
>> https://issues.apache.org/jira/browse/SPARK-9092. But it still cannot
>> explain why 49 executors can be worked.
>>
>> On Tue, Nov 24, 2015 at 4:42 PM, Sabarish Sasidharan <
>> sabarish.sasidha...@manthan.com> wrote:
>>
>>> If yarn has only 50 cores then it can support max 49 executors plus 1
>>> driver application master.
>>>
>>> Regards
>>> Sab
>>> On 24-Nov-2015 1:58 pm, "谢廷稳" <xieting...@gmail.com> wrote:
>>>
>>>> OK, yarn.scheduler.maximum-allocation-mb is 16384.
>>>>
>>>> I have ran it again, the command to run it is:
>>>> ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
>>>> yarn-cluster -
>>>> -driver-memory 4g  --executor-memory 8g lib/spark-examples*.jar 200
>>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>> 15/11/24 16:15:56 INFO yarn.ApplicationMaster: Registered signal handlers 
>>>>> for [TERM, HUP, INT]
>>>>>
>>>>> 15/11/24 16:15:57 INFO yarn.ApplicationMaster: ApplicationAttemptId: 
>>>>> appattempt_1447834709734_0120_000001
>>>>>
>>>>> 15/11/24 16:15:58 INFO spark.SecurityManager: Changing view acls to: 
>>>>> hdfs-test
>>>>>
>>>>> 15/11/24 16:15:58 INFO spark.SecurityManager: Changing modify acls to: 
>>>>> hdfs-test
>>>>>
>>>>> 15/11/24 16:15:58 INFO spark.SecurityManager: SecurityManager: 
>>>>> authentication disabled; ui acls disabled; users with view permissions: 
>>>>> Set(hdfs-test); users with modify permissions: Set(hdfs-test)
>>>>>
>>>>> 15/11/24 16:15:58 INFO yarn.ApplicationMaster: Starting the user 
>>>>> application in a separate Thread
>>>>>
>>>>> 15/11/24 16:15:58 INFO yarn.ApplicationMaster: Waiting for spark context 
>>>>> initialization
>>>>>
>>>>> 15/11/24 16:15:58 INFO yarn.ApplicationMaster: Waiting for spark context 
>>>>> initialization ...
>>>>> 15/11/24 16:15:58 INFO spark.SparkContext: Running Spark version 1.5.0
>>>>>
>>>>> 15/11/24 16:15:58 INFO spark.SecurityManager: Changing view acls to: 
>>>>> hdfs-test
>>>>>
>>>>> 15/11/24 16:15:58 INFO spark.SecurityManager: Changing modify acls to: 
>>>>> hdfs-test
>>>>>
>>>>> 15/11/24 16:15:58 INFO spark.SecurityManager: SecurityManager: 
>>>>> authentication disabled; ui acls disabled; users with view permissions: 
>>>>> Set(hdfs-test); users with modify permissions: Set(hdfs-test)
>>>>> 15/11/24 16:15:58 INFO slf4j.Slf4jLogger: Slf4jLogger started
>>>>> 15/11/24 16:15:59 INFO Remoting: Starting remoting
>>>>>
>>>>> 15/11/24 16:15:59 INFO Remoting: Remoting started; listening on addresses 
>>>>> :[akka.tcp://sparkDriver@X.X.X.X
>>>>> ]
>>>>>
>>>>> 15/11/24 16:15:59 INFO util.Utils: Successfully started service 
>>>>> 'sparkDriver' on port 61904.
>>>>> 15/11/24 16:15:59 INFO spark.SparkEnv: Registering MapOutputTracker
>>>>> 15/11/24 16:15:59 INFO spark.SparkEnv: Registering BlockManagerMaster
>>>>>
>>>>> 15/11/24 16:15:59 INFO storage.DiskBlockManager: Created local directory 
>>>>> at 
>>>>> /data1/hadoop/nm-local-dir/usercache/hdfs-test/appcache/application_1447834709734_0120/blockmgr-33fbe6c4-5138-4eff-83b4-fb0c886667b7
>>>>>
>>>>> 15/11/24 16:15:59 INFO storage.MemoryStore: MemoryStore started with 
>>>>> capacity 1966.1 MB
>>>>>
>>>>> 15/11/24 16:15:59 INFO spark.HttpFileServer: HTTP File server directory 
>>>>> is 
>>>>> /data1/hadoop/nm-local-dir/usercache/hdfs-test/appcache/application_1447834709734_0120/spark-fbbfa2bd-6d30-421e-a634-4546134b3b5f/httpd-e31d7b8e-ca8f-400e-8b4b-d2993fb6f1d1
>>>>> 15/11/24 16:15:59 INFO spark.HttpServer: Starting HTTP Server
>>>>> 15/11/24 16:15:59 INFO server.Server: jetty-8.y.z-SNAPSHOT
>>>>> 15/11/24 16:15:59 INFO server.AbstractConnector: Started
>>>>> SocketConnector@0.0.0.0:14692
>>>>>
>>>>> 15/11/24 16:15:59 INFO util.Utils: Successfully started service 'HTTP 
>>>>> file server' on port 14692.
>>>>>
>>>>> 15/11/24 16:15:59 INFO spark.SparkEnv: Registering OutputCommitCoordinator
>>>>>
>>>>> 15/11/24 16:15:59 INFO ui.JettyUtils: Adding filter: 
>>>>> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
>>>>> 15/11/24 16:15:59 INFO server.Server: jetty-8.y.z-SNAPSHOT
>>>>> 15/11/24 16:15:59 INFO server.AbstractConnector: Started
>>>>> SelectChannelConnector@0.0.0.0:15948
>>>>> 15/11/24 16:15:59 INFO util.Utils: Successfully started service 'SparkUI' 
>>>>> on port 15948.
>>>>>
>>>>> 15/11/24 16:15:59 INFO ui.SparkUI: Started SparkUI at X.X.X.X
>>>>>
>>>>> 15/11/24 16:15:59 INFO cluster.YarnClusterScheduler: Created 
>>>>> YarnClusterScheduler
>>>>>
>>>>> 15/11/24 16:15:59 WARN metrics.MetricsSystem: Using default name 
>>>>> DAGScheduler for source because
>>>>> spark.app.id is not set.
>>>>>
>>>>> 15/11/24 16:15:59 INFO util.Utils: Successfully started service 
>>>>> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 41830.
>>>>> 15/11/24 16:15:59 INFO netty.NettyBlockTransferService: Server created on 
>>>>> 41830
>>>>>
>>>>> 15/11/24 16:15:59 INFO storage.BlockManagerMaster: Trying to register 
>>>>> BlockManager
>>>>>
>>>>> 15/11/24 16:15:59 INFO storage.BlockManagerMasterEndpoint: Registering 
>>>>> block manager X.X.X.X:41830 with 1966.1 MB RAM, BlockManagerId(driver, 
>>>>> 10.12.30.2, 41830)
>>>>>
>>>>>
>>>>> 15/11/24 16:15:59 INFO storage.BlockManagerMaster: Registered BlockManager
>>>>> 15/11/24 16:16:00 INFO scheduler.EventLoggingListener: Logging events to 
>>>>> hdfs:///tmp/latest-spark-events/application_1447834709734_0120_1
>>>>>
>>>>> 15/11/24 16:16:00 INFO 
>>>>> cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster 
>>>>> registered as 
>>>>> AkkaRpcEndpointRef(Actor[akka://sparkDriver/user/YarnAM#293602859])
>>>>>
>>>>> 15/11/24 16:16:00 INFO client.RMProxy: Connecting to ResourceManager at 
>>>>> X.X.X.X
>>>>>
>>>>>
>>>>> 15/11/24 16:16:00 INFO yarn.YarnRMClient: Registering the 
>>>>> ApplicationMaster
>>>>>
>>>>> 15/11/24 16:16:00 INFO yarn.ApplicationMaster: Started progress reporter 
>>>>> thread with (heartbeat : 3000, initial allocation : 200) intervals
>>>>>
>>>>> 15/11/24 16:16:29 INFO cluster.YarnClusterSchedulerBackend: 
>>>>> SchedulerBackend is ready for scheduling beginning after waiting 
>>>>> maxRegisteredResourcesWaitingTime: 30000(ms)
>>>>>
>>>>> 15/11/24 16:16:29 INFO cluster.YarnClusterScheduler: 
>>>>> YarnClusterScheduler.postStartHook done
>>>>>
>>>>> 15/11/24 16:16:29 INFO spark.SparkContext: Starting job: reduce at 
>>>>> SparkPi.scala:36
>>>>>
>>>>> 15/11/24 16:16:29 INFO scheduler.DAGScheduler: Got job 0 (reduce at 
>>>>> SparkPi.scala:36) with 200 output partitions
>>>>>
>>>>> 15/11/24 16:16:29 INFO scheduler.DAGScheduler: Final stage: ResultStage 
>>>>> 0(reduce at SparkPi.scala:36)
>>>>>
>>>>> 15/11/24 16:16:29 INFO scheduler.DAGScheduler: Parents of final stage: 
>>>>> List()
>>>>> 15/11/24 16:16:29 INFO scheduler.DAGScheduler: Missing parents: List()
>>>>>
>>>>> 15/11/24 16:16:29 INFO scheduler.DAGScheduler: Submitting ResultStage 0 
>>>>> (MapPartitionsRDD[1] at map at SparkPi.scala:32), which has no missing 
>>>>> parents
>>>>>
>>>>> 15/11/24 16:16:30 INFO storage.MemoryStore: ensureFreeSpace(1888) called 
>>>>> with curMem=0, maxMem=2061647216
>>>>>
>>>>> 15/11/24 16:16:30 INFO storage.MemoryStore: Block broadcast_0 stored as 
>>>>> values in memory (estimated size 1888.0 B, free 1966.1 MB)
>>>>> 15/11/24 16:16:30 INFO storage.MemoryStore: ensureFreeSpace(1202) called 
>>>>> with curMem=1888, maxMem=2061647216
>>>>>
>>>>> 15/11/24 16:16:30 INFO storage.MemoryStore: Block broadcast_0_piece0 
>>>>> stored as bytes in memory (estimated size 1202.0 B, free 1966.1 MB)
>>>>>
>>>>> 15/11/24 16:16:30 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 
>>>>> in memory on X.X.X.X:41830 (size: 1202.0 B, free: 1966.1 MB)
>>>>>
>>>>>
>>>>> 15/11/24 16:16:30 INFO spark.SparkContext: Created broadcast 0 from 
>>>>> broadcast at DAGScheduler.scala:861
>>>>>
>>>>> 15/11/24 16:16:30 INFO scheduler.DAGScheduler: Submitting 200 missing 
>>>>> tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:32)
>>>>>
>>>>> 15/11/24 16:16:30 INFO cluster.YarnClusterScheduler: Adding task set 0.0 
>>>>> with 200 tasks
>>>>>
>>>>> 15/11/24 16:16:45 WARN cluster.YarnClusterScheduler: Initial job has not 
>>>>> accepted any resources; check your cluster UI to ensure that workers are 
>>>>> registered and have sufficient resources
>>>>>
>>>>> 15/11/24 16:17:00 WARN cluster.YarnClusterScheduler: Initial job has not 
>>>>> accepted any resources; check your cluster UI to ensure that workers are 
>>>>> registered and have sufficient resources
>>>>>
>>>>> 15/11/24 16:17:15 WARN cluster.YarnClusterScheduler: Initial job has not 
>>>>> accepted any resources; check your cluster UI to ensure that workers are 
>>>>> registered and have sufficient resources
>>>>>
>>>>> 15/11/24 16:17:30 WARN cluster.YarnClusterScheduler: Initial job has not 
>>>>> accepted any resources; check your cluster UI to ensure that workers are 
>>>>> registered and have sufficient resources
>>>>>
>>>>> 15/11/24 16:17:45 WARN cluster.YarnClusterScheduler: Initial job has not 
>>>>> accepted any resources; check your cluster UI to ensure that workers are 
>>>>> registered and have sufficient resources
>>>>>
>>>>> 15/11/24 16:18:00 WARN cluster.YarnClusterScheduler: Initial job has not 
>>>>> accepted any resources; check your cluster UI to ensure that workers are 
>>>>> registered and have sufficient resources
>>>>>
>>>>>
>>>> 2015-11-24 15:14 GMT+08:00 Saisai Shao <sai.sai.s...@gmail.com>:
>>>>
>>>>> What about this configure in Yarn "
>>>>> yarn.scheduler.maximum-allocation-mb"
>>>>>
>>>>> I'm curious why 49 executors can be worked, but 50 failed. Would you
>>>>> provide your application master log, if container request is issued, there
>>>>> will be log like:
>>>>>
>>>>> 15/10/14 17:35:37 INFO yarn.YarnAllocator: Will request 2 executor
>>>>> containers, each with 1 cores and 1408 MB memory including 384 MB overhead
>>>>> 15/10/14 17:35:37 INFO yarn.YarnAllocator: Container request (host:
>>>>> Any, capability: <memory:1408, vCores:1>)
>>>>> 15/10/14 17:35:37 INFO yarn.YarnAllocator: Container request (host:
>>>>> Any, capability: <memory:1408, vCores:1>)
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Nov 24, 2015 at 2:56 PM, 谢廷稳 <xieting...@gmail.com> wrote:
>>>>>
>>>>>> OK,  the YARN conf will be list in the following:
>>>>>>
>>>>>> yarn.nodemanager.resource.memory-mb:115200
>>>>>> yarn.nodemanager.resource.cpu-vcores:50
>>>>>>
>>>>>> I think the YARN resource is sufficient. In the previous letter I
>>>>>> have said that I think Spark application didn't request resources
>>>>>> from YARN.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> 2015-11-24 14:30 GMT+08:00 cherrywayb...@gmail.com <
>>>>>> cherrywayb...@gmail.com>:
>>>>>>
>>>>>>> can you show your parameter values in your env ?
>>>>>>>     yarn.nodemanager.resource.cpu-vcores
>>>>>>>     yarn.nodemanager.resource.memory-mb
>>>>>>>
>>>>>>> ------------------------------
>>>>>>> cherrywayb...@gmail.com
>>>>>>>
>>>>>>>
>>>>>>> *From:* 谢廷稳 <xieting...@gmail.com>
>>>>>>> *Date:* 2015-11-24 12:13
>>>>>>> *To:* Saisai Shao <sai.sai.s...@gmail.com>
>>>>>>> *CC:* spark users <user@spark.apache.org>
>>>>>>> *Subject:* Re: A Problem About Running Spark 1.5 on YARN with
>>>>>>> Dynamic Alloction
>>>>>>> OK,the YARN cluster was used by myself,it have 6 node witch can run
>>>>>>> over 100 executor, and the YARN RM logs showed that the Spark 
>>>>>>> application
>>>>>>> did not requested resource from it.
>>>>>>>
>>>>>>> Is this a bug? Should I create a JIRA for this problem?
>>>>>>>
>>>>>>> 2015-11-24 12:00 GMT+08:00 Saisai Shao <sai.sai.s...@gmail.com>:
>>>>>>>
>>>>>>>> OK, so this looks like your Yarn cluster  does not allocate
>>>>>>>> containers which you supposed should be 50. Does the yarn cluster have
>>>>>>>> enough resource after allocating AM container, if not, that is the 
>>>>>>>> problem.
>>>>>>>>
>>>>>>>> The problem not lies in dynamic allocation from my guess of your
>>>>>>>> description. I said I'm OK with min and max executors to the same 
>>>>>>>> number.
>>>>>>>>
>>>>>>>> On Tue, Nov 24, 2015 at 11:54 AM, 谢廷稳 <xieting...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Saisai,
>>>>>>>>> I'm sorry for did not describe it clearly,YARN debug log said I
>>>>>>>>> have 50 executors,but ResourceManager showed that I only have 1 
>>>>>>>>> container
>>>>>>>>> for the AppMaster.
>>>>>>>>>
>>>>>>>>> I have checked YARN RM logs,after AppMaster changed state
>>>>>>>>> from ACCEPTED to RUNNING,it did not have log about this job any 
>>>>>>>>> more.So,the
>>>>>>>>> problem is I did not have any executor but ExecutorAllocationManager 
>>>>>>>>> think
>>>>>>>>> I have.Would you minding having a test in your cluster environment?
>>>>>>>>> Thanks,
>>>>>>>>> Weber
>>>>>>>>>
>>>>>>>>> 2015-11-24 11:00 GMT+08:00 Saisai Shao <sai.sai.s...@gmail.com>:
>>>>>>>>>
>>>>>>>>>> I think this behavior is expected, since you already have 50
>>>>>>>>>> executors launched, so no need to acquire additional executors. You 
>>>>>>>>>> change
>>>>>>>>>> is not solid, it is just hiding the log.
>>>>>>>>>>
>>>>>>>>>> Again I think you should check the logs of Yarn and Spark to see
>>>>>>>>>> if executors are started correctly. Why resource is still not enough 
>>>>>>>>>> where
>>>>>>>>>> you already have 50 executors.
>>>>>>>>>>
>>>>>>>>>> On Tue, Nov 24, 2015 at 10:48 AM, 谢廷稳 <xieting...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi SaiSai,
>>>>>>>>>>> I have changed  "if (numExecutorsTarget >= maxNumExecutors)"  to
>>>>>>>>>>> "if (numExecutorsTarget > maxNumExecutors)" of the first line in the
>>>>>>>>>>> ExecutorAllocationManager#addExecutors() and it rans well.
>>>>>>>>>>> In my opinion,when I was set minExecutors equals
>>>>>>>>>>> maxExecutors,when the first time to add Executors,numExecutorsTarget
>>>>>>>>>>> equals maxNumExecutors and it repeat printe "DEBUG
>>>>>>>>>>> ExecutorAllocationManager: Not adding executors because our current 
>>>>>>>>>>> target
>>>>>>>>>>> total is already 50 (limit 50)".
>>>>>>>>>>> Thanks
>>>>>>>>>>> Weber
>>>>>>>>>>>
>>>>>>>>>>> 2015-11-23 21:00 GMT+08:00 Saisai Shao <sai.sai.s...@gmail.com>:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Tingwen,
>>>>>>>>>>>>
>>>>>>>>>>>> Would you minding sharing your changes in
>>>>>>>>>>>> ExecutorAllocationManager#addExecutors().
>>>>>>>>>>>>
>>>>>>>>>>>> From my understanding and test, dynamic allocation can be
>>>>>>>>>>>> worked when you set the min to max number of executors to the same 
>>>>>>>>>>>> number.
>>>>>>>>>>>>
>>>>>>>>>>>> Please check your Spark and Yarn log to make sure the executors
>>>>>>>>>>>> are correctly started, the warning log means currently resource is 
>>>>>>>>>>>> not
>>>>>>>>>>>> enough to submit tasks.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Saisai
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Nov 23, 2015 at 8:41 PM, 谢廷稳 <xieting...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>> I ran a SparkPi on YARN with Dynamic Allocation enabled and
>>>>>>>>>>>>> set spark.dynamicAllocation.maxExecutors equals
>>>>>>>>>>>>> spark.dynamicAllocation.minExecutors,then I submit an
>>>>>>>>>>>>> application using:
>>>>>>>>>>>>> ./bin/spark-submit --class org.apache.spark.examples.SparkPi
>>>>>>>>>>>>> --master yarn-cluster --driver-memory 4g --executor-memory 8g
>>>>>>>>>>>>> lib/spark-examples*.jar 200
>>>>>>>>>>>>>
>>>>>>>>>>>>> then, this application was submitted successfully, but the
>>>>>>>>>>>>> AppMaster always saying “15/11/23 20:13:08 WARN
>>>>>>>>>>>>> cluster.YarnClusterScheduler: Initial job has not accepted any 
>>>>>>>>>>>>> resources;
>>>>>>>>>>>>> check your cluster UI to ensure that workers are registered and 
>>>>>>>>>>>>> have
>>>>>>>>>>>>> sufficient resources”
>>>>>>>>>>>>> and when I open DEBUG,I found “15/11/23 20:24:00 DEBUG
>>>>>>>>>>>>> ExecutorAllocationManager: Not adding executors because our 
>>>>>>>>>>>>> current target
>>>>>>>>>>>>> total is already 50 (limit 50)” in the console.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have fixed it by modifying code in
>>>>>>>>>>>>> ExecutorAllocationManager.addExecutors,Does this a bug or it was 
>>>>>>>>>>>>> designed
>>>>>>>>>>>>> that we can’t set maxExecutors equals minExecutors?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Weber
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>
>

Re: Re: A Problem About Running Spark 1.5 on YARN with Dynamic Alloction

Reply via email to