[ 
https://issues.apache.org/jira/browse/YARN-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuchang resolved YARN-7474.
---------------------------
    Resolution: Fixed

A submitted application's container has make reservations on all NodeManagers , 
which make all NodeManagers become unavailable.

I think I have configured the *yarn.scheduler.maximum-allocation-mb* far too 
big (about half of *yarn.nodemanager.resource.memory-mb*) so that it is 
possible that a bad-configured application's containers will make reservation 
on all nodes and can never switched to allocated ,namely result in a deadlock.

> Yarn resourcemanager stop allocating container when cluster resource is 
> sufficient 
> -----------------------------------------------------------------------------------
>
>                 Key: YARN-7474
>                 URL: https://issues.apache.org/jira/browse/YARN-7474
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler
>    Affects Versions: 2.7.2
>            Reporter: wuchang
>            Priority: Critical
>         Attachments: rm.log
>
>
> Hadoop Version: *2.7.2*
> My Yarn cluster have *(1100TB,368vCores)*  totallly with 15 nodemangers . 
> My cluster use fair-scheduler and I have 4 queues for different kinds of jobs:
>  
> {quote}
> <allocations>
>     <queue name="queue1">
>        <minResources>100000 mb, 30 vcores</minResources>
>        <maxResources>422280 mb, 132 vcores</maxResources>
>        <maxAMShare>0.5f</maxAMShare>
>        <fairSharePreemptionTimeout>9000000000</fairSharePreemptionTimeout>
>        <minSharePreemptionTimeout>9000000000</minSharePreemptionTimeout>
>        <maxRunningApps>50</maxRunningApps>
>     </queue>
>     <queue name="queue2">
>        <minResources>25000 mb, 20 vcores</minResources>
>        <maxResources>600280 mb, 150 vcores</maxResources>
>        <maxAMShare>0.6f</maxAMShare>
>        <fairSharePreemptionTimeout>9000000000</fairSharePreemptionTimeout>
>        <minSharePreemptionTimeout>9000000000</minSharePreemptionTimeout>
>        <maxRunningApps>50</maxRunningApps>
>     </queue>
>     <queue name="queue3">
>        <minResources>100000 mb, 30 vcores</minResources>
>        <maxResources>647280 mb, 132 vcores</maxResources>
>        <maxAMShare>0.8f</maxAMShare>
>        <fairSharePreemptionTimeout>9000000000</fairSharePreemptionTimeout>
>        <minSharePreemptionTimeout>9000000000</minSharePreemptionTimeout>
>        <maxRunningApps>50</maxRunningApps>
>     </queue>
>   
>     <queue name="queue4">
>        <minResources>80000 mb, 20 vcores</minResources>
>        <maxResources>120000 mb, 30 vcores</maxResources>
>        <maxAMShare>0.5f</maxAMShare>
>        <fairSharePreemptionTimeout>9000000000</fairSharePreemptionTimeout>
>        <minSharePreemptionTimeout>9000000000</minSharePreemptionTimeout>
>        <maxRunningApps>50</maxRunningApps>
>      </queue>
> </allocations>
>  {quote}
> from about 9:00 am, all new-coming applications get stuck for nearly 5 hours, 
> but the cluster resource usage is about *(600GB,120vCores)*,it means,the 
> cluster resource is still *sufficient*.
> *The resource usage of the whole yarn cluster AND of each single queue stay 
> unchanged for 5 hours*, really strange. Obviously , if it a resource 
> insufficiency problem , it's impossible that used resource of all queues 
> didn't have any change for 5 hours. So , I is the problem of ResourceManager.
> Since my cluster scale is not large, only 15 nodes with 1100G memory ,I 
> exclude the possibility showed in [YARN-4618].
>  
> besides that , all the running applications seems never finished, the Yarn RM 
> seems static ,the RM log  have no more state change logs about running 
> applications,except for the log about more and more application is submitted 
> and become ACCEPTED, but never from ACCEPTED to RUNNING.
> *The resource usage of the whole yarn cluster AND of each single queue stay 
> unchanged for 5 hours*, really strange.
> The cluster seems like a zombie.
>  
> I haved checked the ApplicationMaster log of some running but stucked 
> application , 
>  
>  {quote}
> 2017-11-11 09:04:55,896 INFO [IPC Server handler 0 on 42899] 
> org.apache.hadoop.mapreduce.v2.app.client.MRClientService: Getting task 
> report for MAP job_1507795051888_183385. Report-size will be 4
> 2017-11-11 09:04:55,957 INFO [IPC Server handler 0 on 42899] 
> org.apache.hadoop.mapreduce.v2.app.client.MRClientService: Getting task 
> report for REDUCE job_1507795051888_183385. Report-size will be 0
> 2017-11-11 09:04:56,037 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before 
> Scheduling: PendingReds:0 ScheduledMaps:4 ScheduledReds:0 AssignedMaps:0 
> AssignedReds:0 CompletedMaps:0 CompletedReds:0 ContAlloc:0 ContRel:0 
> HostLocal:0 RackLocal:0
> 2017-11-11 09:04:56,061 INFO [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources() 
> for application_1507795051888_183385: ask=6 release= 0 newContainers=0 
> finishedContainers=0 resourcelimit=<memory:109760, vCores:25> knownNMs=15
> 2017-11-11 13:58:56,736 INFO [IPC Server handler 0 on 42899] 
> org.apache.hadoop.mapreduce.v2.app.client.MRClientService: Kill job 
> job_1507795051888_183385 received from appuser (auth:SIMPLE) at 10.120.207.11
>  {quote}
>  
> You can see that at  *2017-11-11 09:04:56,061* It send resource request to 
> ResourceManager but RM allocate zero containers. Then ,no more logs  for 5 
> hours. At  13:58, I have to kill it manually.
>  
> After 5 hours , I kill some pending applications and then everything 
> recovered,remaining cluster resources can be allocated again, ResourceManager 
> seems  to be alive again.
>  
> I have exclude the possibility of  the restriction of maxRunningApps and 
> maxAMShare config because they will just affect a single queue, but my 
> problem is that whole yarn cluster application get stuck.
>  
>  
>  
> Also , I exclude the possibility of a  resourcemanger  full gc problem 
> because I check that with gcutil,no full gc happened , resource manager 
> memory is OK.
>  
> So , anyone could give me some suggestions?
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

Reply via email to