[ 
https://issues.apache.org/jira/browse/YARN-5918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15681783#comment-15681783
 ] 

Varun Saxena commented on YARN-5918:
------------------------------------

While making null checks fixes the NPE, can something else be done or needs to 
be done ?  If we fix code as above, we will return less nodes for scheduling 
opportunistic containers than 
yarn.opportunistic-container-allocation.nodes-used configuration even though 
enough nodes are available. But this should be updated the very next second (as 
per default config) which maybe fine.

Cluster nodes are sorted in NodeQueueLoadMonitor every 1 second by default and 
stored in a list. Although we remove node when a node is lost from cluster 
nodes, we do not remove it from sorted nodes. Because for doing it we will have 
to iterate over the list. Can we keep a set instead ? Also when we get least 
loaded nodes when allocate request comes, we simply create a sublist from the 
sorted nodes. We can potentially iterate over the list and check if node is 
still running or not to avoid NPE but this would be slower than creating a 
sublist especially number of nodes configured for scheduling opportunistic 
containers are way larger than default of 10.

I guess we can check with guys working on distributed scheduling before 
deciding on a fix.
cc [~asuresh] 

> Opportunistic scheduling allocate request failure when NM lost
> --------------------------------------------------------------
>
>                 Key: YARN-5918
>                 URL: https://issues.apache.org/jira/browse/YARN-5918
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Bibin A Chundatt
>            Assignee: Bibin A Chundatt
>         Attachments: YARN-5918.0001.patch
>
>
> Allocate request failure during Opportunistic container allocation when 
> nodemanager is lost 
> {noformat}
> 2016-11-20 10:38:49,011 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root     
> OPERATION=AM Released Container TARGET=SchedulerApp     RESULT=SUCCESS  
> APPID=application_1479637990302_0002    
> CONTAINERID=container_e12_1479637990302_0002_01_000006  
> RESOURCE=<memory:1024, vCores:1>
> 2016-11-20 10:38:49,011 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Removed node docker2:38297 clusterResource: <memory:4096, vCores:8>
> 2016-11-20 10:38:49,434 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 7 on 8030, call Call#35 Retry#0 
> org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from 
> 172.17.0.2:51584
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.convertToRemoteNode(OpportunisticContainerAllocatorAMService.java:420)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.convertToRemoteNodes(OpportunisticContainerAllocatorAMService.java:412)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.getLeastLoadedNodes(OpportunisticContainerAllocatorAMService.java:402)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.allocate(OpportunisticContainerAllocatorAMService.java:236)
>         at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
>         at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:467)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:990)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1857)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2539)
> 2016-11-20 10:38:50,824 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_e12_1479637990302_0002_01_000002 Container Transitioned from 
> RUNNING to COMPLETED
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to