[ 
https://issues.apache.org/jira/browse/YARN-5918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15690057#comment-15690057
 ] 

Varun Saxena commented on YARN-5918:
------------------------------------

Thanks [~bibinchundatt] for the latest patch.

We actually did not need such an elaborate test. We could have probably 
achieved the same by mocking the scheduler to simulate case of NPE. But its not 
necessarily bad to have some sort of E2E test case simulating the complete 
scenario.

Few comments.
# This test doesn't seem to belong to TestNodeQueueLoadMonitor. We can shift it 
to some other test class or create a new class if it cant fit in any.
# We are iterating 30 times over and sleeping for 100 ms in each run. This will 
unnecessarily make the test run for 3 seconds in loop. We can get rid of this 
loop and make run time of test faster. We can do it as under.
** You can set YarnConfiguration.NM_CONTAINER_QUEUING_SORTING_NODES_INTERVAL_MS 
in configuration as 100 ms instead of 1000 ms default.
** Move node added and node update events sent to AM Service outside the loop.
** We can check get OpportunisticContainerContext from FiCaSchedulerApp and use 
it to check against getNodeMap to have a deterministic test case and reduce 
test time.
** After invoking node add and update events, we can loop 10-20 times over(say) 
to send allocate with a sleep of say 50 ms. Break from the loop as soon as 
getNodeMap has 2 nodes. Now send remove node event to scheduler and then loop 
over to send allocate and wait till getNodeMap becomes 1. 
# Not related to your patch. In NodeQueueLoadMonitor we have some LOG.debug 
statements without isDebugEnabled guard. Maybe we can fix this here as well.

> Opportunistic scheduling allocate request failure when NM lost
> --------------------------------------------------------------
>
>                 Key: YARN-5918
>                 URL: https://issues.apache.org/jira/browse/YARN-5918
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Bibin A Chundatt
>            Assignee: Bibin A Chundatt
>         Attachments: YARN-5918.0001.patch, YARN-5918.0002.patch
>
>
> Allocate request failure during Opportunistic container allocation when 
> nodemanager is lost 
> {noformat}
> 2016-11-20 10:38:49,011 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root     
> OPERATION=AM Released Container TARGET=SchedulerApp     RESULT=SUCCESS  
> APPID=application_1479637990302_0002    
> CONTAINERID=container_e12_1479637990302_0002_01_000006  
> RESOURCE=<memory:1024, vCores:1>
> 2016-11-20 10:38:49,011 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Removed node docker2:38297 clusterResource: <memory:4096, vCores:8>
> 2016-11-20 10:38:49,434 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 7 on 8030, call Call#35 Retry#0 
> org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from 
> 172.17.0.2:51584
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.convertToRemoteNode(OpportunisticContainerAllocatorAMService.java:420)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.convertToRemoteNodes(OpportunisticContainerAllocatorAMService.java:412)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.getLeastLoadedNodes(OpportunisticContainerAllocatorAMService.java:402)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.allocate(OpportunisticContainerAllocatorAMService.java:236)
>         at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
>         at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:467)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:990)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1857)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2539)
> 2016-11-20 10:38:50,824 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_e12_1479637990302_0002_01_000002 Container Transitioned from 
> RUNNING to COMPLETED
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to