[ 
https://issues.apache.org/jira/browse/APEXCORE-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16118730#comment-16118730
 ] 

Vlad Rozov commented on APEXCORE-770:
-------------------------------------

There are several preconditions required to reproduce the issue and it happens 
in the following scenario:

- more than one application attempt (application master restart)
- attempt to kill a container that was started by already terminated 
application master

In this case, in {{StreamingAppMasterService.sendContainerAskToRM()}} invokes 
{{NMClientAsync.stopContainerAsync()}} for {{containerId}} that was started by 
already terminated application master and not by the current application 
master. It leads to {{onStopContainerError}} being raised by {{NMClientAsync}} 
(see {{NMClientAsyncImpl}}) as {{containers}} map does not contain requested 
{{containerId}}:
{noformat}
2017-07-25 11:24:51,681 WARN com.datatorrent.stram.StreamingAppMasterService: 
Failed to stop container container_e47_1499808956620_0716_01_000090
org.apache.hadoop.yarn.exceptions.YarnException: Container 
container_e47_1499808956620_0716_01_000090 is neither started nor scheduled to 
start
        at 
org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:45)
        at 
org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl.stopContainerAsync(NMClientAsyncImpl.java:234)
        at 
com.datatorrent.stram.StreamingAppMasterService.sendContainerAskToRM(StreamingAppMasterService.java:1175)
        at 
com.datatorrent.stram.StreamingAppMasterService.execute(StreamingAppMasterService.java:865)
        at 
com.datatorrent.stram.StreamingAppMasterService.run(StreamingAppMasterService.java:671)
        at 
com.datatorrent.stram.StreamingAppMaster.main(StreamingAppMaster.java:106)
{noformat}
{{NMCallbackHandler.onStopContainerError()}} tries to recover the container and 
removes {{containerId}} from {{allocatedContainers}} and sets the state of the 
corresponding PTContainer to {{PTContainer.State.KILLED}}. It leads to a 
shutdown request in the heartbeat response to the container and the container 
terminates (normally). At that point RM (that is fully unaware that the 
container was requested to stop), reports that it terminated normally and as 
{{containerId}} is already removed from {{allocatedContainers}} NPE is reasied 
when {{allocatedContainer}} is used.

> Application is killed due to NPE in ApplicationMaster
> -----------------------------------------------------
>
>                 Key: APEXCORE-770
>                 URL: https://issues.apache.org/jira/browse/APEXCORE-770
>             Project: Apache Apex Core
>          Issue Type: Bug
>            Reporter: Vinay Bangalore Srikanth
>            Assignee: Sandesh
>
> In my apex-application, I was trying to delete different containers ( except 
> the app master ) randomly. 
> The application got killed unexpectedly with the following exception -
> {noformat}
> 2017-07-25 11:24:51,681 WARN com.datatorrent.stram.StreamingAppMasterService: 
> Failed to stop container container_e47_1499808956620_0716_01_000090
> org.apache.hadoop.yarn.exceptions.YarnException: Container 
> container_e47_1499808956620_0716_01_000090 is neither started nor scheduled 
> to start
>       at 
> org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:45)
>       at 
> org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl.stopContainerAsync(NMClientAsyncImpl.java:234)
>       at 
> com.datatorrent.stram.StreamingAppMasterService.sendContainerAskToRM(StreamingAppMasterService.java:1175)
>       at 
> com.datatorrent.stram.StreamingAppMasterService.execute(StreamingAppMasterService.java:865)
>       at 
> com.datatorrent.stram.StreamingAppMasterService.run(StreamingAppMasterService.java:671)
>       at 
> com.datatorrent.stram.StreamingAppMaster.main(StreamingAppMaster.java:106)
> 2017-07-25 11:24:51,681 INFO com.datatorrent.stram.StreamingAppMasterService: 
> Requested stop container container_e47_1499808956620_0716_01_000090
> 2017-07-25 11:24:51,681 INFO 
> org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl: Processing 
> Event EventType: STOP_CONTAINER for Container 
> container_e47_1499808956620_0716_01_000090
> 2017-07-25 11:24:51,681 INFO 
> org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl: Container 
> container_e47_1499808956620_0716_01_000090 is already stopped or failed
> 2017-07-25 11:24:51,686 INFO com.datatorrent.stram.StreamingContainerManager: 
> Initiating recovery for 
> container_e47_1499808956620_0716_01_000...@node21.morado.com:8041
> 2017-07-25 11:24:51,686 INFO com.datatorrent.stram.StreamingContainerManager: 
> Affected operators [PTOperator[id=38,name=passthrough,state=ACTIVE], 
> PTOperator[id=105,name=passthrough.output#unifier,state=ACTIVE], 
> PTOperator[id=97,name=console,state=ACTIVE], 
> PTOperator[id=106,name=passthrough.output#unifier,state=ACTIVE], 
> PTOperator[id=103,name=console,state=ACTIVE], 
> PTOperator[id=107,name=passthrough.output#unifier,state=ACTIVE], 
> PTOperator[id=100,name=console,state=ACTIVE], 
> PTOperator[id=108,name=passthrough.output#unifier,state=ACTIVE], 
> PTOperator[id=99,name=console,state=ACTIVE], 
> PTOperator[id=109,name=passthrough.output#unifier,state=ACTIVE], 
> PTOperator[id=101,name=console,state=ACTIVE], 
> PTOperator[id=110,name=passthrough.output#unifier,state=ACTIVE], 
> PTOperator[id=102,name=console,state=ACTIVE], 
> PTOperator[id=111,name=passthrough.output#unifier,state=ACTIVE], 
> PTOperator[id=98,name=console,state=ACTIVE], 
> PTOperator[id=112,name=passthrough.output#unifier,state=ACTIVE], 
> PTOperator[id=104,name=console,state=ACTIVE], 
> PTOperator[id=68,name=randomGenerator.out#unifier,state=ACTIVE]]
> 2017-07-25 11:24:52,260 ERROR 
> com.datatorrent.stram.StreamingContainerManager: Unknown container 
> container_e47_1499808956620_0716_01_000090
> 2017-07-25 11:24:52,263 INFO com.datatorrent.stram.StreamingContainerParent: 
> child msg: [container_e47_1499808956620_0716_01_000090] Exiting heartbeat 
> loop.. context: 
> PTContainer[id=38(container_e47_1499808956620_0716_01_000090),state=KILLED,operators=[PTOperator[id=38,name=passthrough,state=PENDING_DEPLOY],
>  PTOperator[id=68,name=randomGenerator.out#unifier,state=PENDING_DEPLOY]]]
> 2017-07-25 11:24:52,697 INFO com.datatorrent.stram.ResourceRequestHandler: 
> Strict anti-affinity = [] for container with operators 
> PTOperator[id=38,name=passthrough,state=PENDING_DEPLOY],PTOperator[id=68,name=randomGenerator.out#unifier,state=PENDING_DEPLOY]
> 2017-07-25 11:24:52,698 INFO com.datatorrent.stram.ResourceRequestHandler: 
> Found host null
> 2017-07-25 11:24:52,698 INFO 
> com.datatorrent.stram.BlacklistBasedResourceRequestHandler: No node specific 
> request 
> 2017-07-25 11:24:53,710 INFO 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl: Replacing token for : 
> node18.morado.com:8041
> 2017-07-25 11:24:53,710 INFO com.datatorrent.stram.StreamingAppMasterService: 
> Got new container., containerId=container_e47_1499808956620_0716_02_000034, 
> containerNode=node18.morado.com:8041, 
> containerNodeURI=node18.morado.com:8042, containerResourceMemory4096, 
> priority32
> 2017-07-25 11:24:53,710 INFO com.datatorrent.stram.StreamingContainerManager: 
> Removing container agent container_e47_1499808956620_0716_01_000090
> 2017-07-25 11:24:53,711 INFO com.datatorrent.stram.LaunchContainerRunnable: 
> Setting up container launch context for 
> containerid=container_e47_1499808956620_0716_02_000034
> 2017-07-25 11:24:53,711 INFO com.datatorrent.stram.LaunchContainerRunnable: 
> CLASSPATH: 
> ./*:$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/*:$HADOOP_COMMON_HOME/lib/*:$HADOOP_HDFS_HOME/*:$HADOOP_HDFS_HOME/lib/*:$HADOOP_YARN_HOME/*:$HADOOP_YARN_HOME/lib/*:.
> 2017-07-25 11:24:53,946 INFO 
> com.datatorrent.common.util.BasicContainerOptConfigurator: property map for 
> operator {Generic=null, -Xmx=1920m}
> 2017-07-25 11:24:53,947 INFO 
> com.datatorrent.common.util.BasicContainerOptConfigurator: property map for 
> operator {Generic=null, -Xmx=768m}
> 2017-07-25 11:24:53,947 INFO com.datatorrent.stram.LaunchContainerRunnable: 
> Jvm opts  -Xmx3355443200  for container 
> container_e47_1499808956620_0716_02_000034
> 2017-07-25 11:24:53,947 INFO com.datatorrent.stram.LaunchContainerRunnable: 
> Launching on node: node18.morado.com:8041 command: $JAVA_HOME/bin/java  
> -Xmx3355443200  
> -Ddt.attr.APPLICATION_PATH=hdfs://node18.morado.com:8020/user/vinay/datatorrent/apps/application_1499808956620_0716
>  -Djava.io.tmpdir=$PWD/tmp 
> -Ddt.cid=container_e47_1499808956620_0716_02_000034 
> -Dhadoop.root.logger=INFO,RFA -Dhadoop.log.dir=<LOG_DIR> 
> -Dapex.application.name=$'SlowConsumerTimeoutWindowCountSet.apa' 
> com.datatorrent.stram.engine.StreamingContainer 1><LOG_DIR>/stdout 
> 2><LOG_DIR>/stderr  
> 2017-07-25 11:24:53,947 INFO 
> org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl: Processing 
> Event EventType: START_CONTAINER for Container 
> container_e47_1499808956620_0716_02_000034
> 2017-07-25 11:24:53,947 INFO com.datatorrent.stram.StreamingAppMasterService: 
> Completed containerId=container_e47_1499808956620_0716_01_000090, 
> state=COMPLETE, exitStatus=0, diagnostics=
> 2017-07-25 11:24:53,947 INFO com.datatorrent.stram.StreamingAppMasterService: 
> Container completed successfully., 
> containerId=container_e47_1499808956620_0716_01_000090
> 2017-07-25 11:24:53,947 INFO 
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy: 
> Opening proxy : node18.morado.com:8041
> 2017-07-25 11:24:53,948 ERROR com.datatorrent.stram.StreamingAppMaster: 
> Exiting Application Master
> java.lang.NullPointerException
>       at 
> com.datatorrent.stram.StreamingAppMasterService$AllocatedContainer.access$1000(StreamingAppMasterService.java:1251)
>       at 
> com.datatorrent.stram.StreamingAppMasterService.execute(StreamingAppMasterService.java:1014)
>       at 
> com.datatorrent.stram.StreamingAppMasterService.run(StreamingAppMasterService.java:671)
>       at 
> com.datatorrent.stram.StreamingAppMaster.main(StreamingAppMaster.java:106)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to