[jira] [Created] (YARN-8984) OutstandingSchedRequests in AMRMClient could not be removed when AllocationTags is null or empty

2018-11-06 Thread Yang Wang (JIRA)
Yang Wang created YARN-8984:
---

 Summary: OutstandingSchedRequests in AMRMClient could not be 
removed when AllocationTags is null or empty
 Key: YARN-8984
 URL: https://issues.apache.org/jira/browse/YARN-8984
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Yang Wang


In AMRMClient, outstandingSchedRequests should be removed or decreased when 
container allocated. However, it could not work when allocation tag is null or 
empty.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8331) Race condition in NM container launched after done

2018-05-20 Thread Yang Wang (JIRA)
Yang Wang created YARN-8331:
---

 Summary: Race condition in NM container launched after done
 Key: YARN-8331
 URL: https://issues.apache.org/jira/browse/YARN-8331
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Yang Wang


When a container is launching, in ContainerLaunch#launchContainer, state is 
SCHEDULED,
kill event was sent to this container, state : SCHEDULED->KILLING->DONE
 Then ContainerLaunch send CONTAINER_LAUNCHED event and start the container 
processes. These absent container processes will not be cleaned up anymore.

 
{code:java}
2018-05-21 13:11:56,114 INFO  [Thread-11] nodemanager.NMAuditLogger 
(NMAuditLogger.java:logSuccess(94)) - USER=nobody   OPERATION=Start Container 
Request   TARGET=ContainerManageImpl  RESULT=SUCCESS  
APPID=application_0_CONTAINERID=container_0__01_00
2018-05-21 13:11:56,114 INFO  [NM ContainerManager dispatcher] 
application.ApplicationImpl (ApplicationImpl.java:handle(632)) - Application 
application_0_ transitioned from NEW to INITING
2018-05-21 13:11:56,114 INFO  [NM ContainerManager dispatcher] 
application.ApplicationImpl (ApplicationImpl.java:transition(446)) - Adding 
container_0__01_00 to application application_0_
2018-05-21 13:11:56,118 INFO  [NM ContainerManager dispatcher] 
application.ApplicationImpl (ApplicationImpl.java:handle(632)) - Application 
application_0_ transitioned from INITING to RUNNING
2018-05-21 13:11:56,119 INFO  [NM ContainerManager dispatcher] 
container.ContainerImpl (ContainerImpl.java:handle(2111)) - Container 
container_0__01_00 transitioned from NEW to SCHEDULED
2018-05-21 13:11:56,119 INFO  [NM ContainerManager dispatcher] 
containermanager.AuxServices (AuxServices.java:handle(220)) - Got event 
CONTAINER_INIT for appId application_0_
2018-05-21 13:11:56,119 INFO  [NM ContainerManager dispatcher] 
scheduler.ContainerScheduler (ContainerScheduler.java:startContainer(504)) - 
Starting container [container_0__01_00]
2018-05-21 13:11:56,226 INFO  [NM ContainerManager dispatcher] 
container.ContainerImpl (ContainerImpl.java:handle(2111)) - Container 
container_0__01_00 transitioned from SCHEDULED to KILLING
2018-05-21 13:11:56,227 INFO  [NM ContainerManager dispatcher] 
containermanager.TestContainerManager 
(BaseContainerManagerTest.java:delete(287)) - Psuedo delete: user - nobody, 
type - FILE
2018-05-21 13:11:56,227 INFO  [NM ContainerManager dispatcher] 
nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(94)) - USER=nobody 
 OPERATION=Container Finished - Killed   TARGET=ContainerImplRESULT=SUCCESS 
 APPID=application_0_CONTAINERID=container_0__01_00
2018-05-21 13:11:56,238 INFO  [NM ContainerManager dispatcher] 
container.ContainerImpl (ContainerImpl.java:handle(2111)) - Container 
container_0__01_00 transitioned from KILLING to DONE
2018-05-21 13:11:56,238 INFO  [NM ContainerManager dispatcher] 
application.ApplicationImpl (ApplicationImpl.java:transition(489)) - Removing 
container_0__01_00 from application application_0_
2018-05-21 13:11:56,239 INFO  [NM ContainerManager dispatcher] 
monitor.ContainersMonitorImpl 
(ContainersMonitorImpl.java:onStopMonitoringContainer(932)) - Stopping 
resource-monitoring for container_0__01_00
2018-05-21 13:11:56,239 INFO  [NM ContainerManager dispatcher] 
containermanager.AuxServices (AuxServices.java:handle(220)) - Got event 
CONTAINER_STOP for appId application_0_
2018-05-21 13:11:56,274 WARN  [NM ContainerManager dispatcher] 
container.ContainerImpl (ContainerImpl.java:handle(2106)) - Can't handle this 
event at current state: Current: [DONE], eventType: [CONTAINER_LAUNCHED], 
container: [container_0__01_00]
org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
CONTAINER_LAUNCHED at DONE
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:2104)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:104)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1525)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1518)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
at 

[jira] [Created] (YARN-8153) Guaranteed containers always stay in SCHEDULED on NM after restart

2018-04-11 Thread Yang Wang (JIRA)
Yang Wang created YARN-8153:
---

 Summary: Guaranteed containers always stay in SCHEDULED on NM 
after restart
 Key: YARN-8153
 URL: https://issues.apache.org/jira/browse/YARN-8153
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Yang Wang


When nm recovery is enabled, after NM restart, some containers always stay in 
SCHEDULED because of no sufficient resources.

The root cause is that utilizationTracker.addContainerResources has been called 
twice when restart. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7661) NodeManager metrics return wrong value after update node resource

2017-12-15 Thread Yang Wang (JIRA)
Yang Wang created YARN-7661:
---

 Summary: NodeManager metrics return wrong value after update node 
resource
 Key: YARN-7661
 URL: https://issues.apache.org/jira/browse/YARN-7661
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Yang Wang


{code:title=NodeManagerMetrics.java}
  public void addResource(Resource res) {
availableMB = availableMB + res.getMemorySize();
availableGB.incr((int)Math.floor(availableMB/1024d));
availableVCores.incr(res.getVirtualCores());
  }
{code}
When the node resource was updated through RM-NM heartbeat, the NM metric will 
get wrong value. 
The root cause of this issue is that new resource has been added to 
availableMB, so not needed to increase for availableGB again.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7660) NodeManager metrics return wrong value after update node resource

2017-12-15 Thread Yang Wang (JIRA)
Yang Wang created YARN-7660:
---

 Summary: NodeManager metrics return wrong value after update node 
resource
 Key: YARN-7660
 URL: https://issues.apache.org/jira/browse/YARN-7660
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Yang Wang


{code:title=NodeManagerMetrics.java}
  public void addResource(Resource res) {
availableMB = availableMB + res.getMemorySize();
availableGB.incr((int)Math.floor(availableMB/1024d));
availableVCores.incr(res.getVirtualCores());
  }
{code}
When the node resource was updated through RM-NM heartbeat, the NM metric will 
get wrong value. 
The root cause of this issue is that new resource has been added to 
availableMB, so not needed to increase for availableGB again.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7659) NodeManager metrics return wrong value after update resource

2017-12-15 Thread Yang Wang (JIRA)
Yang Wang created YARN-7659:
---

 Summary: NodeManager metrics return wrong value after update 
resource
 Key: YARN-7659
 URL: https://issues.apache.org/jira/browse/YARN-7659
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Yang Wang


{code:title=NodeManagerMetrics.java}
  public void addResource(Resource res) {
availableMB = availableMB + res.getMemorySize();
availableGB.incr((int)Math.floor(availableMB/1024d));
availableVCores.incr(res.getVirtualCores());
  }
{code}
When the node resource was updated through RM-NM heartbeat, the NM metric will 
get wrong value. 
The root cause of this issue is that new resource has been added to 
availableMB, so not needed to increase for availableGB again.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7647) NM print inappropriate error log when node-labels is enabled

2017-12-12 Thread Yang Wang (JIRA)
Yang Wang created YARN-7647:
---

 Summary: NM print inappropriate error log when node-labels is 
enabled
 Key: YARN-7647
 URL: https://issues.apache.org/jira/browse/YARN-7647
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Yang Wang


{code:title=NodeStatusUpdaterImpl.java}
  ... ...
  if (response.getAreNodeLabelsAcceptedByRM() && LOG.isDebugEnabled()) {
  LOG.debug("Node Labels {" + StringUtils.join(",", previousNodeLabels)
  + "} were Accepted by RM ");
} else {
  // case where updated labels from NodeLabelsProvider is sent to RM and
  // RM rejected the labels
  LOG.error(
  "NM node labels {" + StringUtils.join(",", previousNodeLabels)
  + "} were not accepted by RM and message from RM : "
  + response.getDiagnosticsMessage());
}
  ... ...
{code}

When LOG.isDebugEnabled() is false, NM will always print error log. It is an 
obvious error and is so misleading.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6966) NodeManager metrics may returning wrong negative values when after restart

2017-08-08 Thread Yang Wang (JIRA)
Yang Wang created YARN-6966:
---

 Summary: NodeManager metrics may returning wrong negative values 
when after restart
 Key: YARN-6966
 URL: https://issues.apache.org/jira/browse/YARN-6966
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Yang Wang


Just as YARN-6212. However, I think it is not a duplicate of YARN-3933.
The primary cause of negative values is that metrics do not recover properly 
when NM restart.
AllocatedContainers,ContainersLaunched,AllocatedGB,AvailableGB,AllocatedVCores,AvailableVCores
 in metrics also need to recover when NM restart.
This should be done in ContainerManagerImpl#recoverContainer.

The scenario could be reproduction by the following steps:
# Make sure 
YarnConfiguration.NM_RECOVERY_ENABLED=true,YarnConfiguration.NM_RECOVERY_SUPERVISED=true
 in NM
# Submit an application and keep running
# Restart NM
# Stop the application
# Now you get the negative values
{code}
/jmx?qry=Hadoop:service=NodeManager,name=NodeManagerMetrics
{code}
{code}
{
name: "Hadoop:service=NodeManager,name=NodeManagerMetrics",
modelerType: "NodeManagerMetrics",
tag.Context: "yarn",
tag.Hostname: "hadoop.com",
ContainersLaunched: 0,
ContainersCompleted: 0,
ContainersFailed: 2,
ContainersKilled: 0,
ContainersIniting: 0,
ContainersRunning: 0,
AllocatedGB: 0,
AllocatedContainers: -2,
AvailableGB: 160,
AllocatedVCores: -11,
AvailableVCores: 3611,
ContainerLaunchDurationNumOps: 2,
ContainerLaunchDurationAvgTime: 6,
BadLocalDirs: 0,
BadLogDirs: 0,
GoodLocalDirsDiskUtilizationPerc: 2,
GoodLogDirsDiskUtilizationPerc: 2
}
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6951) Fix debug log when Resource handler chain is enabled

2017-08-04 Thread Yang Wang (JIRA)
Yang Wang created YARN-6951:
---

 Summary: Fix debug log when Resource handler chain is enabled
 Key: YARN-6951
 URL: https://issues.apache.org/jira/browse/YARN-6951
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Yang Wang


{code title=LinuxContainerExecutor.java}
  ... ...
  if (LOG.isDebugEnabled()) {
LOG.debug("Resource handler chain enabled = " + (resourceHandlerChain
== null));
  }
  ... ...
{code}
I think it is just a typo.When resourceHandlerChain is not null, print the log 
"Resource handler chain enabled = true".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6630) Container worker dir could not recover when NM restart

2017-05-22 Thread Yang Wang (JIRA)
Yang Wang created YARN-6630:
---

 Summary: Container worker dir could not recover when NM restart
 Key: YARN-6630
 URL: https://issues.apache.org/jira/browse/YARN-6630
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Yang Wang


When ContainerRetryPolicy is NEVER_RETRY, container worker dir will not be 
saved in NM state store. Then NM restarts, container.workDir is null, and may 
cause other exceptions.

{code:title=ContainerLaunch.java}
...
  private void recordContainerWorkDir(ContainerId containerId,
  String workDir) throws IOException{
container.setWorkDir(workDir);
if (container.isRetryContextSet()) {
  context.getNMStateStore().storeContainerWorkDir(containerId, workDir);
}
  }
{code}

{code:title=ContainerImpl.java}
  static class ResourceLocalizedWhileRunningTransition
  extends ContainerTransition {
...
  String linkFile = new Path(container.workDir, link).toString();
...
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6589) Recover all resources when NM restart

2017-05-11 Thread Yang Wang (JIRA)
Yang Wang created YARN-6589:
---

 Summary: Recover all resources when NM restart
 Key: YARN-6589
 URL: https://issues.apache.org/jira/browse/YARN-6589
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Yang Wang


When NM restart, containers will be recovered. However, only memory and vcores 
in capability have been recovered. All resources need to be recovered.
{code:title=ContainerImpl.java}
  // resource capability had been updated before NM was down
  this.resource = Resource.newInstance(recoveredCapability.getMemorySize(),
  recoveredCapability.getVirtualCores());
{code}

It should be like this.

{code:title=ContainerImpl.java}
  // resource capability had been updated before NM was down
  // need to recover all resources, not only 
  this.resource = Resources.clone(recoveredCapability);
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6578) Return container resource utilization from NM ContainerStatus call

2017-05-10 Thread Yang Wang (JIRA)
Yang Wang created YARN-6578:
---

 Summary: Return container resource utilization from NM 
ContainerStatus call
 Key: YARN-6578
 URL: https://issues.apache.org/jira/browse/YARN-6578
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Yang Wang


When the applicationMaster wants to change(increase/decrease) resources of an 
allocated container, resource utilization is an important reference indicator 
for decision making.  So, when AM call NMClient.getContainerStatus, resource 
utilization needs to be returned.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org