[jira] [Created] (YARN-6396) Call verifyAndCreateRemoteLogDir at service initialization instead of application initialization to decrease load for name node

2017-03-26 Thread zhihai xu (JIRA)
zhihai xu created YARN-6396:
---

 Summary: Call verifyAndCreateRemoteLogDir at service 
initialization instead of application initialization to decrease load for name 
node
 Key: YARN-6396
 URL: https://issues.apache.org/jira/browse/YARN-6396
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: log-aggregation
Affects Versions: 3.0.0-alpha2
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Minor


Call verifyAndCreateRemoteLogDir at service initialization instead of 
application initialization to decrease load for name node.
Currently for every application at each Node, verifyAndCreateRemoteLogDir will 
be called before doing log aggregation, This will be a non trivial overhead for 
name node in a large cluster since verifyAndCreateRemoteLogDir calls 
getFileStatus. Once the remote log directory is created successfully, it is not 
necessary to call it again. It will be better to call 
verifyAndCreateRemoteLogDir at LogAggregationService service initialization.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6392) add submit time to Application Summary log

2017-03-26 Thread zhihai xu (JIRA)
zhihai xu created YARN-6392:
---

 Summary: add submit time to Application Summary log
 Key: YARN-6392
 URL: https://issues.apache.org/jira/browse/YARN-6392
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 3.0.0-alpha2
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Minor


add submit time to Application Summary log, application submit time will be 
passed to Application Master in env variable "APP_SUBMIT_TIME_ENV". It is a 
very important parameter, So it will be useful to log it in Application Summary.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-4979) FSAppAttempt adds duplicate ResourceRequest to demand in updateDemand.

2016-04-20 Thread zhihai xu (JIRA)
zhihai xu created YARN-4979:
---

 Summary: FSAppAttempt adds duplicate ResourceRequest to demand in 
updateDemand.
 Key: YARN-4979
 URL: https://issues.apache.org/jira/browse/YARN-4979
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.7.2, 2.8.0
Reporter: zhihai xu
Assignee: zhihai xu


FSAppAttempt adds duplicate ResourceRequest to demand in updateDemand. We 
should only count ResourceRequest for ResourceRequest.ANY when calculate demand.
Because {{hasContainerForNode}} will return false if no container request for 
ResourceRequest.ANY and both {{allocateNodeLocal}} and {{allocateRackLocal}} 
will also decrease the number of containers for ResourceRequest.ANY.
This issue may cause current memory demand overflow(integer) because duplicate 
requests can be on multiple nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4458) Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl.

2015-12-15 Thread zhihai xu (JIRA)
zhihai xu created YARN-4458:
---

 Summary: Compilation error at branch-2.7 due to 
getNodeLabelExpression not defined in NMContainerStatusPBImpl.
 Key: YARN-4458
 URL: https://issues.apache.org/jira/browse/YARN-4458
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhihai xu
Assignee: zhihai xu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4209) RMStateStore FENCED state doesn’t work

2015-09-28 Thread zhihai xu (JIRA)
zhihai xu created YARN-4209:
---

 Summary: RMStateStore FENCED state doesn’t work
 Key: YARN-4209
 URL: https://issues.apache.org/jira/browse/YARN-4209
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.1
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical


RMStateStore FENCED state doesn’t work. The reason is
{{stateMachine.doTransition}} called from {{updateFencedState}} is embedded in 
{{stateMachine.doTransition}} called from public 
API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So right 
after the internal state transition from {{updateFencedState}} changes the 
state to FENCED state, the external state transition changes the state back to 
ACTIVE state. The end result is that RMStateStore is still in ACTIVE state even 
notifyStoreOperationFailed is called. The only working case for FENCED state is 
{{notifyStoreOperationFailed}} called from 
{{ZKRMStateStore#VerifyActiveStatusThread}}.
For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter 
external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => 
{{notifyStoreOperationFailed}} =>{{updateFencedState}}=>{{handleStoreEvent}}=> 
enter internal {{stateMachine.doTransition}} => exit internal 
{{stateMachine.doTransition}} change state to FENCED => exit external 
{{stateMachine.doTransition}} change state to ACTIVE.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-4190) missing container information in FairScheduler preemption log.

2015-09-18 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu resolved YARN-4190.
-
Resolution: Later

> missing container information in FairScheduler preemption log.
> --
>
> Key: YARN-4190
> URL: https://issues.apache.org/jira/browse/YARN-4190
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 2.7.1
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Trivial
>
> Add container information in FairScheduler preemption log to help debug. 
> Currently the following log doesn't have container information
> {code}
> LOG.info("Preempting container (prio=" + 
> container.getContainer().getPriority() +
> "res=" + container.getContainer().getResource() +
> ") from queue " + queue.getName());
> {code}
> So it will be very difficult to debug preemption related issue for 
> FairScheduler.
> Even the container information is printed in the following code
> {code}
> LOG.info("Killing container" + container +
> " (after waiting for premption for " +
> (getClock().getTime() - time) + "ms)");
> {code}
> But we can't match these two logs based on the container ID.
> It will be very useful to add container information in the first log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4190) Add container information in FairScheduler preemption log to help debug.

2015-09-18 Thread zhihai xu (JIRA)
zhihai xu created YARN-4190:
---

 Summary: Add container information in FairScheduler preemption log 
to help debug.
 Key: YARN-4190
 URL: https://issues.apache.org/jira/browse/YARN-4190
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Affects Versions: 2.7.1
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Trivial


Add container information in FairScheduler preemption log to help debug. 
Currently the following log doesn't have container information
{code}
LOG.info("Preempting container (prio=" + container.getContainer().getPriority() 
+
"res=" + container.getContainer().getResource() +
") from queue " + queue.getName());
{code}
So it will be very difficult to debug preemption related issue for 
FairScheduler.
Even the container information is printed in the following code
{code}
LOG.info("Killing container" + container +
" (after waiting for premption for " +
(getClock().getTime() - time) + "ms)");
{code}
But we can't match these two logs based on the container ID.
It will be very useful to add container information in the first log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4187) Yarn Client uses local address instead RM address as token renewer in a secure cluster when HA is enabled.

2015-09-18 Thread zhihai xu (JIRA)
zhihai xu created YARN-4187:
---

 Summary: Yarn Client uses local address instead RM address as 
token renewer in a secure cluster when HA is enabled.
 Key: YARN-4187
 URL: https://issues.apache.org/jira/browse/YARN-4187
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhihai xu
Assignee: zhihai xu


Yarn Client uses local address instead RM address as token renewer in a secure 
cluster when HA is enabled. This will cause HDFS token renew failure for 
renewer "nobody"  if the rules from {{hadoop.security.auth_to_local}} exclude 
the client address in HDFS DelegationTokenIdentifier.
The following is the exception which cause the job fail
{code}
15/09/12 16:27:24 WARN security.UserGroupInformation: 
PriviledgedActionException as:t...@example.com (auth:KERBEROS) 
cause:java.io.IOException: Failed to run job : yarn tries to renew a token with 
renewer nobody
at 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512)
at 
org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
java.io.IOException: Failed to run job : yarn tries to renew a token with 
renewer nobody
at 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512)
at 
org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:300)
at 
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:438)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1295)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1292)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1292)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1313)
at org.apache.hadoop.examples.WordCount.main(WordCount.java:87)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145)
at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74

[jira] [Created] (YARN-4158) Remove duplicate close for LogWriter in AppLogAggregatorImpl#uploadLogsForContainers

2015-09-15 Thread zhihai xu (JIRA)
zhihai xu created YARN-4158:
---

 Summary: Remove duplicate close for LogWriter in 
AppLogAggregatorImpl#uploadLogsForContainers
 Key: YARN-4158
 URL: https://issues.apache.org/jira/browse/YARN-4158
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Trivial


Remove duplicate {{close}} for {{LogWriter}} in 
{{AppLogAggregatorImpl#uploadLogsForContainers}}
{{writer.close()}} was called twice in {{uploadLogsForContainers}}.
It will be better to close {{writer}} once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4153) TestAsyncDispatcher failed at branch-2.7

2015-09-13 Thread zhihai xu (JIRA)
zhihai xu created YARN-4153:
---

 Summary: TestAsyncDispatcher failed at branch-2.7
 Key: YARN-4153
 URL: https://issues.apache.org/jira/browse/YARN-4153
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Reporter: zhihai xu
Assignee: zhihai xu


TestAsyncDispatcher failed at branch-2.7. It is because the change from 
YARN-3999 didn't merge to branch-2.7 completely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4133) Containers to be preempted leaks in FairScheduler preemption logic.

2015-09-08 Thread zhihai xu (JIRA)
zhihai xu created YARN-4133:
---

 Summary: Containers to be preempted leaks in FairScheduler 
preemption logic.
 Key: YARN-4133
 URL: https://issues.apache.org/jira/browse/YARN-4133
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.7.1
Reporter: zhihai xu
Assignee: zhihai xu


Containers to be preempted leaks in FairScheduler preemption logic. It may 
cause missing preemption due to containers in {{warnedContainers}} wrongly 
removed. The problem is in {{preemptResources}}:
There are two issues which can cause containers  wrongly removed from 
{{warnedContainers}}:
Firstly missing the container state {{RMContainerState.ACQUIRED}} in the 
condition check:
{code}
(container.getState() == RMContainerState.RUNNING ||
  container.getState() == RMContainerState.ALLOCATED)
{code}
Secondly if  {{isResourceGreaterThanNone(toPreempt)}} return false, we 
shouldn't remove container from {{warnedContainers}}, We should only remove 
container from {{warnedContainers}}, if container is not in state 
{{RMContainerState.RUNNING}}, {{RMContainerState.ALLOCATED}} and 
{{RMContainerState.ACQUIRED}}.
{code}
  if ((container.getState() == RMContainerState.RUNNING ||
  container.getState() == RMContainerState.ALLOCATED) &&
  isResourceGreaterThanNone(toPreempt)) {
warnOrKillContainer(container);
Resources.subtractFrom(toPreempt, 
container.getContainer().getResource());
  } else {
warnedIter.remove();
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4095) Avoid sharing AllocatorPerContext object in LocalDirAllocator between ShuffleHandler and LocalDirsHandlerService.

2015-08-30 Thread zhihai xu (JIRA)
zhihai xu created YARN-4095:
---

 Summary: Avoid sharing AllocatorPerContext object in 
LocalDirAllocator between ShuffleHandler and LocalDirsHandlerService.
 Key: YARN-4095
 URL: https://issues.apache.org/jira/browse/YARN-4095
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: zhihai xu
Assignee: zhihai xu


Currently {{ShuffleHandler}} and {{LocalDirsHandlerService}} share 
{{AllocatorPerContext}} object in {{LocalDirAllocator}} for configuration 
{{NM_LOCAL_DIRS}} because {{AllocatorPerContext}}s are stored in a static 
TreeMap with configuration name as key
{code}
  private static Map  contexts = 
 new TreeMap();
{code}
{{LocalDirsHandlerService}} and {{ShuffleHandler}} both create a 
{{LocalDirAllocator}} using {{NM_LOCAL_DIRS}}. Even they don't use the same 
{{Configuration}} object, but they will use the same {{AllocatorPerContext}} 
object. Also {{LocalDirsHandlerService}} may change {{NM_LOCAL_DIRS}} value in 
its {{Configuration}} object to exclude full and bad local dirs, 
{{ShuffleHandler}} always uses the original {{NM_LOCAL_DIRS}} value in its 
{{Configuration}} object. So every time {{AllocatorPerContext#confChanged}} is 
called by {{ShuffleHandler}} after {{LocalDirsHandlerService}}, 
{{AllocatorPerContext}} need be reinitialized because {{NM_LOCAL_DIRS}} value 
is changed. This will cause some overhead.
{code}
  String newLocalDirs = conf.get(contextCfgItemName);
  if (!newLocalDirs.equals(savedLocalDirs)) {
{code}
So it will be a good improvement to not share the same {{AllocatorPerContext}} 
instance between {{ShuffleHandler}} and {{LocalDirsHandlerService}}.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3857) Memory leak in ResourceManager with SIMPLE mode

2015-08-18 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu resolved YARN-3857.
-
Resolution: Fixed

> Memory leak in ResourceManager with SIMPLE mode
> ---
>
> Key: YARN-3857
> URL: https://issues.apache.org/jira/browse/YARN-3857
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.0
>Reporter: mujunchao
>Assignee: mujunchao
>Priority: Critical
>  Labels: patch
> Fix For: 2.7.2
>
> Attachments: YARN-3857-1.patch, YARN-3857-2.patch, YARN-3857-3.patch, 
> YARN-3857-4.patch, hadoop-yarn-server-resourcemanager.patch
>
>
>  We register the ClientTokenMasterKey to avoid client may hold an invalid 
> ClientToken after RM restarts. In SIMPLE mode, we register 
> Pair ,  But we never remove it from HashMap, as 
> unregister only runing while in Security mode, so memory leak coming. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3943) Use separate threshold configurations for disk-full detection and disk-not-full detection.

2015-07-20 Thread zhihai xu (JIRA)
zhihai xu created YARN-3943:
---

 Summary: Use separate threshold configurations for disk-full 
detection and disk-not-full detection.
 Key: YARN-3943
 URL: https://issues.apache.org/jira/browse/YARN-3943
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: zhihai xu
Assignee: zhihai xu


Use separate threshold configurations to check when disks become full and when 
disks become good. Currently the configuration 
"yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage" 
and "yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb" are used 
to check both when disks become full and when disks become good. It will be 
better to use two configurations: one is used when disks become full from 
not-full and the other one is used when disks become not-full from full. So we 
can avoid oscillating frequently.
For example: we can set the one for disk-full detection higher than the one for 
disk-not-full detection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3925) ContainerLogsUtils#getContainerLogFile fails to read container log files from full disks.

2015-07-14 Thread zhihai xu (JIRA)
zhihai xu created YARN-3925:
---

 Summary: ContainerLogsUtils#getContainerLogFile fails to read 
container log files from full disks.
 Key: YARN-3925
 URL: https://issues.apache.org/jira/browse/YARN-3925
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.1
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical


ContainerLogsUtils#getContainerLogFile fails to read files from full disks.
{{getContainerLogFile}} depends on {{LocalDirsHandlerService#getLogPathToRead}} 
to get the log file, but {{LocalDirsHandlerService#getLogPathToRead}} calls 
{{logDirsAllocator.getLocalPathToRead}} and {{logDirsAllocator}} uses 
configuration {{YarnConfiguration.NM_LOG_DIRS}}, which will be updated to not 
include full disks in {{LocalDirsHandlerService#checkDirs}}:
{code}
Configuration conf = getConfig();
List localDirs = getLocalDirs();
conf.setStrings(YarnConfiguration.NM_LOCAL_DIRS,
localDirs.toArray(new String[localDirs.size()]));
List logDirs = getLogDirs();
conf.setStrings(YarnConfiguration.NM_LOG_DIRS,
  logDirs.toArray(new String[logDirs.size()]));
{code}

ContainerLogsUtils#getContainerLogFile is used by NMWebServices#getLogs and 
ContainerLogsPage.ContainersLogsBlock#render to read the log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3882) AggregatedLogFormat should close aclScanner and ownerScanner after create them.

2015-07-02 Thread zhihai xu (JIRA)
zhihai xu created YARN-3882:
---

 Summary: AggregatedLogFormat should close aclScanner and 
ownerScanner after create them.
 Key: YARN-3882
 URL: https://issues.apache.org/jira/browse/YARN-3882
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Minor


AggregatedLogFormat should close aclScanner and ownerScanner after create them. 
{{aclScanner}} and {{ownerScanner}} are created by createScanner in 
{{getApplicationAcls}} and {{getApplicationOwner}} and are never closed. 
{{TFile.Reader.Scanner}} implement java.io.Closeable. We should close them 
after use them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3549) use JNI-based FileStatus implementation from io.nativeio.NativeIO.POSIX#getFstat instead of shell-based implementation from RawLocalFileSystem in checkLocalDir.

2015-06-14 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu resolved YARN-3549.
-
Resolution: Duplicate

> use JNI-based FileStatus implementation from 
> io.nativeio.NativeIO.POSIX#getFstat instead of shell-based implementation 
> from RawLocalFileSystem in checkLocalDir.
> 
>
> Key: YARN-3549
> URL: https://issues.apache.org/jira/browse/YARN-3549
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>
> Use JNI-based FileStatus implementation from 
> io.nativeio.NativeIO.POSIX#getFstat instead of shell-based implementation 
> from RawLocalFileSystem in checkLocalDir.
> As discussed in YARN-3491, shell-based implementation getPermission runs 
> shell command "ls -ld" to get permission, which take 4 or 5 ms(very slow).
> We should switch to io.nativeio.NativeIO.POSIX#getFstat as implementation in 
> RawLocalFileSystem to get rid of shell-based implementation for FileStatus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3802) Two RMNodes for the same NodeId are used in RM sometimes after NM is reconnected.

2015-06-14 Thread zhihai xu (JIRA)
zhihai xu created YARN-3802:
---

 Summary: Two RMNodes for the same NodeId are used in RM sometimes 
after NM is reconnected.
 Key: YARN-3802
 URL: https://issues.apache.org/jira/browse/YARN-3802
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
Reporter: zhihai xu
Assignee: zhihai xu


Two RMNodes for the same NodeId are used in RM sometimes after NM is 
reconnected. Scheduler and RMContext use different RMNode reference for the 
same NodeId sometimes after NM is reconnected, which is not correct. Scheduler 
and RMContext should always use same RMNode reference for the same NodeId.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3780) Should use equals when compare Resource in RMNodeImpl#ReconnectNodeTransition

2015-06-07 Thread zhihai xu (JIRA)
zhihai xu created YARN-3780:
---

 Summary: Should use equals when compare Resource in 
RMNodeImpl#ReconnectNodeTransition
 Key: YARN-3780
 URL: https://issues.apache.org/jira/browse/YARN-3780
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Minor


Should use equals when compare Resource in RMNodeImpl#ReconnectNodeTransition 
to avoid unnecessary NodeResourceUpdateSchedulerEvent.
The current code use {{!=}} to compare Resource totalCapability, which will 
compare reference not the real value in Resource. So we should use equals to 
compare Resource.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3777) Move all reservation-related tests from TestFairScheduler to TestFairSchedulerReservations.

2015-06-05 Thread zhihai xu (JIRA)
zhihai xu created YARN-3777:
---

 Summary: Move all reservation-related tests from TestFairScheduler 
to TestFairSchedulerReservations.
 Key: YARN-3777
 URL: https://issues.apache.org/jira/browse/YARN-3777
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler, test
Affects Versions: 2.7.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Minor


As discussed at YARN-3655, Move all reservation-related tests from 
TestFairScheduler to TestFairSchedulerReservations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3776) FairScheduler code refactoring to separate out the code paths for assigning a reserved container and a non-reserved container

2015-06-05 Thread zhihai xu (JIRA)
zhihai xu created YARN-3776:
---

 Summary: FairScheduler code refactoring to separate out the code 
paths for assigning a reserved container and a non-reserved container
 Key: YARN-3776
 URL: https://issues.apache.org/jira/browse/YARN-3776
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Affects Versions: 2.7.0
Reporter: zhihai xu
Assignee: zhihai xu


FairScheduler code refactoring  toSeparate out the code paths for assigning a 
reserved container and a non-reserved container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3727) For better error recovery, check if the directory exists before using it for localization.

2015-05-27 Thread zhihai xu (JIRA)
zhihai xu created YARN-3727:
---

 Summary: For better error recovery, check if the directory exists 
before using it for localization.
 Key: YARN-3727
 URL: https://issues.apache.org/jira/browse/YARN-3727
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: zhihai xu
Assignee: zhihai xu


For better error recovery, check if the directory exists before using it for 
localization.
We saw the following localization failure happened due to existing cache 
directories.
{code}
2015-05-11 18:59:59,756 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 DEBUG: FAILED { hdfs:///X/libjars/1234.jar, 1431395961545, FILE, null 
}, Rename cannot overwrite non empty destination directory 
//8/yarn/nm/usercache//filecache/21637
2015-05-11 18:59:59,756 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
 Resource 
hdfs:///X/libjars/1234.jar(->//8/yarn/nm/usercache//filecache/21637/1234.jar)
 transitioned from DOWNLOADING to FAILED
{code}

The real cause for this failure may be disk failure, LevelDB operation failure 
for {{startResourceLocalization}}/{{finishResourceLocalization}} or others.

I wonder whether we can add error recovery code to avoid the localization 
failure by not using the existing cache directories for localization.

The exception happened at {{files.rename(dst_work, destDirPath, 
Rename.OVERWRITE)}} in FSDownload#call. Based on the following code, after the 
exception, the existing cache directory used by {{LocalizedResource}} will be 
deleted.
{{code}}
try {
 .
  files.rename(dst_work, destDirPath, Rename.OVERWRITE);
} catch (Exception e) {
  try {
files.delete(destDirPath, true);
  } catch (IOException ignore) {
  }
  throw e;
} finally {
{{code}}

Since the conflicting local directory will be deleted after localization 
failure,
I think it will be better to check if the directory exists before using it for 
localization to avoid the localization failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3713) Remove duplicate function call storeContainerDiagnostics in ContainerDiagnosticsUpdateTransition

2015-05-26 Thread zhihai xu (JIRA)
zhihai xu created YARN-3713:
---

 Summary: Remove duplicate function call storeContainerDiagnostics 
in ContainerDiagnosticsUpdateTransition
 Key: YARN-3713
 URL: https://issues.apache.org/jira/browse/YARN-3713
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Minor


remove duplicate function call {{storeContainerDiagnostics}} in 
ContainerDiagnosticsUpdateTransition. {{storeContainerDiagnostics}} is already 
called at ContainerImpl#addDiagnostics. 
{code}
  private void addDiagnostics(String... diags) {
for (String s : diags) {
  this.diagnostics.append(s);
}
try {
  stateStore.storeContainerDiagnostics(containerId, diagnostics);
} catch (IOException e) {
  LOG.warn("Unable to update diagnostics in state store for "
  + containerId, e);
}
  }
{code} 
So we don't need call {{storeContainerDiagnostics}} in  
ContainerDiagnosticsUpdateTransition#transition.
{code}
  container.addDiagnostics(updateEvent.getDiagnosticsUpdate(), "\n");
  try {
container.stateStore.storeContainerDiagnostics(container.containerId,
container.diagnostics);
  } catch (IOException e) {
LOG.warn("Unable to update state store diagnostics for "
+ container.containerId, e);
  }
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3710) FairScheduler: Should allocate more containers for assign-multiple after assignReservedContainer turns the reservation into an allocation.

2015-05-25 Thread zhihai xu (JIRA)
zhihai xu created YARN-3710:
---

 Summary: FairScheduler: Should allocate more containers for 
assign-multiple after assignReservedContainer turns the reservation into an 
allocation.
 Key: YARN-3710
 URL: https://issues.apache.org/jira/browse/YARN-3710
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Reporter: zhihai xu
Assignee: zhihai xu


FairScheduler: Should allocate more containers for assign-multiple after 
assignReservedContainer turns the reservation into an allocation.
Currently FairScheduler#attemptScheduling does not assign more containers for 
assign-multiple after assignReservedContainer turns the reservation into an 
allocation successfully.
We should try to assign more containers on the same node if assignMultiple is 
enabled after assignReservedContainer turns the reservation into an allocation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3697) FairScheduler: ContinuousSchedulingThread can't be shutdown after stop sometimes.

2015-05-21 Thread zhihai xu (JIRA)
zhihai xu created YARN-3697:
---

 Summary: FairScheduler: ContinuousSchedulingThread can't be 
shutdown after stop sometimes. 
 Key: YARN-3697
 URL: https://issues.apache.org/jira/browse/YARN-3697
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhihai xu
Assignee: zhihai xu


FairScheduler: ContinuousSchedulingThread can't be shutdown after stop 
sometimes. 
The reason is because the InterruptedException is blocked in 
continuousSchedulingAttempt
{code}
  try {
if (node != null && Resources.fitsIn(minimumAllocation,
node.getAvailableResource())) {
  attemptScheduling(node);
}
  } catch (Throwable ex) {
LOG.error("Error while attempting scheduling for node " + node +
": " + ex.toString(), ex);
  }
{code}

I saw the following exception after stop:
{code}
2015-05-17 23:30:43,065 WARN  [FairSchedulerContinuousScheduling] 
event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher 
thread interrupted
java.lang.InterruptedException
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
at 
java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
at 
java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:467)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:462)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:387)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:58)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.allocate(FSAppAttempt.java:357)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:516)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:649)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:803)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:334)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:173)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1082)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1014)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:285)
2015-05-17 23:30:43,066 ERROR [FairSchedulerContinuousScheduling] 
fair.FairScheduler (FairScheduler.java:continuousSchedulingAttempt(1017)) - 
Error while attempting scheduling for node host: 127.0.0.2:2 #containers=1 
available= used=: 
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
java.lang.InterruptedException
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
java.lang.InterruptedException
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:249)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:467)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:462)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.ap

[jira] [Created] (YARN-3667) Fix findbugs warning Inconsistent synchronization of org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.isHDFS

2015-05-15 Thread zhihai xu (JIRA)
zhihai xu created YARN-3667:
---

 Summary: Fix findbugs warning Inconsistent synchronization of 
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.isHDFS
 Key: YARN-3667
 URL: https://issues.apache.org/jira/browse/YARN-3667
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3655) FairScheduler: potential deadlock due to maxAMShare limitation and container reservation

2015-05-15 Thread zhihai xu (JIRA)
zhihai xu created YARN-3655:
---

 Summary: FairScheduler: potential deadlock due to maxAMShare 
limitation and container reservation 
 Key: YARN-3655
 URL: https://issues.apache.org/jira/browse/YARN-3655
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.7.0
Reporter: zhihai xu
Assignee: zhihai xu


FairScheduler: potential deadlock due to maxAMShare limitation and container 
reservation.
If a node is reserved by an application, all the other applications don't have 
any chance to assign a new container on this node, unless the application which 
reserves the node assigns a new container on this node or releases the reserved 
container on this node.
The problem is if an application tries to call assignReservedContainer and fail 
to get a new container due to maxAMShare limitation, it will block all other 
applications to use the nodes it reserves. If all other running applications 
can't release their AM containers due to being blocked by these reserved 
containers. A dead lock situation can happen.
The following is the code at FSAppAttempt#assignContainer which can cause this 
potential dead lock.
{code}
// Check the AM resource usage for the leaf queue
if (!isAmRunning() && !getUnmanagedAM()) {
  List ask = appSchedulingInfo.getAllResourceRequests();
  if (ask.isEmpty() || !getQueue().canRunAppAM(
  ask.get(0).getCapability())) {
if (LOG.isDebugEnabled()) {
  LOG.debug("Skipping allocation because maxAMShare limit would " +
  "be exceeded");
}
return Resources.none();
  }
}
{code}
To fix this issue, we can unreserve the node if we can't allocate the AM 
container on the node due to Max AM share limitation and the node is reserved 
by the application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3628) The default value for yarn.nodemanager.container-metrics.period-ms should not be -1.

2015-05-11 Thread zhihai xu (JIRA)
zhihai xu created YARN-3628:
---

 Summary: The default value for 
yarn.nodemanager.container-metrics.period-ms should not be -1.
 Key: YARN-3628
 URL: https://issues.apache.org/jira/browse/YARN-3628
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Minor


The default value for yarn.nodemanager.container-metrics.period-ms should not 
be -1.
The current default value for yarn.nodemanager.container-metrics.period-ms is 
-1 and the default value  for yarn.nodemanager.container-metrics.enable is 
true. So the empty content is shown for the active container metrics until it 
is finished.
flushOnPeriod is always false if flushPeriodMs is -1,
the content will only be shown when the container is finished.
{code}
if (finished || flushOnPeriod) {
  registry.snapshot(collector.addRecord(registry.info()), all);
}
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3604) removeApplication in ZKRMStateStore should also disable watch.

2015-05-08 Thread zhihai xu (JIRA)
zhihai xu created YARN-3604:
---

 Summary: removeApplication in ZKRMStateStore should also disable 
watch.
 Key: YARN-3604
 URL: https://issues.apache.org/jira/browse/YARN-3604
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Minor


removeApplication in ZKRMStateStore should also disable watch.
Function removeApplication is added from YARN-3410.
YARN-3469 is to disable watch for all function in ZKRMStateStore.
So it looks like YARN-3410 miss the change from YARN-3469 because YARN-3410 add 
removeApplication after YARN-3469  is committed.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3602) TestResourceLocalizationService.testPublicResourceInitializesLocalDir fails Intermittently due to IOException from cleanup

2015-05-08 Thread zhihai xu (JIRA)
zhihai xu created YARN-3602:
---

 Summary: 
TestResourceLocalizationService.testPublicResourceInitializesLocalDir fails 
Intermittently due to IOException from cleanup
 Key: YARN-3602
 URL: https://issues.apache.org/jira/browse/YARN-3602
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Minor


ResourceLocalizationService.testPublicResourceInitializesLocalDir fails 
Intermittently due to IOException from cleanup. The stack trace is the 
following from test report at
https://builds.apache.org/job/PreCommit-YARN-Build/7729/testReport/org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer/TestResourceLocalizationService/testPublicResourceInitializesLocalDir/
{code}
Error Message
Unable to delete directory 
target/org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService/2/filecache.
Stacktrace
java.io.IOException: Unable to delete directory 
target/org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService/2/filecache.
at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1541)
at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2270)
at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653)
at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535)
at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2270)
at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653)
at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService.cleanup(TestResourceLocalizationService.java:187)
{code}
It looks like we can safely ignore the IOException in cleanup which is called 
after test.
The IOException may be due to the test machine environment because 
TestResourceLocalizationService/2/filecache is created by 
ResourceLocalizationService#initializeLocalDir.
testPublicResourceInitializesLocalDir created 0/filecache, 1/filecache, 
2/filecache and 3/filecache
{code}
for (int i = 0; i < 4; ++i) {
  localDirs.add(lfs.makeQualified(new Path(basedir, i + "")));
  sDirs[i] = localDirs.get(i).toString();
}
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-2873) improve LevelDB error handling for missing files DBException to avoid NM start failure.

2015-05-01 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu resolved YARN-2873.
-
Resolution: Not A Problem

> improve LevelDB error handling for missing files DBException to avoid NM 
> start failure.
> ---
>
> Key: YARN-2873
> URL: https://issues.apache.org/jira/browse/YARN-2873
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.5.0
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-2873.000.patch, YARN-2873.001.patch
>
>
> improve LevelDB error handling for missing files DBException to avoid NM 
> start failure.
> We saw the following three level DB exceptions, all these exceptions cause NM 
> start failure.
> DBException 1 in ShuffleHandler
> {code}
> INFO org.apache.hadoop.service.AbstractService: Service 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl
>  failed in state STARTED; cause: 
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 
> missing files; e.g.: 
> /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 
> missing files; e.g.: 
> /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst
>   at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:204)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceStart(AuxServices.java:159)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceStart(ContainerManagerImpl.java:441)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:261)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:446)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: 
> Corruption: 1 missing files; e.g.: 
> /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst
>   at 
> org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
>   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
>   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
>   at 
> org.apache.hadoop.mapred.ShuffleHandler.startStore(ShuffleHandler.java:475)
>   at 
> org.apache.hadoop.mapred.ShuffleHandler.recoverState(ShuffleHandler.java:443)
>   at 
> org.apache.hadoop.mapred.ShuffleHandler.serviceStart(ShuffleHandler.java:379)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   ... 10 more
> {code}
> DBException 2 in NMLeveldbStateStoreService:
> {code}
> Error starting NodeManager 
> org.apache.hadoop.service.ServiceStateException: 
> org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 
> missing files; e.g.: 
> /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/05.sst 
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
>  
> at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) 
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:152)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:190)
>  
> at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) 
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
>  
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: 
> Corruption: 1 missing files; e.g.: 
> /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/05.sst 
> at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) 
> at org.

[jira] [Resolved] (YARN-3114) It would be better to consider integer(long) overflow when compare the time in DelegationTokenRenewer.

2015-05-01 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu resolved YARN-3114.
-
Resolution: Not A Problem

> It would be better to consider integer(long) overflow when compare the time 
> in DelegationTokenRenewer.
> --
>
> Key: YARN-3114
> URL: https://issues.apache.org/jira/browse/YARN-3114
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Minor
> Attachments: YARN-3114.000.patch
>
>
> It would be better to consider integer(long) overflow when compare the time 
> in DelegationTokenRenewer.
> When compare time in DelegationTokenRenewer#DelayedTokenRemovalRunnable to 
> cancel token , it will have problem when currentTimeMillis is close to 
> Long.MAX_VALUE.
> The safer way to compare time will compare the time difference:
> change
> {code}
> if (e.getValue() < System.currentTimeMillis()) {
> {code}
> to 
> {code}
> if (e.getValue() - System.currentTimeMillis() < 0) {
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3549) use JNI-based FileStatus implementation from io.nativeio.NativeIO.POSIX#getFstat instead of shell-based implementation from RawLocalFileSystem in checkLocalDir.

2015-04-27 Thread zhihai xu (JIRA)
zhihai xu created YARN-3549:
---

 Summary: use JNI-based FileStatus implementation from 
io.nativeio.NativeIO.POSIX#getFstat instead of shell-based implementation from 
RawLocalFileSystem in checkLocalDir.
 Key: YARN-3549
 URL: https://issues.apache.org/jira/browse/YARN-3549
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: zhihai xu
Assignee: zhihai xu


Use JNI-based FileStatus implementation from 
io.nativeio.NativeIO.POSIX#getFstat instead of shell-based implementation from 
RawLocalFileSystem in checkLocalDir.
As discussed in YARN-3491, shell-based implementation getPermission runs shell 
command "ls -ld" to get permission, which take 4 or 5 ms.
We should switch to io.nativeio.NativeIO.POSIX#getFstat as implementation in 
RawLocalFileSystem to get rid of shell-based implementation for FileStatus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3190) NM can't aggregate logs: token can't be found in cache

2015-04-23 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu resolved YARN-3190.
-
Resolution: Duplicate

issue is fixed by YARN-2964 

> NM can't aggregate logs: token  can't be found in cache
> ---
>
> Key: YARN-3190
> URL: https://issues.apache.org/jira/browse/YARN-3190
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.0
> Environment: CDH 5.3.1
> HA HDFS
> Kerberos
>Reporter: Andrejs Dubovskis
>Priority: Minor
>
> In rare cases node manager can not aggregate logs: generating exception:
> {code}
> 2015-02-12 13:04:03,703 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
>  Starting aggregate log-file for app application_1423661043235_2150 at 
> /tmp/logs/catalyst/logs/application_1423661043235_2150/catdn001.intrum.net_8041.tmp
> 2015-02-12 13:04:03,707 INFO 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting 
> absolute path : 
> /data5/yarn/nm/usercache/catalyst/appcache/application_1423661043235_2150/container_1423661043235_2150_01_000442
> 2015-02-12 13:04:03,707 INFO 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting 
> absolute path : 
> /data6/yarn/nm/usercache/catalyst/appcache/application_1423661043235_2150/container_1423661043235_2150_01_000442
> 2015-02-12 13:04:03,707 INFO 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting 
> absolute path : 
> /data7/yarn/nm/usercache/catalyst/appcache/application_1423661043235_2150/container_1423661043235_2150_01_000442
> 2015-02-12 13:04:03,709 INFO 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting 
> absolute path : 
> /data1/yarn/nm/usercache/catalyst/appcache/application_1423661043235_2150
> 2015-02-12 13:04:03,709 WARN org.apache.hadoop.security.UserGroupInformation: 
> PriviledgedActionException as:catalyst (auth:SIMPLE) 
> cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (HDFS_DELEGATION_TOKEN token 2334644 for catalyst) can't be found in 
> cache
> 2015-02-12 13:04:03,709 WARN org.apache.hadoop.ipc.Client: Exception 
> encountered while connecting to the server : 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (HDFS_DELEGATION_TOKEN token 2334644 for catalyst) can't be found in 
> cache
> 2015-02-12 13:04:03,709 WARN org.apache.hadoop.security.UserGroupInformation: 
> PriviledgedActionException as:catalyst (auth:SIMPLE) 
> cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (HDFS_DELEGATION_TOKEN token 2334644 for catalyst) can't be found in 
> cache
> 2015-02-12 13:04:03,712 WARN org.apache.hadoop.security.UserGroupInformation: 
> PriviledgedActionException as:catalyst (auth:SIMPLE) 
> cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (HDFS_DELEGATION_TOKEN token 2334644 for catalyst) can't be found in 
> cache
> 2015-02-12 13:04:03,712 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
>  Cannot create writer for app application_1423661043235_2150. Disabling 
> log-aggregation for this app.
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (HDFS_DELEGATION_TOKEN token 2334644 for catalyst) can't be found in 
> cache
> at org.apache.hadoop.ipc.Client.call(Client.java:1411)
> at org.apache.hadoop.ipc.Client.call(Client.java:1364)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> at com.sun.proxy.$Proxy19.getServerDefaults(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getServerDefaults(ClientNamenodeProtocolTranslatorPB.java:259)
> at sun.reflect.GeneratedMethodAccessor114.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy20.getServerDefaults(Unknown Source)
> at 
> org.apache.hadoop.hdfs.DFSClient.getServerDefaults(DFSClient.java:966)
> at org.apache.hadoop.fs.Hdfs.getServerDefaults(Hdfs.java:159)
> at 
> org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:543)
> at 

[jira] [Created] (YARN-3516) killing ContainerLocalizer action doesn't take effect when private localizer receives FETCH_FAILURE status.

2015-04-20 Thread zhihai xu (JIRA)
zhihai xu created YARN-3516:
---

 Summary: killing ContainerLocalizer action doesn't take effect 
when private localizer receives FETCH_FAILURE status.
 Key: YARN-3516
 URL: https://issues.apache.org/jira/browse/YARN-3516
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: zhihai xu
Assignee: zhihai xu


killing ContainerLocalizer action doesn't take effect when private localizer 
receives FETCH_FAILURE status. This is a typo from YARN-3024. With YARN-3024, 
ContainerLocalizer will be killed only if {{action}} is set to 
{{LocalizerAction.DIE}}, calling {{response.setLocalizerAction}} will be 
overwritten. This is also a regression from old code.
Also it make sense to kill the ContainerLocalizer when FETCH_FAILURE, because 
the container will send CLEANUP_CONTAINER_RESOURCES event after localization 
failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3496) Add a configuration to disable/enable storing localization state in NMLeveldbStateStore

2015-04-17 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu resolved YARN-3496.
-
Resolution: Not A Problem

> Add a configuration to disable/enable storing localization state in 
> NMLeveldbStateStore
> ---
>
> Key: YARN-3496
> URL: https://issues.apache.org/jira/browse/YARN-3496
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.7.0
>Reporter: zhihai xu
>Assignee: zhihai xu
>
> Add a configuration to disable/enable storing localization state in 
> NMLeveldbStateStore.
> Store Localization state in the levelDB may have some overhead, which may 
> affect NM performance.
> It would better to have a configuration to disable/enable it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3496) Add a configuration to disable/enable storing localization state in NM StateStore

2015-04-15 Thread zhihai xu (JIRA)
zhihai xu created YARN-3496:
---

 Summary: Add a configuration to disable/enable storing 
localization state in NM StateStore
 Key: YARN-3496
 URL: https://issues.apache.org/jira/browse/YARN-3496
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: zhihai xu
Assignee: zhihai xu


Add a configuration to disable/enable storing localization state in NM 
StateStore.
Store Localization state in the levelDB may have some overhead, which may 
affect NM performance.
It would better to have a configuration to disable/enable it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3491) Improve the public resource localization to do both FSDownload submission to the thread pool and completed localization handling in one thread (PublicLocalizer).

2015-04-15 Thread zhihai xu (JIRA)
zhihai xu created YARN-3491:
---

 Summary: Improve the public resource localization to do both 
FSDownload submission to the thread pool and completed localization handling in 
one thread (PublicLocalizer).
 Key: YARN-3491
 URL: https://issues.apache.org/jira/browse/YARN-3491
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical


Improve the public resource localization to do both FSDownload submission to 
the thread pool and completed localization handling in one thread 
(PublicLocalizer).
Currently FSDownload submission to the thread pool is done in 
PublicLocalizer#addResource which is running in Dispatcher thread and completed 
localization handling is done in PublicLocalizer#run which is running in 
PublicLocalizer thread.
Because FSDownload submission to the thread pool at the following code is time 
consuming, the thread pool can't be fully utilized. Instead of doing public 
resource localization in parallel(multithreading), public resource localization 
is serialized most of the time.
{code}
synchronized (pending) {
  pending.put(queue.submit(new FSDownload(lfs, null, conf,
  publicDirDestPath, resource, 
request.getContext().getStatCache())),
  request);
}
{code}

Also there are two more benefits with this change:
1. The Dispatcher thread won't be blocked by above FSDownload submission. 
Dispatcher thread handles most of time critical events at Node manager.
2. don't need synchronization on HashMap (pending).
Because pending will be only accessed in PublicLocalizer thread.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3465) use LinkedHashMap to keep the order of LocalResourceRequest in ContainerImpl

2015-04-08 Thread zhihai xu (JIRA)
zhihai xu created YARN-3465:
---

 Summary: use LinkedHashMap to keep the order of 
LocalResourceRequest in ContainerImpl
 Key: YARN-3465
 URL: https://issues.apache.org/jira/browse/YARN-3465
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: zhihai xu
Assignee: zhihai xu


use LinkedHashMap to keep the order of LocalResourceRequest in ContainerImpl



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.

2015-04-08 Thread zhihai xu (JIRA)
zhihai xu created YARN-3464:
---

 Summary: Race condition in LocalizerRunner causes container 
localization timeout.
 Key: YARN-3464
 URL: https://issues.apache.org/jira/browse/YARN-3464
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical


Race condition in LocalizerRunner causes container localization timeout.
Currently LocalizerRunner will kill the ContainerLocalizer when pending list 
for LocalizerResourceRequestEvent is empty.
{code}
  } else if (pending.isEmpty()) {
action = LocalizerAction.DIE;
  }
{code}
If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the 
ContainerLocalizer due to empty pending list, this 
LocalizerResourceRequestEvent will never be handled.
The container will stay at LOCALIZING state, until the container is killed by 
AM due to TASK_TIMEOUT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.

2015-04-03 Thread zhihai xu (JIRA)
zhihai xu created YARN-3446:
---

 Summary: FairScheduler HeadRoom calculation should exclude nodes 
in the blacklist.
 Key: YARN-3446
 URL: https://issues.apache.org/jira/browse/YARN-3446
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Reporter: zhihai xu
Assignee: zhihai xu


FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
MRAppMaster does not preempt the reducers because for Reducer preemption 
calculation, headRoom is considering blacklisted nodes. This makes jobs to hang 
forever(ResourceManager does not assign any new containers on blacklisted nodes 
but availableResource AM get from RM includes blacklisted nodes available 
resource).
This issue is similar as YARN-1680 which is for Capacity Scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3429) TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken from appattempt_1427804754787_0001_000001

2015-03-31 Thread zhihai xu (JIRA)
zhihai xu created YARN-3429:
---

 Summary: TestAMRMTokens.testTokenExpiry fails Intermittently with 
error message:Invalid AMRMToken from appattempt_1427804754787_0001_01
 Key: YARN-3429
 URL: https://issues.apache.org/jira/browse/YARN-3429
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Reporter: zhihai xu
Assignee: zhihai xu


TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid 
AMRMToken from appattempt_1427804754787_0001_01
The error logs is at 
https://builds.apache.org/job/PreCommit-YARN-Build/7172//testReport/org.apache.hadoop.yarn.server.resourcemanager.security/TestAMRMTokens/testTokenExpiry_1_/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3395) Handle the user name correctly when submit application and use user name as default queue name.

2015-03-24 Thread zhihai xu (JIRA)
zhihai xu created YARN-3395:
---

 Summary: Handle the user name correctly when submit application 
and use user name as default queue name.
 Key: YARN-3395
 URL: https://issues.apache.org/jira/browse/YARN-3395
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Reporter: zhihai xu
Assignee: zhihai xu


Handle the user name correctly when submit application and use user name as 
default queue name.
We should reject the application with an empty or whitespace only user name.
because it doesn't make sense to have an empty or whitespace only user name.
We should remove the trailing and leading whitespace of the user name when we 
use user name as default queue name, otherwise it will be rejected by 
InvalidQueueNameException from QueueManager. I think this change make sense, 
because it will be compatible with queue name convention and also we already 
did similar thing for '.' in user name.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3385) Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion(Op.delete).

2015-03-22 Thread zhihai xu (JIRA)
zhihai xu created YARN-3385:
---

 Summary: Race condition: KeeperException$NoNodeException will 
cause RM shutdown during ZK node deletion(Op.delete).
 Key: YARN-3385
 URL: https://issues.apache.org/jira/browse/YARN-3385
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical


Race condition: KeeperException$NoNodeException will cause RM shutdown during 
ZK node deletion(Op.delete).
The race condition is similar as YARN-2721 and YARN-3023.
When the race condition exists for ZK node creation, it should also exist for  
ZK node deletion.
We see this issue with the following stack trace:
{code}
2015-03-17 19:18:58,958 FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
STATE_STORE_OP_FAILED. Cause:
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:945)
at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:857)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:854)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:854)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:647)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:691)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
at java.lang.Thread.run(Thread.java:745)
2015-03-17 19:18:58,959 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
status 1
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3363) add localization and container launch time to ContainerMetrics at NM to show these timing information for each active container.

2015-03-17 Thread zhihai xu (JIRA)
zhihai xu created YARN-3363:
---

 Summary: add localization and container launch time to 
ContainerMetrics at NM to show these timing information for each active 
container.
 Key: YARN-3363
 URL: https://issues.apache.org/jira/browse/YARN-3363
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: zhihai xu
Assignee: zhihai xu


add localization and container launch time to ContainerMetrics at NM to show 
these timing information for each active container.
Currently ContainerMetrics has container's actual memory usage(YARN-2984),  
actual CPU usage(YARN-3122), resource  and pid(YARN-3022). It will be better to 
have localization and container launch time in ContainerMetrics for each active 
container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3355) findbugs warning:Inconsistent synchronization of org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocConf

2015-03-16 Thread zhihai xu (JIRA)
zhihai xu created YARN-3355:
---

 Summary: findbugs warning:Inconsistent synchronization of 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocConf
 Key: YARN-3355
 URL: https://issues.apache.org/jira/browse/YARN-3355
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Reporter: zhihai xu
Assignee: zhihai xu


findbugs warning:Inconsistent synchronization of 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocConf
The findbugs warning found out two Unsynchronized access:
1. FairScheduler.getPlanQueues.
It looks like we should add lock at FairScheduler.getPlanQueues.
Because getPlanQueues will be called by AbstractReservationSystem.reinitialize.
2. FairScheduler.getAllocationConfiguration, which looks like ok without lock.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3349) treat all exceptions as failure in testFSRMStateStoreClientRetry

2015-03-15 Thread zhihai xu (JIRA)
zhihai xu created YARN-3349:
---

 Summary: treat all exceptions as failure in 
testFSRMStateStoreClientRetry
 Key: YARN-3349
 URL: https://issues.apache.org/jira/browse/YARN-3349
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: test
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Minor


treat all exceptions as failure in testFSRMStateStoreClientRetry.
Currently the exception "could only be replicated to 0 nodes instead of 
minReplication (=1)" is not treated as failure in testFSRMStateStoreClientRetry.
{code}
// TODO 0 datanode exception will not be retried by dfs client, fix
// that separately.
if (!e.getMessage().contains("could only be replicated" +
" to 0 nodes instead of minReplication (=1)")) {
assertionFailedInThread.set(true);
 }
{code}
With YARN-2820(Retry in FileSystemRMStateStore), we needn't treat this  
exception specially. We can remove the check and treat all exceptions as 
failure in testFSRMStateStoreClientRetry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3263) ContainerManagerImpl#parseCredentials don't rewind the ByteBuffer after credentials.readTokenStorageStream

2015-03-13 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu resolved YARN-3263.
-
Resolution: Not a Problem

This is not an issue.
tokens.rewind() is called before credentials.readTokenStorageStream(buf).
This will have the same effect as rewind after readTokenStorageStream.
Also no other place accesses the tokens except parseCredentials.

> ContainerManagerImpl#parseCredentials don't rewind the ByteBuffer after 
> credentials.readTokenStorageStream
> --
>
> Key: YARN-3263
> URL: https://issues.apache.org/jira/browse/YARN-3263
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>
> ContainerManagerImpl#parseCredentials don't rewind the ByteBuffer after 
> credentials.readTokenStorageStream. So the next time if we access Tokens, we 
> will have EOFException.
> The following is the code for parseCredentials in ContainerManagerImpl.
> {code}
>   private Credentials parseCredentials(ContainerLaunchContext launchContext)
>   throws IOException {
> Credentials credentials = new Credentials();
> //  Parse credentials
> ByteBuffer tokens = launchContext.getTokens();
> if (tokens != null) {
>   DataInputByteBuffer buf = new DataInputByteBuffer();
>   tokens.rewind();
>   buf.reset(tokens);
>   credentials.readTokenStorageStream(buf);
>   if (LOG.isDebugEnabled()) {
> for (Token tk : 
> credentials.getAllTokens()) {
>   LOG.debug(tk.getService() + " = " + tk.toString());
> }
>   }
> }
> //  End of parsing credentials
> return credentials;
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3341) Fix findbugs warning:BC_UNCONFIRMED_CAST at FSSchedulerNode.reserveResource

2015-03-12 Thread zhihai xu (JIRA)
zhihai xu created YARN-3341:
---

 Summary: Fix findbugs warning:BC_UNCONFIRMED_CAST at 
FSSchedulerNode.reserveResource
 Key: YARN-3341
 URL: https://issues.apache.org/jira/browse/YARN-3341
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Minor


Fix findbugs warning:BC_UNCONFIRMED_CAST at FSSchedulerNode.reserveResource
The warning message is
{code}
Unchecked/unconfirmed cast from 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt
 to org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt 
in 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode.reserveResource(SchedulerApplicationAttempt,
 Priority, RMContainer)
{code}
The code which cause the warning is
{code}
this.reservedAppSchedulable = (FSAppAttempt) application;
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3336) FileSystem memory leak in DelegationTokenRenewer

2015-03-11 Thread zhihai xu (JIRA)
zhihai xu created YARN-3336:
---

 Summary: FileSystem memory leak in DelegationTokenRenewer
 Key: YARN-3336
 URL: https://issues.apache.org/jira/browse/YARN-3336
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical


FileSystem memory leak in DelegationTokenRenewer.
Every time DelegationTokenRenewer#obtainSystemTokensForUser is called, a new 
FileSystem entry will be added to  FileSystem#CACHE which will never be garbage 
collected.
This is the implementation of obtainSystemTokensForUser:
{code}
  protected Token[] obtainSystemTokensForUser(String user,
  final Credentials credentials) throws IOException, InterruptedException {
// Get new hdfs tokens on behalf of this user
UserGroupInformation proxyUser =
UserGroupInformation.createProxyUser(user,
  UserGroupInformation.getLoginUser());
Token[] newTokens =
proxyUser.doAs(new PrivilegedExceptionAction[]>() {
  @Override
  public Token[] run() throws Exception {
return FileSystem.get(getConfig()).addDelegationTokens(
  UserGroupInformation.getLoginUser().getUserName(), credentials);
  }
});
return newTokens;
  }
{code}

The memory leak happened when FileSystem.get(getConfig()) is called with a new 
proxy user.
Because createProxyUser will always create a new Subject.
{code}
public static UserGroupInformation createProxyUser(String user,
  UserGroupInformation realUser) {
if (user == null || user.isEmpty()) {
  throw new IllegalArgumentException("Null user");
}
if (realUser == null) {
  throw new IllegalArgumentException("Null real user");
}
Subject subject = new Subject();
Set principals = subject.getPrincipals();
principals.add(new User(user));
principals.add(new RealUser(realUser));
UserGroupInformation result =new UserGroupInformation(subject);
result.setAuthenticationMethod(AuthenticationMethod.PROXY);
return result;
  }
{code}

FileSystem#Cache#Key.equals will compare the ugi
{code}
  Key(URI uri, Configuration conf, long unique) throws IOException {
scheme = uri.getScheme()==null?"":uri.getScheme().toLowerCase();
authority = 
uri.getAuthority()==null?"":uri.getAuthority().toLowerCase();
this.unique = unique;
this.ugi = UserGroupInformation.getCurrentUser();
  }
  public boolean equals(Object obj) {
if (obj == this) {
  return true;
}
if (obj != null && obj instanceof Key) {
  Key that = (Key)obj;
  return isEqual(this.scheme, that.scheme)
 && isEqual(this.authority, that.authority)
 && isEqual(this.ugi, that.ugi)
 && (this.unique == that.unique);
}
return false;
  }
{code}

UserGroupInformation.equals will compare subject by reference.
{code}
  public boolean equals(Object o) {
if (o == this) {
  return true;
} else if (o == null || getClass() != o.getClass()) {
  return false;
} else {
  return subject == ((UserGroupInformation) o).subject;
}
  }
{code}

So in this case, every time createProxyUser and FileSystem.get(getConfig()) are 
called, a new FileSystem will be created and a new entry will be added to 
FileSystem.CACHE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3263) ContainerManagerImpl#parseCredentials don't rewind the ByteBuffer after credentials.readTokenStorageStream

2015-02-25 Thread zhihai xu (JIRA)
zhihai xu created YARN-3263:
---

 Summary: ContainerManagerImpl#parseCredentials don't rewind the 
ByteBuffer after credentials.readTokenStorageStream
 Key: YARN-3263
 URL: https://issues.apache.org/jira/browse/YARN-3263
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhihai xu
Assignee: zhihai xu


ContainerManagerImpl#parseCredentials don't rewind the ByteBuffer after 
credentials.readTokenStorageStream. So the next time if we access Tokens, we 
will have EOFException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3247) TestQueueMappings failure for FairScheduler

2015-02-23 Thread zhihai xu (JIRA)
zhihai xu created YARN-3247:
---

 Summary: TestQueueMappings failure for FairScheduler
 Key: YARN-3247
 URL: https://issues.apache.org/jira/browse/YARN-3247
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Trivial


TestQueueMappings is only supported by CapacityScheduler.
We should configure CapacityScheduler for this test. Otherwise if the default 
scheduler is set to FairScheduler, the test will fail with the following 
message:
{code}
Running 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestQueueMappings
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 2.392 sec <<< 
FAILURE! - in 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestQueueMappings
testQueueMapping(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestQueueMappings)
  Time elapsed: 2.202 sec  <<< ERROR!
java.lang.ClassCastException: 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics cannot be 
cast to 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:118)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1266)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1319)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:558)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:989)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:255)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.yarn.server.resourcemanager.MockRM.(MockRM.java:108)
at 
org.apache.hadoop.yarn.server.resourcemanager.MockRM.(MockRM.java:103)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestQueueMappings.testQueueMapping(TestQueueMappings.java:143)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3242) Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session.

2015-02-21 Thread zhihai xu (JIRA)
zhihai xu created YARN-3242:
---

 Summary: Old ZK client session watcher event messed up new ZK 
client session due to ZooKeeper asynchronously closing client session.
 Key: YARN-3242
 URL: https://issues.apache.org/jira/browse/YARN-3242
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical


Old ZK client session watcher event messed up new ZK client session due to 
ZooKeeper asynchronously closing client session.
The watcher event from old ZK client session can still be sent to 
ZKRMStateStore when the old  ZK client session is closed.
This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper 
session.
We only have one ZKRMStateStore but we can have multiple ZK client sessions.
Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher 
event is from current session. So the watcher event from old ZK client session 
which just is closed will still be processed.
For example, If a Disconnected event received from old session after new 
session is connected, the zkClient will be set to null
{code}
case Disconnected:
  LOG.info("ZKRMStateStore Session disconnected");
  oldZkClient = zkClient;
  zkClient = null;
  break;
{code}
Then ZKRMStateStore won't receive SyncConnected event from new session because 
new session is already in SyncConnected state and it won't send SyncConnected 
event until it is disconnected and connected again.
Then we will see all the ZKRMStateStore operations fail with IOException "Wait 
for ZKClient creation timed out" until  RM shutdown.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3241) Leading space, trailing space and empty sub queue name may cause MetricsException for fair scheduler

2015-02-20 Thread zhihai xu (JIRA)
zhihai xu created YARN-3241:
---

 Summary: Leading space, trailing space and empty sub queue name 
may cause MetricsException for fair scheduler
 Key: YARN-3241
 URL: https://issues.apache.org/jira/browse/YARN-3241
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Reporter: zhihai xu
Assignee: zhihai xu


Leading space, trailing space and empty sub queue name may cause 
MetricsException(Metrics source XXX already exists! ) when add application to 
FairScheduler.
The reason is because QueueMetrics parse the queue name different from the 
QueueManager.
QueueMetrics use Q_SPLITTER to parse queue name, it will remove Leading space 
and trailing space in the sub queue name, It will also remove empty sub queue 
name.
{code}
  static final Splitter Q_SPLITTER =
  Splitter.on('.').omitEmptyStrings().trimResults(); 
{code}
But QueueManager won't remove Leading space, trailing space and empty sub queue 
name.
This will cause out of sync between FSQueue and FSQueueMetrics.
QueueManager will think two queue names are different so it will try to create 
a new queue.
But FSQueueMetrics will think these two queue names as same which will create 
"Metrics source XXX already exists!" MetricsException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3236) cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY.

2015-02-19 Thread zhihai xu (JIRA)
zhihai xu created YARN-3236:
---

 Summary: cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY.
 Key: YARN-3236
 URL: https://issues.apache.org/jira/browse/YARN-3236
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Trivial


cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY.
RMAuthenticationFilter#AUTH_HANDLER_PROPERTY is added in YARN-2247. but the 
code which use  AUTH_HANDLER_PROPERTY is removed at YARN-2656. We would better 
remove it to avoid confusion since it is only introduce for a very short time 
and no one use it now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3205) FileSystemRMStateStore should disable FileSystem Cache to avoid get a Filesystem with an old configuration.

2015-02-16 Thread zhihai xu (JIRA)
zhihai xu created YARN-3205:
---

 Summary: FileSystemRMStateStore should disable FileSystem Cache to 
avoid get a Filesystem with an old configuration.
 Key: YARN-3205
 URL: https://issues.apache.org/jira/browse/YARN-3205
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: zhihai xu
Assignee: zhihai xu


FileSystemRMStateStore should disable FileSystem Cache to avoid get a 
Filesystem with an old configuration. The old configuration may not have all 
these customized DFS_CLIENT configurations for FileSystemRMStateStore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3114) It would be better to consider integer(long) overflow when compare the time in DelegationTokenRenewer.

2015-01-29 Thread zhihai xu (JIRA)
zhihai xu created YARN-3114:
---

 Summary: It would be better to consider integer(long) overflow 
when compare the time in DelegationTokenRenewer.
 Key: YARN-3114
 URL: https://issues.apache.org/jira/browse/YARN-3114
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Minor


It would be better to consider integer(long) overflow when compare the time in 
DelegationTokenRenewer.
When compare time in DelegationTokenRenewer#DelayedTokenRemovalRunnable to 
cancel token , it will have problem when currentTimeMillis is close to 
Long.MAX_VALUE.
The safer way to compare time will compare the time difference:
change
{code}
if (e.getValue() < System.currentTimeMillis()) {
{code}
to 
{code}
if (e.getValue() - System.currentTimeMillis() < 0) {
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3106) The message in IllegalArgumentException gave wrong information in NMTokenSecretManagerInRM.java and RMContainerTokenSecretManager.java

2015-01-28 Thread zhihai xu (JIRA)
zhihai xu created YARN-3106:
---

 Summary: The message in IllegalArgumentException gave wrong 
information in NMTokenSecretManagerInRM.java and 
RMContainerTokenSecretManager.java
 Key: YARN-3106
 URL: https://issues.apache.org/jira/browse/YARN-3106
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Minor


The message in IllegalArgumentException gave wrong information in 
NMTokenSecretManagerInRM.java and RMContainerTokenSecretManager.java.
We saw this error message:
{code}
Error starting ResourceManager
java.lang.IllegalArgumentException: 
yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs should be more 
than 2 X yarn.nm.liveness-monitor.expiry-interval-ms
{code}
After I checked the source code, I find this error message misleading.
The following is code from NMTokenSecretManagerInRM.java
{code}
rollingInterval = this.conf.getLong(
YarnConfiguration.RM_NMTOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS,
YarnConfiguration.DEFAULT_RM_NMTOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS)  * 1000;
this.activationDelay =(long) 
(conf.getLong(YarnConfiguration.RM_NM_EXPIRY_INTERVAL_MS,
YarnConfiguration.DEFAULT_RM_NM_EXPIRY_INTERVAL_MS) * 1.5);
LOG.info("NMTokenKeyRollingInterval: " + this.rollingInterval
+ "ms and NMTokenKeyActivationDelay: " + this.activationDelay
+ "ms");
if (rollingInterval <= activationDelay * 2) {
  throw new IllegalArgumentException( 
YarnConfiguration.RM_NMTOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS  + " 
should be more than 2 X "
  + YarnConfiguration.RM_NM_EXPIRY_INTERVAL_MS);
}
{code}
It should be 3 X not 2 X.
Same error also happened in RMContainerTokenSecretManager.java.
{code}
   this.rollingInterval = conf.getLong(   
YarnConfiguration.RM_CONTAINER_TOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS,   
YarnConfiguration.DEFAULT_RM_CONTAINER_TOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS) 
* 1000;
this.activationDelay =
(long) (conf.getLong(YarnConfiguration.RM_NM_EXPIRY_INTERVAL_MS,
YarnConfiguration.DEFAULT_RM_NM_EXPIRY_INTERVAL_MS) * 1.5);
LOG.info("ContainerTokenKeyRollingInterval: " + this.rollingInterval
+ "ms and ContainerTokenKeyActivationDelay: " + this.activationDelay
+ "ms");
if (rollingInterval <= activationDelay * 2) {
  throw new IllegalArgumentException(  
YarnConfiguration.RM_CONTAINER_TOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS   
   + " should be more than 2 X "
+YarnConfiguration.RM_CONTAINER_TOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS);
}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3079) Scheduler should also update maximumAllocation when updateNodeResource.

2015-01-21 Thread zhihai xu (JIRA)
zhihai xu created YARN-3079:
---

 Summary: Scheduler should also update maximumAllocation when 
updateNodeResource.
 Key: YARN-3079
 URL: https://issues.apache.org/jira/browse/YARN-3079
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhihai xu
Assignee: zhihai xu


Scheduler should also update maximumAllocation when updateNodeResource. 
Otherwise even the node resource is changed by AdminService#updateNodeResource, 
maximumAllocation won't be changed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3056) add verification for containerLaunchDuration in TestNodeManagerMetrics.

2015-01-13 Thread zhihai xu (JIRA)
zhihai xu created YARN-3056:
---

 Summary: add verification for containerLaunchDuration in 
TestNodeManagerMetrics.
 Key: YARN-3056
 URL: https://issues.apache.org/jira/browse/YARN-3056
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: test
Affects Versions: 2.6.0
Reporter: zhihai xu
Priority: Trivial


add verification for containerLaunchDuration in TestNodeManagerMetrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-2679) Add metric for container launch duration

2015-01-13 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu resolved YARN-2679.
-
Resolution: Fixed

> Add metric for container launch duration
> 
>
> Key: YARN-2679
> URL: https://issues.apache.org/jira/browse/YARN-2679
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.5.0
>Reporter: zhihai xu
>Assignee: zhihai xu
>  Labels: metrics, supportability
> Fix For: 2.7.0
>
> Attachments: YARN-2679.000.patch, YARN-2679.001.patch, 
> YARN-2679.002.patch
>
>
> add metrics in NodeManagerMetrics to get prepare time to launch container.
> The prepare time is the duration between sending 
> ContainersLauncherEventType.LAUNCH_CONTAINER event and receiving  
> ContainerEventType.CONTAINER_LAUNCHED event.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3023) Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM crash

2015-01-08 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu resolved YARN-3023.
-
Resolution: Duplicate

> Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM 
> crash 
> -
>
> Key: YARN-3023
> URL: https://issues.apache.org/jira/browse/YARN-3023
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: zhihai xu
>Assignee: zhihai xu
>
> Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM 
> crash.
> The sequence for the Race condition is the following:
> 1, RM Store attempt state to ZK by calling createWithRetries
> {code}
> 2015-01-06 12:37:35,343 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> Storing attempt: AppId: application_1418914202950_42363 AttemptId: 
> appattempt_1418914202950_42363_01 MasterContainer: Container: 
> [ContainerId: container_1418914202950_42363_01_01,
> {code}
> 2. unluckily ConnectionLoss for the ZK session happened at the same time as 
> RM Stored attempt state to ZK.
> The ZooKeeper server created the node and store the data successfully, But 
> due to ConnectionLoss, RM didn't know the operation (createWithRetries) is 
> succeeded.
> {code}
> 2015-01-06 12:37:36,102 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Exception while executing a ZK operation.
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss
> {code}
> 3.RM did retry to store attempt state to ZK after one second
> {code}
> 2015-01-06 12:37:36,104 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Retrying operation on ZK. Retry no. 1
> {code}
> 4. during the one second interval, the ZK session is reconnected.
> {code}
> 2015-01-06 12:37:36,210 INFO org.apache.zookeeper.ClientCnxn: Socket 
> connection established initiating session
> 2015-01-06 12:37:36,213 INFO org.apache.zookeeper.ClientCnxn: Session 
> establishment complete on server, sessionid = 0x44a9166eb2d12cb, negotiated 
> timeout = 1
> {code}
> 5. Because the node was created successfully at ZooKeeper in the first 
> try(runWithCheck),
> For the second try, it will fail with NodeExists KeeperException
> {code}
> 2015-01-06 12:37:37,116 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Exception while executing a ZK operation.
> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
> NodeExists
> 2015-01-06 12:37:37,118 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed 
> out ZK retries. Giving up!
> {code}
> 6.This NodeExists KeeperException will cause Storing AppAttempt failure in 
> RMStateStore
> {code}
> 2015-01-06 12:37:37,118 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
> storing appAttempt: appattempt_1418914202950_42363_01
> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
> NodeExists
> {code}
> 7.RMStateStore will send RMFatalEventType.STATE_STORE_OP_FAILED event to 
> ResourceManager
> {code}
>   protected void notifyStoreOperationFailed(Exception failureCause) {
> RMFatalEventType type;
> if (failureCause instanceof StoreFencedException) {
>   type = RMFatalEventType.STATE_STORE_FENCED;
> } else {
>   type = RMFatalEventType.STATE_STORE_OP_FAILED;
> }
> rmDispatcher.getEventHandler().handle(new RMFatalEvent(type, 
> failureCause));
>   }
> {code}
> 8.ResoureManager will kill itself after received STATE_STORE_OP_FAILED 
> RMFatalEvent.
> {code}
> 2015-01-06 12:37:37,128 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
> STATE_STORE_OP_FAILED. Cause:
> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
> NodeExists
> 2015-01-06 12:37:37,138 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3023) Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM crash

2015-01-08 Thread zhihai xu (JIRA)
zhihai xu created YARN-3023:
---

 Summary: Race condition in ZKRMStateStore#createWithRetries from 
ZooKeeper cause RM crash 
 Key: YARN-3023
 URL: https://issues.apache.org/jira/browse/YARN-3023
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: zhihai xu
Assignee: zhihai xu


Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM 
crash.

The sequence for the Race condition is the following:
1, RM Store attempt state to ZK by calling createWithRetries
{code}
2015-01-06 12:37:35,343 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
Storing attempt: AppId: application_1418914202950_42363 AttemptId: 
appattempt_1418914202950_42363_01 MasterContainer: Container: [ContainerId: 
container_1418914202950_42363_01_01,
{code}

2. unluckily ConnectionLoss for the ZK session happened at the same time as RM 
Stored attempt state to ZK.
The ZooKeeper server created the node and store the data successfully, But due 
to ConnectionLoss, RM didn't know the operation (createWithRetries) is 
succeeded.
{code}
2015-01-06 12:37:36,102 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
Exception while executing a ZK operation.
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss
{code}

3.RM did retry to store attempt state to ZK after one second
{code}
2015-01-06 12:37:36,104 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Retrying 
operation on ZK. Retry no. 1
{code}

4. during the one second interval, the ZK session is reconnected.
{code}
2015-01-06 12:37:36,210 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established initiating session
2015-01-06 12:37:36,213 INFO org.apache.zookeeper.ClientCnxn: Session 
establishment complete on server, sessionid = 0x44a9166eb2d12cb, negotiated 
timeout = 1
{code}

5. Because the node was created successfully at ZooKeeper in the first 
try(runWithCheck),
For the second try, it will fail with NodeExists KeeperException
{code}
2015-01-06 12:37:37,116 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
Exception while executing a ZK operation.
org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
NodeExists
2015-01-06 12:37:37,118 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed 
out ZK retries. Giving up!
{code}

6.This NodeExists KeeperException will cause Storing AppAttempt failure in 
RMStateStore
{code}
2015-01-06 12:37:37,118 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
storing appAttempt: appattempt_1418914202950_42363_01
org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
NodeExists
{code}

7.RMStateStore will send RMFatalEventType.STATE_STORE_OP_FAILED event to 
ResourceManager
{code}
  protected void notifyStoreOperationFailed(Exception failureCause) {
RMFatalEventType type;
if (failureCause instanceof StoreFencedException) {
  type = RMFatalEventType.STATE_STORE_FENCED;
} else {
  type = RMFatalEventType.STATE_STORE_OP_FAILED;
}
rmDispatcher.getEventHandler().handle(new RMFatalEvent(type, failureCause));
  }
{code}

8.ResoureManager will kill itself after received STATE_STORE_OP_FAILED 
RMFatalEvent.
{code}
2015-01-06 12:37:37,128 FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
STATE_STORE_OP_FAILED. Cause:
org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
NodeExists
2015-01-06 12:37:37,138 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
status 1
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2873) improve LevelDB error handling for missing files DBException to avoid NM start failure.

2014-11-17 Thread zhihai xu (JIRA)
zhihai xu created YARN-2873:
---

 Summary: improve LevelDB error handling for missing files 
DBException to avoid NM start failure.
 Key: YARN-2873
 URL: https://issues.apache.org/jira/browse/YARN-2873
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu


improve LevelDB error handling for missing files DBException to avoid NM start 
failure.
We saw the following three level DB exceptions, all these exceptions cause NM 
start failure.
DBException 1 in ShuffleHandler
{code}
INFO org.apache.hadoop.service.AbstractService: Service 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl 
failed in state STARTED; cause: 
org.apache.hadoop.service.ServiceStateException: 
org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 missing 
files; e.g.: 
/tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst
org.apache.hadoop.service.ServiceStateException: 
org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 missing 
files; e.g.: 
/tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst
at 
org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:204)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceStart(AuxServices.java:159)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at 
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceStart(ContainerManagerImpl.java:441)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at 
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:261)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:446)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 
1 missing files; e.g.: 
/tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst
at 
org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
at 
org.apache.hadoop.mapred.ShuffleHandler.startStore(ShuffleHandler.java:475)
at 
org.apache.hadoop.mapred.ShuffleHandler.recoverState(ShuffleHandler.java:443)
at 
org.apache.hadoop.mapred.ShuffleHandler.serviceStart(ShuffleHandler.java:379)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
... 10 more
{code}

DBException 2 in NMLeveldbStateStoreService:
{code}
Error starting NodeManager 
org.apache.hadoop.service.ServiceStateException: 
org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 missing 
files; e.g.: /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/05.sst 
at 
org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
 
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) 
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:152)
 
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:190)
 
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) 
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445)
 
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
 
Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 
1 missing files; e.g.: 
/tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/05.sst 
at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) 
at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) 
at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) 
at 
org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:842)
 
at 
org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:195)
 
at org.apache.hadoop.service.AbstractService.init(

[jira] [Created] (YARN-2831) NM should kill and cleanup the leaked containers.

2014-11-07 Thread zhihai xu (JIRA)
zhihai xu created YARN-2831:
---

 Summary: NM should kill and cleanup the leaked containers.
 Key: YARN-2831
 URL: https://issues.apache.org/jira/browse/YARN-2831
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhihai xu


NM should kill and cleanup the leaked containers. as discussed in YARN-2816, we 
should implement a function which is to kill and cleanup the leaked containers: 
look for the pid file, try to kill it if found, and return a recovered 
container status of killed/lost or something similar.
So this function can be called when a leaked container is found.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2820) Improve FileSystemRMStateStore update failure exception handling to not shutdown RM.

2014-11-06 Thread zhihai xu (JIRA)
zhihai xu created YARN-2820:
---

 Summary: Improve FileSystemRMStateStore update failure exception 
handling to not  shutdown RM.
 Key: YARN-2820
 URL: https://issues.apache.org/jira/browse/YARN-2820
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu


When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We saw 
the following IOexception cause the RM shutdown.

{code}
FATAL
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
STATE_STORE_OP_FAILED. Cause: 
java.io.IOException: Unable to close file because the last block does not have 
enough number of replicas. 
at 
org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) 
at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) 
at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)
 
at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
 
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) 
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) 
at java.lang.Thread.run(Thread.java:744) 
{code}

It will be better to  Improve FileSystemRMStateStore update failure exception 
handling to not  shutdown RM. So that a single state write out failure can't 
stop all jobs .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2816) NM fail to start with NPE during container recovery

2014-11-05 Thread zhihai xu (JIRA)
zhihai xu created YARN-2816:
---

 Summary: NM fail to start with NPE during container recovery
 Key: YARN-2816
 URL: https://issues.apache.org/jira/browse/YARN-2816
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu


NM fail to start with NPE during container recovery.
We saw the following crash happen:
2014-10-30 22:22:37,211 INFO org.apache.hadoop.service.AbstractService: Service 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl 
failed in state INITED; cause: java.lang.NullPointerException
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)

The reason is some DB files used in NMLeveldbStateStoreService are accidentally 
deleted to save disk space at /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state. 
This leaves some incomplete container record which don't have 
CONTAINER_REQUEST_KEY_SUFFIX(startRequest) entry in the DB. When container is 
recovered at ContainerManagerImpl#recoverContainer, 
The NullPointerException at the following code cause NM shutdown.
{code}
StartContainerRequest req = rcs.getStartRequest();
ContainerLaunchContext launchContext = req.getContainerLaunchContext();
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2802) add AM container launch and register delay metrics in QueueMetrics to help diagnose performance issue.

2014-11-03 Thread zhihai xu (JIRA)
zhihai xu created YARN-2802:
---

 Summary: add AM container launch and register delay metrics in 
QueueMetrics to help diagnose performance issue.
 Key: YARN-2802
 URL: https://issues.apache.org/jira/browse/YARN-2802
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu


add AM container launch and register delay metrics in QueueMetrics to help 
diagnose performance issue.
Added two metrics in QueueMetrics:
aMLaunchDelay: the time spent from sending event AMLauncherEventType.LAUNCH to 
receiving event RMAppAttemptEventType.LAUNCHED in RMAppAttemptImpl.

aMRegisterDelay: the time waiting from receiving event 
RMAppAttemptEventType.LAUNCHED to receiving event 
RMAppAttemptEventType.REGISTERED(ApplicationMasterService#registerApplicationMaster)
 in RMAppAttemptImpl.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2799) cleanup TestLogAggregationService based on the change in YARN-90

2014-11-02 Thread zhihai xu (JIRA)
zhihai xu created YARN-2799:
---

 Summary: cleanup TestLogAggregationService based on the change in 
YARN-90
 Key: YARN-2799
 URL: https://issues.apache.org/jira/browse/YARN-2799
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: test
Reporter: zhihai xu
Priority: Minor


cleanup TestLogAggregationService based on the change in YARN-90.
The following code is added to setup in YARN-90, 
{code}
dispatcher = createDispatcher();
appEventHandler = mock(EventHandler.class);
dispatcher.register(ApplicationEventType.class, appEventHandler);
{code}
In this case, we should remove all these code from each test function to avoid 
duplicate code.

Same for dispatcher.stop() which is in tearDown,
we can remove dispatcher.stop() from from each test function also because it 
will always be called from tearDown for each test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2759) addToCluserNodeLabels should not change the value in labelCollections if the key already exists to avoid the Label.resource is reset.

2014-10-27 Thread zhihai xu (JIRA)
zhihai xu created YARN-2759:
---

 Summary: addToCluserNodeLabels should not change the value in 
labelCollections if the key already exists to avoid the Label.resource is reset.
 Key: YARN-2759
 URL: https://issues.apache.org/jira/browse/YARN-2759
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhihai xu
Assignee: zhihai xu


addToCluserNodeLabels should not change the value in labelCollections if the 
key already exists to avoid the Label.resource is reset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2757) potential NPE in checkNodeLabelExpression of SchedulerUtils for nodeLabels.

2014-10-27 Thread zhihai xu (JIRA)
zhihai xu created YARN-2757:
---

 Summary: potential NPE in checkNodeLabelExpression of 
SchedulerUtils for nodeLabels.
 Key: YARN-2757
 URL: https://issues.apache.org/jira/browse/YARN-2757
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: zhihai xu
Assignee: zhihai xu


pontential NPE in checkNodeLabelExpression of SchedulerUtils for nodeLabels.
since we check the nodeLabels null at 
{code}
if (!str.trim().isEmpty()
&& (nodeLabels == null || !nodeLabels.contains(str.trim( {
  return false;
}
{code}
We should also check nodeLabels null at 
{code}
  if (!nodeLabels.isEmpty()) {
return false;
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2756) use static variable (Resources.none()) for not-running Node.resource in CommonNodeLabelsManager to save memory.

2014-10-27 Thread zhihai xu (JIRA)
zhihai xu created YARN-2756:
---

 Summary: use static variable (Resources.none()) for not-running 
Node.resource in CommonNodeLabelsManager to save memory.
 Key: YARN-2756
 URL: https://issues.apache.org/jira/browse/YARN-2756
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Minor


use static variable (Resources.none()) for not-running Node.resource in 
CommonNodeLabelsManager to save memory. When a Node is not activated, the 
resource is never used, When a Node is activated, a new resource will be 
assigned to it in RMNodeLabelsManager#activateNode (nm.resource = resource;) So 
it would be better to use static variable Resources.none() instead of 
allocating a new variable(Resource.newInstance(0, 0)) for each node 
deactivation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2754) addToCluserNodeLabels should be protected by writeLock in RMNodeLabelsManager.java.

2014-10-27 Thread zhihai xu (JIRA)
zhihai xu created YARN-2754:
---

 Summary: addToCluserNodeLabels should be protected by writeLock in 
RMNodeLabelsManager.java.
 Key: YARN-2754
 URL: https://issues.apache.org/jira/browse/YARN-2754
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhihai xu
Assignee: zhihai xu


addToCluserNodeLabels should be protected by writeLock in 
RMNodeLabelsManager.java. because we should protect labelCollections in 
RMNodeLabelsManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2753) potential NPE in checkRemoveLabelsFromNode of CommonNodeLabelsManager

2014-10-27 Thread zhihai xu (JIRA)
zhihai xu created YARN-2753:
---

 Summary: potential NPE in checkRemoveLabelsFromNode of 
CommonNodeLabelsManager
 Key: YARN-2753
 URL: https://issues.apache.org/jira/browse/YARN-2753
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhihai xu
Assignee: zhihai xu


potential NPE(NullPointerException) in checkRemoveLabelsFromNode of 
CommonNodeLabelsManager.
It because when a Node is created, Node.labels can be null.
In this case, nm.labels; may be null.
So we need check originalLabels not null before use 
it(originalLabels.containsAll).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2735) diskUtilizationPercentageCutoff and diskUtilizationSpaceCutoff are initialized twice in DirectoryCollection

2014-10-23 Thread zhihai xu (JIRA)
zhihai xu created YARN-2735:
---

 Summary: diskUtilizationPercentageCutoff and 
diskUtilizationSpaceCutoff are initialized twice in DirectoryCollection
 Key: YARN-2735
 URL: https://issues.apache.org/jira/browse/YARN-2735
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Minor


diskUtilizationPercentageCutoff and diskUtilizationSpaceCutoff are initialized 
twice in DirectoryCollection



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2682) WindowsSecureContainerExecutor should not depend on DefaultContainerExecutor#getFirstApplicationDir.

2014-10-13 Thread zhihai xu (JIRA)
zhihai xu created YARN-2682:
---

 Summary: WindowsSecureContainerExecutor should not depend on 
DefaultContainerExecutor#getFirstApplicationDir. 
 Key: YARN-2682
 URL: https://issues.apache.org/jira/browse/YARN-2682
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu


DefaultContainerExecutor won't use getFirstApplicationDir any more. But we 
can't delete getFirstApplicationDir in DefaultContainerExecutor because 
WindowsSecureContainerExecutor uses it.
We should move getFirstApplicationDir function from DefaultContainerExecutor to 
WindowsSecureContainerExecutor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2679) add container launch prepare time metrics to NM.

2014-10-13 Thread zhihai xu (JIRA)
zhihai xu created YARN-2679:
---

 Summary: add container launch prepare time metrics to NM.
 Key: YARN-2679
 URL: https://issues.apache.org/jira/browse/YARN-2679
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu


add metrics in NodeManagerMetrics to get prepare time to launch container.
The prepare time is the duration between sending 
ContainersLauncherEventType.LAUNCH_CONTAINER event and receiving  
ContainerEventType.CONTAINER_LAUNCHED event.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2675) the containersKilled metrics is not updated when the container is killed during localization.

2014-10-10 Thread zhihai xu (JIRA)
zhihai xu created YARN-2675:
---

 Summary: the containersKilled metrics is not updated when the 
container is killed during localization.
 Key: YARN-2675
 URL: https://issues.apache.org/jira/browse/YARN-2675
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu


The containersKilled metrics is not updated when the container is killed during 
localization. We should add KILLING state in finished of ContainerImpl.java to 
update killedContainer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2641) improve node decommission latency in RM.

2014-10-02 Thread zhihai xu (JIRA)
zhihai xu created YARN-2641:
---

 Summary: improve node decommission latency in RM.
 Key: YARN-2641
 URL: https://issues.apache.org/jira/browse/YARN-2641
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu


improve node decommission latency in RM. 
Currently the node decommission only happened after RM received nodeHeartbeat 
from the Node Manager. The node heartbeat interval is configurable. The default 
value is 1 second.
It will be better to do the decommission during RM Refresh(NodesListManager) 
instead of nodeHeartbeat(ResourceTrackerService).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2623) Linux container executor only use the first local directory to copy token file in container-executor.c.

2014-09-29 Thread zhihai xu (JIRA)
zhihai xu created YARN-2623:
---

 Summary: Linux container executor only use the first local 
directory to copy token file in container-executor.c.
 Key: YARN-2623
 URL: https://issues.apache.org/jira/browse/YARN-2623
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
 Environment: Linux container executor only use the first local 
directory to copy token file in container-executor.c.
Reporter: zhihai xu
Assignee: zhihai xu


Linux container executor only use the first local directory to copy token file 
in container-executor.c. if It failed to copy token file to the first local 
directory, the  localization failure event will happen. Even though it can copy 
token file to the other local directory successfully. The correct way should be 
to copy token file  to the next local directory  if the first one failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.

2014-09-17 Thread zhihai xu (JIRA)
zhihai xu created YARN-2566:
---

 Summary: IOException happen in startLocalizer of 
DefaultContainerExecutor due to not enough disk space for the first localDir.
 Key: YARN-2566
 URL: https://issues.apache.org/jira/browse/YARN-2566
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu


startLocalizer in DefaultContainerExecutor will only use the first localDir to 
copy the token file, if the copy is failed for first localDir due to not enough 
disk space in the first localDir, the localization will be failed even there 
are plenty of disk space in other localDirs. We see the following error for 
this case:
{code}
2014-09-13 23:33:25,171 WARN 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to 
create app directory 
/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004
java.io.IOException: mkdir of 
/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed
at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062)
at 
org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157)
at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721)
at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717)
at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
2014-09-13 23:33:25,185 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 Localizer failed
java.io.FileNotFoundException: File 
file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 does 
not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
at 
org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111)
at 
org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76)
at 
org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.(ChecksumFs.java:344)
at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
at 
org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577)
at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677)
at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673)
at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
at org.apache.hadoop.fs.FileContext.create(FileContext.java:673)
at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021)
at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
2014-09-13 23:33:25,186 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: 
Container container_1410663092546_0004_01_01 transitioned from LOCALIZING 
to LOCALIZATION_FAILED
2014-09-13 23:33:25,187 WARN 
org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=cloudera 
OPERATION=Container Finished - Failed   TARGET=ContainerImplRESULT=FAILURE  
DESCRIPTION=Container failed with state: LOCALIZATION_FAILED
APPID=application_1410663092546_0004
CONTAINERID=container_1410663092546_0004_01_01
2014-09-13 23:33:25,187 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: 
Container container_1410663092546_0004_01_01 transitioned from 
LOCALIZATION_FAILED to DONE
2014-09-13 23:33:25,187 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
 Removing container_1410663092546_0004_01_01 from application 
application_1410663092546_0004
2014-09-13 23:33:25,187

[jira] [Created] (YARN-2534) FairScheduler: totalMaxShare is not calculated correctly in computeSharesInternal

2014-09-10 Thread zhihai xu (JIRA)
zhihai xu created YARN-2534:
---

 Summary: FairScheduler: totalMaxShare is not calculated correctly 
in computeSharesInternal
 Key: YARN-2534
 URL: https://issues.apache.org/jira/browse/YARN-2534
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Fix For: 2.6.0


FairScheduler: totalMaxShare is not calculated correctly in 
computeSharesInternal for some cases.
If the sum of MAX share of all Schedulables is more than Integer.MAX_VALUE ,but 
each individual MAX share is not equal to Integer.MAX_VALUE. then totalMaxShare 
will be a negative value, which will cause all fairShare are wrongly calculated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2453) TestProportionalCapacityPreemptionPolicy is failed for FairScheduler

2014-08-25 Thread zhihai xu (JIRA)
zhihai xu created YARN-2453:
---

 Summary: TestProportionalCapacityPreemptionPolicy is failed for 
FairScheduler
 Key: YARN-2453
 URL: https://issues.apache.org/jira/browse/YARN-2453
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhihai xu
Assignee: zhihai xu


TestProportionalCapacityPreemptionPolicy is failed for FairScheduler.
The following is error message:
Running 
org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy
Tests run: 18, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 3.94 sec <<< 
FAILURE! - in 
org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy
testPolicyInitializeAfterSchedulerInitialized(org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy)
  Time elapsed: 1.61 sec  <<< FAILURE!
java.lang.AssertionError: Failed to find SchedulingMonitor service, please 
check what happened
at org.junit.Assert.fail(Assert.java:88)
at 
org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy.testPolicyInitializeAfterSchedulerInitialized(TestProportionalCapacityPreemptionPolicy.java:469)

This test should only work for capacity scheduler because the following source 
code in ResourceManager.java prove it will only work for capacity scheduler.
{code}
if (scheduler instanceof PreemptableResourceScheduler
  && conf.getBoolean(YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS,
  YarnConfiguration.DEFAULT_RM_SCHEDULER_ENABLE_MONITORS)) {
{code}

Because CapacityScheduler is instance of PreemptableResourceScheduler and 
FairScheduler is not  instance of PreemptableResourceScheduler.
I will upload a patch to fix this issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2452) TestRMApplicationHistoryWriter is failed for FairScheduler

2014-08-25 Thread zhihai xu (JIRA)
zhihai xu created YARN-2452:
---

 Summary: TestRMApplicationHistoryWriter is failed for FairScheduler
 Key: YARN-2452
 URL: https://issues.apache.org/jira/browse/YARN-2452
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhihai xu
Assignee: zhihai xu


TestRMApplicationHistoryWriter is failed for FairScheduler. The failure is the 
following:
T E S T S
---
Running 
org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter
Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 69.311 sec <<< 
FAILURE! - in 
org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter
testRMWritingMassiveHistory(org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter)
  Time elapsed: 66.261 sec  <<< FAILURE!
java.lang.AssertionError: expected:<1> but was:<200>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:555)
at org.junit.Assert.assertEquals(Assert.java:542)
at 
org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:430)
at 
org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:391)





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (YARN-2376) Too many threads blocking on the global JobTracker lock from getJobCounters, optimize getJobCounters to release global JobTracker lock before access the per job counter i

2014-07-31 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu resolved YARN-2376.
-

Resolution: Duplicate

> Too many threads blocking on the global JobTracker lock from getJobCounters, 
> optimize getJobCounters to release global JobTracker lock before access the 
> per job counter in JobInProgress
> -
>
> Key: YARN-2376
> URL: https://issues.apache.org/jira/browse/YARN-2376
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-2376.000.patch
>
>
> Too many threads blocking on the global JobTracker lock from getJobCounters, 
> optimize getJobCounters to release global JobTracker lock before access the 
> per job counter in JobInProgress. It may be a lot of JobClients to call 
> getJobCounters in JobTracker at the same time, Current code will lock the 
> JobTracker to block all the threads to get counter from JobInProgress. It is 
> better to unlock the JobTracker when get counter from 
> JobInProgress(job.getCounters(counters)). So all the theads can run parallel 
> when access its own job counter.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2376) Too many threads blocking on the global JobTracker lock from getJobCounters, optimize getJobCounters to release global JobTracker lock before access the per job counter in

2014-07-31 Thread zhihai xu (JIRA)
zhihai xu created YARN-2376:
---

 Summary: Too many threads blocking on the global JobTracker lock 
from getJobCounters, optimize getJobCounters to release global JobTracker lock 
before access the per job counter in JobInProgress
 Key: YARN-2376
 URL: https://issues.apache.org/jira/browse/YARN-2376
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: zhihai xu
Assignee: zhihai xu


Too many threads blocking on the global JobTracker lock from getJobCounters, 
optimize getJobCounters to release global JobTracker lock before access the per 
job counter in JobInProgress. It may be a lot of JobClients to call 
getJobCounters in JobTracker at the same time, Current code will lock the 
JobTracker to block all the threads to get counter from JobInProgress. It is 
better to unlock the JobTracker when get counter from 
JobInProgress(job.getCounters(counters)). So all the theads can run parallel 
when access its own job counter.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2361) remove duplicate entries (EXPIRE event) in the EnumSet of event type in RMAppAttempt state machine

2014-07-25 Thread zhihai xu (JIRA)
zhihai xu created YARN-2361:
---

 Summary: remove duplicate entries (EXPIRE event) in the EnumSet of 
event type in RMAppAttempt state machine
 Key: YARN-2361
 URL: https://issues.apache.org/jira/browse/YARN-2361
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: zhihai xu
Priority: Minor
 Attachments: YARN-2361.000.patch

remove duplicate entries in the EnumSet of event type in RMAppAttempt state 
machine. The  event RMAppAttemptEventType.EXPIRE is duplicated in the following 
code.
{code}
  EnumSet.of(RMAppAttemptEventType.ATTEMPT_ADDED,
  RMAppAttemptEventType.EXPIRE,
  RMAppAttemptEventType.LAUNCHED,
  RMAppAttemptEventType.LAUNCH_FAILED,
  RMAppAttemptEventType.EXPIRE,
  RMAppAttemptEventType.REGISTERED,
  RMAppAttemptEventType.CONTAINER_ALLOCATED,
  RMAppAttemptEventType.UNREGISTERED,
  RMAppAttemptEventType.KILL,
  RMAppAttemptEventType.STATUS_UPDATE))
{code}




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.

2014-07-25 Thread zhihai xu (JIRA)
zhihai xu created YARN-2359:
---

 Summary: Application is hung without timeout and retry after 
DNS/network is down. 
 Key: YARN-2359
 URL: https://issues.apache.org/jira/browse/YARN-2359
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: zhihai xu


Application is hung without timeout and retry after DNS/network is down. 
It is because right after the container is allocated for the AM, the 
DNS/network is down for the node which has the AM container.
The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
IllegalArgumentException(due to DNS error) happened, it stay at state 
RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
processed at this state:
RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
The code didn't handle any event(RMAppAttemptEventType.CONTAINER_FINISHED) 
which will be generated by the node and container timeout. So even the node is 
removed, the Application is still hung in this state 
RMAppAttemptState.SCHEDULED.
The only way to make the application exit this state is to send 
RMAppAttemptEventType.KILL event which will only be generated when you manually 
kill the application from Job Client by forceKillApplication.

To fix the issue, we should add an entry in the state machine table to handle 
RMAppAttemptEventType.CONTAINER_FINISHED event at state 
RMAppAttemptState.SCHEDULED
add the following code in StateMachineFactory:
 .addTransition(RMAppAttemptState.SCHEDULED, 
  RMAppAttemptState.FINAL_SAVING,
  RMAppAttemptEventType.CONTAINER_FINISHED,
  new FinalSavingTransition(
new AMContainerCrashedBeforeRunningTransition(), 
RMAppAttemptState.FAILED))



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2337) remove duplication function call (setClientRMService) in resource manage class

2014-07-23 Thread zhihai xu (JIRA)
zhihai xu created YARN-2337:
---

 Summary: remove duplication function call (setClientRMService) in 
resource manage class
 Key: YARN-2337
 URL: https://issues.apache.org/jira/browse/YARN-2337
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: zhihai xu
Priority: Minor


remove duplication function call (setClientRMService) in resource manage class.
rmContext.setClientRMService(clientRM); is duplicate in serviceInit of 
ResourceManager. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2325) need check whether node is null in nodeUpdate for FairScheduler

2014-07-20 Thread zhihai xu (JIRA)
zhihai xu created YARN-2325:
---

 Summary: need check whether node is null in nodeUpdate for 
FairScheduler 
 Key: YARN-2325
 URL: https://issues.apache.org/jira/browse/YARN-2325
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhihai xu


need check whether node is null in nodeUpdate for FairScheduler.
If nodeUpdate is called after removeNode, the getFSSchedulerNode will be null. 
If the node is null, we should return with error message.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2324) Race condition in continuousScheduling for FairScheduler

2014-07-20 Thread zhihai xu (JIRA)
zhihai xu created YARN-2324:
---

 Summary: Race condition in continuousScheduling for FairScheduler
 Key: YARN-2324
 URL: https://issues.apache.org/jira/browse/YARN-2324
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhihai xu


Race condition in continuousScheduling for FairScheduler.
removeNode can run when continuousScheduling is called in schedulingThread. If 
the node is removed from nodes, nodes.get(n2) and getFSSchedulerNode(nodeId) 
will be null. So we need add lock to remove the NPE/race conditions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2315) Should use setCurrentCapacity instead of setCapacity to configure used resource capacity for FairScheduler.

2014-07-17 Thread zhihai xu (JIRA)
zhihai xu created YARN-2315:
---

 Summary: Should use setCurrentCapacity instead of setCapacity to 
configure used resource capacity for FairScheduler.
 Key: YARN-2315
 URL: https://issues.apache.org/jira/browse/YARN-2315
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhihai xu
Assignee: zhihai xu


Should use setCurrentCapacity instead of setCapacity to configure used resource 
capacity for FairScheduler.
In function getQueueInfo of FSQueue.java, we call setCapacity twice with 
different parameters so the first call is overrode by the second call. 
queueInfo.setCapacity((float) getFairShare().getMemory() /
scheduler.getClusterResource().getMemory());
queueInfo.setCapacity((float) getResourceUsage().getMemory() /
scheduler.getClusterResource().getMemory());
We should change the second setCapacity call to setCurrentCapacity to configure 
the current used capacity.




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2254) change TestRMWebServicesAppsModification to support FairScheduler.

2014-07-04 Thread zhihai xu (JIRA)
zhihai xu created YARN-2254:
---

 Summary: change TestRMWebServicesAppsModification to support 
FairScheduler.
 Key: YARN-2254
 URL: https://issues.apache.org/jira/browse/YARN-2254
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: zhihai xu
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.2#6252)