[jira] [Created] (YARN-6396) Call verifyAndCreateRemoteLogDir at service initialization instead of application initialization to decrease load for name node
zhihai xu created YARN-6396: --- Summary: Call verifyAndCreateRemoteLogDir at service initialization instead of application initialization to decrease load for name node Key: YARN-6396 URL: https://issues.apache.org/jira/browse/YARN-6396 Project: Hadoop YARN Issue Type: Improvement Components: log-aggregation Affects Versions: 3.0.0-alpha2 Reporter: zhihai xu Assignee: zhihai xu Priority: Minor Call verifyAndCreateRemoteLogDir at service initialization instead of application initialization to decrease load for name node. Currently for every application at each Node, verifyAndCreateRemoteLogDir will be called before doing log aggregation, This will be a non trivial overhead for name node in a large cluster since verifyAndCreateRemoteLogDir calls getFileStatus. Once the remote log directory is created successfully, it is not necessary to call it again. It will be better to call verifyAndCreateRemoteLogDir at LogAggregationService service initialization. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6392) add submit time to Application Summary log
zhihai xu created YARN-6392: --- Summary: add submit time to Application Summary log Key: YARN-6392 URL: https://issues.apache.org/jira/browse/YARN-6392 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 3.0.0-alpha2 Reporter: zhihai xu Assignee: zhihai xu Priority: Minor add submit time to Application Summary log, application submit time will be passed to Application Master in env variable "APP_SUBMIT_TIME_ENV". It is a very important parameter, So it will be useful to log it in Application Summary. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-4979) FSAppAttempt adds duplicate ResourceRequest to demand in updateDemand.
zhihai xu created YARN-4979: --- Summary: FSAppAttempt adds duplicate ResourceRequest to demand in updateDemand. Key: YARN-4979 URL: https://issues.apache.org/jira/browse/YARN-4979 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.2, 2.8.0 Reporter: zhihai xu Assignee: zhihai xu FSAppAttempt adds duplicate ResourceRequest to demand in updateDemand. We should only count ResourceRequest for ResourceRequest.ANY when calculate demand. Because {{hasContainerForNode}} will return false if no container request for ResourceRequest.ANY and both {{allocateNodeLocal}} and {{allocateRackLocal}} will also decrease the number of containers for ResourceRequest.ANY. This issue may cause current memory demand overflow(integer) because duplicate requests can be on multiple nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4458) Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl.
zhihai xu created YARN-4458: --- Summary: Compilation error at branch-2.7 due to getNodeLabelExpression not defined in NMContainerStatusPBImpl. Key: YARN-4458 URL: https://issues.apache.org/jira/browse/YARN-4458 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4209) RMStateStore FENCED state doesn’t work
zhihai xu created YARN-4209: --- Summary: RMStateStore FENCED state doesn’t work Key: YARN-4209 URL: https://issues.apache.org/jira/browse/YARN-4209 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.1 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical RMStateStore FENCED state doesn’t work. The reason is {{stateMachine.doTransition}} called from {{updateFencedState}} is embedded in {{stateMachine.doTransition}} called from public API(removeRMDelegationToken...) or {{ForwardingEventHandler#handle}}. So right after the internal state transition from {{updateFencedState}} changes the state to FENCED state, the external state transition changes the state back to ACTIVE state. The end result is that RMStateStore is still in ACTIVE state even notifyStoreOperationFailed is called. The only working case for FENCED state is {{notifyStoreOperationFailed}} called from {{ZKRMStateStore#VerifyActiveStatusThread}}. For example: {{removeRMDelegationToken}} => {{handleStoreEvent}} => enter external {{stateMachine.doTransition}} => {{RemoveRMDTTransition}} => {{notifyStoreOperationFailed}} =>{{updateFencedState}}=>{{handleStoreEvent}}=> enter internal {{stateMachine.doTransition}} => exit internal {{stateMachine.doTransition}} change state to FENCED => exit external {{stateMachine.doTransition}} change state to ACTIVE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-4190) missing container information in FairScheduler preemption log.
[ https://issues.apache.org/jira/browse/YARN-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu resolved YARN-4190. - Resolution: Later > missing container information in FairScheduler preemption log. > -- > > Key: YARN-4190 > URL: https://issues.apache.org/jira/browse/YARN-4190 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 2.7.1 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Trivial > > Add container information in FairScheduler preemption log to help debug. > Currently the following log doesn't have container information > {code} > LOG.info("Preempting container (prio=" + > container.getContainer().getPriority() + > "res=" + container.getContainer().getResource() + > ") from queue " + queue.getName()); > {code} > So it will be very difficult to debug preemption related issue for > FairScheduler. > Even the container information is printed in the following code > {code} > LOG.info("Killing container" + container + > " (after waiting for premption for " + > (getClock().getTime() - time) + "ms)"); > {code} > But we can't match these two logs based on the container ID. > It will be very useful to add container information in the first log. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4190) Add container information in FairScheduler preemption log to help debug.
zhihai xu created YARN-4190: --- Summary: Add container information in FairScheduler preemption log to help debug. Key: YARN-4190 URL: https://issues.apache.org/jira/browse/YARN-4190 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Affects Versions: 2.7.1 Reporter: zhihai xu Assignee: zhihai xu Priority: Trivial Add container information in FairScheduler preemption log to help debug. Currently the following log doesn't have container information {code} LOG.info("Preempting container (prio=" + container.getContainer().getPriority() + "res=" + container.getContainer().getResource() + ") from queue " + queue.getName()); {code} So it will be very difficult to debug preemption related issue for FairScheduler. Even the container information is printed in the following code {code} LOG.info("Killing container" + container + " (after waiting for premption for " + (getClock().getTime() - time) + "ms)"); {code} But we can't match these two logs based on the container ID. It will be very useful to add container information in the first log. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4187) Yarn Client uses local address instead RM address as token renewer in a secure cluster when HA is enabled.
zhihai xu created YARN-4187: --- Summary: Yarn Client uses local address instead RM address as token renewer in a secure cluster when HA is enabled. Key: YARN-4187 URL: https://issues.apache.org/jira/browse/YARN-4187 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Yarn Client uses local address instead RM address as token renewer in a secure cluster when HA is enabled. This will cause HDFS token renew failure for renewer "nobody" if the rules from {{hadoop.security.auth_to_local}} exclude the client address in HDFS DelegationTokenIdentifier. The following is the exception which cause the job fail {code} 15/09/12 16:27:24 WARN security.UserGroupInformation: PriviledgedActionException as:t...@example.com (auth:KERBEROS) cause:java.io.IOException: Failed to run job : yarn tries to renew a token with renewer nobody at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) java.io.IOException: Failed to run job : yarn tries to renew a token with renewer nobody at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:464) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewDelegationToken(FSNamesystem.java:7109) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewDelegationToken(NameNodeRpcServer.java:512) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.renewDelegationToken(AuthorizationProviderProxyClientProtocol.java:648) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:975) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:300) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:438) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1295) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1292) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1292) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1313) at org.apache.hadoop.examples.WordCount.main(WordCount.java:87) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145) at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74
[jira] [Created] (YARN-4158) Remove duplicate close for LogWriter in AppLogAggregatorImpl#uploadLogsForContainers
zhihai xu created YARN-4158: --- Summary: Remove duplicate close for LogWriter in AppLogAggregatorImpl#uploadLogsForContainers Key: YARN-4158 URL: https://issues.apache.org/jira/browse/YARN-4158 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Trivial Remove duplicate {{close}} for {{LogWriter}} in {{AppLogAggregatorImpl#uploadLogsForContainers}} {{writer.close()}} was called twice in {{uploadLogsForContainers}}. It will be better to close {{writer}} once. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4153) TestAsyncDispatcher failed at branch-2.7
zhihai xu created YARN-4153: --- Summary: TestAsyncDispatcher failed at branch-2.7 Key: YARN-4153 URL: https://issues.apache.org/jira/browse/YARN-4153 Project: Hadoop YARN Issue Type: Bug Components: yarn Reporter: zhihai xu Assignee: zhihai xu TestAsyncDispatcher failed at branch-2.7. It is because the change from YARN-3999 didn't merge to branch-2.7 completely. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4133) Containers to be preempted leaks in FairScheduler preemption logic.
zhihai xu created YARN-4133: --- Summary: Containers to be preempted leaks in FairScheduler preemption logic. Key: YARN-4133 URL: https://issues.apache.org/jira/browse/YARN-4133 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.1 Reporter: zhihai xu Assignee: zhihai xu Containers to be preempted leaks in FairScheduler preemption logic. It may cause missing preemption due to containers in {{warnedContainers}} wrongly removed. The problem is in {{preemptResources}}: There are two issues which can cause containers wrongly removed from {{warnedContainers}}: Firstly missing the container state {{RMContainerState.ACQUIRED}} in the condition check: {code} (container.getState() == RMContainerState.RUNNING || container.getState() == RMContainerState.ALLOCATED) {code} Secondly if {{isResourceGreaterThanNone(toPreempt)}} return false, we shouldn't remove container from {{warnedContainers}}, We should only remove container from {{warnedContainers}}, if container is not in state {{RMContainerState.RUNNING}}, {{RMContainerState.ALLOCATED}} and {{RMContainerState.ACQUIRED}}. {code} if ((container.getState() == RMContainerState.RUNNING || container.getState() == RMContainerState.ALLOCATED) && isResourceGreaterThanNone(toPreempt)) { warnOrKillContainer(container); Resources.subtractFrom(toPreempt, container.getContainer().getResource()); } else { warnedIter.remove(); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4095) Avoid sharing AllocatorPerContext object in LocalDirAllocator between ShuffleHandler and LocalDirsHandlerService.
zhihai xu created YARN-4095: --- Summary: Avoid sharing AllocatorPerContext object in LocalDirAllocator between ShuffleHandler and LocalDirsHandlerService. Key: YARN-4095 URL: https://issues.apache.org/jira/browse/YARN-4095 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: zhihai xu Assignee: zhihai xu Currently {{ShuffleHandler}} and {{LocalDirsHandlerService}} share {{AllocatorPerContext}} object in {{LocalDirAllocator}} for configuration {{NM_LOCAL_DIRS}} because {{AllocatorPerContext}}s are stored in a static TreeMap with configuration name as key {code} private static Map contexts = new TreeMap(); {code} {{LocalDirsHandlerService}} and {{ShuffleHandler}} both create a {{LocalDirAllocator}} using {{NM_LOCAL_DIRS}}. Even they don't use the same {{Configuration}} object, but they will use the same {{AllocatorPerContext}} object. Also {{LocalDirsHandlerService}} may change {{NM_LOCAL_DIRS}} value in its {{Configuration}} object to exclude full and bad local dirs, {{ShuffleHandler}} always uses the original {{NM_LOCAL_DIRS}} value in its {{Configuration}} object. So every time {{AllocatorPerContext#confChanged}} is called by {{ShuffleHandler}} after {{LocalDirsHandlerService}}, {{AllocatorPerContext}} need be reinitialized because {{NM_LOCAL_DIRS}} value is changed. This will cause some overhead. {code} String newLocalDirs = conf.get(contextCfgItemName); if (!newLocalDirs.equals(savedLocalDirs)) { {code} So it will be a good improvement to not share the same {{AllocatorPerContext}} instance between {{ShuffleHandler}} and {{LocalDirsHandlerService}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3857) Memory leak in ResourceManager with SIMPLE mode
[ https://issues.apache.org/jira/browse/YARN-3857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu resolved YARN-3857. - Resolution: Fixed > Memory leak in ResourceManager with SIMPLE mode > --- > > Key: YARN-3857 > URL: https://issues.apache.org/jira/browse/YARN-3857 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 >Reporter: mujunchao >Assignee: mujunchao >Priority: Critical > Labels: patch > Fix For: 2.7.2 > > Attachments: YARN-3857-1.patch, YARN-3857-2.patch, YARN-3857-3.patch, > YARN-3857-4.patch, hadoop-yarn-server-resourcemanager.patch > > > We register the ClientTokenMasterKey to avoid client may hold an invalid > ClientToken after RM restarts. In SIMPLE mode, we register > Pair , But we never remove it from HashMap, as > unregister only runing while in Security mode, so memory leak coming. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3943) Use separate threshold configurations for disk-full detection and disk-not-full detection.
zhihai xu created YARN-3943: --- Summary: Use separate threshold configurations for disk-full detection and disk-not-full detection. Key: YARN-3943 URL: https://issues.apache.org/jira/browse/YARN-3943 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: zhihai xu Assignee: zhihai xu Use separate threshold configurations to check when disks become full and when disks become good. Currently the configuration "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage" and "yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb" are used to check both when disks become full and when disks become good. It will be better to use two configurations: one is used when disks become full from not-full and the other one is used when disks become not-full from full. So we can avoid oscillating frequently. For example: we can set the one for disk-full detection higher than the one for disk-not-full detection. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3925) ContainerLogsUtils#getContainerLogFile fails to read container log files from full disks.
zhihai xu created YARN-3925: --- Summary: ContainerLogsUtils#getContainerLogFile fails to read container log files from full disks. Key: YARN-3925 URL: https://issues.apache.org/jira/browse/YARN-3925 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.1 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical ContainerLogsUtils#getContainerLogFile fails to read files from full disks. {{getContainerLogFile}} depends on {{LocalDirsHandlerService#getLogPathToRead}} to get the log file, but {{LocalDirsHandlerService#getLogPathToRead}} calls {{logDirsAllocator.getLocalPathToRead}} and {{logDirsAllocator}} uses configuration {{YarnConfiguration.NM_LOG_DIRS}}, which will be updated to not include full disks in {{LocalDirsHandlerService#checkDirs}}: {code} Configuration conf = getConfig(); List localDirs = getLocalDirs(); conf.setStrings(YarnConfiguration.NM_LOCAL_DIRS, localDirs.toArray(new String[localDirs.size()])); List logDirs = getLogDirs(); conf.setStrings(YarnConfiguration.NM_LOG_DIRS, logDirs.toArray(new String[logDirs.size()])); {code} ContainerLogsUtils#getContainerLogFile is used by NMWebServices#getLogs and ContainerLogsPage.ContainersLogsBlock#render to read the log. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3882) AggregatedLogFormat should close aclScanner and ownerScanner after create them.
zhihai xu created YARN-3882: --- Summary: AggregatedLogFormat should close aclScanner and ownerScanner after create them. Key: YARN-3882 URL: https://issues.apache.org/jira/browse/YARN-3882 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Minor AggregatedLogFormat should close aclScanner and ownerScanner after create them. {{aclScanner}} and {{ownerScanner}} are created by createScanner in {{getApplicationAcls}} and {{getApplicationOwner}} and are never closed. {{TFile.Reader.Scanner}} implement java.io.Closeable. We should close them after use them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3549) use JNI-based FileStatus implementation from io.nativeio.NativeIO.POSIX#getFstat instead of shell-based implementation from RawLocalFileSystem in checkLocalDir.
[ https://issues.apache.org/jira/browse/YARN-3549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu resolved YARN-3549. - Resolution: Duplicate > use JNI-based FileStatus implementation from > io.nativeio.NativeIO.POSIX#getFstat instead of shell-based implementation > from RawLocalFileSystem in checkLocalDir. > > > Key: YARN-3549 > URL: https://issues.apache.org/jira/browse/YARN-3549 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu > > Use JNI-based FileStatus implementation from > io.nativeio.NativeIO.POSIX#getFstat instead of shell-based implementation > from RawLocalFileSystem in checkLocalDir. > As discussed in YARN-3491, shell-based implementation getPermission runs > shell command "ls -ld" to get permission, which take 4 or 5 ms(very slow). > We should switch to io.nativeio.NativeIO.POSIX#getFstat as implementation in > RawLocalFileSystem to get rid of shell-based implementation for FileStatus. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3802) Two RMNodes for the same NodeId are used in RM sometimes after NM is reconnected.
zhihai xu created YARN-3802: --- Summary: Two RMNodes for the same NodeId are used in RM sometimes after NM is reconnected. Key: YARN-3802 URL: https://issues.apache.org/jira/browse/YARN-3802 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu Two RMNodes for the same NodeId are used in RM sometimes after NM is reconnected. Scheduler and RMContext use different RMNode reference for the same NodeId sometimes after NM is reconnected, which is not correct. Scheduler and RMContext should always use same RMNode reference for the same NodeId. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3780) Should use equals when compare Resource in RMNodeImpl#ReconnectNodeTransition
zhihai xu created YARN-3780: --- Summary: Should use equals when compare Resource in RMNodeImpl#ReconnectNodeTransition Key: YARN-3780 URL: https://issues.apache.org/jira/browse/YARN-3780 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Minor Should use equals when compare Resource in RMNodeImpl#ReconnectNodeTransition to avoid unnecessary NodeResourceUpdateSchedulerEvent. The current code use {{!=}} to compare Resource totalCapability, which will compare reference not the real value in Resource. So we should use equals to compare Resource. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3777) Move all reservation-related tests from TestFairScheduler to TestFairSchedulerReservations.
zhihai xu created YARN-3777: --- Summary: Move all reservation-related tests from TestFairScheduler to TestFairSchedulerReservations. Key: YARN-3777 URL: https://issues.apache.org/jira/browse/YARN-3777 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler, test Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Minor As discussed at YARN-3655, Move all reservation-related tests from TestFairScheduler to TestFairSchedulerReservations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3776) FairScheduler code refactoring to separate out the code paths for assigning a reserved container and a non-reserved container
zhihai xu created YARN-3776: --- Summary: FairScheduler code refactoring to separate out the code paths for assigning a reserved container and a non-reserved container Key: YARN-3776 URL: https://issues.apache.org/jira/browse/YARN-3776 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu FairScheduler code refactoring toSeparate out the code paths for assigning a reserved container and a non-reserved container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3727) For better error recovery, check if the directory exists before using it for localization.
zhihai xu created YARN-3727: --- Summary: For better error recovery, check if the directory exists before using it for localization. Key: YARN-3727 URL: https://issues.apache.org/jira/browse/YARN-3727 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu For better error recovery, check if the directory exists before using it for localization. We saw the following localization failure happened due to existing cache directories. {code} 2015-05-11 18:59:59,756 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: DEBUG: FAILED { hdfs:///X/libjars/1234.jar, 1431395961545, FILE, null }, Rename cannot overwrite non empty destination directory //8/yarn/nm/usercache//filecache/21637 2015-05-11 18:59:59,756 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs:///X/libjars/1234.jar(->//8/yarn/nm/usercache//filecache/21637/1234.jar) transitioned from DOWNLOADING to FAILED {code} The real cause for this failure may be disk failure, LevelDB operation failure for {{startResourceLocalization}}/{{finishResourceLocalization}} or others. I wonder whether we can add error recovery code to avoid the localization failure by not using the existing cache directories for localization. The exception happened at {{files.rename(dst_work, destDirPath, Rename.OVERWRITE)}} in FSDownload#call. Based on the following code, after the exception, the existing cache directory used by {{LocalizedResource}} will be deleted. {{code}} try { . files.rename(dst_work, destDirPath, Rename.OVERWRITE); } catch (Exception e) { try { files.delete(destDirPath, true); } catch (IOException ignore) { } throw e; } finally { {{code}} Since the conflicting local directory will be deleted after localization failure, I think it will be better to check if the directory exists before using it for localization to avoid the localization failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3713) Remove duplicate function call storeContainerDiagnostics in ContainerDiagnosticsUpdateTransition
zhihai xu created YARN-3713: --- Summary: Remove duplicate function call storeContainerDiagnostics in ContainerDiagnosticsUpdateTransition Key: YARN-3713 URL: https://issues.apache.org/jira/browse/YARN-3713 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Minor remove duplicate function call {{storeContainerDiagnostics}} in ContainerDiagnosticsUpdateTransition. {{storeContainerDiagnostics}} is already called at ContainerImpl#addDiagnostics. {code} private void addDiagnostics(String... diags) { for (String s : diags) { this.diagnostics.append(s); } try { stateStore.storeContainerDiagnostics(containerId, diagnostics); } catch (IOException e) { LOG.warn("Unable to update diagnostics in state store for " + containerId, e); } } {code} So we don't need call {{storeContainerDiagnostics}} in ContainerDiagnosticsUpdateTransition#transition. {code} container.addDiagnostics(updateEvent.getDiagnosticsUpdate(), "\n"); try { container.stateStore.storeContainerDiagnostics(container.containerId, container.diagnostics); } catch (IOException e) { LOG.warn("Unable to update state store diagnostics for " + container.containerId, e); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3710) FairScheduler: Should allocate more containers for assign-multiple after assignReservedContainer turns the reservation into an allocation.
zhihai xu created YARN-3710: --- Summary: FairScheduler: Should allocate more containers for assign-multiple after assignReservedContainer turns the reservation into an allocation. Key: YARN-3710 URL: https://issues.apache.org/jira/browse/YARN-3710 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Reporter: zhihai xu Assignee: zhihai xu FairScheduler: Should allocate more containers for assign-multiple after assignReservedContainer turns the reservation into an allocation. Currently FairScheduler#attemptScheduling does not assign more containers for assign-multiple after assignReservedContainer turns the reservation into an allocation successfully. We should try to assign more containers on the same node if assignMultiple is enabled after assignReservedContainer turns the reservation into an allocation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3697) FairScheduler: ContinuousSchedulingThread can't be shutdown after stop sometimes.
zhihai xu created YARN-3697: --- Summary: FairScheduler: ContinuousSchedulingThread can't be shutdown after stop sometimes. Key: YARN-3697 URL: https://issues.apache.org/jira/browse/YARN-3697 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu FairScheduler: ContinuousSchedulingThread can't be shutdown after stop sometimes. The reason is because the InterruptedException is blocked in continuousSchedulingAttempt {code} try { if (node != null && Resources.fitsIn(minimumAllocation, node.getAvailableResource())) { attemptScheduling(node); } } catch (Throwable ex) { LOG.error("Error while attempting scheduling for node " + node + ": " + ex.toString(), ex); } {code} I saw the following exception after stop: {code} 2015-05-17 23:30:43,065 WARN [FairSchedulerContinuousScheduling] event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher thread interrupted java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219) at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338) at org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244) at org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:467) at org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:462) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:387) at org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:58) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.allocate(FSAppAttempt.java:357) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:516) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:649) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:803) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:334) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:173) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1082) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1014) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:285) 2015-05-17 23:30:43,066 ERROR [FairSchedulerContinuousScheduling] fair.FairScheduler (FairScheduler.java:continuousSchedulingAttempt(1017)) - Error while attempting scheduling for node host: 127.0.0.2:2 #containers=1 available= used=: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.lang.InterruptedException org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.lang.InterruptedException at org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:249) at org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:467) at org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$ContainerStartedTransition.transition(RMContainerImpl.java:462) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.ap
[jira] [Created] (YARN-3667) Fix findbugs warning Inconsistent synchronization of org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.isHDFS
zhihai xu created YARN-3667: --- Summary: Fix findbugs warning Inconsistent synchronization of org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.isHDFS Key: YARN-3667 URL: https://issues.apache.org/jira/browse/YARN-3667 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3655) FairScheduler: potential deadlock due to maxAMShare limitation and container reservation
zhihai xu created YARN-3655: --- Summary: FairScheduler: potential deadlock due to maxAMShare limitation and container reservation Key: YARN-3655 URL: https://issues.apache.org/jira/browse/YARN-3655 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu FairScheduler: potential deadlock due to maxAMShare limitation and container reservation. If a node is reserved by an application, all the other applications don't have any chance to assign a new container on this node, unless the application which reserves the node assigns a new container on this node or releases the reserved container on this node. The problem is if an application tries to call assignReservedContainer and fail to get a new container due to maxAMShare limitation, it will block all other applications to use the nodes it reserves. If all other running applications can't release their AM containers due to being blocked by these reserved containers. A dead lock situation can happen. The following is the code at FSAppAttempt#assignContainer which can cause this potential dead lock. {code} // Check the AM resource usage for the leaf queue if (!isAmRunning() && !getUnmanagedAM()) { List ask = appSchedulingInfo.getAllResourceRequests(); if (ask.isEmpty() || !getQueue().canRunAppAM( ask.get(0).getCapability())) { if (LOG.isDebugEnabled()) { LOG.debug("Skipping allocation because maxAMShare limit would " + "be exceeded"); } return Resources.none(); } } {code} To fix this issue, we can unreserve the node if we can't allocate the AM container on the node due to Max AM share limitation and the node is reserved by the application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3628) The default value for yarn.nodemanager.container-metrics.period-ms should not be -1.
zhihai xu created YARN-3628: --- Summary: The default value for yarn.nodemanager.container-metrics.period-ms should not be -1. Key: YARN-3628 URL: https://issues.apache.org/jira/browse/YARN-3628 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Priority: Minor The default value for yarn.nodemanager.container-metrics.period-ms should not be -1. The current default value for yarn.nodemanager.container-metrics.period-ms is -1 and the default value for yarn.nodemanager.container-metrics.enable is true. So the empty content is shown for the active container metrics until it is finished. flushOnPeriod is always false if flushPeriodMs is -1, the content will only be shown when the container is finished. {code} if (finished || flushOnPeriod) { registry.snapshot(collector.addRecord(registry.info()), all); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3604) removeApplication in ZKRMStateStore should also disable watch.
zhihai xu created YARN-3604: --- Summary: removeApplication in ZKRMStateStore should also disable watch. Key: YARN-3604 URL: https://issues.apache.org/jira/browse/YARN-3604 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Priority: Minor removeApplication in ZKRMStateStore should also disable watch. Function removeApplication is added from YARN-3410. YARN-3469 is to disable watch for all function in ZKRMStateStore. So it looks like YARN-3410 miss the change from YARN-3469 because YARN-3410 add removeApplication after YARN-3469 is committed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3602) TestResourceLocalizationService.testPublicResourceInitializesLocalDir fails Intermittently due to IOException from cleanup
zhihai xu created YARN-3602: --- Summary: TestResourceLocalizationService.testPublicResourceInitializesLocalDir fails Intermittently due to IOException from cleanup Key: YARN-3602 URL: https://issues.apache.org/jira/browse/YARN-3602 Project: Hadoop YARN Issue Type: Bug Components: test Reporter: zhihai xu Assignee: zhihai xu Priority: Minor ResourceLocalizationService.testPublicResourceInitializesLocalDir fails Intermittently due to IOException from cleanup. The stack trace is the following from test report at https://builds.apache.org/job/PreCommit-YARN-Build/7729/testReport/org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer/TestResourceLocalizationService/testPublicResourceInitializesLocalDir/ {code} Error Message Unable to delete directory target/org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService/2/filecache. Stacktrace java.io.IOException: Unable to delete directory target/org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService/2/filecache. at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1541) at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2270) at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653) at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535) at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2270) at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653) at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService.cleanup(TestResourceLocalizationService.java:187) {code} It looks like we can safely ignore the IOException in cleanup which is called after test. The IOException may be due to the test machine environment because TestResourceLocalizationService/2/filecache is created by ResourceLocalizationService#initializeLocalDir. testPublicResourceInitializesLocalDir created 0/filecache, 1/filecache, 2/filecache and 3/filecache {code} for (int i = 0; i < 4; ++i) { localDirs.add(lfs.makeQualified(new Path(basedir, i + ""))); sDirs[i] = localDirs.get(i).toString(); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2873) improve LevelDB error handling for missing files DBException to avoid NM start failure.
[ https://issues.apache.org/jira/browse/YARN-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu resolved YARN-2873. - Resolution: Not A Problem > improve LevelDB error handling for missing files DBException to avoid NM > start failure. > --- > > Key: YARN-2873 > URL: https://issues.apache.org/jira/browse/YARN-2873 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-2873.000.patch, YARN-2873.001.patch > > > improve LevelDB error handling for missing files DBException to avoid NM > start failure. > We saw the following three level DB exceptions, all these exceptions cause NM > start failure. > DBException 1 in ShuffleHandler > {code} > INFO org.apache.hadoop.service.AbstractService: Service > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl > failed in state STARTED; cause: > org.apache.hadoop.service.ServiceStateException: > org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 > missing files; e.g.: > /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst > org.apache.hadoop.service.ServiceStateException: > org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 > missing files; e.g.: > /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceStart(AuxServices.java:159) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceStart(ContainerManagerImpl.java:441) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:261) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:446) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492) > Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: > Corruption: 1 missing files; e.g.: > /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst > at > org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) > at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) > at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) > at > org.apache.hadoop.mapred.ShuffleHandler.startStore(ShuffleHandler.java:475) > at > org.apache.hadoop.mapred.ShuffleHandler.recoverState(ShuffleHandler.java:443) > at > org.apache.hadoop.mapred.ShuffleHandler.serviceStart(ShuffleHandler.java:379) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > ... 10 more > {code} > DBException 2 in NMLeveldbStateStoreService: > {code} > Error starting NodeManager > org.apache.hadoop.service.ServiceStateException: > org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 > missing files; e.g.: > /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/05.sst > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) > > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:152) > > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:190) > > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445) > > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492) > > Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: > Corruption: 1 missing files; e.g.: > /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/05.sst > at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) > at org.
[jira] [Resolved] (YARN-3114) It would be better to consider integer(long) overflow when compare the time in DelegationTokenRenewer.
[ https://issues.apache.org/jira/browse/YARN-3114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu resolved YARN-3114. - Resolution: Not A Problem > It would be better to consider integer(long) overflow when compare the time > in DelegationTokenRenewer. > -- > > Key: YARN-3114 > URL: https://issues.apache.org/jira/browse/YARN-3114 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Minor > Attachments: YARN-3114.000.patch > > > It would be better to consider integer(long) overflow when compare the time > in DelegationTokenRenewer. > When compare time in DelegationTokenRenewer#DelayedTokenRemovalRunnable to > cancel token , it will have problem when currentTimeMillis is close to > Long.MAX_VALUE. > The safer way to compare time will compare the time difference: > change > {code} > if (e.getValue() < System.currentTimeMillis()) { > {code} > to > {code} > if (e.getValue() - System.currentTimeMillis() < 0) { > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3549) use JNI-based FileStatus implementation from io.nativeio.NativeIO.POSIX#getFstat instead of shell-based implementation from RawLocalFileSystem in checkLocalDir.
zhihai xu created YARN-3549: --- Summary: use JNI-based FileStatus implementation from io.nativeio.NativeIO.POSIX#getFstat instead of shell-based implementation from RawLocalFileSystem in checkLocalDir. Key: YARN-3549 URL: https://issues.apache.org/jira/browse/YARN-3549 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: zhihai xu Assignee: zhihai xu Use JNI-based FileStatus implementation from io.nativeio.NativeIO.POSIX#getFstat instead of shell-based implementation from RawLocalFileSystem in checkLocalDir. As discussed in YARN-3491, shell-based implementation getPermission runs shell command "ls -ld" to get permission, which take 4 or 5 ms. We should switch to io.nativeio.NativeIO.POSIX#getFstat as implementation in RawLocalFileSystem to get rid of shell-based implementation for FileStatus. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3190) NM can't aggregate logs: token can't be found in cache
[ https://issues.apache.org/jira/browse/YARN-3190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu resolved YARN-3190. - Resolution: Duplicate issue is fixed by YARN-2964 > NM can't aggregate logs: token can't be found in cache > --- > > Key: YARN-3190 > URL: https://issues.apache.org/jira/browse/YARN-3190 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.0 > Environment: CDH 5.3.1 > HA HDFS > Kerberos >Reporter: Andrejs Dubovskis >Priority: Minor > > In rare cases node manager can not aggregate logs: generating exception: > {code} > 2015-02-12 13:04:03,703 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Starting aggregate log-file for app application_1423661043235_2150 at > /tmp/logs/catalyst/logs/application_1423661043235_2150/catdn001.intrum.net_8041.tmp > 2015-02-12 13:04:03,707 INFO > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting > absolute path : > /data5/yarn/nm/usercache/catalyst/appcache/application_1423661043235_2150/container_1423661043235_2150_01_000442 > 2015-02-12 13:04:03,707 INFO > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting > absolute path : > /data6/yarn/nm/usercache/catalyst/appcache/application_1423661043235_2150/container_1423661043235_2150_01_000442 > 2015-02-12 13:04:03,707 INFO > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting > absolute path : > /data7/yarn/nm/usercache/catalyst/appcache/application_1423661043235_2150/container_1423661043235_2150_01_000442 > 2015-02-12 13:04:03,709 INFO > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting > absolute path : > /data1/yarn/nm/usercache/catalyst/appcache/application_1423661043235_2150 > 2015-02-12 13:04:03,709 WARN org.apache.hadoop.security.UserGroupInformation: > PriviledgedActionException as:catalyst (auth:SIMPLE) > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): > token (HDFS_DELEGATION_TOKEN token 2334644 for catalyst) can't be found in > cache > 2015-02-12 13:04:03,709 WARN org.apache.hadoop.ipc.Client: Exception > encountered while connecting to the server : > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): > token (HDFS_DELEGATION_TOKEN token 2334644 for catalyst) can't be found in > cache > 2015-02-12 13:04:03,709 WARN org.apache.hadoop.security.UserGroupInformation: > PriviledgedActionException as:catalyst (auth:SIMPLE) > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): > token (HDFS_DELEGATION_TOKEN token 2334644 for catalyst) can't be found in > cache > 2015-02-12 13:04:03,712 WARN org.apache.hadoop.security.UserGroupInformation: > PriviledgedActionException as:catalyst (auth:SIMPLE) > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): > token (HDFS_DELEGATION_TOKEN token 2334644 for catalyst) can't be found in > cache > 2015-02-12 13:04:03,712 ERROR > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Cannot create writer for app application_1423661043235_2150. Disabling > log-aggregation for this app. > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): > token (HDFS_DELEGATION_TOKEN token 2334644 for catalyst) can't be found in > cache > at org.apache.hadoop.ipc.Client.call(Client.java:1411) > at org.apache.hadoop.ipc.Client.call(Client.java:1364) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy19.getServerDefaults(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getServerDefaults(ClientNamenodeProtocolTranslatorPB.java:259) > at sun.reflect.GeneratedMethodAccessor114.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy20.getServerDefaults(Unknown Source) > at > org.apache.hadoop.hdfs.DFSClient.getServerDefaults(DFSClient.java:966) > at org.apache.hadoop.fs.Hdfs.getServerDefaults(Hdfs.java:159) > at > org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:543) > at
[jira] [Created] (YARN-3516) killing ContainerLocalizer action doesn't take effect when private localizer receives FETCH_FAILURE status.
zhihai xu created YARN-3516: --- Summary: killing ContainerLocalizer action doesn't take effect when private localizer receives FETCH_FAILURE status. Key: YARN-3516 URL: https://issues.apache.org/jira/browse/YARN-3516 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: zhihai xu Assignee: zhihai xu killing ContainerLocalizer action doesn't take effect when private localizer receives FETCH_FAILURE status. This is a typo from YARN-3024. With YARN-3024, ContainerLocalizer will be killed only if {{action}} is set to {{LocalizerAction.DIE}}, calling {{response.setLocalizerAction}} will be overwritten. This is also a regression from old code. Also it make sense to kill the ContainerLocalizer when FETCH_FAILURE, because the container will send CLEANUP_CONTAINER_RESOURCES event after localization failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3496) Add a configuration to disable/enable storing localization state in NMLeveldbStateStore
[ https://issues.apache.org/jira/browse/YARN-3496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu resolved YARN-3496. - Resolution: Not A Problem > Add a configuration to disable/enable storing localization state in > NMLeveldbStateStore > --- > > Key: YARN-3496 > URL: https://issues.apache.org/jira/browse/YARN-3496 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.7.0 >Reporter: zhihai xu >Assignee: zhihai xu > > Add a configuration to disable/enable storing localization state in > NMLeveldbStateStore. > Store Localization state in the levelDB may have some overhead, which may > affect NM performance. > It would better to have a configuration to disable/enable it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3496) Add a configuration to disable/enable storing localization state in NM StateStore
zhihai xu created YARN-3496: --- Summary: Add a configuration to disable/enable storing localization state in NM StateStore Key: YARN-3496 URL: https://issues.apache.org/jira/browse/YARN-3496 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu Add a configuration to disable/enable storing localization state in NM StateStore. Store Localization state in the levelDB may have some overhead, which may affect NM performance. It would better to have a configuration to disable/enable it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3491) Improve the public resource localization to do both FSDownload submission to the thread pool and completed localization handling in one thread (PublicLocalizer).
zhihai xu created YARN-3491: --- Summary: Improve the public resource localization to do both FSDownload submission to the thread pool and completed localization handling in one thread (PublicLocalizer). Key: YARN-3491 URL: https://issues.apache.org/jira/browse/YARN-3491 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Improve the public resource localization to do both FSDownload submission to the thread pool and completed localization handling in one thread (PublicLocalizer). Currently FSDownload submission to the thread pool is done in PublicLocalizer#addResource which is running in Dispatcher thread and completed localization handling is done in PublicLocalizer#run which is running in PublicLocalizer thread. Because FSDownload submission to the thread pool at the following code is time consuming, the thread pool can't be fully utilized. Instead of doing public resource localization in parallel(multithreading), public resource localization is serialized most of the time. {code} synchronized (pending) { pending.put(queue.submit(new FSDownload(lfs, null, conf, publicDirDestPath, resource, request.getContext().getStatCache())), request); } {code} Also there are two more benefits with this change: 1. The Dispatcher thread won't be blocked by above FSDownload submission. Dispatcher thread handles most of time critical events at Node manager. 2. don't need synchronization on HashMap (pending). Because pending will be only accessed in PublicLocalizer thread. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3465) use LinkedHashMap to keep the order of LocalResourceRequest in ContainerImpl
zhihai xu created YARN-3465: --- Summary: use LinkedHashMap to keep the order of LocalResourceRequest in ContainerImpl Key: YARN-3465 URL: https://issues.apache.org/jira/browse/YARN-3465 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: zhihai xu Assignee: zhihai xu use LinkedHashMap to keep the order of LocalResourceRequest in ContainerImpl -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.
zhihai xu created YARN-3464: --- Summary: Race condition in LocalizerRunner causes container localization timeout. Key: YARN-3464 URL: https://issues.apache.org/jira/browse/YARN-3464 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Race condition in LocalizerRunner causes container localization timeout. Currently LocalizerRunner will kill the ContainerLocalizer when pending list for LocalizerResourceRequestEvent is empty. {code} } else if (pending.isEmpty()) { action = LocalizerAction.DIE; } {code} If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the ContainerLocalizer due to empty pending list, this LocalizerResourceRequestEvent will never be handled. The container will stay at LOCALIZING state, until the container is killed by AM due to TASK_TIMEOUT. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
zhihai xu created YARN-3446: --- Summary: FairScheduler HeadRoom calculation should exclude nodes in the blacklist. Key: YARN-3446 URL: https://issues.apache.org/jira/browse/YARN-3446 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Reporter: zhihai xu Assignee: zhihai xu FairScheduler HeadRoom calculation should exclude nodes in the blacklist. MRAppMaster does not preempt the reducers because for Reducer preemption calculation, headRoom is considering blacklisted nodes. This makes jobs to hang forever(ResourceManager does not assign any new containers on blacklisted nodes but availableResource AM get from RM includes blacklisted nodes available resource). This issue is similar as YARN-1680 which is for Capacity Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3429) TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken from appattempt_1427804754787_0001_000001
zhihai xu created YARN-3429: --- Summary: TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken from appattempt_1427804754787_0001_01 Key: YARN-3429 URL: https://issues.apache.org/jira/browse/YARN-3429 Project: Hadoop YARN Issue Type: Bug Components: test Reporter: zhihai xu Assignee: zhihai xu TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken from appattempt_1427804754787_0001_01 The error logs is at https://builds.apache.org/job/PreCommit-YARN-Build/7172//testReport/org.apache.hadoop.yarn.server.resourcemanager.security/TestAMRMTokens/testTokenExpiry_1_/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3395) Handle the user name correctly when submit application and use user name as default queue name.
zhihai xu created YARN-3395: --- Summary: Handle the user name correctly when submit application and use user name as default queue name. Key: YARN-3395 URL: https://issues.apache.org/jira/browse/YARN-3395 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: zhihai xu Assignee: zhihai xu Handle the user name correctly when submit application and use user name as default queue name. We should reject the application with an empty or whitespace only user name. because it doesn't make sense to have an empty or whitespace only user name. We should remove the trailing and leading whitespace of the user name when we use user name as default queue name, otherwise it will be rejected by InvalidQueueNameException from QueueManager. I think this change make sense, because it will be compatible with queue name convention and also we already did similar thing for '.' in user name. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3385) Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion(Op.delete).
zhihai xu created YARN-3385: --- Summary: Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion(Op.delete). Key: YARN-3385 URL: https://issues.apache.org/jira/browse/YARN-3385 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion(Op.delete). The race condition is similar as YARN-2721 and YARN-3023. When the race condition exists for ZK node creation, it should also exist for ZK node deletion. We see this issue with the following stack trace: {code} 2015-03-17 19:18:58,958 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:945) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:857) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:854) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:854) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:647) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:691) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) 2015-03-17 19:18:58,959 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3363) add localization and container launch time to ContainerMetrics at NM to show these timing information for each active container.
zhihai xu created YARN-3363: --- Summary: add localization and container launch time to ContainerMetrics at NM to show these timing information for each active container. Key: YARN-3363 URL: https://issues.apache.org/jira/browse/YARN-3363 Project: Hadoop YARN Issue Type: Improvement Reporter: zhihai xu Assignee: zhihai xu add localization and container launch time to ContainerMetrics at NM to show these timing information for each active container. Currently ContainerMetrics has container's actual memory usage(YARN-2984), actual CPU usage(YARN-3122), resource and pid(YARN-3022). It will be better to have localization and container launch time in ContainerMetrics for each active container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3355) findbugs warning:Inconsistent synchronization of org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocConf
zhihai xu created YARN-3355: --- Summary: findbugs warning:Inconsistent synchronization of org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocConf Key: YARN-3355 URL: https://issues.apache.org/jira/browse/YARN-3355 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: zhihai xu Assignee: zhihai xu findbugs warning:Inconsistent synchronization of org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocConf The findbugs warning found out two Unsynchronized access: 1. FairScheduler.getPlanQueues. It looks like we should add lock at FairScheduler.getPlanQueues. Because getPlanQueues will be called by AbstractReservationSystem.reinitialize. 2. FairScheduler.getAllocationConfiguration, which looks like ok without lock. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3349) treat all exceptions as failure in testFSRMStateStoreClientRetry
zhihai xu created YARN-3349: --- Summary: treat all exceptions as failure in testFSRMStateStoreClientRetry Key: YARN-3349 URL: https://issues.apache.org/jira/browse/YARN-3349 Project: Hadoop YARN Issue Type: Improvement Components: test Reporter: zhihai xu Assignee: zhihai xu Priority: Minor treat all exceptions as failure in testFSRMStateStoreClientRetry. Currently the exception "could only be replicated to 0 nodes instead of minReplication (=1)" is not treated as failure in testFSRMStateStoreClientRetry. {code} // TODO 0 datanode exception will not be retried by dfs client, fix // that separately. if (!e.getMessage().contains("could only be replicated" + " to 0 nodes instead of minReplication (=1)")) { assertionFailedInThread.set(true); } {code} With YARN-2820(Retry in FileSystemRMStateStore), we needn't treat this exception specially. We can remove the check and treat all exceptions as failure in testFSRMStateStoreClientRetry. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3263) ContainerManagerImpl#parseCredentials don't rewind the ByteBuffer after credentials.readTokenStorageStream
[ https://issues.apache.org/jira/browse/YARN-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu resolved YARN-3263. - Resolution: Not a Problem This is not an issue. tokens.rewind() is called before credentials.readTokenStorageStream(buf). This will have the same effect as rewind after readTokenStorageStream. Also no other place accesses the tokens except parseCredentials. > ContainerManagerImpl#parseCredentials don't rewind the ByteBuffer after > credentials.readTokenStorageStream > -- > > Key: YARN-3263 > URL: https://issues.apache.org/jira/browse/YARN-3263 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu > > ContainerManagerImpl#parseCredentials don't rewind the ByteBuffer after > credentials.readTokenStorageStream. So the next time if we access Tokens, we > will have EOFException. > The following is the code for parseCredentials in ContainerManagerImpl. > {code} > private Credentials parseCredentials(ContainerLaunchContext launchContext) > throws IOException { > Credentials credentials = new Credentials(); > // Parse credentials > ByteBuffer tokens = launchContext.getTokens(); > if (tokens != null) { > DataInputByteBuffer buf = new DataInputByteBuffer(); > tokens.rewind(); > buf.reset(tokens); > credentials.readTokenStorageStream(buf); > if (LOG.isDebugEnabled()) { > for (Token tk : > credentials.getAllTokens()) { > LOG.debug(tk.getService() + " = " + tk.toString()); > } > } > } > // End of parsing credentials > return credentials; > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3341) Fix findbugs warning:BC_UNCONFIRMED_CAST at FSSchedulerNode.reserveResource
zhihai xu created YARN-3341: --- Summary: Fix findbugs warning:BC_UNCONFIRMED_CAST at FSSchedulerNode.reserveResource Key: YARN-3341 URL: https://issues.apache.org/jira/browse/YARN-3341 Project: Hadoop YARN Issue Type: Improvement Reporter: zhihai xu Assignee: zhihai xu Priority: Minor Fix findbugs warning:BC_UNCONFIRMED_CAST at FSSchedulerNode.reserveResource The warning message is {code} Unchecked/unconfirmed cast from org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt to org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt in org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode.reserveResource(SchedulerApplicationAttempt, Priority, RMContainer) {code} The code which cause the warning is {code} this.reservedAppSchedulable = (FSAppAttempt) application; {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3336) FileSystem memory leak in DelegationTokenRenewer
zhihai xu created YARN-3336: --- Summary: FileSystem memory leak in DelegationTokenRenewer Key: YARN-3336 URL: https://issues.apache.org/jira/browse/YARN-3336 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Critical FileSystem memory leak in DelegationTokenRenewer. Every time DelegationTokenRenewer#obtainSystemTokensForUser is called, a new FileSystem entry will be added to FileSystem#CACHE which will never be garbage collected. This is the implementation of obtainSystemTokensForUser: {code} protected Token[] obtainSystemTokensForUser(String user, final Credentials credentials) throws IOException, InterruptedException { // Get new hdfs tokens on behalf of this user UserGroupInformation proxyUser = UserGroupInformation.createProxyUser(user, UserGroupInformation.getLoginUser()); Token[] newTokens = proxyUser.doAs(new PrivilegedExceptionAction[]>() { @Override public Token[] run() throws Exception { return FileSystem.get(getConfig()).addDelegationTokens( UserGroupInformation.getLoginUser().getUserName(), credentials); } }); return newTokens; } {code} The memory leak happened when FileSystem.get(getConfig()) is called with a new proxy user. Because createProxyUser will always create a new Subject. {code} public static UserGroupInformation createProxyUser(String user, UserGroupInformation realUser) { if (user == null || user.isEmpty()) { throw new IllegalArgumentException("Null user"); } if (realUser == null) { throw new IllegalArgumentException("Null real user"); } Subject subject = new Subject(); Set principals = subject.getPrincipals(); principals.add(new User(user)); principals.add(new RealUser(realUser)); UserGroupInformation result =new UserGroupInformation(subject); result.setAuthenticationMethod(AuthenticationMethod.PROXY); return result; } {code} FileSystem#Cache#Key.equals will compare the ugi {code} Key(URI uri, Configuration conf, long unique) throws IOException { scheme = uri.getScheme()==null?"":uri.getScheme().toLowerCase(); authority = uri.getAuthority()==null?"":uri.getAuthority().toLowerCase(); this.unique = unique; this.ugi = UserGroupInformation.getCurrentUser(); } public boolean equals(Object obj) { if (obj == this) { return true; } if (obj != null && obj instanceof Key) { Key that = (Key)obj; return isEqual(this.scheme, that.scheme) && isEqual(this.authority, that.authority) && isEqual(this.ugi, that.ugi) && (this.unique == that.unique); } return false; } {code} UserGroupInformation.equals will compare subject by reference. {code} public boolean equals(Object o) { if (o == this) { return true; } else if (o == null || getClass() != o.getClass()) { return false; } else { return subject == ((UserGroupInformation) o).subject; } } {code} So in this case, every time createProxyUser and FileSystem.get(getConfig()) are called, a new FileSystem will be created and a new entry will be added to FileSystem.CACHE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3263) ContainerManagerImpl#parseCredentials don't rewind the ByteBuffer after credentials.readTokenStorageStream
zhihai xu created YARN-3263: --- Summary: ContainerManagerImpl#parseCredentials don't rewind the ByteBuffer after credentials.readTokenStorageStream Key: YARN-3263 URL: https://issues.apache.org/jira/browse/YARN-3263 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu ContainerManagerImpl#parseCredentials don't rewind the ByteBuffer after credentials.readTokenStorageStream. So the next time if we access Tokens, we will have EOFException. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3247) TestQueueMappings failure for FairScheduler
zhihai xu created YARN-3247: --- Summary: TestQueueMappings failure for FairScheduler Key: YARN-3247 URL: https://issues.apache.org/jira/browse/YARN-3247 Project: Hadoop YARN Issue Type: Bug Components: test Reporter: zhihai xu Assignee: zhihai xu Priority: Trivial TestQueueMappings is only supported by CapacityScheduler. We should configure CapacityScheduler for this test. Otherwise if the default scheduler is set to FairScheduler, the test will fail with the following message: {code} Running org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestQueueMappings Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 2.392 sec <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestQueueMappings testQueueMapping(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestQueueMappings) Time elapsed: 2.202 sec <<< ERROR! java.lang.ClassCastException: org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics cannot be cast to org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:118) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1266) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1319) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:558) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:989) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:255) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.(MockRM.java:108) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.(MockRM.java:103) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestQueueMappings.testQueueMapping(TestQueueMappings.java:143) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3242) Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session.
zhihai xu created YARN-3242: --- Summary: Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session. Key: YARN-3242 URL: https://issues.apache.org/jira/browse/YARN-3242 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session. The watcher event from old ZK client session can still be sent to ZKRMStateStore when the old ZK client session is closed. This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper session. We only have one ZKRMStateStore but we can have multiple ZK client sessions. Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher event is from current session. So the watcher event from old ZK client session which just is closed will still be processed. For example, If a Disconnected event received from old session after new session is connected, the zkClient will be set to null {code} case Disconnected: LOG.info("ZKRMStateStore Session disconnected"); oldZkClient = zkClient; zkClient = null; break; {code} Then ZKRMStateStore won't receive SyncConnected event from new session because new session is already in SyncConnected state and it won't send SyncConnected event until it is disconnected and connected again. Then we will see all the ZKRMStateStore operations fail with IOException "Wait for ZKClient creation timed out" until RM shutdown. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3241) Leading space, trailing space and empty sub queue name may cause MetricsException for fair scheduler
zhihai xu created YARN-3241: --- Summary: Leading space, trailing space and empty sub queue name may cause MetricsException for fair scheduler Key: YARN-3241 URL: https://issues.apache.org/jira/browse/YARN-3241 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: zhihai xu Assignee: zhihai xu Leading space, trailing space and empty sub queue name may cause MetricsException(Metrics source XXX already exists! ) when add application to FairScheduler. The reason is because QueueMetrics parse the queue name different from the QueueManager. QueueMetrics use Q_SPLITTER to parse queue name, it will remove Leading space and trailing space in the sub queue name, It will also remove empty sub queue name. {code} static final Splitter Q_SPLITTER = Splitter.on('.').omitEmptyStrings().trimResults(); {code} But QueueManager won't remove Leading space, trailing space and empty sub queue name. This will cause out of sync between FSQueue and FSQueueMetrics. QueueManager will think two queue names are different so it will try to create a new queue. But FSQueueMetrics will think these two queue names as same which will create "Metrics source XXX already exists!" MetricsException. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3236) cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY.
zhihai xu created YARN-3236: --- Summary: cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY. Key: YARN-3236 URL: https://issues.apache.org/jira/browse/YARN-3236 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Trivial cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY. RMAuthenticationFilter#AUTH_HANDLER_PROPERTY is added in YARN-2247. but the code which use AUTH_HANDLER_PROPERTY is removed at YARN-2656. We would better remove it to avoid confusion since it is only introduce for a very short time and no one use it now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3205) FileSystemRMStateStore should disable FileSystem Cache to avoid get a Filesystem with an old configuration.
zhihai xu created YARN-3205: --- Summary: FileSystemRMStateStore should disable FileSystem Cache to avoid get a Filesystem with an old configuration. Key: YARN-3205 URL: https://issues.apache.org/jira/browse/YARN-3205 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: zhihai xu Assignee: zhihai xu FileSystemRMStateStore should disable FileSystem Cache to avoid get a Filesystem with an old configuration. The old configuration may not have all these customized DFS_CLIENT configurations for FileSystemRMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3114) It would be better to consider integer(long) overflow when compare the time in DelegationTokenRenewer.
zhihai xu created YARN-3114: --- Summary: It would be better to consider integer(long) overflow when compare the time in DelegationTokenRenewer. Key: YARN-3114 URL: https://issues.apache.org/jira/browse/YARN-3114 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Minor It would be better to consider integer(long) overflow when compare the time in DelegationTokenRenewer. When compare time in DelegationTokenRenewer#DelayedTokenRemovalRunnable to cancel token , it will have problem when currentTimeMillis is close to Long.MAX_VALUE. The safer way to compare time will compare the time difference: change {code} if (e.getValue() < System.currentTimeMillis()) { {code} to {code} if (e.getValue() - System.currentTimeMillis() < 0) { {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3106) The message in IllegalArgumentException gave wrong information in NMTokenSecretManagerInRM.java and RMContainerTokenSecretManager.java
zhihai xu created YARN-3106: --- Summary: The message in IllegalArgumentException gave wrong information in NMTokenSecretManagerInRM.java and RMContainerTokenSecretManager.java Key: YARN-3106 URL: https://issues.apache.org/jira/browse/YARN-3106 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Minor The message in IllegalArgumentException gave wrong information in NMTokenSecretManagerInRM.java and RMContainerTokenSecretManager.java. We saw this error message: {code} Error starting ResourceManager java.lang.IllegalArgumentException: yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs should be more than 2 X yarn.nm.liveness-monitor.expiry-interval-ms {code} After I checked the source code, I find this error message misleading. The following is code from NMTokenSecretManagerInRM.java {code} rollingInterval = this.conf.getLong( YarnConfiguration.RM_NMTOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS, YarnConfiguration.DEFAULT_RM_NMTOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS) * 1000; this.activationDelay =(long) (conf.getLong(YarnConfiguration.RM_NM_EXPIRY_INTERVAL_MS, YarnConfiguration.DEFAULT_RM_NM_EXPIRY_INTERVAL_MS) * 1.5); LOG.info("NMTokenKeyRollingInterval: " + this.rollingInterval + "ms and NMTokenKeyActivationDelay: " + this.activationDelay + "ms"); if (rollingInterval <= activationDelay * 2) { throw new IllegalArgumentException( YarnConfiguration.RM_NMTOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS + " should be more than 2 X " + YarnConfiguration.RM_NM_EXPIRY_INTERVAL_MS); } {code} It should be 3 X not 2 X. Same error also happened in RMContainerTokenSecretManager.java. {code} this.rollingInterval = conf.getLong( YarnConfiguration.RM_CONTAINER_TOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS, YarnConfiguration.DEFAULT_RM_CONTAINER_TOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS) * 1000; this.activationDelay = (long) (conf.getLong(YarnConfiguration.RM_NM_EXPIRY_INTERVAL_MS, YarnConfiguration.DEFAULT_RM_NM_EXPIRY_INTERVAL_MS) * 1.5); LOG.info("ContainerTokenKeyRollingInterval: " + this.rollingInterval + "ms and ContainerTokenKeyActivationDelay: " + this.activationDelay + "ms"); if (rollingInterval <= activationDelay * 2) { throw new IllegalArgumentException( YarnConfiguration.RM_CONTAINER_TOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS + " should be more than 2 X " +YarnConfiguration.RM_CONTAINER_TOKEN_MASTER_KEY_ROLLING_INTERVAL_SECS); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3079) Scheduler should also update maximumAllocation when updateNodeResource.
zhihai xu created YARN-3079: --- Summary: Scheduler should also update maximumAllocation when updateNodeResource. Key: YARN-3079 URL: https://issues.apache.org/jira/browse/YARN-3079 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Scheduler should also update maximumAllocation when updateNodeResource. Otherwise even the node resource is changed by AdminService#updateNodeResource, maximumAllocation won't be changed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3056) add verification for containerLaunchDuration in TestNodeManagerMetrics.
zhihai xu created YARN-3056: --- Summary: add verification for containerLaunchDuration in TestNodeManagerMetrics. Key: YARN-3056 URL: https://issues.apache.org/jira/browse/YARN-3056 Project: Hadoop YARN Issue Type: Improvement Components: test Affects Versions: 2.6.0 Reporter: zhihai xu Priority: Trivial add verification for containerLaunchDuration in TestNodeManagerMetrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2679) Add metric for container launch duration
[ https://issues.apache.org/jira/browse/YARN-2679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu resolved YARN-2679. - Resolution: Fixed > Add metric for container launch duration > > > Key: YARN-2679 > URL: https://issues.apache.org/jira/browse/YARN-2679 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: zhihai xu >Assignee: zhihai xu > Labels: metrics, supportability > Fix For: 2.7.0 > > Attachments: YARN-2679.000.patch, YARN-2679.001.patch, > YARN-2679.002.patch > > > add metrics in NodeManagerMetrics to get prepare time to launch container. > The prepare time is the duration between sending > ContainersLauncherEventType.LAUNCH_CONTAINER event and receiving > ContainerEventType.CONTAINER_LAUNCHED event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3023) Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM crash
[ https://issues.apache.org/jira/browse/YARN-3023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu resolved YARN-3023. - Resolution: Duplicate > Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM > crash > - > > Key: YARN-3023 > URL: https://issues.apache.org/jira/browse/YARN-3023 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: zhihai xu >Assignee: zhihai xu > > Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM > crash. > The sequence for the Race condition is the following: > 1, RM Store attempt state to ZK by calling createWithRetries > {code} > 2015-01-06 12:37:35,343 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > Storing attempt: AppId: application_1418914202950_42363 AttemptId: > appattempt_1418914202950_42363_01 MasterContainer: Container: > [ContainerId: container_1418914202950_42363_01_01, > {code} > 2. unluckily ConnectionLoss for the ZK session happened at the same time as > RM Stored attempt state to ZK. > The ZooKeeper server created the node and store the data successfully, But > due to ConnectionLoss, RM didn't know the operation (createWithRetries) is > succeeded. > {code} > 2015-01-06 12:37:36,102 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Exception while executing a ZK operation. > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode > = ConnectionLoss > {code} > 3.RM did retry to store attempt state to ZK after one second > {code} > 2015-01-06 12:37:36,104 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Retrying operation on ZK. Retry no. 1 > {code} > 4. during the one second interval, the ZK session is reconnected. > {code} > 2015-01-06 12:37:36,210 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established initiating session > 2015-01-06 12:37:36,213 INFO org.apache.zookeeper.ClientCnxn: Session > establishment complete on server, sessionid = 0x44a9166eb2d12cb, negotiated > timeout = 1 > {code} > 5. Because the node was created successfully at ZooKeeper in the first > try(runWithCheck), > For the second try, it will fail with NodeExists KeeperException > {code} > 2015-01-06 12:37:37,116 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > Exception while executing a ZK operation. > org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists > 2015-01-06 12:37:37,118 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed > out ZK retries. Giving up! > {code} > 6.This NodeExists KeeperException will cause Storing AppAttempt failure in > RMStateStore > {code} > 2015-01-06 12:37:37,118 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error > storing appAttempt: appattempt_1418914202950_42363_01 > org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists > {code} > 7.RMStateStore will send RMFatalEventType.STATE_STORE_OP_FAILED event to > ResourceManager > {code} > protected void notifyStoreOperationFailed(Exception failureCause) { > RMFatalEventType type; > if (failureCause instanceof StoreFencedException) { > type = RMFatalEventType.STATE_STORE_FENCED; > } else { > type = RMFatalEventType.STATE_STORE_OP_FAILED; > } > rmDispatcher.getEventHandler().handle(new RMFatalEvent(type, > failureCause)); > } > {code} > 8.ResoureManager will kill itself after received STATE_STORE_OP_FAILED > RMFatalEvent. > {code} > 2015-01-06 12:37:37,128 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a > org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists > 2015-01-06 12:37:37,138 INFO org.apache.hadoop.util.ExitUtil: Exiting with > status 1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3023) Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM crash
zhihai xu created YARN-3023: --- Summary: Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM crash Key: YARN-3023 URL: https://issues.apache.org/jira/browse/YARN-3023 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: zhihai xu Assignee: zhihai xu Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM crash. The sequence for the Race condition is the following: 1, RM Store attempt state to ZK by calling createWithRetries {code} 2015-01-06 12:37:35,343 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Storing attempt: AppId: application_1418914202950_42363 AttemptId: appattempt_1418914202950_42363_01 MasterContainer: Container: [ContainerId: container_1418914202950_42363_01_01, {code} 2. unluckily ConnectionLoss for the ZK session happened at the same time as RM Stored attempt state to ZK. The ZooKeeper server created the node and store the data successfully, But due to ConnectionLoss, RM didn't know the operation (createWithRetries) is succeeded. {code} 2015-01-06 12:37:36,102 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation. org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss {code} 3.RM did retry to store attempt state to ZK after one second {code} 2015-01-06 12:37:36,104 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Retrying operation on ZK. Retry no. 1 {code} 4. during the one second interval, the ZK session is reconnected. {code} 2015-01-06 12:37:36,210 INFO org.apache.zookeeper.ClientCnxn: Socket connection established initiating session 2015-01-06 12:37:36,213 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server, sessionid = 0x44a9166eb2d12cb, negotiated timeout = 1 {code} 5. Because the node was created successfully at ZooKeeper in the first try(runWithCheck), For the second try, it will fail with NodeExists KeeperException {code} 2015-01-06 12:37:37,116 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation. org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists 2015-01-06 12:37:37,118 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed out ZK retries. Giving up! {code} 6.This NodeExists KeeperException will cause Storing AppAttempt failure in RMStateStore {code} 2015-01-06 12:37:37,118 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error storing appAttempt: appattempt_1418914202950_42363_01 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists {code} 7.RMStateStore will send RMFatalEventType.STATE_STORE_OP_FAILED event to ResourceManager {code} protected void notifyStoreOperationFailed(Exception failureCause) { RMFatalEventType type; if (failureCause instanceof StoreFencedException) { type = RMFatalEventType.STATE_STORE_FENCED; } else { type = RMFatalEventType.STATE_STORE_OP_FAILED; } rmDispatcher.getEventHandler().handle(new RMFatalEvent(type, failureCause)); } {code} 8.ResoureManager will kill itself after received STATE_STORE_OP_FAILED RMFatalEvent. {code} 2015-01-06 12:37:37,128 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists 2015-01-06 12:37:37,138 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2873) improve LevelDB error handling for missing files DBException to avoid NM start failure.
zhihai xu created YARN-2873: --- Summary: improve LevelDB error handling for missing files DBException to avoid NM start failure. Key: YARN-2873 URL: https://issues.apache.org/jira/browse/YARN-2873 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu improve LevelDB error handling for missing files DBException to avoid NM start failure. We saw the following three level DB exceptions, all these exceptions cause NM start failure. DBException 1 in ShuffleHandler {code} INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl failed in state STARTED; cause: org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 missing files; e.g.: /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 missing files; e.g.: /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceStart(AuxServices.java:159) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceStart(ContainerManagerImpl.java:441) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:261) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:446) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492) Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 missing files; e.g.: /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/05.sst at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.hadoop.mapred.ShuffleHandler.startStore(ShuffleHandler.java:475) at org.apache.hadoop.mapred.ShuffleHandler.recoverState(ShuffleHandler.java:443) at org.apache.hadoop.mapred.ShuffleHandler.serviceStart(ShuffleHandler.java:379) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) ... 10 more {code} DBException 2 in NMLeveldbStateStoreService: {code} Error starting NodeManager org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 missing files; e.g.: /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/05.sst at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:152) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:190) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492) Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 missing files; e.g.: /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/05.sst at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:842) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:195) at org.apache.hadoop.service.AbstractService.init(
[jira] [Created] (YARN-2831) NM should kill and cleanup the leaked containers.
zhihai xu created YARN-2831: --- Summary: NM should kill and cleanup the leaked containers. Key: YARN-2831 URL: https://issues.apache.org/jira/browse/YARN-2831 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu NM should kill and cleanup the leaked containers. as discussed in YARN-2816, we should implement a function which is to kill and cleanup the leaked containers: look for the pid file, try to kill it if found, and return a recovered container status of killed/lost or something similar. So this function can be called when a leaked container is found. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2820) Improve FileSystemRMStateStore update failure exception handling to not shutdown RM.
zhihai xu created YARN-2820: --- Summary: Improve FileSystemRMStateStore update failure exception handling to not shutdown RM. Key: YARN-2820 URL: https://issues.apache.org/jira/browse/YARN-2820 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We saw the following IOexception cause the RM shutdown. {code} FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: java.io.IOException: Unable to close file because the last block does not have enough number of replicas. at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:744) {code} It will be better to Improve FileSystemRMStateStore update failure exception handling to not shutdown RM. So that a single state write out failure can't stop all jobs . -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2816) NM fail to start with NPE during container recovery
zhihai xu created YARN-2816: --- Summary: NM fail to start with NPE during container recovery Key: YARN-2816 URL: https://issues.apache.org/jira/browse/YARN-2816 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu NM fail to start with NPE during container recovery. We saw the following crash happen: 2014-10-30 22:22:37,211 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl failed in state INITED; cause: java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492) The reason is some DB files used in NMLeveldbStateStoreService are accidentally deleted to save disk space at /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state. This leaves some incomplete container record which don't have CONTAINER_REQUEST_KEY_SUFFIX(startRequest) entry in the DB. When container is recovered at ContainerManagerImpl#recoverContainer, The NullPointerException at the following code cause NM shutdown. {code} StartContainerRequest req = rcs.getStartRequest(); ContainerLaunchContext launchContext = req.getContainerLaunchContext(); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2802) add AM container launch and register delay metrics in QueueMetrics to help diagnose performance issue.
zhihai xu created YARN-2802: --- Summary: add AM container launch and register delay metrics in QueueMetrics to help diagnose performance issue. Key: YARN-2802 URL: https://issues.apache.org/jira/browse/YARN-2802 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu add AM container launch and register delay metrics in QueueMetrics to help diagnose performance issue. Added two metrics in QueueMetrics: aMLaunchDelay: the time spent from sending event AMLauncherEventType.LAUNCH to receiving event RMAppAttemptEventType.LAUNCHED in RMAppAttemptImpl. aMRegisterDelay: the time waiting from receiving event RMAppAttemptEventType.LAUNCHED to receiving event RMAppAttemptEventType.REGISTERED(ApplicationMasterService#registerApplicationMaster) in RMAppAttemptImpl. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2799) cleanup TestLogAggregationService based on the change in YARN-90
zhihai xu created YARN-2799: --- Summary: cleanup TestLogAggregationService based on the change in YARN-90 Key: YARN-2799 URL: https://issues.apache.org/jira/browse/YARN-2799 Project: Hadoop YARN Issue Type: Improvement Components: test Reporter: zhihai xu Priority: Minor cleanup TestLogAggregationService based on the change in YARN-90. The following code is added to setup in YARN-90, {code} dispatcher = createDispatcher(); appEventHandler = mock(EventHandler.class); dispatcher.register(ApplicationEventType.class, appEventHandler); {code} In this case, we should remove all these code from each test function to avoid duplicate code. Same for dispatcher.stop() which is in tearDown, we can remove dispatcher.stop() from from each test function also because it will always be called from tearDown for each test. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2759) addToCluserNodeLabels should not change the value in labelCollections if the key already exists to avoid the Label.resource is reset.
zhihai xu created YARN-2759: --- Summary: addToCluserNodeLabels should not change the value in labelCollections if the key already exists to avoid the Label.resource is reset. Key: YARN-2759 URL: https://issues.apache.org/jira/browse/YARN-2759 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu addToCluserNodeLabels should not change the value in labelCollections if the key already exists to avoid the Label.resource is reset. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2757) potential NPE in checkNodeLabelExpression of SchedulerUtils for nodeLabels.
zhihai xu created YARN-2757: --- Summary: potential NPE in checkNodeLabelExpression of SchedulerUtils for nodeLabels. Key: YARN-2757 URL: https://issues.apache.org/jira/browse/YARN-2757 Project: Hadoop YARN Issue Type: Improvement Reporter: zhihai xu Assignee: zhihai xu pontential NPE in checkNodeLabelExpression of SchedulerUtils for nodeLabels. since we check the nodeLabels null at {code} if (!str.trim().isEmpty() && (nodeLabels == null || !nodeLabels.contains(str.trim( { return false; } {code} We should also check nodeLabels null at {code} if (!nodeLabels.isEmpty()) { return false; } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2756) use static variable (Resources.none()) for not-running Node.resource in CommonNodeLabelsManager to save memory.
zhihai xu created YARN-2756: --- Summary: use static variable (Resources.none()) for not-running Node.resource in CommonNodeLabelsManager to save memory. Key: YARN-2756 URL: https://issues.apache.org/jira/browse/YARN-2756 Project: Hadoop YARN Issue Type: Improvement Reporter: zhihai xu Assignee: zhihai xu Priority: Minor use static variable (Resources.none()) for not-running Node.resource in CommonNodeLabelsManager to save memory. When a Node is not activated, the resource is never used, When a Node is activated, a new resource will be assigned to it in RMNodeLabelsManager#activateNode (nm.resource = resource;) So it would be better to use static variable Resources.none() instead of allocating a new variable(Resource.newInstance(0, 0)) for each node deactivation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2754) addToCluserNodeLabels should be protected by writeLock in RMNodeLabelsManager.java.
zhihai xu created YARN-2754: --- Summary: addToCluserNodeLabels should be protected by writeLock in RMNodeLabelsManager.java. Key: YARN-2754 URL: https://issues.apache.org/jira/browse/YARN-2754 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu addToCluserNodeLabels should be protected by writeLock in RMNodeLabelsManager.java. because we should protect labelCollections in RMNodeLabelsManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2753) potential NPE in checkRemoveLabelsFromNode of CommonNodeLabelsManager
zhihai xu created YARN-2753: --- Summary: potential NPE in checkRemoveLabelsFromNode of CommonNodeLabelsManager Key: YARN-2753 URL: https://issues.apache.org/jira/browse/YARN-2753 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu potential NPE(NullPointerException) in checkRemoveLabelsFromNode of CommonNodeLabelsManager. It because when a Node is created, Node.labels can be null. In this case, nm.labels; may be null. So we need check originalLabels not null before use it(originalLabels.containsAll). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2735) diskUtilizationPercentageCutoff and diskUtilizationSpaceCutoff are initialized twice in DirectoryCollection
zhihai xu created YARN-2735: --- Summary: diskUtilizationPercentageCutoff and diskUtilizationSpaceCutoff are initialized twice in DirectoryCollection Key: YARN-2735 URL: https://issues.apache.org/jira/browse/YARN-2735 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Minor diskUtilizationPercentageCutoff and diskUtilizationSpaceCutoff are initialized twice in DirectoryCollection -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2682) WindowsSecureContainerExecutor should not depend on DefaultContainerExecutor#getFirstApplicationDir.
zhihai xu created YARN-2682: --- Summary: WindowsSecureContainerExecutor should not depend on DefaultContainerExecutor#getFirstApplicationDir. Key: YARN-2682 URL: https://issues.apache.org/jira/browse/YARN-2682 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu DefaultContainerExecutor won't use getFirstApplicationDir any more. But we can't delete getFirstApplicationDir in DefaultContainerExecutor because WindowsSecureContainerExecutor uses it. We should move getFirstApplicationDir function from DefaultContainerExecutor to WindowsSecureContainerExecutor. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2679) add container launch prepare time metrics to NM.
zhihai xu created YARN-2679: --- Summary: add container launch prepare time metrics to NM. Key: YARN-2679 URL: https://issues.apache.org/jira/browse/YARN-2679 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu add metrics in NodeManagerMetrics to get prepare time to launch container. The prepare time is the duration between sending ContainersLauncherEventType.LAUNCH_CONTAINER event and receiving ContainerEventType.CONTAINER_LAUNCHED event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2675) the containersKilled metrics is not updated when the container is killed during localization.
zhihai xu created YARN-2675: --- Summary: the containersKilled metrics is not updated when the container is killed during localization. Key: YARN-2675 URL: https://issues.apache.org/jira/browse/YARN-2675 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu The containersKilled metrics is not updated when the container is killed during localization. We should add KILLING state in finished of ContainerImpl.java to update killedContainer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2641) improve node decommission latency in RM.
zhihai xu created YARN-2641: --- Summary: improve node decommission latency in RM. Key: YARN-2641 URL: https://issues.apache.org/jira/browse/YARN-2641 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu improve node decommission latency in RM. Currently the node decommission only happened after RM received nodeHeartbeat from the Node Manager. The node heartbeat interval is configurable. The default value is 1 second. It will be better to do the decommission during RM Refresh(NodesListManager) instead of nodeHeartbeat(ResourceTrackerService). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2623) Linux container executor only use the first local directory to copy token file in container-executor.c.
zhihai xu created YARN-2623: --- Summary: Linux container executor only use the first local directory to copy token file in container-executor.c. Key: YARN-2623 URL: https://issues.apache.org/jira/browse/YARN-2623 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Environment: Linux container executor only use the first local directory to copy token file in container-executor.c. Reporter: zhihai xu Assignee: zhihai xu Linux container executor only use the first local directory to copy token file in container-executor.c. if It failed to copy token file to the first local directory, the localization failure event will happen. Even though it can copy token file to the other local directory successfully. The correct way should be to copy token file to the next local directory if the first one failed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.
zhihai xu created YARN-2566: --- Summary: IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir. Key: YARN-2566 URL: https://issues.apache.org/jira/browse/YARN-2566 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu startLocalizer in DefaultContainerExecutor will only use the first localDir to copy the token file, if the copy is failed for first localDir due to not enough disk space in the first localDir, the localization will be failed even there are plenty of disk space in other localDirs. We see the following error for this case: {code} 2014-09-13 23:33:25,171 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to create app directory /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 java.io.IOException: mkdir of /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) 2014-09-13 23:33:25,185 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Localizer failed java.io.FileNotFoundException: File file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) at org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111) at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76) at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.(ChecksumFs.java:344) at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390) at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.create(FileContext.java:673) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) 2014-09-13 23:33:25,186 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1410663092546_0004_01_01 transitioned from LOCALIZING to LOCALIZATION_FAILED 2014-09-13 23:33:25,187 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=cloudera OPERATION=Container Finished - Failed TARGET=ContainerImplRESULT=FAILURE DESCRIPTION=Container failed with state: LOCALIZATION_FAILED APPID=application_1410663092546_0004 CONTAINERID=container_1410663092546_0004_01_01 2014-09-13 23:33:25,187 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1410663092546_0004_01_01 transitioned from LOCALIZATION_FAILED to DONE 2014-09-13 23:33:25,187 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Removing container_1410663092546_0004_01_01 from application application_1410663092546_0004 2014-09-13 23:33:25,187
[jira] [Created] (YARN-2534) FairScheduler: totalMaxShare is not calculated correctly in computeSharesInternal
zhihai xu created YARN-2534: --- Summary: FairScheduler: totalMaxShare is not calculated correctly in computeSharesInternal Key: YARN-2534 URL: https://issues.apache.org/jira/browse/YARN-2534 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Fix For: 2.6.0 FairScheduler: totalMaxShare is not calculated correctly in computeSharesInternal for some cases. If the sum of MAX share of all Schedulables is more than Integer.MAX_VALUE ,but each individual MAX share is not equal to Integer.MAX_VALUE. then totalMaxShare will be a negative value, which will cause all fairShare are wrongly calculated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2453) TestProportionalCapacityPreemptionPolicy is failed for FairScheduler
zhihai xu created YARN-2453: --- Summary: TestProportionalCapacityPreemptionPolicy is failed for FairScheduler Key: YARN-2453 URL: https://issues.apache.org/jira/browse/YARN-2453 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu TestProportionalCapacityPreemptionPolicy is failed for FairScheduler. The following is error message: Running org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy Tests run: 18, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 3.94 sec <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy testPolicyInitializeAfterSchedulerInitialized(org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy) Time elapsed: 1.61 sec <<< FAILURE! java.lang.AssertionError: Failed to find SchedulingMonitor service, please check what happened at org.junit.Assert.fail(Assert.java:88) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy.testPolicyInitializeAfterSchedulerInitialized(TestProportionalCapacityPreemptionPolicy.java:469) This test should only work for capacity scheduler because the following source code in ResourceManager.java prove it will only work for capacity scheduler. {code} if (scheduler instanceof PreemptableResourceScheduler && conf.getBoolean(YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, YarnConfiguration.DEFAULT_RM_SCHEDULER_ENABLE_MONITORS)) { {code} Because CapacityScheduler is instance of PreemptableResourceScheduler and FairScheduler is not instance of PreemptableResourceScheduler. I will upload a patch to fix this issue. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2452) TestRMApplicationHistoryWriter is failed for FairScheduler
zhihai xu created YARN-2452: --- Summary: TestRMApplicationHistoryWriter is failed for FairScheduler Key: YARN-2452 URL: https://issues.apache.org/jira/browse/YARN-2452 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu TestRMApplicationHistoryWriter is failed for FairScheduler. The failure is the following: T E S T S --- Running org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 69.311 sec <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter testRMWritingMassiveHistory(org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter) Time elapsed: 66.261 sec <<< FAILURE! java.lang.AssertionError: expected:<1> but was:<200> at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:430) at org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:391) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-2376) Too many threads blocking on the global JobTracker lock from getJobCounters, optimize getJobCounters to release global JobTracker lock before access the per job counter i
[ https://issues.apache.org/jira/browse/YARN-2376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu resolved YARN-2376. - Resolution: Duplicate > Too many threads blocking on the global JobTracker lock from getJobCounters, > optimize getJobCounters to release global JobTracker lock before access the > per job counter in JobInProgress > - > > Key: YARN-2376 > URL: https://issues.apache.org/jira/browse/YARN-2376 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-2376.000.patch > > > Too many threads blocking on the global JobTracker lock from getJobCounters, > optimize getJobCounters to release global JobTracker lock before access the > per job counter in JobInProgress. It may be a lot of JobClients to call > getJobCounters in JobTracker at the same time, Current code will lock the > JobTracker to block all the threads to get counter from JobInProgress. It is > better to unlock the JobTracker when get counter from > JobInProgress(job.getCounters(counters)). So all the theads can run parallel > when access its own job counter. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2376) Too many threads blocking on the global JobTracker lock from getJobCounters, optimize getJobCounters to release global JobTracker lock before access the per job counter in
zhihai xu created YARN-2376: --- Summary: Too many threads blocking on the global JobTracker lock from getJobCounters, optimize getJobCounters to release global JobTracker lock before access the per job counter in JobInProgress Key: YARN-2376 URL: https://issues.apache.org/jira/browse/YARN-2376 Project: Hadoop YARN Issue Type: Improvement Reporter: zhihai xu Assignee: zhihai xu Too many threads blocking on the global JobTracker lock from getJobCounters, optimize getJobCounters to release global JobTracker lock before access the per job counter in JobInProgress. It may be a lot of JobClients to call getJobCounters in JobTracker at the same time, Current code will lock the JobTracker to block all the threads to get counter from JobInProgress. It is better to unlock the JobTracker when get counter from JobInProgress(job.getCounters(counters)). So all the theads can run parallel when access its own job counter. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2361) remove duplicate entries (EXPIRE event) in the EnumSet of event type in RMAppAttempt state machine
zhihai xu created YARN-2361: --- Summary: remove duplicate entries (EXPIRE event) in the EnumSet of event type in RMAppAttempt state machine Key: YARN-2361 URL: https://issues.apache.org/jira/browse/YARN-2361 Project: Hadoop YARN Issue Type: Improvement Reporter: zhihai xu Priority: Minor Attachments: YARN-2361.000.patch remove duplicate entries in the EnumSet of event type in RMAppAttempt state machine. The event RMAppAttemptEventType.EXPIRE is duplicated in the following code. {code} EnumSet.of(RMAppAttemptEventType.ATTEMPT_ADDED, RMAppAttemptEventType.EXPIRE, RMAppAttemptEventType.LAUNCHED, RMAppAttemptEventType.LAUNCH_FAILED, RMAppAttemptEventType.EXPIRE, RMAppAttemptEventType.REGISTERED, RMAppAttemptEventType.CONTAINER_ALLOCATED, RMAppAttemptEventType.UNREGISTERED, RMAppAttemptEventType.KILL, RMAppAttemptEventType.STATUS_UPDATE)) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.
zhihai xu created YARN-2359: --- Summary: Application is hung without timeout and retry after DNS/network is down. Key: YARN-2359 URL: https://issues.apache.org/jira/browse/YARN-2359 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: zhihai xu Application is hung without timeout and retry after DNS/network is down. It is because right after the container is allocated for the AM, the DNS/network is down for the node which has the AM container. The application attempt is at state RMAppAttemptState.SCHEDULED, it receive RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the IllegalArgumentException(due to DNS error) happened, it stay at state RMAppAttemptState.SCHEDULED. In the state machine, only two events will be processed at this state: RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL. The code didn't handle any event(RMAppAttemptEventType.CONTAINER_FINISHED) which will be generated by the node and container timeout. So even the node is removed, the Application is still hung in this state RMAppAttemptState.SCHEDULED. The only way to make the application exit this state is to send RMAppAttemptEventType.KILL event which will only be generated when you manually kill the application from Job Client by forceKillApplication. To fix the issue, we should add an entry in the state machine table to handle RMAppAttemptEventType.CONTAINER_FINISHED event at state RMAppAttemptState.SCHEDULED add the following code in StateMachineFactory: .addTransition(RMAppAttemptState.SCHEDULED, RMAppAttemptState.FINAL_SAVING, RMAppAttemptEventType.CONTAINER_FINISHED, new FinalSavingTransition( new AMContainerCrashedBeforeRunningTransition(), RMAppAttemptState.FAILED)) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2337) remove duplication function call (setClientRMService) in resource manage class
zhihai xu created YARN-2337: --- Summary: remove duplication function call (setClientRMService) in resource manage class Key: YARN-2337 URL: https://issues.apache.org/jira/browse/YARN-2337 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: zhihai xu Priority: Minor remove duplication function call (setClientRMService) in resource manage class. rmContext.setClientRMService(clientRM); is duplicate in serviceInit of ResourceManager. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2325) need check whether node is null in nodeUpdate for FairScheduler
zhihai xu created YARN-2325: --- Summary: need check whether node is null in nodeUpdate for FairScheduler Key: YARN-2325 URL: https://issues.apache.org/jira/browse/YARN-2325 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu need check whether node is null in nodeUpdate for FairScheduler. If nodeUpdate is called after removeNode, the getFSSchedulerNode will be null. If the node is null, we should return with error message. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2324) Race condition in continuousScheduling for FairScheduler
zhihai xu created YARN-2324: --- Summary: Race condition in continuousScheduling for FairScheduler Key: YARN-2324 URL: https://issues.apache.org/jira/browse/YARN-2324 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Race condition in continuousScheduling for FairScheduler. removeNode can run when continuousScheduling is called in schedulingThread. If the node is removed from nodes, nodes.get(n2) and getFSSchedulerNode(nodeId) will be null. So we need add lock to remove the NPE/race conditions. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2315) Should use setCurrentCapacity instead of setCapacity to configure used resource capacity for FairScheduler.
zhihai xu created YARN-2315: --- Summary: Should use setCurrentCapacity instead of setCapacity to configure used resource capacity for FairScheduler. Key: YARN-2315 URL: https://issues.apache.org/jira/browse/YARN-2315 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Should use setCurrentCapacity instead of setCapacity to configure used resource capacity for FairScheduler. In function getQueueInfo of FSQueue.java, we call setCapacity twice with different parameters so the first call is overrode by the second call. queueInfo.setCapacity((float) getFairShare().getMemory() / scheduler.getClusterResource().getMemory()); queueInfo.setCapacity((float) getResourceUsage().getMemory() / scheduler.getClusterResource().getMemory()); We should change the second setCapacity call to setCurrentCapacity to configure the current used capacity. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2254) change TestRMWebServicesAppsModification to support FairScheduler.
zhihai xu created YARN-2254: --- Summary: change TestRMWebServicesAppsModification to support FairScheduler. Key: YARN-2254 URL: https://issues.apache.org/jira/browse/YARN-2254 Project: Hadoop YARN Issue Type: Improvement Reporter: zhihai xu Priority: Minor -- This message was sent by Atlassian JIRA (v6.2#6252)