[jira] [Created] (YARN-11215) MRAppmaster fails due to file permission bugs
lujie created YARN-11215: Summary: MRAppmaster fails due to file permission bugs Key: YARN-11215 URL: https://issues.apache.org/jira/browse/YARN-11215 Project: Hadoop YARN Issue Type: Bug Reporter: lujie 2022-06-21 12:30:11,175 INFO [main] org.apache.hadoop.service.AbstractService: Service JobHistoryEventHandler failed in state INITED org.apache.hadoop.yarn.exceptions.YarnRuntimeException: org.apache.hadoop.security.AccessControlException: Permission denied: user=user2, access=EXECUTE, inode="/tmp/hadoop-yarn/staging/history":user1:supergroup:drwx-- at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:506) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:422) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:333) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermissionWithContext(FSPermissionChecker.java:370) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:240) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:713) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:1892) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:1910) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.resolvePath(FSDirectory.java:727) at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getFileInfo(FSDirStatAndListingOp.java:112) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3350) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1208) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:1042) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:604) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:572) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:556) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1093) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1043) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:971) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2976) at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.serviceInit(JobHistoryEventHandler.java:232) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceInit(MRAppMaster.java:494) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$6.run(MRAppMaster.java:1760) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1757) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1691) 2022-06-23 05:56:04,936 ERROR org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager: Error while trying to scan the directory hdfs://dffdddc36db0:8020/tmp/hadoop-yarn/staging/history/done_intermediate/user2 org.apache.hadoop.security.AccessControlException: Permission denied: user=mapred, access=READ_EXECUTE, inode="/tmp/hadoop-yarn/staging/history/done_intermediate/user2":user2:supergroup:drwxrwx--- at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:506) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:352) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermissionWithContext(FSPermissionChecker.java:370) at
[jira] [Created] (YARN-11214) Fails to run the job due to file permission bugs
lujie created YARN-11214: Summary: Fails to run the job due to file permission bugs Key: YARN-11214 URL: https://issues.apache.org/jira/browse/YARN-11214 Project: Hadoop YARN Issue Type: Bug Reporter: lujie when user1 run the job and staging dir /tmp/hadoop-yarn is created by user1, its permissson is 700. hence user2 failes to run jobs. Configuring for multihomed network 2022-06-21 11:33:31,066 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at resourcemanager/172.22.0.5:8032 2022-06-21 11:33:31,296 INFO client.AHSProxy: Connecting to Application History server at historyserver/172.22.0.4:10200 Exception in thread "main" org.apache.hadoop.security.AccessControlException: Permission denied: user=user2, access=EXECUTE, inode="/tmp/hadoop-yarn":user1:supergroup:drwx-- at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:506) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:422) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:333) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermissionWithContext(FSPermissionChecker.java:370) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:240) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:713) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:1892) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:1910) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.resolvePath(FSDirectory.java:727) at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getFileInfo(FSDirStatAndListingOp.java:112) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3350) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1208) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:1042) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:604) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:572) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:556) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1093) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1043) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:971) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2976) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1741) at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1753) at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1750) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1765) at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:136) at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:113) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:148) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1571) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1568) at java.security.AccessController.doPrivileged(Native Method) at
[jira] [Created] (YARN-11151) sensitive infor may leak due to crash
lujie created YARN-11151: Summary: sensitive infor may leak due to crash Key: YARN-11151 URL: https://issues.apache.org/jira/browse/YARN-11151 Project: Hadoop YARN Issue Type: Bug Reporter: lujie we init LevelDBCacheTimelineStore and LeveldbTimelineStore like: {code:java} try { localFS = FileSystem.getLocal(conf); if (!localFS.exists(dbPath)) { if (!localFS.mkdirs(dbPath)) { throw new IOException("Couldn't create directory for leveldb " + "timeline store " + dbPath); } localFS.setPermission(dbPath, LeveldbUtils.LEVELDB_DIR_UMASK); } } finally { IOUtils.cleanupWithLogger(LOG, localFS); } {code} if node crash before setPermisson, then the permisison will be 755 forever code should be like : {code:java} try { localFS = FileSystem.getLocal(conf); if (!localFS.exists(dbPath)) { if (!localFS.mkdirs(dbPath)) { throw new IOException("Couldn't create directory for leveldb " + "timeline store " + dbPath); } } if (localFS.getStatus(dbPath).getPermmision().equlas(LeveldbUtils.LEVELDB_DIR_UMASK))){ localFS.setPermission(dbPath, LeveldbUtils.LEVELDB_DIR_UMASK);} } finally { IOUtils.cleanupWithLogger(LOG, localFS); } {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11150) sensitive inform may leak due to crash
lujie created YARN-11150: Summary: sensitive inform may leak due to crash Key: YARN-11150 URL: https://issues.apache.org/jira/browse/YARN-11150 Project: Hadoop YARN Issue Type: Bug Reporter: lujie -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-10980) fix CVE-2020-8908
[ https://issues.apache.org/jira/browse/YARN-10980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lujie resolved YARN-10980. -- Resolution: Fixed > fix CVE-2020-8908 > - > > Key: YARN-10980 > URL: https://issues.apache.org/jira/browse/YARN-10980 > Project: Hadoop YARN > Issue Type: Bug >Reporter: lujie >Priority: Major > > see [https://www.cvedetails.com/cve/CVE-2020-8908/] > > A temp directory creation vulnerability exists in all versions of Guava, > allowing an attacker with access to the machine to potentially access data in > a temporary directory created by the Guava API > com.google.common.io.Files.createTempDir(). By default, on unix-like systems, > the created directory is world-readable (readable by an attacker with access > to the system). The method in question has been marked @Deprecated in > versions 30.0 and later and should not be used. For Android developers, we > recommend choosing a temporary directory API provided by Android, such as > context.getCacheDir(). For other Java developers, we recommend migrating to > the Java 7 API java.nio.file.Files.createTempDirectory() which explicitly > configures permissions of 700, or configuring the Java runtime's > java.io.tmpdir system property to point to a location whose permissions are > appropriately configured. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10980) fix CVE-2020-8908
lujie created YARN-10980: Summary: fix CVE-2020-8908 Key: YARN-10980 URL: https://issues.apache.org/jira/browse/YARN-10980 Project: Hadoop YARN Issue Type: Bug Reporter: lujie -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10976) Resource Leak due to Files.walk
lujie created YARN-10976: Summary: Resource Leak due to Files.walk Key: YARN-10976 URL: https://issues.apache.org/jira/browse/YARN-10976 Project: Hadoop YARN Issue Type: Improvement Reporter: lujie Stream creates by File.walk should be closed, like jdk said: * The returned stream encapsulates one or more \{@link DirectoryStream}s. * If timely disposal of file system resources is required, the * {@code try}-with-resources construct should be used to ensure that the * stream's \{@link Stream#close close} method is invoked after the stream * operations are completed. Operating on a closed stream will result in an * {@link java.lang.IllegalStateException}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10819) Security inconsistency for log file
lujie created YARN-10819: Summary: Security inconsistency for log file Key: YARN-10819 URL: https://issues.apache.org/jira/browse/YARN-10819 Project: Hadoop YARN Issue Type: Bug Reporter: lujie -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10555) missing security check before getAppAttempts
lujie created YARN-10555: Summary: missing security check before getAppAttempts Key: YARN-10555 URL: https://issues.apache.org/jira/browse/YARN-10555 Project: Hadoop YARN Issue Type: Bug Reporter: lujie -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-10551) non-admin user can change the log level
[ https://issues.apache.org/jira/browse/YARN-10551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lujie resolved YARN-10551. -- Resolution: Not A Problem misconfiguration! see https://issues.apache.org/jira/secure/attachment/12832635/HADOOP-13707.001.patch > non-admin user can change the log level > --- > > Key: YARN-10551 > URL: https://issues.apache.org/jira/browse/YARN-10551 > Project: Hadoop YARN > Issue Type: Bug >Reporter: lujie >Priority: Major > > reproduce: > 1. login as user1 and do > {code:java} > yarn daemonlog -setlevel hadoop11:8088 > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl DEBUG > {code} > 2. login as user2 and run wordcount > 3. check the log of RM > {code:java} > 2020-12-27 10:54:15,917 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing > event for application_1609065586411_0003 of type START > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10551) non-admin user can chane the log level
lujie created YARN-10551: Summary: non-admin user can chane the log level Key: YARN-10551 URL: https://issues.apache.org/jira/browse/YARN-10551 Project: Hadoop YARN Issue Type: Bug Reporter: lujie reproduce: 1. login as user1 and do {code:java} yarn daemonlog -setlevel hadoop11:8088 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl DEBUG {code} 2. login as user2 and run wordcount 3. check the log of RM {code:java} 2020-12-27 10:54:15,917 DEBUG org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1609065586411_0003 of type START {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9594) Unknown event arrived at ContainerScheduler: EventType: RECOVERY_COMPLETED
lujie created YARN-9594: --- Summary: Unknown event arrived at ContainerScheduler: EventType: RECOVERY_COMPLETED Key: YARN-9594 URL: https://issues.apache.org/jira/browse/YARN-9594 Project: Hadoop YARN Issue Type: Bug Reporter: lujie {code:java} case RECOVERY_COMPLETED: startPendingContainers(maxOppQueueLength <= 0); metrics.setQueuedContainers(queuedOpportunisticContainers.size(), queuedGuaranteedContainers.size()); default: LOG.error("Unknown event arrived at ContainerScheduler: " + event.toString()); {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9589) NPE in HttpURLConnection.disconnect
lujie created YARN-9589: --- Summary: NPE in HttpURLConnection.disconnect Key: YARN-9589 URL: https://issues.apache.org/jira/browse/YARN-9589 Project: Hadoop YARN Issue Type: Bug Reporter: lujie {code:java} 2019-05-30 10:17:59,869 ERROR [IPC Server handler 2 on default port 51923] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1559182584278_0001_r_00_0 - exited : java.lang.NullPointerException at sun.net.www.protocol.http.HttpURLConnection.disconnect(HttpURLConnection.java:2821) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.closeConnection(Fetcher.java:255) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.interrupt(Fetcher.java:216) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.shutDown(Fetcher.java:224) at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:147) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:377) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1891) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9588) InvalidToken: appattempt_XXX not found in AMRMTokenSecretManager while RM reboot
lujie created YARN-9588: --- Summary: InvalidToken: appattempt_XXX not found in AMRMTokenSecretManager while RM reboot Key: YARN-9588 URL: https://issues.apache.org/jira/browse/YARN-9588 Project: Hadoop YARN Issue Type: Bug Reporter: lujie HI: while application is success, but before AM unregistered, RM reboot, then one error happens: {code:java} 2019-05-29 18:55:11,112 ERROR [Thread-76] org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: Exception while unregistering org.apache.hadoop.security.token.SecretManager$InvalidToken: appattempt_1559127208490_0001_01 not found in AMRMTokenSecretManager. at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateIOException(RPCUtil.java:80) at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:119) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.finishApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:94) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) at com.sun.proxy.$Proxy84.finishApplicationMaster(Unknown Source) at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.doUnregistration(RMCommunicator.java:227) at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.unregister(RMCommunicator.java:189) at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.serviceStop(RMCommunicator.java:267) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.serviceStop(RMContainerAllocator.java:327) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:54) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$ContainerAllocatorRouter.serviceStop(MRAppMaster.java:985) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:54) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:102) at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:158) at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:132) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStop(MRAppMaster.java:1868) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.stop(MRAppMaster.java:1309) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.shutDownJob(MRAppMaster.java:668) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler$1.run(MRAppMaster.java:747) Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): appattempt_1559127208490_0001_01 not found in AMRMTokenSecretManager. at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1547) at org.apache.hadoop.ipc.Client.call(Client.java:1493) at org.apache.hadoop.ipc.Client.call(Client.java:1392) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) at com.sun.proxy.$Proxy83.finishApplicationMaster(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.finishApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:92) ... 27 more {code} This will make the isLastAMRetry = false; current AM won't clean up the staging dir: {code:java} org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Skipping cleaning up the staging dir. assuming AM will be retried. {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For
[jira] [Created] (YARN-9248) RMContainerImpl:Invalid event: ACQUIRED at KILLED
lujie created YARN-9248: --- Summary: RMContainerImpl:Invalid event: ACQUIRED at KILLED Key: YARN-9248 URL: https://issues.apache.org/jira/browse/YARN-9248 Project: Hadoop YARN Issue Type: Bug Reporter: lujie Assignee: lujie {code:java} 2019-01-29 11:46:53,596 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: ACQUIRED at KILLED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) at org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:475) at org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:67) at org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.handleNewContainers(OpportunisticContainerAllocatorAMService.java:351) at org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.access$100(OpportunisticContainerAllocatorAMService.java:94) at org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService$OpportunisticAMSProcessor.allocate(OpportunisticContainerAllocatorAMService.java:197) at org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:424) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:878) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2830) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9238) An huge Data Race can make we get a wrong attempt by an appAttemptId
lujie created YARN-9238: --- Summary: An huge Data Race can make we get a wrong attempt by an appAttemptId Key: YARN-9238 URL: https://issues.apache.org/jira/browse/YARN-9238 Project: Hadoop YARN Issue Type: Bug Reporter: lujie We have foud a data race that can make an odd situation. See org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.OpportunisticAMSProcessor.allocate: {code:java} // Allocate OPPORTUNISTIC containers. 171. SchedulerApplicationAttempt appAttempt = 172.((AbstractYarnScheduler)rmContext.getScheduler()) 173. .getApplicationAttempt(appAttemptId); 174. 175. OpportunisticContainerContext oppCtx = 176. appAttempt.getOpportunisticContainerContext(); 177. oppCtx.updateNodeList(getLeastLoadedNodes()); {code} if we just crash the current AM(its attemptid is appattempt_0)just before line171, when the code of line 171~173 continue to execute to get the appAttempt by appattempt_0, the appAttempt should represents the currenct AM. But we found that the appAttempt represents to the new AM and its attempid is appattempt_1. This appAttempt that represents the new AM has not init its oppCtx, so NPE happnes at line 177. {code:java} java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService$OpportunisticAMSProcessor.allocate(OpportunisticContainerAllocatorAMService.java:177) at org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:424) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:878) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2830) {code} We have found the reason why we use old appattempt_0 but get the new appAttempt that represent to new AM. Below is the function body of getApplicationAttempt at line 173 {code:java} 399. public T getApplicationAttempt(ApplicationAttemptId applicationAttemptId) { 400 SchedulerApplication app = applications.get( 401 applicationAttemptId.getApplicationId()); 402 return app == null ? null : app.getCurrentAppAttempt(); 403 } {code} when old AM Crash, the CurrentAppAttempt of app will be setted as the new appAttempt that presentes the new AM. So the code line 402 will return the new appAttempt. We shoud add the check: whether the the getted appAttempt have the same id as given id. patch comes soon! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9226) NPE while YarnChild shudown
lujie created YARN-9226: --- Summary: NPE while YarnChild shudown Key: YARN-9226 URL: https://issues.apache.org/jira/browse/YARN-9226 Project: Hadoop YARN Issue Type: Bug Reporter: lujie In YarnChild.main {code:java} try{ logSyncer = TaskLog.createLogSyncer();//line 168 taskFinal.run(job, umbilical); //line 178 }catch (Exception exception) {//line 187 LOG.warn("Exception running child : " + StringUtils.stringifyException(exception)); . task.taskCleanup(umbilical);// line 200 }{code} At line 178. it will initialize the task.committer, but the line168 may throw exception, it will skip initialize the task.committer, hence task.committer == null. Line 187 will catch this exception and do clean up(line 200), code line 200 will use task.committer without null check, hence NPE happens {code:java} 2019-01-23 16:59:42,864 INFO [main] org.apache.hadoop.mapred.YarnChild: Exception cleaning up: java.lang.NullPointerException at org.apache.hadoop.mapred.Task.taskCleanup(Task.java:1458) at org.apache.hadoop.mapred.YarnChild$3.run(YarnChild.java:200) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:197) {code} So why line168 may throw exception, below log give a example: {code:java} 2019-01-23 16:59:42,857 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.IllegalStateException: Shutdown in progress, cannot add a shutdownHook at org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:299) at org.apache.hadoop.mapred.TaskLog.createLogSyncer(TaskLog.java:340) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9223) NPE happens in NM while loading recovery fails
lujie created YARN-9223: --- Summary: NPE happens in NM while loading recovery fails Key: YARN-9223 URL: https://issues.apache.org/jira/browse/YARN-9223 Project: Hadoop YARN Issue Type: Bug Reporter: lujie Assignee: lujie In org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit: {code:java} try { initAndStartRecoveryStore(conf); } catch (IOException e) { String recoveryDirName = conf.get(YarnConfiguration.NM_RECOVERY_DIR); throw new YarnRuntimeException("Unable to initialize recovery directory at " + recoveryDirName, e); } this.context = createNMContext(containerTokenSecretManager, nmTokenSecretManager, nmStore, isDistSchedulingEnabled, conf); {code} while Recovery fails, the context is null, and YarnRuntimeException will cause serviceStop to run(due to Shutdown Hook): {code:java} // Cleanup ResourcePluginManager ResourcePluginManager rpm = context.getResourcePluginManager(); {code} hence NPE happens: {code:java} java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:530) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:54) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:102) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:984) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1064) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-9201) RMAppAttemptImpl: Invalid event: LAUNCH_FAILED at FAILED
[ https://issues.apache.org/jira/browse/YARN-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lujie resolved YARN-9201. - Resolution: Duplicate Assignee: lujie > RMAppAttemptImpl: Invalid event: LAUNCH_FAILED at FAILED > > > Key: YARN-9201 > URL: https://issues.apache.org/jira/browse/YARN-9201 > Project: Hadoop YARN > Issue Type: Bug >Reporter: lujie >Assignee: lujie >Priority: Major > Attachments: hadoop-hires-resourcemanager-hadoop11.log > > > While node removed, RM will kill the application and change its state as > failed. AMLauncher can't not launch due to java.io.IOException and send > LAUNCH_FAILED event to application, an error happens > > {code:java} > 2019-01-16 12:55:33,334 ERROR > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > App attempt: appattempt_1547614499484_0001_01 can't handle this event at > current state > org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: > LAUNCH_FAILED at FAILED > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:913) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:121) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1073) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1054) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9201) RMAppAttemptImpl: Invalid event: LAUNCH_FAILED at FAILED
lujie created YARN-9201: --- Summary: RMAppAttemptImpl: Invalid event: LAUNCH_FAILED at FAILED Key: YARN-9201 URL: https://issues.apache.org/jira/browse/YARN-9201 Project: Hadoop YARN Issue Type: Bug Reporter: lujie While node removed, RM will kill the application and change its state as failed. AMLauncher can't not launch due to java.io.IOException {code:java} 2019-01-16 12:55:33,334 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: App attempt: appattempt_1547614499484_0001_01 can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: LAUNCH_FAILED at FAILED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:913) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:121) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1073) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1054) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-9193) NullPointerException happens in RM while shutdown a NM
[ https://issues.apache.org/jira/browse/YARN-9193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lujie resolved YARN-9193. - Resolution: Duplicate it will be fixed in [YARN-9194link title|https://issues.apache.org/jira/browse/YARN-9194] > NullPointerException happens in RM while shutdown a NM > -- > > Key: YARN-9193 > URL: https://issues.apache.org/jira/browse/YARN-9193 > Project: Hadoop YARN > Issue Type: Bug >Reporter: lujie >Assignee: lujie >Priority: Major > Attachments: hadoop-hires-resourcemanager-hadoop11.log > > > while shutdown a NodeManager, the RM occurs a null point exception > > {code:java} > 2019-01-13 08:52:20,299 ERROR > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type CONTAINER_ALLOCATED for applicationAttempt > appattempt_1547340702286_0001_01 > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMContainerAllocatedTransition.transition(RMAppAttemptImpl.java:1210) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMContainerAllocatedTransition.transition(RMAppAttemptImpl.java:1180) > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:913) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:121) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1073) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1054) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:745) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9194) Invalid event: REGISTERED at FAILED
lujie created YARN-9194: --- Summary: Invalid event: REGISTERED at FAILED Key: YARN-9194 URL: https://issues.apache.org/jira/browse/YARN-9194 Project: Hadoop YARN Issue Type: Bug Reporter: lujie Assignee: lujie While the attempt fails, the REGISTERED comes, hence the InvalidStateTransitionException happens. {code:java} 2019-01-13 00:41:57,127 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: App attempt: appattempt_1547311267249_0001_02 can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: REGISTERED at FAILED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:913) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:121) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1073) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1054) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9193) NullPointerException happens in RM while shutdown a NM
lujie created YARN-9193: --- Summary: NullPointerException happens in RM while shutdown a NM Key: YARN-9193 URL: https://issues.apache.org/jira/browse/YARN-9193 Project: Hadoop YARN Issue Type: Bug Reporter: lujie while shutdown a NodeManager, the RM occurs a null point exception -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9165) NPE which is similar to YARN-5918
lujie created YARN-9165: --- Summary: NPE which is similar to YARN-5918 Key: YARN-9165 URL: https://issues.apache.org/jira/browse/YARN-9165 Project: Hadoop YARN Issue Type: Bug Reporter: lujie {code:java} 2018-12-31 22:30:06,681 WARN org.apache.hadoop.ipc.Server: IPC Server handler 2 on default port 8030, call Call#23 Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from 10.3.1.15:52796 java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.createOpportunisticRmContainer(SchedulerUtils.java:576) at org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.handleNewContainers(OpportunisticContainerAllocatorAMService.java:349) at org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.access$100(OpportunisticContainerAllocatorAMService.java:94) at org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService$OpportunisticAMSProcessor.allocate(OpportunisticContainerAllocatorAMService.java:197) at org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:424) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:878) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2830) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9164) NullPointerException crash the ResourceManager
lujie created YARN-9164: --- Summary: NullPointerException crash the ResourceManager Key: YARN-9164 URL: https://issues.apache.org/jira/browse/YARN-9164 Project: Hadoop YARN Issue Type: Bug Reporter: lujie Assignee: lujie We have meeted an NPE which can crash the whole cluster {code:java} 2018-12-31 22:18:11,924 FATAL org.apache.hadoop.yarn.event.EventDispatcher: Error in handling event type APP_ATTEMPT_REMOVED to the Event Dispatcher java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:696) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:1123) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1827) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:171) at org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66) at java.lang.Thread.run(Thread.java:745) {code} this bug also happens in the latest trunk! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-8650) Invalid event: CONTAINER_KILLED_ON_REQUEST at DONE and Invalid event: CONTAINER_LAUNCHED at DONE
[ https://issues.apache.org/jira/browse/YARN-8650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lujie resolved YARN-8650. - Resolution: Duplicate > Invalid event: CONTAINER_KILLED_ON_REQUEST at DONE and Invalid event: > CONTAINER_LAUNCHED at DONE > - > > Key: YARN-8650 > URL: https://issues.apache.org/jira/browse/YARN-8650 > Project: Hadoop YARN > Issue Type: Bug >Reporter: lujie >Priority: Major > Attachments: hadoop-hires-nodemanager-hadoop11.log, > hadoop-hires-nodemanager-hadoop15.log > > > We have tested the hadoop while nodemanager is shutting down and encounter > two InvalidStateTransitionException: > {code:java} > 2018-08-04 14:29:33,025 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Can't handle this event at current state: Current: [DONE], eventType: > [CONTAINER_KILLED_ON_REQUEST], container: > [container_1533364185282_0001_01_01] > org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: > CONTAINER_KILLED_ON_REQUEST at DONE > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:2084) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:103) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1483) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1476) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:745) > {code} > {code:java} > org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: > CONTAINER_LAUNCHED at DONE > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:2084) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:103) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1483) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1476) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:745) > {code} > We have analysis these two bugs, and find that shutdown will send kill event > and hence cause these two exception. We have test the our cluster for many > time and can determinately reproduce it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8650) Invalid event: CONTAINER_KILLED_ON_REQUEST at DONE and Invalid event: CONTAINER_LAUNCHED at DONE
lujie created YARN-8650: --- Summary: Invalid event: CONTAINER_KILLED_ON_REQUEST at DONE and Invalid event: CONTAINER_LAUNCHED at DONE Key: YARN-8650 URL: https://issues.apache.org/jira/browse/YARN-8650 Project: Hadoop YARN Issue Type: Bug Reporter: lujie We have tested the hadoop while nodemanager is shutting down and encounter two InvalidStateTransitionException: {code:java} 2018-08-04 14:29:33,025 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Can't handle this event at current state: Current: [DONE], eventType: [CONTAINER_KILLED_ON_REQUEST], container: [container_1533364185282_0001_01_01] org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: CONTAINER_KILLED_ON_REQUEST at DONE at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:2084) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:103) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1483) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1476) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) at java.lang.Thread.run(Thread.java:745) {code} {code:java} org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: CONTAINER_LAUNCHED at DONE at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:2084) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:103) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1483) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1476) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) at java.lang.Thread.run(Thread.java:745) {code} We have analysis these two bugs, and find that shutdown will send kill event and hence cause these two exception. We have test the our cluster for many time and can determinately reproduce it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8649) Same as YARN-4355:NPE while processing localizer heartbeat
lujie created YARN-8649: --- Summary: Same as YARN-4355:NPE while processing localizer heartbeat Key: YARN-8649 URL: https://issues.apache.org/jira/browse/YARN-8649 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.1.1 Reporter: lujie I have noticed that a nodemanager was getting NPEs processing a heartbeat. This is similar to [YARN-4355|https://issues.apache.org/jira/browse/YARN-4355 ] which reported by [# Jason Lowe] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8381) Job get stuck while node is unhealthy, but without log messages to indicate such case
lujie created YARN-8381: --- Summary: Job get stuck while node is unhealthy, but without log messages to indicate such case Key: YARN-8381 URL: https://issues.apache.org/jira/browse/YARN-8381 Project: Hadoop YARN Issue Type: Improvement Reporter: lujie I started a fresh pseudo-distributed system on an node, then run a job but it stuck. My first reaction was checking log message to local problem, but obtaining no error message. Then I waked up to check the node health after reading log message for long time. The Yarn web UI showed that the nodemanager is unhealthy, due to the "l{{ocal-dirs are bad: /tmp/hadoop-hduser/nm-local-dir}}". I reconfigure the "{{yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage}}" to 98% and solved this problem. But I still strongly recommend adding error log messages for unhealthy nodemanger. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8164) Fix an Potential NPE
lujie created YARN-8164: --- Summary: Fix an Potential NPE Key: YARN-8164 URL: https://issues.apache.org/jira/browse/YARN-8164 Project: Hadoop YARN Issue Type: Bug Reporter: lujie We have developed a static analysis tool [NPEDetector|https://github.com/lujiefsi/NPEDetector] to find some potential NPE. Our analysis shows that some callees may return null in corner case(e.g. node crash , IO exception), some of their callers have _!=null_ check but some do not have. Callee FairScheduler#getAppsInQueue can return null {code:java} public List getAppsInQueue(String queueName) { FSQueue queue = queueMgr.getQueue(queueName); if (queue == null) { return null;//here } } {code} it has 4 callers, three of them have null checker, one dost not have. In this issue we post a patch which can add !=null based on existed !=null check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7786) NullPointerException while launching ApplicationMaster
lujie created YARN-7786: --- Summary: NullPointerException while launching ApplicationMaster Key: YARN-7786 URL: https://issues.apache.org/jira/browse/YARN-7786 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0-beta1 Reporter: lujie Assignee: lujie -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7785) Invalid event: CONTAINER_RESOURCES_CLEANEDUP at DONE
lujie created YARN-7785: --- Summary: Invalid event: CONTAINER_RESOURCES_CLEANEDUP at DONE Key: YARN-7785 URL: https://issues.apache.org/jira/browse/YARN-7785 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0-beta1, 2.8.0 Environment: {code:java} 2017-11-25 18:51:30,234 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Can't handle this event at current state: Current: [DONE], eventType: [CONTAINER_RESOURCES_CLEANEDUP], container: [container_1511606993239_0001_01_11] org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: CONTAINER_RESOURCES_CLEANEDUP at DONE at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1704) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:96) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1490) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1483) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) at java.lang.Thread.run(Thread.java:745) {code} Reporter: lujie Assignee: lujie send kill command while job is running, some Exception occur in nodemanager: -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7726) RMAppImpl: can't handle APP_ACCEPTED at state ACCEPTED
lujie created YARN-7726: --- Summary: RMAppImpl: can't handle APP_ACCEPTED at state ACCEPTED Key: YARN-7726 URL: https://issues.apache.org/jira/browse/YARN-7726 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.8.3 Reporter: lujie Priority: Minor while adding patch to TestRMAppTransitions, the patch triggers error message: "can't handle APP_ACCEPTED at state ACCEPTED" in unit test testAppAcceptedFailed and testAppRunningFailed -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-6950) Invalid event: LAUNCH_FAILED at FAILED
[ https://issues.apache.org/jira/browse/YARN-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lujie resolved YARN-6950. - Resolution: Duplicate > Invalid event: LAUNCH_FAILED at FAILED > -- > > Key: YARN-6950 > URL: https://issues.apache.org/jira/browse/YARN-6950 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.6.0 >Reporter: lujie > Fix For: 2.7.0 > > > A RMAppAttemptImpl fail due to some reason,meanwhile AM fails to launch a > container and send event LAUNCH_FAILED,and the StateMachine can not handle > it: > {code:java} > 2017-07-05 03:33:09,013 ERROR > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > Can't handle this event at current state > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: > LAUNCH_FAILED at FAILED > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:757) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:106) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:834) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:815) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7563) Invalid event: FINISH_APPLICATION at NEW
lujie created YARN-7563: --- Summary: Invalid event: FINISH_APPLICATION at NEW Key: YARN-7563 URL: https://issues.apache.org/jira/browse/YARN-7563 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 3.0.0-beta1 Reporter: lujie I send kill command to application, nodemanager log shows: {code:java} 2017-11-25 19:18:48,126 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: couldn't find container container_1511608703018_0001_01_01 while processing FINISH_CONTAINERS event 2017-11-25 19:18:48,146 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: FINISH_APPLICATION at NEW at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) at org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:627) at org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:75) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:1508) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:1501) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) at java.lang.Thread.run(Thread.java:745) 2017-11-25 19:18:48,151 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: Application application_1511608703018_0001 transitioned from NEW to INITING {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7176) After kill command is send, the job hangs
lujie created YARN-7176: --- Summary: After kill command is send, the job hangs Key: YARN-7176 URL: https://issues.apache.org/jira/browse/YARN-7176 Project: Hadoop YARN Issue Type: Bug Components: RM Affects Versions: 2.6.0 Reporter: lujie Priority: Critical I submit a job, but i need to kill it immediately due to some reason. Then I found the job is hang, I check the log and found ArrayIndexOutOfBoundsException and NullPointerException in RMLog: {code:java} 2017-09-08 02:34:37,967 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Error launching appattempt_1504809243340_0001_01. Got exception: java.lang.ArrayIndexOutOfBoundsException: 3 at java.util.ArrayList.add(ArrayList.java:441) at com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:330) at org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto$Builder.addAllApplicationACLs(YarnProtos.java:39956) at org.apache.hadoop.yarn.api.records.impl.pb.ContainerLaunchContextPBImpl.addApplicationACLs(ContainerLaunchContextPBImpl.java:446) at org.apache.hadoop.yarn.api.records.impl.pb.ContainerLaunchContextPBImpl.mergeLocalToBuilder(ContainerLaunchContextPBImpl.java:121) at org.apache.hadoop.yarn.api.records.impl.pb.ContainerLaunchContextPBImpl.mergeLocalToProto(ContainerLaunchContextPBImpl.java:128) at org.apache.hadoop.yarn.api.records.impl.pb.ContainerLaunchContextPBImpl.getProto(ContainerLaunchContextPBImpl.java:70) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainerRequestPBImpl.convertToProtoFormat(StartContainerRequestPBImpl.java:156) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainerRequestPBImpl.mergeLocalToBuilder(StartContainerRequestPBImpl.java:85) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainerRequestPBImpl.mergeLocalToProto(StartContainerRequestPBImpl.java:95) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainerRequestPBImpl.getProto(StartContainerRequestPBImpl.java:57) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainersRequestPBImpl.convertToProtoFormat(StartContainersRequestPBImpl.java:137) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainersRequestPBImpl.addLocalRequestsToProto(StartContainersRequestPBImpl.java:97) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainersRequestPBImpl.mergeLocalToBuilder(StartContainersRequestPBImpl.java:79) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainersRequestPBImpl.mergeLocalToProto(StartContainersRequestPBImpl.java:72) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainersRequestPBImpl.getProto(StartContainersRequestPBImpl.java:48) at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:93) at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:119) at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:254) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2017-09-08 02:34:37,968 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error updating app: application_1504809243340_0001 java.lang.NullPointerException at com.google.protobuf.CodedOutputStream.computeMessageSizeNoTag(CodedOutputStream.java:749) at com.google.protobuf.CodedOutputStream.computeMessageSize(CodedOutputStream.java:530) at org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto.getSerializedSize(YarnProtos.java:38512) at com.google.protobuf.CodedOutputStream.computeMessageSizeNoTag(CodedOutputStream.java:749) at com.google.protobuf.CodedOutputStream.computeMessageSize(CodedOutputStream.java:530) at org.apache.hadoop.yarn.proto.YarnProtos$ApplicationSubmissionContextProto.getSerializedSize(YarnProtos.java:28481) at com.google.protobuf.CodedOutputStream.computeMessageSizeNoTag(CodedOutputStream.java:749) at com.google.protobuf.CodedOutputStream.computeMessageSize(CodedOutputStream.java:530) at org.apache.hadoop.yarn.proto.YarnServerResourceManagerRecoveryProtos$ApplicationStateDataProto.getSerializedSize(YarnServerResourceManagerRecoveryProtos.java:816) at com.google.protobuf.AbstractMessageLite.toByteArray(AbstractMessageLite.java:62) at
[jira] [Created] (YARN-6950) Invalid event: LAUNCH_FAILED at FAILED
lujie created YARN-6950: --- Summary: Invalid event: LAUNCH_FAILED at FAILED Key: YARN-6950 URL: https://issues.apache.org/jira/browse/YARN-6950 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.6.0 Reporter: lujie A RMAppAttemptImpl fail due to some reason,meanwhile AM fails to launch a container and send event LAUNCH_FAILED,and the StateMachine can not handle it: {code:java} 2017-07-05 03:33:09,013 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: LAUNCH_FAILED at FAILED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:757) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:106) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:834) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:815) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6949) Invalid event: LOCALIZED at LOCALIZED
lujie created YARN-6949: --- Summary: Invalid event: LOCALIZED at LOCALIZED Key: YARN-6949 URL: https://issues.apache.org/jira/browse/YARN-6949 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.8.0 Reporter: lujie When job is running, I stop a nodemanager in one machine due to some reason, Then I check the logs to see the running state,I find many InvalidStateTransitionException: {code:java} rg.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: LOCALIZATION_FAILED at LOCALIZED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource.handle(LocalizedResource.java:198) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl.handle(LocalResourcesTrackerImpl.java:194) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl.handle(LocalResourcesTrackerImpl.java:58) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.processHeartbeat(ResourceLocalizationService.java:1058) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.processHeartbeat(ResourceLocalizationService.java:720) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.heartbeat(ResourceLocalizationService.java:355) at org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.service.LocalizationProtocolPBServiceImpl.heartbeat(LocalizationProtocolPBServiceImpl.java:48) at org.apache.hadoop.yarn.proto.LocalizationProtocol$LocalizationProtocolService$2.callBlockingMethod(LocalizationProtocol.java:63) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:845) {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6948) Invalid event: ATTEMPT_ADDED at FINAL_SAVING
lujie created YARN-6948: --- Summary: Invalid event: ATTEMPT_ADDED at FINAL_SAVING Key: YARN-6948 URL: https://issues.apache.org/jira/browse/YARN-6948 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.8.0 Reporter: lujie When I send kill command to a running job, I check the logs and find the Exception: {code:java} 2017-08-03 01:35:20,485 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ATTEMPT_ADDED at FINAL_SAVING at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:757) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:106) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:834) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:815) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org