[jira] [Created] (YARN-11215) MRAppmaster fails due to file permission bugs

2022-07-22 Thread lujie (Jira)
lujie created YARN-11215:


 Summary: MRAppmaster fails due to file permission bugs
 Key: YARN-11215
 URL: https://issues.apache.org/jira/browse/YARN-11215
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: lujie


2022-06-21 12:30:11,175 INFO [main] org.apache.hadoop.service.AbstractService: 
Service JobHistoryEventHandler failed in state INITED
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
org.apache.hadoop.security.AccessControlException: Permission denied: 
user=user2, access=EXECUTE, 
inode="/tmp/hadoop-yarn/staging/history":user1:supergroup:drwx--
        at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:506)
        at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:422)
        at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:333)
        at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermissionWithContext(FSPermissionChecker.java:370)
        at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:240)
        at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:713)
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:1892)
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:1910)
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.resolvePath(FSDirectory.java:727)
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getFileInfo(FSDirStatAndListingOp.java:112)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3350)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1208)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:1042)
        at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:604)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:572)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:556)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1093)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1043)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:971)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2976)

        at 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.serviceInit(JobHistoryEventHandler.java:232)
        at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at 
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109)
        at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceInit(MRAppMaster.java:494)
        at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
        at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$6.run(MRAppMaster.java:1760)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
        at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1757)
        at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1691)
        
        
        2022-06-23 05:56:04,936 ERROR 
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager: Error while trying to 
scan the directory 
hdfs://dffdddc36db0:8020/tmp/hadoop-yarn/staging/history/done_intermediate/user2
org.apache.hadoop.security.AccessControlException: Permission denied: 
user=mapred, access=READ_EXECUTE, 
inode="/tmp/hadoop-yarn/staging/history/done_intermediate/user2":user2:supergroup:drwxrwx---
        at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:506)
        at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:352)
        at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermissionWithContext(FSPermissionChecker.java:370)
        at 

[jira] [Created] (YARN-11214) Fails to run the job due to file permission bugs

2022-07-22 Thread lujie (Jira)
lujie created YARN-11214:


 Summary: Fails to run the job due to file permission bugs
 Key: YARN-11214
 URL: https://issues.apache.org/jira/browse/YARN-11214
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: lujie


when user1 run the job and staging dir /tmp/hadoop-yarn  is created by user1, 
its permissson is 700. hence user2 failes to run jobs.

 

 

Configuring for multihomed network
2022-06-21 11:33:31,066 INFO client.DefaultNoHARMFailoverProxyProvider: 
Connecting to ResourceManager at resourcemanager/172.22.0.5:8032
2022-06-21 11:33:31,296 INFO client.AHSProxy: Connecting to Application History 
server at historyserver/172.22.0.4:10200
Exception in thread "main" org.apache.hadoop.security.AccessControlException: 
Permission denied: user=user2, access=EXECUTE, 
inode="/tmp/hadoop-yarn":user1:supergroup:drwx--
        at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:506)
        at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:422)
        at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:333)
        at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermissionWithContext(FSPermissionChecker.java:370)
        at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:240)
        at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:713)
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:1892)
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:1910)
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.resolvePath(FSDirectory.java:727)
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getFileInfo(FSDirStatAndListingOp.java:112)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3350)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1208)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:1042)
        at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:604)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:572)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:556)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1093)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1043)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:971)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2976)

        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at 
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121)
        at 
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88)
        at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1741)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1753)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1750)
        at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1765)
        at 
org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:136)
        at 
org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:113)
        at 
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:148)
        at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1571)
        at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1568)
        at java.security.AccessController.doPrivileged(Native Method)
        at 

[jira] [Created] (YARN-11151) sensitive infor may leak due to crash

2022-05-14 Thread lujie (Jira)
lujie created YARN-11151:


 Summary: sensitive infor may leak due to crash
 Key: YARN-11151
 URL: https://issues.apache.org/jira/browse/YARN-11151
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: lujie


we init LevelDBCacheTimelineStore and LeveldbTimelineStore like:

 
{code:java}
 try {
      localFS = FileSystem.getLocal(conf);
      if (!localFS.exists(dbPath)) {
        if (!localFS.mkdirs(dbPath)) {
          throw new IOException("Couldn't create directory for leveldb " +
              "timeline store " + dbPath);
        }
        localFS.setPermission(dbPath, LeveldbUtils.LEVELDB_DIR_UMASK);
      }
    } finally {
      IOUtils.cleanupWithLogger(LOG, localFS);
    } {code}
 

if node crash before setPermisson, then the permisison will be 755 forever

 

code should be like :
{code:java}
 try {
      localFS = FileSystem.getLocal(conf);
      if (!localFS.exists(dbPath)) {
        if (!localFS.mkdirs(dbPath)) {
          throw new IOException("Couldn't create directory for leveldb " +
              "timeline store " + dbPath);
        }
        
      }
 if 
(localFS.getStatus(dbPath).getPermmision().equlas(LeveldbUtils.LEVELDB_DIR_UMASK))){
   localFS.setPermission(dbPath, LeveldbUtils.LEVELDB_DIR_UMASK);}     } 
finally {
      IOUtils.cleanupWithLogger(LOG, localFS);
    } {code}
 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11150) sensitive inform may leak due to crash

2022-05-14 Thread lujie (Jira)
lujie created YARN-11150:


 Summary: sensitive inform may leak due to crash
 Key: YARN-11150
 URL: https://issues.apache.org/jira/browse/YARN-11150
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: lujie






--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10980) fix CVE-2020-8908

2021-10-18 Thread lujie (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lujie resolved YARN-10980.
--
Resolution: Fixed

> fix CVE-2020-8908
> -
>
> Key: YARN-10980
> URL: https://issues.apache.org/jira/browse/YARN-10980
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: lujie
>Priority: Major
>
> see  [https://www.cvedetails.com/cve/CVE-2020-8908/]
>  
> A temp directory creation vulnerability exists in all versions of Guava, 
> allowing an attacker with access to the machine to potentially access data in 
> a temporary directory created by the Guava API 
> com.google.common.io.Files.createTempDir(). By default, on unix-like systems, 
> the created directory is world-readable (readable by an attacker with access 
> to the system). The method in question has been marked @Deprecated in 
> versions 30.0 and later and should not be used. For Android developers, we 
> recommend choosing a temporary directory API provided by Android, such as 
> context.getCacheDir(). For other Java developers, we recommend migrating to 
> the Java 7 API java.nio.file.Files.createTempDirectory() which explicitly 
> configures permissions of 700, or configuring the Java runtime's 
> java.io.tmpdir system property to point to a location whose permissions are 
> appropriately configured.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10980) fix CVE-2020-8908

2021-10-18 Thread lujie (Jira)
lujie created YARN-10980:


 Summary: fix CVE-2020-8908
 Key: YARN-10980
 URL: https://issues.apache.org/jira/browse/YARN-10980
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: lujie






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10976) Resource Leak due to Files.walk

2021-10-13 Thread lujie (Jira)
lujie created YARN-10976:


 Summary: Resource Leak due to Files.walk
 Key: YARN-10976
 URL: https://issues.apache.org/jira/browse/YARN-10976
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: lujie


Stream creates by File.walk should be closed, like jdk said:
 *  The returned stream encapsulates one or more \{@link DirectoryStream}s.
 * If timely disposal of file system resources is required, the
 * {@code try}-with-resources construct should be used to ensure that the
 * stream's \{@link Stream#close close} method is invoked after the stream
 * operations are completed. Operating on a closed stream will result in an
 * {@link java.lang.IllegalStateException}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10819) Security inconsistency for log file

2021-06-11 Thread lujie (Jira)
lujie created YARN-10819:


 Summary: Security inconsistency for log file
 Key: YARN-10819
 URL: https://issues.apache.org/jira/browse/YARN-10819
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: lujie






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10555) missing security check before getAppAttempts

2020-12-30 Thread lujie (Jira)
lujie created YARN-10555:


 Summary:  missing security check before getAppAttempts
 Key: YARN-10555
 URL: https://issues.apache.org/jira/browse/YARN-10555
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: lujie






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10551) non-admin user can change the log level

2020-12-30 Thread lujie (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lujie resolved YARN-10551.
--
Resolution: Not A Problem

misconfiguration! see 
https://issues.apache.org/jira/secure/attachment/12832635/HADOOP-13707.001.patch

> non-admin user can change the log level
> ---
>
> Key: YARN-10551
> URL: https://issues.apache.org/jira/browse/YARN-10551
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: lujie
>Priority: Major
>
> reproduce:
> 1. login as user1 and do
> {code:java}
> yarn daemonlog -setlevel hadoop11:8088 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl DEBUG
> {code}
> 2. login as user2 and run wordcount
> 3. check the log of RM
> {code:java}
> 2020-12-27 10:54:15,917 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing 
> event for application_1609065586411_0003 of type START
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10551) non-admin user can chane the log level

2020-12-27 Thread lujie (Jira)
lujie created YARN-10551:


 Summary: non-admin user can chane the log level
 Key: YARN-10551
 URL: https://issues.apache.org/jira/browse/YARN-10551
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: lujie


reproduce:

1. login as user1 and do
{code:java}
yarn daemonlog -setlevel hadoop11:8088 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl DEBUG
{code}
2. login as user2 and run wordcount

3. check the log of RM
{code:java}
2020-12-27 10:54:15,917 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event 
for application_1609065586411_0003 of type START
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9594) Unknown event arrived at ContainerScheduler: EventType: RECOVERY_COMPLETED

2019-06-02 Thread lujie (JIRA)
lujie created YARN-9594:
---

 Summary: Unknown event arrived at ContainerScheduler: EventType: 
RECOVERY_COMPLETED
 Key: YARN-9594
 URL: https://issues.apache.org/jira/browse/YARN-9594
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: lujie


{code:java}
case RECOVERY_COMPLETED:
  startPendingContainers(maxOppQueueLength <= 0);
  metrics.setQueuedContainers(queuedOpportunisticContainers.size(),
 queuedGuaranteedContainers.size());
default:
  LOG.error("Unknown event arrived at ContainerScheduler: "
+ event.toString());
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9589) NPE in HttpURLConnection.disconnect

2019-05-30 Thread lujie (JIRA)
lujie created YARN-9589:
---

 Summary: NPE in HttpURLConnection.disconnect
 Key: YARN-9589
 URL: https://issues.apache.org/jira/browse/YARN-9589
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: lujie


 

 
{code:java}
2019-05-30 10:17:59,869 ERROR [IPC Server handler 2 on default port 51923] 
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
attempt_1559182584278_0001_r_00_0 - exited : java.lang.NullPointerException
at 
sun.net.www.protocol.http.HttpURLConnection.disconnect(HttpURLConnection.java:2821)
at 
org.apache.hadoop.mapreduce.task.reduce.Fetcher.closeConnection(Fetcher.java:255)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.interrupt(Fetcher.java:216)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.shutDown(Fetcher.java:224)
at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:147)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:377)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1891)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172)

{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9588) InvalidToken: appattempt_XXX not found in AMRMTokenSecretManager while RM reboot

2019-05-29 Thread lujie (JIRA)
lujie created YARN-9588:
---

 Summary: InvalidToken: appattempt_XXX not found in 
AMRMTokenSecretManager while RM reboot
 Key: YARN-9588
 URL: https://issues.apache.org/jira/browse/YARN-9588
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: lujie


HI:

while application is success, but before AM unregistered, RM reboot, then one 
error happens:
{code:java}
2019-05-29 18:55:11,112 ERROR [Thread-76] 
org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: Exception while 
unregistering
org.apache.hadoop.security.token.SecretManager$InvalidToken: 
appattempt_1559127208490_0001_01 not found in AMRMTokenSecretManager.
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateIOException(RPCUtil.java:80)
at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:119)
at 
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.finishApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:94)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy84.finishApplicationMaster(Unknown Source)
at 
org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.doUnregistration(RMCommunicator.java:227)
at 
org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.unregister(RMCommunicator.java:189)
at 
org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.serviceStop(RMCommunicator.java:267)
at 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.serviceStop(RMContainerAllocator.java:327)
at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220)
at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:54)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$ContainerAllocatorRouter.serviceStop(MRAppMaster.java:985)
at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220)
at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:54)
at 
org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:102)
at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:158)
at 
org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:132)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStop(MRAppMaster.java:1868)
at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.stop(MRAppMaster.java:1309)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.shutDownJob(MRAppMaster.java:668)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler$1.run(MRAppMaster.java:747)
Caused by: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
 appattempt_1559127208490_0001_01 not found in AMRMTokenSecretManager.
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1547)
at org.apache.hadoop.ipc.Client.call(Client.java:1493)
at org.apache.hadoop.ipc.Client.call(Client.java:1392)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
at com.sun.proxy.$Proxy83.finishApplicationMaster(Unknown Source)
at 
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.finishApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:92)
... 27 more

{code}
This will make the isLastAMRetry = false;   current AM won't clean up the 
staging dir:
{code:java}
org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Skipping cleaning up the 
staging dir. assuming AM will be retried.
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For 

[jira] [Created] (YARN-9248) RMContainerImpl:Invalid event: ACQUIRED at KILLED

2019-01-29 Thread lujie (JIRA)
lujie created YARN-9248:
---

 Summary: RMContainerImpl:Invalid event: ACQUIRED at KILLED
 Key: YARN-9248
 URL: https://issues.apache.org/jira/browse/YARN-9248
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: lujie
Assignee: lujie


{code:java}
2019-01-29 11:46:53,596 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
Can't handle this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
ACQUIRED at KILLED
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:475)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:67)
at 
org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.handleNewContainers(OpportunisticContainerAllocatorAMService.java:351)
at 
org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.access$100(OpportunisticContainerAllocatorAMService.java:94)
at 
org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService$OpportunisticAMSProcessor.allocate(OpportunisticContainerAllocatorAMService.java:197)
at 
org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
at 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:424)
at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
at 
org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:878)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2830)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9238) An huge Data Race can make we get a wrong attempt by an appAttemptId

2019-01-25 Thread lujie (JIRA)
lujie created YARN-9238:
---

 Summary: An huge Data Race can make we get a wrong attempt  by an 
appAttemptId 
 Key: YARN-9238
 URL: https://issues.apache.org/jira/browse/YARN-9238
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: lujie


We have foud a data race that can make an odd situation.

See 
org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.OpportunisticAMSProcessor.allocate:
{code:java}
 // Allocate OPPORTUNISTIC containers.
171.  SchedulerApplicationAttempt appAttempt =
172.((AbstractYarnScheduler)rmContext.getScheduler())
173.  .getApplicationAttempt(appAttemptId);
174.
175.  OpportunisticContainerContext oppCtx =
176.  appAttempt.getOpportunisticContainerContext();
177.  oppCtx.updateNodeList(getLeastLoadedNodes());
{code}
if we just crash the current AM(its attemptid is appattempt_0)just before 
line171, when the code of line 171~173 continue to execute to get the 
appAttempt by appattempt_0, the appAttempt  should represents the  currenct AM. 
But we found that the  appAttempt  represents to  the new AM and its attempid 
is appattempt_1. This appAttempt that represents  the new AM  has not init its 
oppCtx, so NPE happnes at line 177.

 
{code:java}
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService$OpportunisticAMSProcessor.allocate(OpportunisticContainerAllocatorAMService.java:177)
at 
org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
at 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:424)
at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
at 
org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:878)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2830)
{code}
We have found the reason why we use old appattempt_0 but get the new appAttempt 
that represent to new AM. Below is the function body of getApplicationAttempt  
at line 173

 
{code:java}
399. public T getApplicationAttempt(ApplicationAttemptId applicationAttemptId) {
400   SchedulerApplication app = applications.get(
401  applicationAttemptId.getApplicationId());
402   return app == null ? null : app.getCurrentAppAttempt();
403  }
{code}
when old AM Crash,  the CurrentAppAttempt of app will be setted as the new 
appAttempt that presentes the new AM. So the code line 402 will return the new 
appAttempt. 

We shoud add the check: whether the the getted appAttempt have the same id as 
given id.

patch comes soon!

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9226) NPE while YarnChild shudown

2019-01-23 Thread lujie (JIRA)
lujie created YARN-9226:
---

 Summary: NPE while YarnChild shudown
 Key: YARN-9226
 URL: https://issues.apache.org/jira/browse/YARN-9226
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: lujie


In YarnChild.main
{code:java}
try{
 logSyncer = TaskLog.createLogSyncer();//line 168
 
 taskFinal.run(job, umbilical); //line 178
}catch (Exception exception) {//line 187
  LOG.warn("Exception running child : "
   + StringUtils.stringifyException(exception));
   .
   task.taskCleanup(umbilical);// line 200
}{code}
At line 178. it will initialize the task.committer, but the line168 may throw 
exception, it will skip  initialize the task.committer, hence task.committer == 
null. Line 187 will catch this exception and do clean up(line 200), code line 
200 will use  task.committer without null check, hence NPE happens
{code:java}
2019-01-23 16:59:42,864 INFO [main] org.apache.hadoop.mapred.YarnChild: 
Exception cleaning up: java.lang.NullPointerException
at org.apache.hadoop.mapred.Task.taskCleanup(Task.java:1458)
at org.apache.hadoop.mapred.YarnChild$3.run(YarnChild.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:197)
{code}
So why  line168 may throw exception, below log give a example:
{code:java}
2019-01-23 16:59:42,857 WARN [main] org.apache.hadoop.mapred.YarnChild: 
Exception running child : java.lang.IllegalStateException: Shutdown in 
progress, cannot add a shutdownHook
at 
org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:299)
at org.apache.hadoop.mapred.TaskLog.createLogSyncer(TaskLog.java:340)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9223) NPE happens in NM while loading recovery fails

2019-01-22 Thread lujie (JIRA)
lujie created YARN-9223:
---

 Summary: NPE happens in NM while loading recovery fails
 Key: YARN-9223
 URL: https://issues.apache.org/jira/browse/YARN-9223
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: lujie
Assignee: lujie


In org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit:

 
{code:java}
try {
  initAndStartRecoveryStore(conf);
} catch (IOException e) {
   String recoveryDirName = conf.get(YarnConfiguration.NM_RECOVERY_DIR);
  throw new
   YarnRuntimeException("Unable to initialize recovery directory at "
   + recoveryDirName, e);
}

this.context = createNMContext(containerTokenSecretManager,
nmTokenSecretManager, nmStore, isDistSchedulingEnabled, conf);
{code}
while Recovery fails, the context  is null, and YarnRuntimeException will cause 
serviceStop to run(due to Shutdown Hook):
{code:java}
// Cleanup ResourcePluginManager
ResourcePluginManager rpm = context.getResourcePluginManager();
{code}
hence NPE happens:
{code:java}
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:530)
at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220)
at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:54)
at 
org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:102)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:984)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1064)
{code}
 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-9201) RMAppAttemptImpl: Invalid event: LAUNCH_FAILED at FAILED

2019-01-16 Thread lujie (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lujie resolved YARN-9201.
-
Resolution: Duplicate
  Assignee: lujie

> RMAppAttemptImpl: Invalid event: LAUNCH_FAILED at FAILED
> 
>
> Key: YARN-9201
> URL: https://issues.apache.org/jira/browse/YARN-9201
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: lujie
>Assignee: lujie
>Priority: Major
> Attachments: hadoop-hires-resourcemanager-hadoop11.log
>
>
> While node removed, RM will kill the application and change its state as 
> failed. AMLauncher can't not launch due to java.io.IOException and send 
> LAUNCH_FAILED event to application, an error happens
>  
> {code:java}
> 2019-01-16 12:55:33,334 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> App attempt: appattempt_1547614499484_0001_01 can't handle this event at 
> current state
> org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
> LAUNCH_FAILED at FAILED
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:913)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1073)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1054)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9201) RMAppAttemptImpl: Invalid event: LAUNCH_FAILED at FAILED

2019-01-16 Thread lujie (JIRA)
lujie created YARN-9201:
---

 Summary: RMAppAttemptImpl: Invalid event: LAUNCH_FAILED at FAILED
 Key: YARN-9201
 URL: https://issues.apache.org/jira/browse/YARN-9201
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: lujie


While node removed, RM will kill the application and change its state as 
failed. AMLauncher can't not launch due to java.io.IOException

 
{code:java}
2019-01-16 12:55:33,334 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
App attempt: appattempt_1547614499484_0001_01 can't handle this event at 
current state
org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
LAUNCH_FAILED at FAILED
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:913)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:121)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1073)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1054)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
at java.lang.Thread.run(Thread.java:745)

{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-9193) NullPointerException happens in RM while shutdown a NM

2019-01-14 Thread lujie (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lujie resolved YARN-9193.
-
Resolution: Duplicate

it will be fixed in [YARN-9194link 
title|https://issues.apache.org/jira/browse/YARN-9194]

> NullPointerException happens in RM while shutdown a NM
> --
>
> Key: YARN-9193
> URL: https://issues.apache.org/jira/browse/YARN-9193
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: lujie
>Assignee: lujie
>Priority: Major
> Attachments: hadoop-hires-resourcemanager-hadoop11.log
>
>
> while shutdown a NodeManager, the RM occurs a null point exception
>  
> {code:java}
> 2019-01-13 08:52:20,299 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type CONTAINER_ALLOCATED for applicationAttempt 
> appattempt_1547340702286_0001_01
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMContainerAllocatedTransition.transition(RMAppAttemptImpl.java:1210)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMContainerAllocatedTransition.transition(RMAppAttemptImpl.java:1180)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
>   at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:913)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:121)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1073)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1054)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9194) Invalid event: REGISTERED at FAILED

2019-01-12 Thread lujie (JIRA)
lujie created YARN-9194:
---

 Summary: Invalid event: REGISTERED at FAILED
 Key: YARN-9194
 URL: https://issues.apache.org/jira/browse/YARN-9194
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: lujie
Assignee: lujie


While the attempt fails, the REGISTERED comes, hence the 
InvalidStateTransitionException happens.

 
{code:java}
2019-01-13 00:41:57,127 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
App attempt: appattempt_1547311267249_0001_02 can't handle this event at 
current state
org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
REGISTERED at FAILED
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:913)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:121)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1073)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1054)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
at java.lang.Thread.run(Thread.java:745)
{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9193) NullPointerException happens in RM while shutdown a NM

2019-01-12 Thread lujie (JIRA)
lujie created YARN-9193:
---

 Summary: NullPointerException happens in RM while shutdown a NM
 Key: YARN-9193
 URL: https://issues.apache.org/jira/browse/YARN-9193
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: lujie


while shutdown a NodeManager, the RM occurs a null point exception



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9165) NPE which is similar to YARN-5918

2018-12-31 Thread lujie (JIRA)
lujie created YARN-9165:
---

 Summary: NPE which is similar to YARN-5918
 Key: YARN-9165
 URL: https://issues.apache.org/jira/browse/YARN-9165
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: lujie


{code:java}
2018-12-31 22:30:06,681 WARN org.apache.hadoop.ipc.Server: IPC Server handler 2 
on default port 8030, call Call#23 Retry#0 
org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from 
10.3.1.15:52796
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.createOpportunisticRmContainer(SchedulerUtils.java:576)
at 
org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.handleNewContainers(OpportunisticContainerAllocatorAMService.java:349)
at 
org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.access$100(OpportunisticContainerAllocatorAMService.java:94)
at 
org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService$OpportunisticAMSProcessor.allocate(OpportunisticContainerAllocatorAMService.java:197)
at 
org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
at 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:424)
at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
at 
org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:878)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2830)

{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9164) NullPointerException crash the ResourceManager

2018-12-31 Thread lujie (JIRA)
lujie created YARN-9164:
---

 Summary: NullPointerException crash the ResourceManager
 Key: YARN-9164
 URL: https://issues.apache.org/jira/browse/YARN-9164
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: lujie
Assignee: lujie


We have meeted an NPE which can crash the whole cluster
{code:java}
2018-12-31 22:18:11,924 FATAL org.apache.hadoop.yarn.event.EventDispatcher: 
Error in handling event type APP_ATTEMPT_REMOVED to the Event Dispatcher
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:696)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:1123)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1827)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:171)
at 
org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
at java.lang.Thread.run(Thread.java:745)

{code}
this bug also happens in the latest trunk!

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-8650) Invalid event: CONTAINER_KILLED_ON_REQUEST at DONE and Invalid event: CONTAINER_LAUNCHED at DONE

2018-08-13 Thread lujie (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lujie resolved YARN-8650.
-
Resolution: Duplicate

> Invalid event: CONTAINER_KILLED_ON_REQUEST at DONE and  Invalid event: 
> CONTAINER_LAUNCHED at DONE
> -
>
> Key: YARN-8650
> URL: https://issues.apache.org/jira/browse/YARN-8650
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: lujie
>Priority: Major
> Attachments: hadoop-hires-nodemanager-hadoop11.log, 
> hadoop-hires-nodemanager-hadoop15.log
>
>
> We have tested the hadoop while  nodemanager is shutting down and encounter 
> two InvalidStateTransitionException:
> {code:java}
> 2018-08-04 14:29:33,025 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Can't handle this event at current state: Current: [DONE], eventType: 
> [CONTAINER_KILLED_ON_REQUEST], container: 
> [container_1533364185282_0001_01_01]
> org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
> CONTAINER_KILLED_ON_REQUEST at DONE
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:2084)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:103)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1483)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1476)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> {code:java}
> org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
> CONTAINER_LAUNCHED at DONE
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:2084)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:103)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1483)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1476)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> We have analysis these two bugs, and find that shutdown will send kill event 
> and hence cause these two exception. We have test the our cluster for many 
> time and can determinately  reproduce it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8650) Invalid event: CONTAINER_KILLED_ON_REQUEST at DONE and Invalid event: CONTAINER_LAUNCHED at DONE

2018-08-10 Thread lujie (JIRA)
lujie created YARN-8650:
---

 Summary: Invalid event: CONTAINER_KILLED_ON_REQUEST at DONE and  
Invalid event: CONTAINER_LAUNCHED at DONE
 Key: YARN-8650
 URL: https://issues.apache.org/jira/browse/YARN-8650
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: lujie


We have tested the hadoop while  nodemanager is shutting down and encounter two 
InvalidStateTransitionException:
{code:java}
2018-08-04 14:29:33,025 WARN 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
 Can't handle this event at current state: Current: [DONE], eventType: 
[CONTAINER_KILLED_ON_REQUEST], container: 
[container_1533364185282_0001_01_01]
org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
CONTAINER_KILLED_ON_REQUEST at DONE
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:2084)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:103)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1483)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1476)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
at java.lang.Thread.run(Thread.java:745)
{code}
{code:java}
org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
CONTAINER_LAUNCHED at DONE
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:2084)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:103)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1483)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1476)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
at java.lang.Thread.run(Thread.java:745)
{code}
We have analysis these two bugs, and find that shutdown will send kill event 
and hence cause these two exception. We have test the our cluster for many time 
and can determinately  reproduce it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8649) Same as YARN-4355:NPE while processing localizer heartbeat

2018-08-10 Thread lujie (JIRA)
lujie created YARN-8649:
---

 Summary: Same as YARN-4355:NPE while processing localizer heartbeat
 Key: YARN-8649
 URL: https://issues.apache.org/jira/browse/YARN-8649
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.1.1
Reporter: lujie


I have noticed that a nodemanager was getting NPEs processing a heartbeat. This 
is  similar to [YARN-4355|https://issues.apache.org/jira/browse/YARN-4355 ] 
which reported by [# Jason Lowe] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8381) Job get stuck while node is unhealthy, but without log messages to indicate such case

2018-05-30 Thread lujie (JIRA)
lujie created YARN-8381:
---

 Summary: Job get stuck while node is unhealthy, but without log 
messages to indicate such case
 Key: YARN-8381
 URL: https://issues.apache.org/jira/browse/YARN-8381
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: lujie


I started a fresh pseudo-distributed system on an node, then run a  job but it 
stuck. My first reaction was checking log message to local problem, but 
obtaining no error message. Then I waked up to check the node health after  
reading log message for long time. The Yarn web UI showed that the nodemanager 
is unhealthy, due to the "l{{ocal-dirs are bad: 
/tmp/hadoop-hduser/nm-local-dir}}".  I reconfigure the 
"{{yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage}}"
 to 98% and solved this problem. But I still  strongly recommend adding error 
log messages for unhealthy nodemanger.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8164) Fix an Potential NPE

2018-04-15 Thread lujie (JIRA)
lujie created YARN-8164:
---

 Summary: Fix an Potential NPE 
 Key: YARN-8164
 URL: https://issues.apache.org/jira/browse/YARN-8164
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: lujie


We have developed a static analysis tool 
[NPEDetector|https://github.com/lujiefsi/NPEDetector] to find some potential 
NPE. Our analysis shows that some callees may return null in corner case(e.g. 
node crash , IO exception), some of their callers have  _!=null_ check but some 
do not have.

Callee FairScheduler#getAppsInQueue can return null
{code:java}
public List getAppsInQueue(String queueName) {
FSQueue queue = queueMgr.getQueue(queueName);
   if (queue == null) {
  return null;//here
  }
}
{code}
it has 4 callers, three of them have null checker, one dost not have. In this 
issue we post a patch which can add  !=null  based on existed !=null  check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7786) NullPointerException while launching ApplicationMaster

2018-01-22 Thread lujie (JIRA)
lujie created YARN-7786:
---

 Summary: NullPointerException while launching ApplicationMaster
 Key: YARN-7786
 URL: https://issues.apache.org/jira/browse/YARN-7786
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0-beta1
Reporter: lujie
Assignee: lujie






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7785) Invalid event: CONTAINER_RESOURCES_CLEANEDUP at DONE

2018-01-22 Thread lujie (JIRA)
lujie created YARN-7785:
---

 Summary: Invalid event: CONTAINER_RESOURCES_CLEANEDUP at DONE
 Key: YARN-7785
 URL: https://issues.apache.org/jira/browse/YARN-7785
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0-beta1, 2.8.0
 Environment: 
{code:java}
2017-11-25 18:51:30,234 WARN 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
 Can't handle this event at current state: Current: [DONE], eventType: 
[CONTAINER_RESOURCES_CLEANEDUP], container: 
[container_1511606993239_0001_01_11]
org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
CONTAINER_RESOURCES_CLEANEDUP at DONE
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1704)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:96)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1490)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1483)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
at java.lang.Thread.run(Thread.java:745)
{code}

Reporter: lujie
Assignee: lujie


send kill command while job is running, some Exception occur in nodemanager:




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7726) RMAppImpl: can't handle APP_ACCEPTED at state ACCEPTED

2018-01-09 Thread lujie (JIRA)
lujie created YARN-7726:
---

 Summary: RMAppImpl: can't handle APP_ACCEPTED at state ACCEPTED
 Key: YARN-7726
 URL: https://issues.apache.org/jira/browse/YARN-7726
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 2.8.3
Reporter: lujie
Priority: Minor


while adding  patch  to TestRMAppTransitions, the patch triggers error message: 
  "can't handle APP_ACCEPTED at state ACCEPTED" in unit test  
testAppAcceptedFailed and testAppRunningFailed



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-6950) Invalid event: LAUNCH_FAILED at FAILED

2017-12-15 Thread lujie (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lujie resolved YARN-6950.
-
Resolution: Duplicate

> Invalid event: LAUNCH_FAILED at FAILED
> --
>
> Key: YARN-6950
> URL: https://issues.apache.org/jira/browse/YARN-6950
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.6.0
>Reporter: lujie
> Fix For: 2.7.0
>
>
> A RMAppAttemptImpl fail due to some reason,meanwhile AM fails to launch a 
> container and send event  LAUNCH_FAILED,and the StateMachine can not handle 
> it:
> {code:java}
> 2017-07-05 03:33:09,013 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> Can't handle this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
> LAUNCH_FAILED at FAILED
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:757)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:106)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:834)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:815)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7563) Invalid event: FINISH_APPLICATION at NEW

2017-11-27 Thread lujie (JIRA)
lujie created YARN-7563:
---

 Summary: Invalid event: FINISH_APPLICATION at NEW
 Key: YARN-7563
 URL: https://issues.apache.org/jira/browse/YARN-7563
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 3.0.0-beta1
Reporter: lujie


I send kill command to application, nodemanager log shows:

{code:java}
2017-11-25 19:18:48,126 WARN 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
 couldn't find container container_1511608703018_0001_01_01 while 
processing FINISH_CONTAINERS event
2017-11-25 19:18:48,146 WARN 
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
 Can't handle this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
FINISH_APPLICATION at NEW
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:627)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:75)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:1508)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:1501)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
at java.lang.Thread.run(Thread.java:745)
2017-11-25 19:18:48,151 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
 Application application_1511608703018_0001 transitioned from NEW to INITING
{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7176) After kill command is send, the job hangs

2017-09-07 Thread lujie (JIRA)
lujie created YARN-7176:
---

 Summary: After kill command is send, the job hangs 
 Key: YARN-7176
 URL: https://issues.apache.org/jira/browse/YARN-7176
 Project: Hadoop YARN
  Issue Type: Bug
  Components: RM
Affects Versions: 2.6.0
Reporter: lujie
Priority: Critical


I submit a job, but i need to kill it immediately due to some reason. Then I 
found the job is hang,
I check the log and found ArrayIndexOutOfBoundsException and 
NullPointerException in RMLog:

{code:java}
2017-09-08 02:34:37,967 INFO 
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Error 
launching appattempt_1504809243340_0001_01. Got exception: 
java.lang.ArrayIndexOutOfBoundsException: 3
at java.util.ArrayList.add(ArrayList.java:441)
at 
com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:330)
at 
org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto$Builder.addAllApplicationACLs(YarnProtos.java:39956)
at 
org.apache.hadoop.yarn.api.records.impl.pb.ContainerLaunchContextPBImpl.addApplicationACLs(ContainerLaunchContextPBImpl.java:446)
at 
org.apache.hadoop.yarn.api.records.impl.pb.ContainerLaunchContextPBImpl.mergeLocalToBuilder(ContainerLaunchContextPBImpl.java:121)
at 
org.apache.hadoop.yarn.api.records.impl.pb.ContainerLaunchContextPBImpl.mergeLocalToProto(ContainerLaunchContextPBImpl.java:128)
at 
org.apache.hadoop.yarn.api.records.impl.pb.ContainerLaunchContextPBImpl.getProto(ContainerLaunchContextPBImpl.java:70)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainerRequestPBImpl.convertToProtoFormat(StartContainerRequestPBImpl.java:156)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainerRequestPBImpl.mergeLocalToBuilder(StartContainerRequestPBImpl.java:85)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainerRequestPBImpl.mergeLocalToProto(StartContainerRequestPBImpl.java:95)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainerRequestPBImpl.getProto(StartContainerRequestPBImpl.java:57)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainersRequestPBImpl.convertToProtoFormat(StartContainersRequestPBImpl.java:137)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainersRequestPBImpl.addLocalRequestsToProto(StartContainersRequestPBImpl.java:97)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainersRequestPBImpl.mergeLocalToBuilder(StartContainersRequestPBImpl.java:79)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainersRequestPBImpl.mergeLocalToProto(StartContainersRequestPBImpl.java:72)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainersRequestPBImpl.getProto(StartContainersRequestPBImpl.java:48)
at 
org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:93)
at 
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:119)
at 
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:254)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

2017-09-08 02:34:37,968 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
updating app: application_1504809243340_0001
java.lang.NullPointerException
at 
com.google.protobuf.CodedOutputStream.computeMessageSizeNoTag(CodedOutputStream.java:749)
at 
com.google.protobuf.CodedOutputStream.computeMessageSize(CodedOutputStream.java:530)
at 
org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto.getSerializedSize(YarnProtos.java:38512)
at 
com.google.protobuf.CodedOutputStream.computeMessageSizeNoTag(CodedOutputStream.java:749)
at 
com.google.protobuf.CodedOutputStream.computeMessageSize(CodedOutputStream.java:530)
at 
org.apache.hadoop.yarn.proto.YarnProtos$ApplicationSubmissionContextProto.getSerializedSize(YarnProtos.java:28481)
at 
com.google.protobuf.CodedOutputStream.computeMessageSizeNoTag(CodedOutputStream.java:749)
at 
com.google.protobuf.CodedOutputStream.computeMessageSize(CodedOutputStream.java:530)
at 
org.apache.hadoop.yarn.proto.YarnServerResourceManagerRecoveryProtos$ApplicationStateDataProto.getSerializedSize(YarnServerResourceManagerRecoveryProtos.java:816)
at 
com.google.protobuf.AbstractMessageLite.toByteArray(AbstractMessageLite.java:62)
at 

[jira] [Created] (YARN-6950) Invalid event: LAUNCH_FAILED at FAILED

2017-08-04 Thread lujie (JIRA)
lujie created YARN-6950:
---

 Summary: Invalid event: LAUNCH_FAILED at FAILED
 Key: YARN-6950
 URL: https://issues.apache.org/jira/browse/YARN-6950
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 2.6.0
Reporter: lujie


A RMAppAttemptImpl fail due to some reason,meanwhile AM fails to launch a 
container and send event  LAUNCH_FAILED,and the StateMachine can not handle it:

{code:java}
2017-07-05 03:33:09,013 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
Can't handle this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
LAUNCH_FAILED at FAILED
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:757)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:106)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:834)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:815)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
at java.lang.Thread.run(Thread.java:745)
{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6949) Invalid event: LOCALIZED at LOCALIZED

2017-08-04 Thread lujie (JIRA)
lujie created YARN-6949:
---

 Summary: Invalid event: LOCALIZED at LOCALIZED
 Key: YARN-6949
 URL: https://issues.apache.org/jira/browse/YARN-6949
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 2.8.0
Reporter: lujie


When job is running, I stop a nodemanager in one machine due to some reason, 
Then I check the logs to see the running state,I find many 
InvalidStateTransitionException:

{code:java}
rg.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
LOCALIZATION_FAILED at LOCALIZED
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource.handle(LocalizedResource.java:198)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl.handle(LocalResourcesTrackerImpl.java:194)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl.handle(LocalResourcesTrackerImpl.java:58)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.processHeartbeat(ResourceLocalizationService.java:1058)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.processHeartbeat(ResourceLocalizationService.java:720)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.heartbeat(ResourceLocalizationService.java:355)
at 
org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.service.LocalizationProtocolPBServiceImpl.heartbeat(LocalizationProtocolPBServiceImpl.java:48)
at 
org.apache.hadoop.yarn.proto.LocalizationProtocol$LocalizationProtocolService$2.callBlockingMethod(LocalizationProtocol.java:63)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:845)
{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6948) Invalid event: ATTEMPT_ADDED at FINAL_SAVING

2017-08-04 Thread lujie (JIRA)
lujie created YARN-6948:
---

 Summary: Invalid event: ATTEMPT_ADDED at FINAL_SAVING
 Key: YARN-6948
 URL: https://issues.apache.org/jira/browse/YARN-6948
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 2.8.0
Reporter: lujie


When I send kill command to a running job, I check the logs and find the 
Exception:

{code:java}
2017-08-03 01:35:20,485 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
Can't handle this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
ATTEMPT_ADDED at FINAL_SAVING
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:757)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:106)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:834)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:815)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
at java.lang.Thread.run(Thread.java:745)
{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org