subject:"\[jira\] \[Commented\] \(YARN\-1058\) Recovery issues on RM Restart with FileSystemRMStateStore"

[jira] [Commented] (YARN-1058) Recovery issues on RM Restart with FileSystemRMStateStore

2013-10-10 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13792121#comment-13792121
 ] 

Jian He commented on YARN-1058:
---

Believe we have fixed this, close it.

 Recovery issues on RM Restart with FileSystemRMStateStore
 -

 Key: YARN-1058
 URL: https://issues.apache.org/jira/browse/YARN-1058
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla

 App recovery doesn't work as expected using FileSystemRMStateStore.
 Steps to reproduce:
 - Ran sleep job with a single map and sleep time of 2 mins
 - Restarted RM while the map task is still running
 - The first attempt fails with the following error
 {noformat}
 Caused by: 
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
  Password not found for ApplicationAttempt 
 appattempt_1376294441253_0001_01
   at org.apache.hadoop.ipc.Client.call(Client.java:1404)
   at org.apache.hadoop.ipc.Client.call(Client.java:1357)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
   at $Proxy28.finishApplicationMaster(Unknown Source)
   at 
 org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.finishApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:91)
 {noformat}
 - The second attempt fails with a different error:
 {noformat}
 Caused by: 
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
  No lease on 
 /tmp/hadoop-yarn/staging/kasha/.staging/job_1376294441253_0001/job_1376294441253_0001_2.jhist:
  File does not exist. Holder DFSClient_NONMAPREDUCE_389533538_1 does not have 
 any open files.
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2737)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2543)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2454)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:534)
   at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:387)
   at 
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:48073)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:605)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-1058) Recovery issues on RM Restart with FileSystemRMStateStore

2013-10-10 Thread Karthik Kambatla (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13792179#comment-13792179
 ] 

Karthik Kambatla commented on YARN-1058:


I have also noticed that this was fixed in my testing of RM HA, but I haven't 
figured out what change has fixed this. [~jianhe], any idea which JIRA might 
have fixed this? 

 Recovery issues on RM Restart with FileSystemRMStateStore
 -

 Key: YARN-1058
 URL: https://issues.apache.org/jira/browse/YARN-1058
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla

 App recovery doesn't work as expected using FileSystemRMStateStore.
 Steps to reproduce:
 - Ran sleep job with a single map and sleep time of 2 mins
 - Restarted RM while the map task is still running
 - The first attempt fails with the following error
 {noformat}
 Caused by: 
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
  Password not found for ApplicationAttempt 
 appattempt_1376294441253_0001_01
   at org.apache.hadoop.ipc.Client.call(Client.java:1404)
   at org.apache.hadoop.ipc.Client.call(Client.java:1357)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
   at $Proxy28.finishApplicationMaster(Unknown Source)
   at 
 org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.finishApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:91)
 {noformat}
 - The second attempt fails with a different error:
 {noformat}
 Caused by: 
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
  No lease on 
 /tmp/hadoop-yarn/staging/kasha/.staging/job_1376294441253_0001/job_1376294441253_0001_2.jhist:
  File does not exist. Holder DFSClient_NONMAPREDUCE_389533538_1 does not have 
 any open files.
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2737)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2543)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2454)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:534)
   at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:387)
   at 
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:48073)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:605)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-1058) Recovery issues on RM Restart with FileSystemRMStateStore

2013-10-10 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13792190#comment-13792190
 ] 

Jian He commented on YARN-1058:
---

YARN-1116  fixed the AMRMToken part , MAPREDUCE-5476 fixed the staging dir part

 Recovery issues on RM Restart with FileSystemRMStateStore
 -

 Key: YARN-1058
 URL: https://issues.apache.org/jira/browse/YARN-1058
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla

 App recovery doesn't work as expected using FileSystemRMStateStore.
 Steps to reproduce:
 - Ran sleep job with a single map and sleep time of 2 mins
 - Restarted RM while the map task is still running
 - The first attempt fails with the following error
 {noformat}
 Caused by: 
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
  Password not found for ApplicationAttempt 
 appattempt_1376294441253_0001_01
   at org.apache.hadoop.ipc.Client.call(Client.java:1404)
   at org.apache.hadoop.ipc.Client.call(Client.java:1357)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
   at $Proxy28.finishApplicationMaster(Unknown Source)
   at 
 org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.finishApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:91)
 {noformat}
 - The second attempt fails with a different error:
 {noformat}
 Caused by: 
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
  No lease on 
 /tmp/hadoop-yarn/staging/kasha/.staging/job_1376294441253_0001/job_1376294441253_0001_2.jhist:
  File does not exist. Holder DFSClient_NONMAPREDUCE_389533538_1 does not have 
 any open files.
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2737)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2543)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2454)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:534)
   at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:387)
   at 
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:48073)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:605)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (YARN-1058) Recovery issues on RM Restart with FileSystemRMStateStore

2013-08-13 Thread Karthik Kambatla (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738934#comment-13738934
 ] 

Karthik Kambatla commented on YARN-1058:


I was expecting the first one, and Bikas is right about the second one.

When I kil the job client, the job does finish successfully. However, the AM 
for the recovered attempt fails to write the history. 
{noformat}
2013-08-13 13:57:32,440 ERROR [eventHandlingThread] 
org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
Thread[eventHandlingThread,5,main] threw an Exception.
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
 No lease on 
/tmp/hadoop-yarn/staging/kasha/.staging/job_1376427059607_0002/job_1376427059607_0002_2.jhist:
 File does not exist. Holder DFSClient_NONMAPREDUCE_416024880_1 does not have 
any open files.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2737)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2543)
...  
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2037)

at 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleEvent(JobHistoryEventHandler.java:514)
at 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler$1.run(JobHistoryEventHandler.java:276)
at java.lang.Thread.run(Thread.java:662)
{noformat}

 Recovery issues on RM Restart with FileSystemRMStateStore
 -

 Key: YARN-1058
 URL: https://issues.apache.org/jira/browse/YARN-1058
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla

 App recovery doesn't work as expected using FileSystemRMStateStore.
 Steps to reproduce:
 - Ran sleep job with a single map and sleep time of 2 mins
 - Restarted RM while the map task is still running
 - The first attempt fails with the following error
 {noformat}
 Caused by: 
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
  Password not found for ApplicationAttempt 
 appattempt_1376294441253_0001_01
   at org.apache.hadoop.ipc.Client.call(Client.java:1404)
   at org.apache.hadoop.ipc.Client.call(Client.java:1357)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
   at $Proxy28.finishApplicationMaster(Unknown Source)
   at 
 org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.finishApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:91)
 {noformat}
 - The second attempt fails with a different error:
 {noformat}
 Caused by: 
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
  No lease on 
 /tmp/hadoop-yarn/staging/kasha/.staging/job_1376294441253_0001/job_1376294441253_0001_2.jhist:
  File does not exist. Holder DFSClient_NONMAPREDUCE_389533538_1 does not have 
 any open files.
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2737)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2543)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2454)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:534)
   at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:387)
   at 
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:48073)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:605)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-1058) Recovery issues on RM Restart with FileSystemRMStateStore

2013-08-13 Thread Bikas Saha (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739013#comment-13739013
 ] 

Bikas Saha commented on YARN-1058:
--

It could be that history service was not properly shutdown in the first AM. 
Earlier, the AM would receive proper reboot command from the RM and would 
shutdown properly based on the reboot flag being set. Now the AM is getting an 
exception from the RM and so not shutting down properly. This should get fixed 
when we refresh the AM RM token from the saved value.

 Recovery issues on RM Restart with FileSystemRMStateStore
 -

 Key: YARN-1058
 URL: https://issues.apache.org/jira/browse/YARN-1058
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla

 App recovery doesn't work as expected using FileSystemRMStateStore.
 Steps to reproduce:
 - Ran sleep job with a single map and sleep time of 2 mins
 - Restarted RM while the map task is still running
 - The first attempt fails with the following error
 {noformat}
 Caused by: 
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
  Password not found for ApplicationAttempt 
 appattempt_1376294441253_0001_01
   at org.apache.hadoop.ipc.Client.call(Client.java:1404)
   at org.apache.hadoop.ipc.Client.call(Client.java:1357)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
   at $Proxy28.finishApplicationMaster(Unknown Source)
   at 
 org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.finishApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:91)
 {noformat}
 - The second attempt fails with a different error:
 {noformat}
 Caused by: 
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
  No lease on 
 /tmp/hadoop-yarn/staging/kasha/.staging/job_1376294441253_0001/job_1376294441253_0001_2.jhist:
  File does not exist. Holder DFSClient_NONMAPREDUCE_389533538_1 does not have 
 any open files.
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2737)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2543)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2454)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:534)
   at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:387)
   at 
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:48073)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:605)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-1058) Recovery issues on RM Restart with FileSystemRMStateStore

2013-08-12 Thread Bikas Saha (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13737021#comment-13737021
 ] 

Bikas Saha commented on YARN-1058:
--

The first one is expected because the RM is currently not preserving AMRMTokens.
The second one may be because the job client is deleting staging dir because it 
thinks the job has failed when the first attempt fails? Can you try by 
terminating the sleep job client after it has launched the job so that it 
cannot take further action?

 Recovery issues on RM Restart with FileSystemRMStateStore
 -

 Key: YARN-1058
 URL: https://issues.apache.org/jira/browse/YARN-1058
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Karthik Kambatla

 App recovery doesn't work as expected using FileSystemRMStateStore.
 Steps to reproduce:
 - Ran sleep job with a single map and sleep time of 2 mins
 - Restarted RM while the map task is still running
 - The first attempt fails with the following error
 {noformat}
 Caused by: 
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
  Password not found for ApplicationAttempt 
 appattempt_1376294441253_0001_01
   at org.apache.hadoop.ipc.Client.call(Client.java:1404)
   at org.apache.hadoop.ipc.Client.call(Client.java:1357)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
   at $Proxy28.finishApplicationMaster(Unknown Source)
   at 
 org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.finishApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:91)
 {noformat}
 - The second attempt fails with a different error:
 {noformat}
 Caused by: 
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
  No lease on 
 /tmp/hadoop-yarn/staging/kasha/.staging/job_1376294441253_0001/job_1376294441253_0001_2.jhist:
  File does not exist. Holder DFSClient_NONMAPREDUCE_389533538_1 does not have 
 any open files.
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2737)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2543)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2454)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:534)
   at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:387)
   at 
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:48073)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:605)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-1058) Recovery issues on RM Restart with FileSystemRMStateStore

2013-08-12 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13737100#comment-13737100
 ] 

Jian He commented on YARN-1058:
---

As Bikas said, the first exception is expected because although AMRMTokens 
currently are stored along with AppAttemptState, but it's not populated back to 
AMRMTokenSecretManager yet when RM comes back. What MR AM now handles this 
exception is simply ignoring it(MAPREDUCE-5436). So AM process will hang and 
waiting be killed by NM instead of rebooting itself.

 Recovery issues on RM Restart with FileSystemRMStateStore
 -

 Key: YARN-1058
 URL: https://issues.apache.org/jira/browse/YARN-1058
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla

 App recovery doesn't work as expected using FileSystemRMStateStore.
 Steps to reproduce:
 - Ran sleep job with a single map and sleep time of 2 mins
 - Restarted RM while the map task is still running
 - The first attempt fails with the following error
 {noformat}
 Caused by: 
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
  Password not found for ApplicationAttempt 
 appattempt_1376294441253_0001_01
   at org.apache.hadoop.ipc.Client.call(Client.java:1404)
   at org.apache.hadoop.ipc.Client.call(Client.java:1357)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
   at $Proxy28.finishApplicationMaster(Unknown Source)
   at 
 org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.finishApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:91)
 {noformat}
 - The second attempt fails with a different error:
 {noformat}
 Caused by: 
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
  No lease on 
 /tmp/hadoop-yarn/staging/kasha/.staging/job_1376294441253_0001/job_1376294441253_0001_2.jhist:
  File does not exist. Holder DFSClient_NONMAPREDUCE_389533538_1 does not have 
 any open files.
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2737)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2543)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2454)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:534)
   at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:387)
   at 
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:48073)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:605)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-1058) Recovery issues on RM Restart with FileSystemRMStateStore

[jira] [Commented] (YARN-1058) Recovery issues on RM Restart with FileSystemRMStateStore

[jira] [Commented] (YARN-1058) Recovery issues on RM Restart with FileSystemRMStateStore

[jira] [Commented] (YARN-1058) Recovery issues on RM Restart with FileSystemRMStateStore

[jira] [Commented] (YARN-1058) Recovery issues on RM Restart with FileSystemRMStateStore

[jira] [Commented] (YARN-1058) Recovery issues on RM Restart with FileSystemRMStateStore

[jira] [Commented] (YARN-1058) Recovery issues on RM Restart with FileSystemRMStateStore

7 matches

Site Navigation

Mail list logo

Footer information