[jira] [Commented] (YARN-933) After an AppAttempt_1 got failed [ removal and releasing of container is done , AppAttempt_2 is scheduled ] again relaunching of AppAttempt_1 throws Exception at RM .And

2015-02-08 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311512#comment-14311512
 ] 

Rohith commented on YARN-933:
-

I do not see any possibility of *Invalid event: LAUNCH_FAILED at FAILED* in 
trunk version. But I see potential problem of *Invalid event: LAUNCHED at 
FINAL_SAVING* or *Invalid event: LAUNCH_FAILED at FINAL_SAVING* when 
launched/launch_failed event are triggered after transition ALLOCATED---kill 
event---FINAL_SAVING.

 After an AppAttempt_1 got failed [ removal and releasing of container is done 
 , AppAttempt_2 is scheduled ] again relaunching of AppAttempt_1 throws 
 Exception at RM .And client exited before appattempt retries got over
 --

 Key: YARN-933
 URL: https://issues.apache.org/jira/browse/YARN-933
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.5-alpha
Reporter: J.Andreina
Assignee: Rohith
 Attachments: YARN-933.patch


 am max retries configured as 3 at client and RM side.
 Step 1: Install cluster with NM on 2 Machines 
 Step 2: Make Ping using ip from RM machine to NM1 machine as successful ,But 
 using Hostname should fail
 Step 3: Execute a job
 Step 4: After AM [ AppAttempt_1 ] allocation to NM1 machine is done , 
 connection loss happened.
 Observation :
 ==
 After AppAttempt_1 has moved to failed state ,release of container for 
 AppAttempt_1 and Application removal are successful. New AppAttempt_2 is 
 sponed.
 1. Then again retry for AppAttempt_1 happens.
 2. Again RM side it is trying to launch AppAttempt_1, hence fails with 
 InvalidStateTransitonException
 3. Client got exited after AppAttempt_1 is been finished [But actually job is 
 still running ], while the appattempts configured is 3 and rest appattempts 
 are all sponed and running.
 RMLogs:
 ==
 2013-07-17 16:22:51,013 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 appattempt_1373952096466_0056_01 State change from SCHEDULED to ALLOCATED
 2013-07-17 16:35:48,171 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: host-10-18-40-15/10.18.40.59:8048. Already tried 36 time(s); 
 maxRetries=45
 2013-07-17 16:36:07,091 INFO 
 org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: 
 Expired:container_1373952096466_0056_01_01 Timed out after 600 secs
 2013-07-17 16:36:07,093 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
 container_1373952096466_0056_01_01 Container Transitioned from ACQUIRED 
 to EXPIRED
 2013-07-17 16:36:07,093 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
 Registering appattempt_1373952096466_0056_02
 2013-07-17 16:36:07,131 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
  Application appattempt_1373952096466_0056_01 is done. finalState=FAILED
 2013-07-17 16:36:07,131 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
 Application removed - appId: application_1373952096466_0056 user: Rex 
 leaf-queue of parent: root #applications: 35
 2013-07-17 16:36:07,132 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
  Application Submission: appattempt_1373952096466_0056_02, 
 2013-07-17 16:36:07,138 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 appattempt_1373952096466_0056_02 State change from SUBMITTED to SCHEDULED
 2013-07-17 16:36:30,179 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: host-10-18-40-15/10.18.40.59:8048. Already tried 38 time(s); 
 maxRetries=45
 2013-07-17 16:38:36,203 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: host-10-18-40-15/10.18.40.59:8048. Already tried 44 time(s); 
 maxRetries=45
 2013-07-17 16:38:56,207 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Error 
 launching appattempt_1373952096466_0056_01. Got exception: 
 java.lang.reflect.UndeclaredThrowableException
 2013-07-17 16:38:56,207 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 Can't handle this event at current state
 org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
 LAUNCH_FAILED at FAILED
  at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
  at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
  at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:445)
 

[jira] [Commented] (YARN-933) After an AppAttempt_1 got failed [ removal and releasing of container is done , AppAttempt_2 is scheduled ] again relaunching of AppAttempt_1 throws Exception at RM .And

2015-02-08 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311814#comment-14311814
 ] 

Hadoop QA commented on YARN-933:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12697353/0001-YARN-933.patch
  against trunk revision 1382ae5.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/6552//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6552//console

This message is automatically generated.

 After an AppAttempt_1 got failed [ removal and releasing of container is done 
 , AppAttempt_2 is scheduled ] again relaunching of AppAttempt_1 throws 
 Exception at RM .And client exited before appattempt retries got over
 --

 Key: YARN-933
 URL: https://issues.apache.org/jira/browse/YARN-933
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.5-alpha
Reporter: J.Andreina
Assignee: Rohith
 Attachments: 0001-YARN-933.patch, YARN-933.patch


 am max retries configured as 3 at client and RM side.
 Step 1: Install cluster with NM on 2 Machines 
 Step 2: Make Ping using ip from RM machine to NM1 machine as successful ,But 
 using Hostname should fail
 Step 3: Execute a job
 Step 4: After AM [ AppAttempt_1 ] allocation to NM1 machine is done , 
 connection loss happened.
 Observation :
 ==
 After AppAttempt_1 has moved to failed state ,release of container for 
 AppAttempt_1 and Application removal are successful. New AppAttempt_2 is 
 sponed.
 1. Then again retry for AppAttempt_1 happens.
 2. Again RM side it is trying to launch AppAttempt_1, hence fails with 
 InvalidStateTransitonException
 3. Client got exited after AppAttempt_1 is been finished [But actually job is 
 still running ], while the appattempts configured is 3 and rest appattempts 
 are all sponed and running.
 RMLogs:
 ==
 2013-07-17 16:22:51,013 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 appattempt_1373952096466_0056_01 State change from SCHEDULED to ALLOCATED
 2013-07-17 16:35:48,171 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: host-10-18-40-15/10.18.40.59:8048. Already tried 36 time(s); 
 maxRetries=45
 2013-07-17 16:36:07,091 INFO 
 org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: 
 Expired:container_1373952096466_0056_01_01 Timed out after 600 secs
 2013-07-17 16:36:07,093 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
 container_1373952096466_0056_01_01 Container Transitioned from ACQUIRED 
 to EXPIRED
 2013-07-17 16:36:07,093 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
 Registering appattempt_1373952096466_0056_02
 2013-07-17 16:36:07,131 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
  Application appattempt_1373952096466_0056_01 is done. finalState=FAILED
 2013-07-17 16:36:07,131 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
 Application removed - appId: application_1373952096466_0056 user: Rex 
 leaf-queue of parent: root #applications: 35
 2013-07-17 16:36:07,132 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
  Application Submission: appattempt_1373952096466_0056_02, 
 2013-07-17 16:36:07,138 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 appattempt_1373952096466_0056_02 State change from SUBMITTED to SCHEDULED
 2013-07-17 16:36:30,179 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: host-10-18-40-15/10.18.40.59:8048. Already tried 38 

[jira] [Commented] (YARN-933) After an AppAttempt_1 got failed [ removal and releasing of container is done , AppAttempt_2 is scheduled ] again relaunching of AppAttempt_1 throws Exception at RM .And

2015-02-08 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311762#comment-14311762
 ] 

Rohith commented on YARN-933:
-

[~jianhe] Could you have look at this issue please?
The issue reported in descriiption flow has been changed and which . But since 
there is potential issue at final_saving state. I have attached patch to this 
jira insted of creating new jira. Is it fine ?
And HadoopQA does not run since yesterday night. I am not sure is there was any 
problem?

 After an AppAttempt_1 got failed [ removal and releasing of container is done 
 , AppAttempt_2 is scheduled ] again relaunching of AppAttempt_1 throws 
 Exception at RM .And client exited before appattempt retries got over
 --

 Key: YARN-933
 URL: https://issues.apache.org/jira/browse/YARN-933
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.5-alpha
Reporter: J.Andreina
Assignee: Rohith
 Attachments: 0001-YARN-933.patch, YARN-933.patch


 am max retries configured as 3 at client and RM side.
 Step 1: Install cluster with NM on 2 Machines 
 Step 2: Make Ping using ip from RM machine to NM1 machine as successful ,But 
 using Hostname should fail
 Step 3: Execute a job
 Step 4: After AM [ AppAttempt_1 ] allocation to NM1 machine is done , 
 connection loss happened.
 Observation :
 ==
 After AppAttempt_1 has moved to failed state ,release of container for 
 AppAttempt_1 and Application removal are successful. New AppAttempt_2 is 
 sponed.
 1. Then again retry for AppAttempt_1 happens.
 2. Again RM side it is trying to launch AppAttempt_1, hence fails with 
 InvalidStateTransitonException
 3. Client got exited after AppAttempt_1 is been finished [But actually job is 
 still running ], while the appattempts configured is 3 and rest appattempts 
 are all sponed and running.
 RMLogs:
 ==
 2013-07-17 16:22:51,013 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 appattempt_1373952096466_0056_01 State change from SCHEDULED to ALLOCATED
 2013-07-17 16:35:48,171 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: host-10-18-40-15/10.18.40.59:8048. Already tried 36 time(s); 
 maxRetries=45
 2013-07-17 16:36:07,091 INFO 
 org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: 
 Expired:container_1373952096466_0056_01_01 Timed out after 600 secs
 2013-07-17 16:36:07,093 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
 container_1373952096466_0056_01_01 Container Transitioned from ACQUIRED 
 to EXPIRED
 2013-07-17 16:36:07,093 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
 Registering appattempt_1373952096466_0056_02
 2013-07-17 16:36:07,131 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
  Application appattempt_1373952096466_0056_01 is done. finalState=FAILED
 2013-07-17 16:36:07,131 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
 Application removed - appId: application_1373952096466_0056 user: Rex 
 leaf-queue of parent: root #applications: 35
 2013-07-17 16:36:07,132 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
  Application Submission: appattempt_1373952096466_0056_02, 
 2013-07-17 16:36:07,138 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 appattempt_1373952096466_0056_02 State change from SUBMITTED to SCHEDULED
 2013-07-17 16:36:30,179 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: host-10-18-40-15/10.18.40.59:8048. Already tried 38 time(s); 
 maxRetries=45
 2013-07-17 16:38:36,203 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: host-10-18-40-15/10.18.40.59:8048. Already tried 44 time(s); 
 maxRetries=45
 2013-07-17 16:38:56,207 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Error 
 launching appattempt_1373952096466_0056_01. Got exception: 
 java.lang.reflect.UndeclaredThrowableException
 2013-07-17 16:38:56,207 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 Can't handle this event at current state
 org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
 LAUNCH_FAILED at FAILED
  at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
  at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
  at 
 

[jira] [Commented] (YARN-933) After an AppAttempt_1 got failed [ removal and releasing of container is done , AppAttempt_2 is scheduled ] again relaunching of AppAttempt_1 throws Exception at RM .And

2015-02-08 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311793#comment-14311793
 ] 

Jian He commented on YARN-933:
--

sure, mind editing the title to reflect the problem ?
nit: {{// verify for both launched and launch_failed transitions in failed}}, 
the comment should say in killed state.

 After an AppAttempt_1 got failed [ removal and releasing of container is done 
 , AppAttempt_2 is scheduled ] again relaunching of AppAttempt_1 throws 
 Exception at RM .And client exited before appattempt retries got over
 --

 Key: YARN-933
 URL: https://issues.apache.org/jira/browse/YARN-933
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.5-alpha
Reporter: J.Andreina
Assignee: Rohith
 Attachments: 0001-YARN-933.patch, YARN-933.patch


 am max retries configured as 3 at client and RM side.
 Step 1: Install cluster with NM on 2 Machines 
 Step 2: Make Ping using ip from RM machine to NM1 machine as successful ,But 
 using Hostname should fail
 Step 3: Execute a job
 Step 4: After AM [ AppAttempt_1 ] allocation to NM1 machine is done , 
 connection loss happened.
 Observation :
 ==
 After AppAttempt_1 has moved to failed state ,release of container for 
 AppAttempt_1 and Application removal are successful. New AppAttempt_2 is 
 sponed.
 1. Then again retry for AppAttempt_1 happens.
 2. Again RM side it is trying to launch AppAttempt_1, hence fails with 
 InvalidStateTransitonException
 3. Client got exited after AppAttempt_1 is been finished [But actually job is 
 still running ], while the appattempts configured is 3 and rest appattempts 
 are all sponed and running.
 RMLogs:
 ==
 2013-07-17 16:22:51,013 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 appattempt_1373952096466_0056_01 State change from SCHEDULED to ALLOCATED
 2013-07-17 16:35:48,171 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: host-10-18-40-15/10.18.40.59:8048. Already tried 36 time(s); 
 maxRetries=45
 2013-07-17 16:36:07,091 INFO 
 org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: 
 Expired:container_1373952096466_0056_01_01 Timed out after 600 secs
 2013-07-17 16:36:07,093 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
 container_1373952096466_0056_01_01 Container Transitioned from ACQUIRED 
 to EXPIRED
 2013-07-17 16:36:07,093 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
 Registering appattempt_1373952096466_0056_02
 2013-07-17 16:36:07,131 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
  Application appattempt_1373952096466_0056_01 is done. finalState=FAILED
 2013-07-17 16:36:07,131 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
 Application removed - appId: application_1373952096466_0056 user: Rex 
 leaf-queue of parent: root #applications: 35
 2013-07-17 16:36:07,132 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
  Application Submission: appattempt_1373952096466_0056_02, 
 2013-07-17 16:36:07,138 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 appattempt_1373952096466_0056_02 State change from SUBMITTED to SCHEDULED
 2013-07-17 16:36:30,179 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: host-10-18-40-15/10.18.40.59:8048. Already tried 38 time(s); 
 maxRetries=45
 2013-07-17 16:38:36,203 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: host-10-18-40-15/10.18.40.59:8048. Already tried 44 time(s); 
 maxRetries=45
 2013-07-17 16:38:56,207 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Error 
 launching appattempt_1373952096466_0056_01. Got exception: 
 java.lang.reflect.UndeclaredThrowableException
 2013-07-17 16:38:56,207 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 Can't handle this event at current state
 org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
 LAUNCH_FAILED at FAILED
  at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
  at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
  at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:445)
  at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:630)
  at 
 

[jira] [Commented] (YARN-933) After an AppAttempt_1 got failed [ removal and releasing of container is done , AppAttempt_2 is scheduled ] again relaunching of AppAttempt_1 throws Exception at RM .And

2013-07-22 Thread rohithsharma (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715145#comment-13715145
 ] 

rohithsharma commented on YARN-933:
---

If the continer expiry happens before app master is launched/failed to launch 
at nodemanager ( because of ipc connection retry time is greater then container 
expiry interval ) , then rm app attempt is transitioned to FAILED state. At rm 
app attempt FAILED state, LAUNCHED or LAUNCH_FAILED events are not defined 
which intern causes InvalidStateTransitonException.

 After an AppAttempt_1 got failed [ removal and releasing of container is done 
 , AppAttempt_2 is scheduled ] again relaunching of AppAttempt_1 throws 
 Exception at RM .And client exited before appattempt retries got over
 --

 Key: YARN-933
 URL: https://issues.apache.org/jira/browse/YARN-933
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.5-alpha
Reporter: J.Andreina

 am max retries configured as 3 at client and RM side.
 Step 1: Install cluster with NM on 2 Machines 
 Step 2: Make Ping using ip from RM machine to NM1 machine as successful ,But 
 using Hostname should fail
 Step 3: Execute a job
 Step 4: After AM [ AppAttempt_1 ] allocation to NM1 machine is done , 
 connection loss happened.
 Observation :
 ==
 After AppAttempt_1 has moved to failed state ,release of container for 
 AppAttempt_1 and Application removal are successful. New AppAttempt_2 is 
 sponed.
 1. Then again retry for AppAttempt_1 happens.
 2. Again RM side it is trying to launch AppAttempt_1, hence fails with 
 InvalidStateTransitonException
 3. Client got exited after AppAttempt_1 is been finished [But actually job is 
 still running ], while the appattempts configured is 3 and rest appattempts 
 are all sponed and running.
 RMLogs:
 ==
 2013-07-17 16:22:51,013 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 appattempt_1373952096466_0056_01 State change from SCHEDULED to ALLOCATED
 2013-07-17 16:35:48,171 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: host-10-18-40-15/10.18.40.59:8048. Already tried 36 time(s); 
 maxRetries=45
 2013-07-17 16:36:07,091 INFO 
 org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: 
 Expired:container_1373952096466_0056_01_01 Timed out after 600 secs
 2013-07-17 16:36:07,093 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
 container_1373952096466_0056_01_01 Container Transitioned from ACQUIRED 
 to EXPIRED
 2013-07-17 16:36:07,093 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
 Registering appattempt_1373952096466_0056_02
 2013-07-17 16:36:07,131 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
  Application appattempt_1373952096466_0056_01 is done. finalState=FAILED
 2013-07-17 16:36:07,131 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
 Application removed - appId: application_1373952096466_0056 user: Rex 
 leaf-queue of parent: root #applications: 35
 2013-07-17 16:36:07,132 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
  Application Submission: appattempt_1373952096466_0056_02, 
 2013-07-17 16:36:07,138 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 appattempt_1373952096466_0056_02 State change from SUBMITTED to SCHEDULED
 2013-07-17 16:36:30,179 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: host-10-18-40-15/10.18.40.59:8048. Already tried 38 time(s); 
 maxRetries=45
 2013-07-17 16:38:36,203 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: host-10-18-40-15/10.18.40.59:8048. Already tried 44 time(s); 
 maxRetries=45
 2013-07-17 16:38:56,207 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Error 
 launching appattempt_1373952096466_0056_01. Got exception: 
 java.lang.reflect.UndeclaredThrowableException
 2013-07-17 16:38:56,207 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 Can't handle this event at current state
 org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
 LAUNCH_FAILED at FAILED
  at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
  at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
  at 
 

[jira] [Commented] (YARN-933) After an AppAttempt_1 got failed [ removal and releasing of container is done , AppAttempt_2 is scheduled ] again relaunching of AppAttempt_1 throws Exception at RM .And

2013-07-22 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715166#comment-13715166
 ] 

Hadoop QA commented on YARN-933:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12593496/YARN-933.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1544//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1544//console

This message is automatically generated.

 After an AppAttempt_1 got failed [ removal and releasing of container is done 
 , AppAttempt_2 is scheduled ] again relaunching of AppAttempt_1 throws 
 Exception at RM .And client exited before appattempt retries got over
 --

 Key: YARN-933
 URL: https://issues.apache.org/jira/browse/YARN-933
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.5-alpha
Reporter: J.Andreina
 Attachments: YARN-933.patch


 am max retries configured as 3 at client and RM side.
 Step 1: Install cluster with NM on 2 Machines 
 Step 2: Make Ping using ip from RM machine to NM1 machine as successful ,But 
 using Hostname should fail
 Step 3: Execute a job
 Step 4: After AM [ AppAttempt_1 ] allocation to NM1 machine is done , 
 connection loss happened.
 Observation :
 ==
 After AppAttempt_1 has moved to failed state ,release of container for 
 AppAttempt_1 and Application removal are successful. New AppAttempt_2 is 
 sponed.
 1. Then again retry for AppAttempt_1 happens.
 2. Again RM side it is trying to launch AppAttempt_1, hence fails with 
 InvalidStateTransitonException
 3. Client got exited after AppAttempt_1 is been finished [But actually job is 
 still running ], while the appattempts configured is 3 and rest appattempts 
 are all sponed and running.
 RMLogs:
 ==
 2013-07-17 16:22:51,013 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 appattempt_1373952096466_0056_01 State change from SCHEDULED to ALLOCATED
 2013-07-17 16:35:48,171 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: host-10-18-40-15/10.18.40.59:8048. Already tried 36 time(s); 
 maxRetries=45
 2013-07-17 16:36:07,091 INFO 
 org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: 
 Expired:container_1373952096466_0056_01_01 Timed out after 600 secs
 2013-07-17 16:36:07,093 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
 container_1373952096466_0056_01_01 Container Transitioned from ACQUIRED 
 to EXPIRED
 2013-07-17 16:36:07,093 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
 Registering appattempt_1373952096466_0056_02
 2013-07-17 16:36:07,131 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
  Application appattempt_1373952096466_0056_01 is done. finalState=FAILED
 2013-07-17 16:36:07,131 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
 Application removed - appId: application_1373952096466_0056 user: Rex 
 leaf-queue of parent: root #applications: 35
 2013-07-17 16:36:07,132 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
  Application Submission: appattempt_1373952096466_0056_02, 
 2013-07-17 16:36:07,138 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 appattempt_1373952096466_0056_02 State change from SUBMITTED to SCHEDULED
 2013-07-17 16:36:30,179 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: