[jira] [Commented] (YARN-933) After an AppAttempt_1 got failed [ removal and releasing of container is done , AppAttempt_2 is scheduled ] again relaunching of AppAttempt_1 throws Exception at RM .And
[ https://issues.apache.org/jira/browse/YARN-933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311512#comment-14311512 ] Rohith commented on YARN-933: - I do not see any possibility of *Invalid event: LAUNCH_FAILED at FAILED* in trunk version. But I see potential problem of *Invalid event: LAUNCHED at FINAL_SAVING* or *Invalid event: LAUNCH_FAILED at FINAL_SAVING* when launched/launch_failed event are triggered after transition ALLOCATED---kill event---FINAL_SAVING. After an AppAttempt_1 got failed [ removal and releasing of container is done , AppAttempt_2 is scheduled ] again relaunching of AppAttempt_1 throws Exception at RM .And client exited before appattempt retries got over -- Key: YARN-933 URL: https://issues.apache.org/jira/browse/YARN-933 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.5-alpha Reporter: J.Andreina Assignee: Rohith Attachments: YARN-933.patch am max retries configured as 3 at client and RM side. Step 1: Install cluster with NM on 2 Machines Step 2: Make Ping using ip from RM machine to NM1 machine as successful ,But using Hostname should fail Step 3: Execute a job Step 4: After AM [ AppAttempt_1 ] allocation to NM1 machine is done , connection loss happened. Observation : == After AppAttempt_1 has moved to failed state ,release of container for AppAttempt_1 and Application removal are successful. New AppAttempt_2 is sponed. 1. Then again retry for AppAttempt_1 happens. 2. Again RM side it is trying to launch AppAttempt_1, hence fails with InvalidStateTransitonException 3. Client got exited after AppAttempt_1 is been finished [But actually job is still running ], while the appattempts configured is 3 and rest appattempts are all sponed and running. RMLogs: == 2013-07-17 16:22:51,013 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1373952096466_0056_01 State change from SCHEDULED to ALLOCATED 2013-07-17 16:35:48,171 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: host-10-18-40-15/10.18.40.59:8048. Already tried 36 time(s); maxRetries=45 2013-07-17 16:36:07,091 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:container_1373952096466_0056_01_01 Timed out after 600 secs 2013-07-17 16:36:07,093 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1373952096466_0056_01_01 Container Transitioned from ACQUIRED to EXPIRED 2013-07-17 16:36:07,093 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering appattempt_1373952096466_0056_02 2013-07-17 16:36:07,131 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1373952096466_0056_01 is done. finalState=FAILED 2013-07-17 16:36:07,131 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Application removed - appId: application_1373952096466_0056 user: Rex leaf-queue of parent: root #applications: 35 2013-07-17 16:36:07,132 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application Submission: appattempt_1373952096466_0056_02, 2013-07-17 16:36:07,138 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1373952096466_0056_02 State change from SUBMITTED to SCHEDULED 2013-07-17 16:36:30,179 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: host-10-18-40-15/10.18.40.59:8048. Already tried 38 time(s); maxRetries=45 2013-07-17 16:38:36,203 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: host-10-18-40-15/10.18.40.59:8048. Already tried 44 time(s); maxRetries=45 2013-07-17 16:38:56,207 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Error launching appattempt_1373952096466_0056_01. Got exception: java.lang.reflect.UndeclaredThrowableException 2013-07-17 16:38:56,207 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: LAUNCH_FAILED at FAILED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:445)
[jira] [Commented] (YARN-933) After an AppAttempt_1 got failed [ removal and releasing of container is done , AppAttempt_2 is scheduled ] again relaunching of AppAttempt_1 throws Exception at RM .And
[ https://issues.apache.org/jira/browse/YARN-933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311814#comment-14311814 ] Hadoop QA commented on YARN-933: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12697353/0001-YARN-933.patch against trunk revision 1382ae5. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6552//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6552//console This message is automatically generated. After an AppAttempt_1 got failed [ removal and releasing of container is done , AppAttempt_2 is scheduled ] again relaunching of AppAttempt_1 throws Exception at RM .And client exited before appattempt retries got over -- Key: YARN-933 URL: https://issues.apache.org/jira/browse/YARN-933 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.5-alpha Reporter: J.Andreina Assignee: Rohith Attachments: 0001-YARN-933.patch, YARN-933.patch am max retries configured as 3 at client and RM side. Step 1: Install cluster with NM on 2 Machines Step 2: Make Ping using ip from RM machine to NM1 machine as successful ,But using Hostname should fail Step 3: Execute a job Step 4: After AM [ AppAttempt_1 ] allocation to NM1 machine is done , connection loss happened. Observation : == After AppAttempt_1 has moved to failed state ,release of container for AppAttempt_1 and Application removal are successful. New AppAttempt_2 is sponed. 1. Then again retry for AppAttempt_1 happens. 2. Again RM side it is trying to launch AppAttempt_1, hence fails with InvalidStateTransitonException 3. Client got exited after AppAttempt_1 is been finished [But actually job is still running ], while the appattempts configured is 3 and rest appattempts are all sponed and running. RMLogs: == 2013-07-17 16:22:51,013 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1373952096466_0056_01 State change from SCHEDULED to ALLOCATED 2013-07-17 16:35:48,171 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: host-10-18-40-15/10.18.40.59:8048. Already tried 36 time(s); maxRetries=45 2013-07-17 16:36:07,091 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:container_1373952096466_0056_01_01 Timed out after 600 secs 2013-07-17 16:36:07,093 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1373952096466_0056_01_01 Container Transitioned from ACQUIRED to EXPIRED 2013-07-17 16:36:07,093 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering appattempt_1373952096466_0056_02 2013-07-17 16:36:07,131 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1373952096466_0056_01 is done. finalState=FAILED 2013-07-17 16:36:07,131 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Application removed - appId: application_1373952096466_0056 user: Rex leaf-queue of parent: root #applications: 35 2013-07-17 16:36:07,132 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application Submission: appattempt_1373952096466_0056_02, 2013-07-17 16:36:07,138 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1373952096466_0056_02 State change from SUBMITTED to SCHEDULED 2013-07-17 16:36:30,179 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: host-10-18-40-15/10.18.40.59:8048. Already tried 38
[jira] [Commented] (YARN-933) After an AppAttempt_1 got failed [ removal and releasing of container is done , AppAttempt_2 is scheduled ] again relaunching of AppAttempt_1 throws Exception at RM .And
[ https://issues.apache.org/jira/browse/YARN-933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311762#comment-14311762 ] Rohith commented on YARN-933: - [~jianhe] Could you have look at this issue please? The issue reported in descriiption flow has been changed and which . But since there is potential issue at final_saving state. I have attached patch to this jira insted of creating new jira. Is it fine ? And HadoopQA does not run since yesterday night. I am not sure is there was any problem? After an AppAttempt_1 got failed [ removal and releasing of container is done , AppAttempt_2 is scheduled ] again relaunching of AppAttempt_1 throws Exception at RM .And client exited before appattempt retries got over -- Key: YARN-933 URL: https://issues.apache.org/jira/browse/YARN-933 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.5-alpha Reporter: J.Andreina Assignee: Rohith Attachments: 0001-YARN-933.patch, YARN-933.patch am max retries configured as 3 at client and RM side. Step 1: Install cluster with NM on 2 Machines Step 2: Make Ping using ip from RM machine to NM1 machine as successful ,But using Hostname should fail Step 3: Execute a job Step 4: After AM [ AppAttempt_1 ] allocation to NM1 machine is done , connection loss happened. Observation : == After AppAttempt_1 has moved to failed state ,release of container for AppAttempt_1 and Application removal are successful. New AppAttempt_2 is sponed. 1. Then again retry for AppAttempt_1 happens. 2. Again RM side it is trying to launch AppAttempt_1, hence fails with InvalidStateTransitonException 3. Client got exited after AppAttempt_1 is been finished [But actually job is still running ], while the appattempts configured is 3 and rest appattempts are all sponed and running. RMLogs: == 2013-07-17 16:22:51,013 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1373952096466_0056_01 State change from SCHEDULED to ALLOCATED 2013-07-17 16:35:48,171 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: host-10-18-40-15/10.18.40.59:8048. Already tried 36 time(s); maxRetries=45 2013-07-17 16:36:07,091 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:container_1373952096466_0056_01_01 Timed out after 600 secs 2013-07-17 16:36:07,093 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1373952096466_0056_01_01 Container Transitioned from ACQUIRED to EXPIRED 2013-07-17 16:36:07,093 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering appattempt_1373952096466_0056_02 2013-07-17 16:36:07,131 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1373952096466_0056_01 is done. finalState=FAILED 2013-07-17 16:36:07,131 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Application removed - appId: application_1373952096466_0056 user: Rex leaf-queue of parent: root #applications: 35 2013-07-17 16:36:07,132 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application Submission: appattempt_1373952096466_0056_02, 2013-07-17 16:36:07,138 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1373952096466_0056_02 State change from SUBMITTED to SCHEDULED 2013-07-17 16:36:30,179 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: host-10-18-40-15/10.18.40.59:8048. Already tried 38 time(s); maxRetries=45 2013-07-17 16:38:36,203 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: host-10-18-40-15/10.18.40.59:8048. Already tried 44 time(s); maxRetries=45 2013-07-17 16:38:56,207 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Error launching appattempt_1373952096466_0056_01. Got exception: java.lang.reflect.UndeclaredThrowableException 2013-07-17 16:38:56,207 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: LAUNCH_FAILED at FAILED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43) at
[jira] [Commented] (YARN-933) After an AppAttempt_1 got failed [ removal and releasing of container is done , AppAttempt_2 is scheduled ] again relaunching of AppAttempt_1 throws Exception at RM .And
[ https://issues.apache.org/jira/browse/YARN-933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311793#comment-14311793 ] Jian He commented on YARN-933: -- sure, mind editing the title to reflect the problem ? nit: {{// verify for both launched and launch_failed transitions in failed}}, the comment should say in killed state. After an AppAttempt_1 got failed [ removal and releasing of container is done , AppAttempt_2 is scheduled ] again relaunching of AppAttempt_1 throws Exception at RM .And client exited before appattempt retries got over -- Key: YARN-933 URL: https://issues.apache.org/jira/browse/YARN-933 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.5-alpha Reporter: J.Andreina Assignee: Rohith Attachments: 0001-YARN-933.patch, YARN-933.patch am max retries configured as 3 at client and RM side. Step 1: Install cluster with NM on 2 Machines Step 2: Make Ping using ip from RM machine to NM1 machine as successful ,But using Hostname should fail Step 3: Execute a job Step 4: After AM [ AppAttempt_1 ] allocation to NM1 machine is done , connection loss happened. Observation : == After AppAttempt_1 has moved to failed state ,release of container for AppAttempt_1 and Application removal are successful. New AppAttempt_2 is sponed. 1. Then again retry for AppAttempt_1 happens. 2. Again RM side it is trying to launch AppAttempt_1, hence fails with InvalidStateTransitonException 3. Client got exited after AppAttempt_1 is been finished [But actually job is still running ], while the appattempts configured is 3 and rest appattempts are all sponed and running. RMLogs: == 2013-07-17 16:22:51,013 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1373952096466_0056_01 State change from SCHEDULED to ALLOCATED 2013-07-17 16:35:48,171 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: host-10-18-40-15/10.18.40.59:8048. Already tried 36 time(s); maxRetries=45 2013-07-17 16:36:07,091 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:container_1373952096466_0056_01_01 Timed out after 600 secs 2013-07-17 16:36:07,093 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1373952096466_0056_01_01 Container Transitioned from ACQUIRED to EXPIRED 2013-07-17 16:36:07,093 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering appattempt_1373952096466_0056_02 2013-07-17 16:36:07,131 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1373952096466_0056_01 is done. finalState=FAILED 2013-07-17 16:36:07,131 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Application removed - appId: application_1373952096466_0056 user: Rex leaf-queue of parent: root #applications: 35 2013-07-17 16:36:07,132 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application Submission: appattempt_1373952096466_0056_02, 2013-07-17 16:36:07,138 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1373952096466_0056_02 State change from SUBMITTED to SCHEDULED 2013-07-17 16:36:30,179 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: host-10-18-40-15/10.18.40.59:8048. Already tried 38 time(s); maxRetries=45 2013-07-17 16:38:36,203 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: host-10-18-40-15/10.18.40.59:8048. Already tried 44 time(s); maxRetries=45 2013-07-17 16:38:56,207 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Error launching appattempt_1373952096466_0056_01. Got exception: java.lang.reflect.UndeclaredThrowableException 2013-07-17 16:38:56,207 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: LAUNCH_FAILED at FAILED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:445) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:630) at
[jira] [Commented] (YARN-933) After an AppAttempt_1 got failed [ removal and releasing of container is done , AppAttempt_2 is scheduled ] again relaunching of AppAttempt_1 throws Exception at RM .And
[ https://issues.apache.org/jira/browse/YARN-933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715145#comment-13715145 ] rohithsharma commented on YARN-933: --- If the continer expiry happens before app master is launched/failed to launch at nodemanager ( because of ipc connection retry time is greater then container expiry interval ) , then rm app attempt is transitioned to FAILED state. At rm app attempt FAILED state, LAUNCHED or LAUNCH_FAILED events are not defined which intern causes InvalidStateTransitonException. After an AppAttempt_1 got failed [ removal and releasing of container is done , AppAttempt_2 is scheduled ] again relaunching of AppAttempt_1 throws Exception at RM .And client exited before appattempt retries got over -- Key: YARN-933 URL: https://issues.apache.org/jira/browse/YARN-933 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.5-alpha Reporter: J.Andreina am max retries configured as 3 at client and RM side. Step 1: Install cluster with NM on 2 Machines Step 2: Make Ping using ip from RM machine to NM1 machine as successful ,But using Hostname should fail Step 3: Execute a job Step 4: After AM [ AppAttempt_1 ] allocation to NM1 machine is done , connection loss happened. Observation : == After AppAttempt_1 has moved to failed state ,release of container for AppAttempt_1 and Application removal are successful. New AppAttempt_2 is sponed. 1. Then again retry for AppAttempt_1 happens. 2. Again RM side it is trying to launch AppAttempt_1, hence fails with InvalidStateTransitonException 3. Client got exited after AppAttempt_1 is been finished [But actually job is still running ], while the appattempts configured is 3 and rest appattempts are all sponed and running. RMLogs: == 2013-07-17 16:22:51,013 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1373952096466_0056_01 State change from SCHEDULED to ALLOCATED 2013-07-17 16:35:48,171 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: host-10-18-40-15/10.18.40.59:8048. Already tried 36 time(s); maxRetries=45 2013-07-17 16:36:07,091 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:container_1373952096466_0056_01_01 Timed out after 600 secs 2013-07-17 16:36:07,093 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1373952096466_0056_01_01 Container Transitioned from ACQUIRED to EXPIRED 2013-07-17 16:36:07,093 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering appattempt_1373952096466_0056_02 2013-07-17 16:36:07,131 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1373952096466_0056_01 is done. finalState=FAILED 2013-07-17 16:36:07,131 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Application removed - appId: application_1373952096466_0056 user: Rex leaf-queue of parent: root #applications: 35 2013-07-17 16:36:07,132 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application Submission: appattempt_1373952096466_0056_02, 2013-07-17 16:36:07,138 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1373952096466_0056_02 State change from SUBMITTED to SCHEDULED 2013-07-17 16:36:30,179 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: host-10-18-40-15/10.18.40.59:8048. Already tried 38 time(s); maxRetries=45 2013-07-17 16:38:36,203 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: host-10-18-40-15/10.18.40.59:8048. Already tried 44 time(s); maxRetries=45 2013-07-17 16:38:56,207 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Error launching appattempt_1373952096466_0056_01. Got exception: java.lang.reflect.UndeclaredThrowableException 2013-07-17 16:38:56,207 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: LAUNCH_FAILED at FAILED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43) at
[jira] [Commented] (YARN-933) After an AppAttempt_1 got failed [ removal and releasing of container is done , AppAttempt_2 is scheduled ] again relaunching of AppAttempt_1 throws Exception at RM .And
[ https://issues.apache.org/jira/browse/YARN-933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715166#comment-13715166 ] Hadoop QA commented on YARN-933: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12593496/YARN-933.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/1544//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1544//console This message is automatically generated. After an AppAttempt_1 got failed [ removal and releasing of container is done , AppAttempt_2 is scheduled ] again relaunching of AppAttempt_1 throws Exception at RM .And client exited before appattempt retries got over -- Key: YARN-933 URL: https://issues.apache.org/jira/browse/YARN-933 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.5-alpha Reporter: J.Andreina Attachments: YARN-933.patch am max retries configured as 3 at client and RM side. Step 1: Install cluster with NM on 2 Machines Step 2: Make Ping using ip from RM machine to NM1 machine as successful ,But using Hostname should fail Step 3: Execute a job Step 4: After AM [ AppAttempt_1 ] allocation to NM1 machine is done , connection loss happened. Observation : == After AppAttempt_1 has moved to failed state ,release of container for AppAttempt_1 and Application removal are successful. New AppAttempt_2 is sponed. 1. Then again retry for AppAttempt_1 happens. 2. Again RM side it is trying to launch AppAttempt_1, hence fails with InvalidStateTransitonException 3. Client got exited after AppAttempt_1 is been finished [But actually job is still running ], while the appattempts configured is 3 and rest appattempts are all sponed and running. RMLogs: == 2013-07-17 16:22:51,013 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1373952096466_0056_01 State change from SCHEDULED to ALLOCATED 2013-07-17 16:35:48,171 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: host-10-18-40-15/10.18.40.59:8048. Already tried 36 time(s); maxRetries=45 2013-07-17 16:36:07,091 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:container_1373952096466_0056_01_01 Timed out after 600 secs 2013-07-17 16:36:07,093 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1373952096466_0056_01_01 Container Transitioned from ACQUIRED to EXPIRED 2013-07-17 16:36:07,093 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering appattempt_1373952096466_0056_02 2013-07-17 16:36:07,131 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1373952096466_0056_01 is done. finalState=FAILED 2013-07-17 16:36:07,131 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Application removed - appId: application_1373952096466_0056 user: Rex leaf-queue of parent: root #applications: 35 2013-07-17 16:36:07,132 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application Submission: appattempt_1373952096466_0056_02, 2013-07-17 16:36:07,138 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1373952096466_0056_02 State change from SUBMITTED to SCHEDULED 2013-07-17 16:36:30,179 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: