[jira] [Commented] (YARN-403) Node Manager throws java.io.IOException: Verification of the hashReply failed
[ https://issues.apache.org/jira/browse/YARN-403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13724186#comment-13724186 ] rohithsharma commented on YARN-403: --- Hash verification is one authentication process at Reducer/NM for Http request/response i.e fetching map output from NodeManager's. One specific scenario I have observed above exception is bq. When app master is killed after map phase is completed. 2nd app master attempt start reducer phase which intern reducers request for fetching map output. In the above scenario, 1. 1st attemp AM has SecretKey.This key has sent to Node Manager during start of container(Map Task). Node Manager stores in memory i.e *ShuffleHandler.secretManager* which will be removed only after application is finished. After Map Task is completed, app master is killed. 2. When 2nd attempt app master started, new SecretKey is generated. This Secretkey is sent along with startContainer request to NodeManager and Reducer JVM is started by NM. Fetcher at reducer encrypt request url using SecretKey and sends fetch request to Node Manager where Map Task has run (Old SecretKey is present). At NodeManager , hash verification is verified with old 1st attempt app master SecretKey. This verification fails since keys are different. Node Manager throws java.io.IOException: Verification of the hashReply failed - Key: YARN-403 URL: https://issues.apache.org/jira/browse/YARN-403 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.0.2-alpha, 0.23.6 Reporter: Devaraj K Assignee: Omkar Vinit Joshi {code:xml} 2013-02-09 22:59:47,490 WARN org.apache.hadoop.mapred.ShuffleHandler: Shuffle failure java.io.IOException: Verification of the hashReply failed at org.apache.hadoop.mapreduce.security.SecureShuffleUtils.verifyReply(SecureShuffleUtils.java:98) at org.apache.hadoop.mapred.ShuffleHandler$Shuffle.verifyRequest(ShuffleHandler.java:436) at org.apache.hadoop.mapred.ShuffleHandler$Shuffle.messageReceived(ShuffleHandler.java:383) at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:80) at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:545) at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:754) at org.jboss.netty.handler.stream.ChunkedWriteHandler.handleUpstream(ChunkedWriteHandler.java:148) at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:545) at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:754) at org.jboss.netty.handler.codec.http.HttpChunkAggregator.messageReceived(HttpChunkAggregator.java:116) at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:80) at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:545) at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:754) at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:302) at org.jboss.netty.handler.codec.replay.ReplayingDecoder.unfoldAndfireMessageReceived(ReplayingDecoder.java:522) at org.jboss.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:506) at org.jboss.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:443) at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:80) at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:545) at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:540) at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:274) at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:261) at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:349) at org.jboss.netty.channel.socket.nio.NioWorker.processSelectedKeys(NioWorker.java:280) at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:200) at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:44) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at
[jira] [Commented] (YARN-933) After an AppAttempt_1 got failed [ removal and releasing of container is done , AppAttempt_2 is scheduled ] again relaunching of AppAttempt_1 throws Exception at RM .And
[ https://issues.apache.org/jira/browse/YARN-933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715145#comment-13715145 ] rohithsharma commented on YARN-933: --- If the continer expiry happens before app master is launched/failed to launch at nodemanager ( because of ipc connection retry time is greater then container expiry interval ) , then rm app attempt is transitioned to FAILED state. At rm app attempt FAILED state, LAUNCHED or LAUNCH_FAILED events are not defined which intern causes InvalidStateTransitonException. After an AppAttempt_1 got failed [ removal and releasing of container is done , AppAttempt_2 is scheduled ] again relaunching of AppAttempt_1 throws Exception at RM .And client exited before appattempt retries got over -- Key: YARN-933 URL: https://issues.apache.org/jira/browse/YARN-933 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.5-alpha Reporter: J.Andreina am max retries configured as 3 at client and RM side. Step 1: Install cluster with NM on 2 Machines Step 2: Make Ping using ip from RM machine to NM1 machine as successful ,But using Hostname should fail Step 3: Execute a job Step 4: After AM [ AppAttempt_1 ] allocation to NM1 machine is done , connection loss happened. Observation : == After AppAttempt_1 has moved to failed state ,release of container for AppAttempt_1 and Application removal are successful. New AppAttempt_2 is sponed. 1. Then again retry for AppAttempt_1 happens. 2. Again RM side it is trying to launch AppAttempt_1, hence fails with InvalidStateTransitonException 3. Client got exited after AppAttempt_1 is been finished [But actually job is still running ], while the appattempts configured is 3 and rest appattempts are all sponed and running. RMLogs: == 2013-07-17 16:22:51,013 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1373952096466_0056_01 State change from SCHEDULED to ALLOCATED 2013-07-17 16:35:48,171 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: host-10-18-40-15/10.18.40.59:8048. Already tried 36 time(s); maxRetries=45 2013-07-17 16:36:07,091 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:container_1373952096466_0056_01_01 Timed out after 600 secs 2013-07-17 16:36:07,093 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1373952096466_0056_01_01 Container Transitioned from ACQUIRED to EXPIRED 2013-07-17 16:36:07,093 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering appattempt_1373952096466_0056_02 2013-07-17 16:36:07,131 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1373952096466_0056_01 is done. finalState=FAILED 2013-07-17 16:36:07,131 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Application removed - appId: application_1373952096466_0056 user: Rex leaf-queue of parent: root #applications: 35 2013-07-17 16:36:07,132 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application Submission: appattempt_1373952096466_0056_02, 2013-07-17 16:36:07,138 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1373952096466_0056_02 State change from SUBMITTED to SCHEDULED 2013-07-17 16:36:30,179 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: host-10-18-40-15/10.18.40.59:8048. Already tried 38 time(s); maxRetries=45 2013-07-17 16:38:36,203 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: host-10-18-40-15/10.18.40.59:8048. Already tried 44 time(s); maxRetries=45 2013-07-17 16:38:56,207 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Error launching appattempt_1373952096466_0056_01. Got exception: java.lang.reflect.UndeclaredThrowableException 2013-07-17 16:38:56,207 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: LAUNCH_FAILED at FAILED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43) at
[jira] [Updated] (YARN-933) After an AppAttempt_1 got failed [ removal and releasing of container is done , AppAttempt_2 is scheduled ] again relaunching of AppAttempt_1 throws Exception at RM .And cl
[ https://issues.apache.org/jira/browse/YARN-933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] rohithsharma updated YARN-933: -- Attachment: YARN-933.patch Attached the patch to fix this issue. Please review this. After an AppAttempt_1 got failed [ removal and releasing of container is done , AppAttempt_2 is scheduled ] again relaunching of AppAttempt_1 throws Exception at RM .And client exited before appattempt retries got over -- Key: YARN-933 URL: https://issues.apache.org/jira/browse/YARN-933 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.5-alpha Reporter: J.Andreina Attachments: YARN-933.patch am max retries configured as 3 at client and RM side. Step 1: Install cluster with NM on 2 Machines Step 2: Make Ping using ip from RM machine to NM1 machine as successful ,But using Hostname should fail Step 3: Execute a job Step 4: After AM [ AppAttempt_1 ] allocation to NM1 machine is done , connection loss happened. Observation : == After AppAttempt_1 has moved to failed state ,release of container for AppAttempt_1 and Application removal are successful. New AppAttempt_2 is sponed. 1. Then again retry for AppAttempt_1 happens. 2. Again RM side it is trying to launch AppAttempt_1, hence fails with InvalidStateTransitonException 3. Client got exited after AppAttempt_1 is been finished [But actually job is still running ], while the appattempts configured is 3 and rest appattempts are all sponed and running. RMLogs: == 2013-07-17 16:22:51,013 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1373952096466_0056_01 State change from SCHEDULED to ALLOCATED 2013-07-17 16:35:48,171 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: host-10-18-40-15/10.18.40.59:8048. Already tried 36 time(s); maxRetries=45 2013-07-17 16:36:07,091 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:container_1373952096466_0056_01_01 Timed out after 600 secs 2013-07-17 16:36:07,093 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1373952096466_0056_01_01 Container Transitioned from ACQUIRED to EXPIRED 2013-07-17 16:36:07,093 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering appattempt_1373952096466_0056_02 2013-07-17 16:36:07,131 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1373952096466_0056_01 is done. finalState=FAILED 2013-07-17 16:36:07,131 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Application removed - appId: application_1373952096466_0056 user: Rex leaf-queue of parent: root #applications: 35 2013-07-17 16:36:07,132 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application Submission: appattempt_1373952096466_0056_02, 2013-07-17 16:36:07,138 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1373952096466_0056_02 State change from SUBMITTED to SCHEDULED 2013-07-17 16:36:30,179 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: host-10-18-40-15/10.18.40.59:8048. Already tried 38 time(s); maxRetries=45 2013-07-17 16:38:36,203 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: host-10-18-40-15/10.18.40.59:8048. Already tried 44 time(s); maxRetries=45 2013-07-17 16:38:56,207 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Error launching appattempt_1373952096466_0056_01. Got exception: java.lang.reflect.UndeclaredThrowableException 2013-07-17 16:38:56,207 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: LAUNCH_FAILED at FAILED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:445) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:630) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:99) at
[jira] [Created] (YARN-929) 2 MRAppMaster spawned for same Job Id
rohithsharma created YARN-929: - Summary: 2 MRAppMaster spawned for same Job Id Key: YARN-929 URL: https://issues.apache.org/jira/browse/YARN-929 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.5-alpha Reporter: rohithsharma Configuration : yarn.resourcemanager.am.max-retries = 3 Scenario is NodeManager is killed forcefully i.e using kill -9 NM_PID. After Node expiry , RM killed all the container running in this NodeManager. But , MRAppMaster JVM is still running. RM spawn the 2nd attempt MRAppMaster since am retry is configured as 3. Problem from running 2 MRApp is 1st attempt appmaster deletes the job information from hdfs which cause FileNotFoundException for 2nd attempt MRApp. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-929) 2 MRAppMaster running parallely for same Job Id
[ https://issues.apache.org/jira/browse/YARN-929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] rohithsharma updated YARN-929: -- Summary: 2 MRAppMaster running parallely for same Job Id (was: 2 MRAppMaster spawned for same Job Id) 2 MRAppMaster running parallely for same Job Id --- Key: YARN-929 URL: https://issues.apache.org/jira/browse/YARN-929 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.5-alpha Reporter: rohithsharma Configuration : yarn.resourcemanager.am.max-retries = 3 Scenario is NodeManager is killed forcefully i.e using kill -9 NM_PID. After Node expiry , RM killed all the container running in this NodeManager. But , MRAppMaster JVM is still running. RM spawn the 2nd attempt MRAppMaster since am retry is configured as 3. At this point, there are 2 MRAppMaster is running parallely for same job Id Problem from running 2 MRApp is 1st attempt appmaster deletes the job information from hdfs which cause FileNotFoundException for 2nd attempt MRApp. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-907) MRAppMaster failed to initialize second attempt when first attempt is FAILED.
rohithsharma created YARN-907: - Summary: MRAppMaster failed to initialize second attempt when first attempt is FAILED. Key: YARN-907 URL: https://issues.apache.org/jira/browse/YARN-907 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.0.5-alpha Environment: SuSeLinux , HDFS HA cluster. Reporter: rohithsharma configuration : yarn.resourcemanager.am.max-retries = 3 It is observed that , 1. MRAppMaster is failed start services and exitted with shutdownhook running. As part of shutdownhook execution staging directory is deleted from hdfs. 2. New attempt has been spawned from the ResourceManager.But second attempt is failed throwing FileNotFoundException because in the first attempt shutdownhook, staging directory is deleted. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-901) Active users field in Resourcemanager scheduler UI gives negative values
[ https://issues.apache.org/jira/browse/YARN-901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13700445#comment-13700445 ] rohithsharma commented on YARN-901: --- Active users shows negative value during restart of RM. When APP_ADDED event, Active user values is calculated and same is recalculated at APP_REMOVED event. Afer submitting job, if we restart RM then calculation lead to Negative value.The problem is InMemory storage of User Info at each queue which will be reset during RM start up. Active users field in Resourcemanager scheduler UI gives negative values -- Key: YARN-901 URL: https://issues.apache.org/jira/browse/YARN-901 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.0.5-alpha Reporter: Nishan Shetty Priority: Minor Active users field in Resourcemanager scheduler UI gives negative values on Resourcemanager restart when job is in progress -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira