[jira] [Commented] (YARN-403) Node Manager throws java.io.IOException: Verification of the hashReply failed

2013-07-30 Thread rohithsharma (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13724186#comment-13724186
 ] 

rohithsharma commented on YARN-403:
---

Hash verification is one authentication process at Reducer/NM for Http 
request/response i.e fetching map output from NodeManager's. 

One specific scenario I have observed above exception is 
 bq. When app master is killed after map phase is completed. 2nd app master 
attempt start reducer phase which intern reducers request for fetching map 
output. 

In the above scenario, 
1. 1st attemp AM has SecretKey.This key has sent to Node Manager during start 
of container(Map Task). Node Manager stores in memory i.e 
*ShuffleHandler.secretManager* which will be removed only after application is 
finished. 

After Map Task is completed, app master is killed.

2. When 2nd attempt app master started, new SecretKey is generated. This 
Secretkey is sent along with startContainer request to NodeManager and Reducer 
JVM is started by NM. Fetcher at reducer encrypt request url using SecretKey 
and sends fetch request to Node Manager where Map Task has run (Old SecretKey 
is present). 
At NodeManager , hash verification is verified with old 1st attempt app master 
SecretKey. This verification fails since keys are different.

 Node Manager throws java.io.IOException: Verification of the hashReply failed
 -

 Key: YARN-403
 URL: https://issues.apache.org/jira/browse/YARN-403
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.2-alpha, 0.23.6
Reporter: Devaraj K
Assignee: Omkar Vinit Joshi

 {code:xml}
 2013-02-09 22:59:47,490 WARN org.apache.hadoop.mapred.ShuffleHandler: Shuffle 
 failure 
 java.io.IOException: Verification of the hashReply failed
   at 
 org.apache.hadoop.mapreduce.security.SecureShuffleUtils.verifyReply(SecureShuffleUtils.java:98)
   at 
 org.apache.hadoop.mapred.ShuffleHandler$Shuffle.verifyRequest(ShuffleHandler.java:436)
   at 
 org.apache.hadoop.mapred.ShuffleHandler$Shuffle.messageReceived(ShuffleHandler.java:383)
   at 
 org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:80)
   at 
 org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:545)
   at 
 org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:754)
   at 
 org.jboss.netty.handler.stream.ChunkedWriteHandler.handleUpstream(ChunkedWriteHandler.java:148)
   at 
 org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:545)
   at 
 org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:754)
   at 
 org.jboss.netty.handler.codec.http.HttpChunkAggregator.messageReceived(HttpChunkAggregator.java:116)
   at 
 org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:80)
   at 
 org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:545)
   at 
 org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:754)
   at 
 org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:302)
   at 
 org.jboss.netty.handler.codec.replay.ReplayingDecoder.unfoldAndfireMessageReceived(ReplayingDecoder.java:522)
   at 
 org.jboss.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:506)
   at 
 org.jboss.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:443)
   at 
 org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:80)
   at 
 org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:545)
   at 
 org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:540)
   at 
 org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:274)
   at 
 org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:261)
   at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:349)
   at 
 org.jboss.netty.channel.socket.nio.NioWorker.processSelectedKeys(NioWorker.java:280)
   at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:200)
   at 
 org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
   at 
 org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:44)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at 
 

[jira] [Commented] (YARN-933) After an AppAttempt_1 got failed [ removal and releasing of container is done , AppAttempt_2 is scheduled ] again relaunching of AppAttempt_1 throws Exception at RM .And

2013-07-22 Thread rohithsharma (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715145#comment-13715145
 ] 

rohithsharma commented on YARN-933:
---

If the continer expiry happens before app master is launched/failed to launch 
at nodemanager ( because of ipc connection retry time is greater then container 
expiry interval ) , then rm app attempt is transitioned to FAILED state. At rm 
app attempt FAILED state, LAUNCHED or LAUNCH_FAILED events are not defined 
which intern causes InvalidStateTransitonException.

 After an AppAttempt_1 got failed [ removal and releasing of container is done 
 , AppAttempt_2 is scheduled ] again relaunching of AppAttempt_1 throws 
 Exception at RM .And client exited before appattempt retries got over
 --

 Key: YARN-933
 URL: https://issues.apache.org/jira/browse/YARN-933
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.5-alpha
Reporter: J.Andreina

 am max retries configured as 3 at client and RM side.
 Step 1: Install cluster with NM on 2 Machines 
 Step 2: Make Ping using ip from RM machine to NM1 machine as successful ,But 
 using Hostname should fail
 Step 3: Execute a job
 Step 4: After AM [ AppAttempt_1 ] allocation to NM1 machine is done , 
 connection loss happened.
 Observation :
 ==
 After AppAttempt_1 has moved to failed state ,release of container for 
 AppAttempt_1 and Application removal are successful. New AppAttempt_2 is 
 sponed.
 1. Then again retry for AppAttempt_1 happens.
 2. Again RM side it is trying to launch AppAttempt_1, hence fails with 
 InvalidStateTransitonException
 3. Client got exited after AppAttempt_1 is been finished [But actually job is 
 still running ], while the appattempts configured is 3 and rest appattempts 
 are all sponed and running.
 RMLogs:
 ==
 2013-07-17 16:22:51,013 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 appattempt_1373952096466_0056_01 State change from SCHEDULED to ALLOCATED
 2013-07-17 16:35:48,171 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: host-10-18-40-15/10.18.40.59:8048. Already tried 36 time(s); 
 maxRetries=45
 2013-07-17 16:36:07,091 INFO 
 org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: 
 Expired:container_1373952096466_0056_01_01 Timed out after 600 secs
 2013-07-17 16:36:07,093 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
 container_1373952096466_0056_01_01 Container Transitioned from ACQUIRED 
 to EXPIRED
 2013-07-17 16:36:07,093 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
 Registering appattempt_1373952096466_0056_02
 2013-07-17 16:36:07,131 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
  Application appattempt_1373952096466_0056_01 is done. finalState=FAILED
 2013-07-17 16:36:07,131 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
 Application removed - appId: application_1373952096466_0056 user: Rex 
 leaf-queue of parent: root #applications: 35
 2013-07-17 16:36:07,132 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
  Application Submission: appattempt_1373952096466_0056_02, 
 2013-07-17 16:36:07,138 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 appattempt_1373952096466_0056_02 State change from SUBMITTED to SCHEDULED
 2013-07-17 16:36:30,179 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: host-10-18-40-15/10.18.40.59:8048. Already tried 38 time(s); 
 maxRetries=45
 2013-07-17 16:38:36,203 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: host-10-18-40-15/10.18.40.59:8048. Already tried 44 time(s); 
 maxRetries=45
 2013-07-17 16:38:56,207 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Error 
 launching appattempt_1373952096466_0056_01. Got exception: 
 java.lang.reflect.UndeclaredThrowableException
 2013-07-17 16:38:56,207 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 Can't handle this event at current state
 org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
 LAUNCH_FAILED at FAILED
  at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
  at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
  at 
 

[jira] [Updated] (YARN-933) After an AppAttempt_1 got failed [ removal and releasing of container is done , AppAttempt_2 is scheduled ] again relaunching of AppAttempt_1 throws Exception at RM .And cl

2013-07-22 Thread rohithsharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

rohithsharma updated YARN-933:
--

Attachment: YARN-933.patch

Attached the patch to fix this issue. Please review this.

 After an AppAttempt_1 got failed [ removal and releasing of container is done 
 , AppAttempt_2 is scheduled ] again relaunching of AppAttempt_1 throws 
 Exception at RM .And client exited before appattempt retries got over
 --

 Key: YARN-933
 URL: https://issues.apache.org/jira/browse/YARN-933
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.5-alpha
Reporter: J.Andreina
 Attachments: YARN-933.patch


 am max retries configured as 3 at client and RM side.
 Step 1: Install cluster with NM on 2 Machines 
 Step 2: Make Ping using ip from RM machine to NM1 machine as successful ,But 
 using Hostname should fail
 Step 3: Execute a job
 Step 4: After AM [ AppAttempt_1 ] allocation to NM1 machine is done , 
 connection loss happened.
 Observation :
 ==
 After AppAttempt_1 has moved to failed state ,release of container for 
 AppAttempt_1 and Application removal are successful. New AppAttempt_2 is 
 sponed.
 1. Then again retry for AppAttempt_1 happens.
 2. Again RM side it is trying to launch AppAttempt_1, hence fails with 
 InvalidStateTransitonException
 3. Client got exited after AppAttempt_1 is been finished [But actually job is 
 still running ], while the appattempts configured is 3 and rest appattempts 
 are all sponed and running.
 RMLogs:
 ==
 2013-07-17 16:22:51,013 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 appattempt_1373952096466_0056_01 State change from SCHEDULED to ALLOCATED
 2013-07-17 16:35:48,171 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: host-10-18-40-15/10.18.40.59:8048. Already tried 36 time(s); 
 maxRetries=45
 2013-07-17 16:36:07,091 INFO 
 org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: 
 Expired:container_1373952096466_0056_01_01 Timed out after 600 secs
 2013-07-17 16:36:07,093 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
 container_1373952096466_0056_01_01 Container Transitioned from ACQUIRED 
 to EXPIRED
 2013-07-17 16:36:07,093 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
 Registering appattempt_1373952096466_0056_02
 2013-07-17 16:36:07,131 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
  Application appattempt_1373952096466_0056_01 is done. finalState=FAILED
 2013-07-17 16:36:07,131 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
 Application removed - appId: application_1373952096466_0056 user: Rex 
 leaf-queue of parent: root #applications: 35
 2013-07-17 16:36:07,132 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
  Application Submission: appattempt_1373952096466_0056_02, 
 2013-07-17 16:36:07,138 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 appattempt_1373952096466_0056_02 State change from SUBMITTED to SCHEDULED
 2013-07-17 16:36:30,179 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: host-10-18-40-15/10.18.40.59:8048. Already tried 38 time(s); 
 maxRetries=45
 2013-07-17 16:38:36,203 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: host-10-18-40-15/10.18.40.59:8048. Already tried 44 time(s); 
 maxRetries=45
 2013-07-17 16:38:56,207 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Error 
 launching appattempt_1373952096466_0056_01. Got exception: 
 java.lang.reflect.UndeclaredThrowableException
 2013-07-17 16:38:56,207 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 Can't handle this event at current state
 org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
 LAUNCH_FAILED at FAILED
  at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
  at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
  at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:445)
  at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:630)
  at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:99)
  at 
 

[jira] [Created] (YARN-929) 2 MRAppMaster spawned for same Job Id

2013-07-16 Thread rohithsharma (JIRA)
rohithsharma created YARN-929:
-

 Summary: 2 MRAppMaster spawned for same Job Id
 Key: YARN-929
 URL: https://issues.apache.org/jira/browse/YARN-929
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.5-alpha
Reporter: rohithsharma


Configuration : 
yarn.resourcemanager.am.max-retries = 3

Scenario is 
NodeManager is killed forcefully i.e using kill -9 NM_PID.
After Node expiry , RM killed all the container running in this NodeManager.
But , MRAppMaster JVM is still running.
RM spawn the 2nd attempt MRAppMaster since am retry is configured as 3.

Problem from running 2 MRApp is 1st attempt appmaster deletes the job 
information from hdfs which cause FileNotFoundException for 2nd attempt MRApp.  
 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-929) 2 MRAppMaster running parallely for same Job Id

2013-07-16 Thread rohithsharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

rohithsharma updated YARN-929:
--

Summary: 2 MRAppMaster running parallely for same Job Id  (was: 2 
MRAppMaster spawned for same Job Id)

 2 MRAppMaster running parallely for same Job Id
 ---

 Key: YARN-929
 URL: https://issues.apache.org/jira/browse/YARN-929
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.5-alpha
Reporter: rohithsharma

 Configuration : 
 yarn.resourcemanager.am.max-retries = 3
 Scenario is 
 NodeManager is killed forcefully i.e using kill -9 NM_PID.
 After Node expiry , RM killed all the container running in this 
 NodeManager.
 But , MRAppMaster JVM is still running.
 RM spawn the 2nd attempt MRAppMaster since am retry is configured as 3. 
 At this point, there are 2 MRAppMaster is running parallely for same job Id
 Problem from running 2 MRApp is 1st attempt appmaster deletes the job 
 information from hdfs which cause FileNotFoundException for 2nd attempt 
 MRApp.  
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-907) MRAppMaster failed to initialize second attempt when first attempt is FAILED.

2013-07-09 Thread rohithsharma (JIRA)
rohithsharma created YARN-907:
-

 Summary: MRAppMaster failed to initialize second attempt when 
first attempt is FAILED.
 Key: YARN-907
 URL: https://issues.apache.org/jira/browse/YARN-907
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.0.5-alpha
 Environment: SuSeLinux , HDFS HA cluster.
Reporter: rohithsharma


configuration : 
yarn.resourcemanager.am.max-retries = 3

It is observed that ,
1. MRAppMaster is failed start services and exitted with shutdownhook running. 
As part of shutdownhook execution staging directory is deleted from hdfs.

2. New attempt has been spawned from the ResourceManager.But second attempt is 
failed throwing FileNotFoundException because in the first attempt 
shutdownhook, staging directory is deleted.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-901) Active users field in Resourcemanager scheduler UI gives negative values

2013-07-04 Thread rohithsharma (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13700445#comment-13700445
 ] 

rohithsharma commented on YARN-901:
---

Active users shows negative value during restart of RM. When APP_ADDED event, 
Active user values is calculated and same is recalculated at APP_REMOVED event.
Afer submitting job, if we restart RM then calculation lead to Negative 
value.The problem is InMemory storage of User Info at each queue which will be 
reset during RM start up.

 Active users field in Resourcemanager scheduler UI gives negative values
 --

 Key: YARN-901
 URL: https://issues.apache.org/jira/browse/YARN-901
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.0.5-alpha
Reporter: Nishan Shetty
Priority: Minor

 Active users field in Resourcemanager scheduler UI gives negative values on 
 Resourcemanager restart when job is in progress

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira