[ https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814496#comment-13814496 ]
Omkar Vinit Joshi commented on YARN-1210: ----------------------------------------- Thanks [~jianhe] for reviewing it. {code} Instead of passing running containers as parameter in RegisterNodeManagerRequest, is it possible to just call heartBeat immediately after registerCall and then unBlockNewContainerRequests ? That way we can take advantage of the existing heartbeat logic, cover other things like keep app alive for log aggregation after AM container completes. Or at least we can send the list of ContainerStatus(including diagnostics) instead of just container Ids and also the list of keep-alive apps (separate jira)? {code} it makes sense replacing finishedContainers with containerStatuses. bq. Unnecessary import changes in DefaultContainerExecutor.java and LinuxContainerExecutor, ContainerLaunch, ContainersLauncher actually I wanted that earlier as I had created new ExitCode.java. I wanted to access it from ResourceTrackerService. Now since we are sending container status from node manager itself so no longer need that ..fixed it. bq. Finished containers may not necessary be killed. The containers can also normal finish and remain in the NM cache before NM resync. Updated the logic for cleanupContainers on node manager side. Now we should have all the finishedContainer statuses as it is. bq. wrong LOG class name. :) fixed it.. bq. LogFactory.getLog(RMAppImpl.class); removed. bq. Isn't always the case that after this patch only the last attempt can be running ? a new attempt will not be launched until the previous attempt reports back it really exits. If this is case, it can be a bug. We may only need to check that if the last attempt is finished or not. It is actually checking for any attempt to be in non running state. Do you want me to only check last attempt (by comparing application attempt ids)?. bq. should we return RUNNING or ACCEPTED for apps that are not in final state ? It's ok to return RUNNING in the scope of this patch because anyways we are launching a new attempt. Later on in working preserving restart, RM can crash before attempt register, attempt can register with RM after RM comes back in which case we can then move app from ACCEPTED to RUNNING? Yes right now I will keep it as RUNNING only. Today we don't have any information whether previous application master started and registered or not. Once we will have that information then probably we can do this. > During RM restart, RM should start a new attempt only when previous attempt > exits for real > ------------------------------------------------------------------------------------------ > > Key: YARN-1210 > URL: https://issues.apache.org/jira/browse/YARN-1210 > Project: Hadoop YARN > Issue Type: Sub-task > Reporter: Vinod Kumar Vavilapalli > Assignee: Omkar Vinit Joshi > Attachments: YARN-1210.1.patch, YARN-1210.2.patch > > > When RM recovers, it can wait for existing AMs to contact RM back and then > kill them forcefully before even starting a new AM. Worst case, RM will start > a new AppAttempt after waiting for 10 mins ( the expiry interval). This way > we'll minimize multiple AMs racing with each other. This can help issues with > downstream components like Pig, Hive and Oozie during RM restart. > In the mean while, new apps will proceed as usual as existing apps wait for > recovery. > This can continue to be useful after work-preserving restart, so that AMs > which can properly sync back up with RM can continue to run and those that > don't are guaranteed to be killed before starting a new attempt. -- This message was sent by Atlassian JIRA (v6.1#6144)