unhealthy

Robert Joseph Evans (Commented) (JIRA) Fri, 06 Apr 2012 11:41:49 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13248585#comment-13248585
 ]


Robert Joseph Evans commented on MAPREDUCE-3921:
------------------------------------------------

For the testing I just want to be sure that nothing catastrophically bad 
happens in these cases.  If a failed task is not detected until the reducer 
fails to fetch data from it, that is fine with me, but if the AM dies or hangs, 
or if there is some how data corruption I really would like to avoid those.
 
By AM Recovery I mean that when the AM dies, i.e. it was on a bad node, the RM 
will restart it.  The AM then looks through the JobHistory logs to find out 
which tasks finished successfully before it died, and which ones need to be 
restarted.  I just want to be sure that if a map task is restarted because a 
node is unhealthy and the AM also is restarted that the recovery code will 
handle that case correctly.

bq. Are reducers informed about new locations for map outputs on the fly?
That is my understanding otherwise no reducer could be launched until all 
mappers had finished, and all reducers would have to be relaunched if a map 
task disappeared on a bad node.

bq. I dont think the assert is necessary given the current code but do you 
usually put in asserts?
I don't usually put in asserts.  But I don't really like dangling TODO's lying 
around.  If it is something that needs to be done I feel we should either do it 
or file a JIRA to track it so it gets done.  If it is not something that needs 
to be done then we don't need a TODO for it.  If this is a copy and paste TODO 
I am OK with leaving it.  That is the reason I did not comment on the other 
TODOs added into the code, I could see where they were copied from.

bq. The log means that this task was not started and hence further history 
events are not being added. This is similar to other places in the code
Yes I can see the place where it was copied from.  What I am referring to is 
that the KilledTransition, where this looks like it came from, handles the kill 
event coming in from many different states.  In some of these states it is 
reasonable to have a launch time of 0.  In KilledAfterSuccessTransition, as the 
name implies, it seems very difficult to have a taskattempt in the "SUCCESS" 
state that had no launch time.  A task that finished successfully but was never 
run seems odd to me, if you want to leave it for defensive programming I am 
happy to, but I would prefer the log message to not be debug so someone looking 
can see that something odd happened here.

bq. The comments are for reviewers to clarify those points. eg. Some of the 
code was taken from similar actions elsewhere. They set the finish time and I 
was not sure if that was the correct thing to do.
It seems logical that if you are killing a task that we want to be sure the 
finish time is set, so just set it, but that should already have been set for 
the SUCCESS case, so I would just leave it off, but I really don't know for 
sure.  
                
> MR AM should act on the nodes liveliness information when nodes go 
> up/down/unhealthy
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3921
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3921
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3921-branch-0.23.patch, 
> MAPREDUCE-3921-branch-0.23.patch, MAPREDUCE-3921-branch-0.23.patch, 
> MAPREDUCE-3921.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3921) MR AM should act on the nodes liveliness information when nodes go up/down/unhealthy

Reply via email to