[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16306803#comment-16306803
 ] 

Jason Lowe commented on MAPREDUCE-7028:
---------------------------------------

bq. This pattern does not ensure that the updates keep an order meaning that an 
later update with progress 100% can be succeeded by an update with progress 50%.

This should not be possible.  Status updates are not arbitrarily asynchronous 
from a single task attempt.  The task attempt will not generate a new status 
update until it has received a response from the previous status update call.  
The reason we're seeing multiple simultaneous attempts is because the task 
attempt is retrying the status update RPC call with the same payload.  The 
attempt never received an RPC response (i.e.: RPC timeout or network cut).  
Therefore it should not matter which one we take first, since they are the same 
update.

Even if we placed a fair lock here it doesn't really solve the issue, since a 
latter response can race ahead of an earlier response and enter the lock first. 
 If for some reason this really needs to be solved then there needs to be 
sequence IDs in the payload.  Then the listener can tell when a status update 
is stale and ignore the progress from a stale update.

> Concurrent task progress updates causing NPE in Application Master
> ------------------------------------------------------------------
>
>                 Key: MAPREDUCE-7028
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7028
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 3.1.0, 3.0.1, 2.10.0, 2.9.1, 2.8.4, 2.7.6
>            Reporter: Gergo Repas
>            Assignee: Gergo Repas
>         Attachments: MAPREDUCE-7028.000.patch, MAPREDUCE-7028.001.patch
>
>
> Concurrent task progress updates can cause a NullPointerException in the 
> Application Master (stack trace is with code at current trunk):
> {quote}
> 2017-12-20 06:49:42,369 INFO [IPC Server handler 9 on 39501] 
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt 
> attempt_1513780867907_0001_m_000002_0 is : 0.02677883
> 2017-12-20 06:49:42,369 INFO [IPC Server handler 13 on 39501] 
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt 
> attempt_1513780867907_0001_m_000002_0 is : 0.02677883
> 2017-12-20 06:49:42,383 FATAL [AsyncDispatcher event handler] 
> org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$StatusUpdater.transition(TaskAttemptImpl.java:2450)
>         at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$StatusUpdater.transition(TaskAttemptImpl.java:2433)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
>         at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:1362)
>         at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:154)
>         at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:1543)
>         at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:1535)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
>         at java.lang.Thread.run(Thread.java:748)
> 2017-12-20 06:49:42,385 INFO [IPC Server handler 13 on 39501] 
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt 
> attempt_1513780867907_0001_m_000002_0 is : 0.02677883
> 2017-12-20 06:49:42,386 INFO [AsyncDispatcher ShutDown handler] 
> org.apache.hadoop.yarn.event.AsyncDispatcher: Exiting, bbye..
> {quote}
> This happened naturally in several big wordcount runs, and I could reproduce 
> this reliably by artificially making task updates more frequent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

Reply via email to