[ https://issues.apache.org/jira/browse/MAPREDUCE-7028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16305673#comment-16305673 ]
Miklos Szegedi commented on MAPREDUCE-7028: ------------------------------------------- Thank you, [~grepas] for the patch. I have a few comments: {code} 591 while (!done) { 592 TaskAttemptStatus lastStatus = lastStatusRef.get(); 593 List<TaskAttemptId> fetchFailedMaps = taskAttemptStatus.fetchFailedMaps; {code} Since the code runs a lot, there is no need to do the save within the loop every time. You can do it before the while. Also I would name it as something more specific like savedFailedMaps. {code} 599 taskAttemptStatus.fetchFailedMaps = 600 new ArrayList<>(taskAttemptStatus.fetchFailedMaps); 601 taskAttemptStatus.fetchFailedMaps.addAll( 602 lastStatus.fetchFailedMaps); {code} The arraylist should be created with an initial capacity of the sum of the length of the two base lists. Otherwise the addAll will do unnecessary copies. I was thinking about something like: {code} taskAttemptStatus.fetchFailedMaps = new ArrayList<>(taskAttemptStatus.fetchFailedMaps.size() + lastStatus.fetchFailedMaps.size()); taskAttemptStatus.fetchFailedMaps.addAll(fetchFailedMaps); taskAttemptStatus.fetchFailedMaps.addAll( lastStatus.fetchFailedMaps); {code} Also we discussed this offline with [~rkanter]. This pattern does not ensure that the updates keep an order meaning that an later update with progress 100% can be succeeded by an update with progress 50%. A fair ReentrantLock would solve this since compareAndSet does not. > Concurrent task progress updates causing NPE in Application Master > ------------------------------------------------------------------ > > Key: MAPREDUCE-7028 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7028 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mr-am > Affects Versions: 3.1.0, 3.0.1, 2.10.0, 2.9.1, 2.8.4, 2.7.6 > Reporter: Gergo Repas > Assignee: Gergo Repas > Attachments: MAPREDUCE-7028.000.patch, MAPREDUCE-7028.001.patch > > > Concurrent task progress updates can cause a NullPointerException in the > Application Master (stack trace is with code at current trunk): > {quote} > 2017-12-20 06:49:42,369 INFO [IPC Server handler 9 on 39501] > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt > attempt_1513780867907_0001_m_000002_0 is : 0.02677883 > 2017-12-20 06:49:42,369 INFO [IPC Server handler 13 on 39501] > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt > attempt_1513780867907_0001_m_000002_0 is : 0.02677883 > 2017-12-20 06:49:42,383 FATAL [AsyncDispatcher event handler] > org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread > java.lang.NullPointerException > at > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$StatusUpdater.transition(TaskAttemptImpl.java:2450) > at > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$StatusUpdater.transition(TaskAttemptImpl.java:2433) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > at > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:1362) > at > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:154) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:1543) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:1535) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:748) > 2017-12-20 06:49:42,385 INFO [IPC Server handler 13 on 39501] > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt > attempt_1513780867907_0001_m_000002_0 is : 0.02677883 > 2017-12-20 06:49:42,386 INFO [AsyncDispatcher ShutDown handler] > org.apache.hadoop.yarn.event.AsyncDispatcher: Exiting, bbye.. > {quote} > This happened naturally in several big wordcount runs, and I could reproduce > this reliably by artificially making task updates more frequent. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org