[jira] [Updated] (MAPREDUCE-3355) AM scheduling hangs frequently with sort job on 350 nodes
[ https://issues.apache.org/jira/browse/MAPREDUCE-3355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated MAPREDUCE-3355: --- Resolution: Fixed Release Note: Fixed MR AM's ContainerLauncher to handle node-command timeouts correctly. Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) I just committed this to trunk and branch-0.23. > AM scheduling hangs frequently with sort job on 350 nodes > - > > Key: MAPREDUCE-3355 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3355 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster, mrv2 >Affects Versions: 0.23.0 >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli >Priority: Blocker > Fix For: 0.23.1 > > Attachments: MAPREDUCE-3355-2009.1.txt, > MAPREDUCE-3355-2009.txt, MAPREDUCE-3355-2015.txt, MR3355.txt > > > Another collaboration with [~karams]. Sort job hangs not so rarely on a 350 > node cluster. Found this in AM logs: > {code} > Exception in thread "ContainerLauncher #60" > org.apache.hadoop.yarn.YarnException: java.lang.InterruptedException > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:170) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:379) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:619) > Caused by: java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:312) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:294) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:168) > ... 4 more > Exception in thread "ContainerLauncher #53" > org.apache.hadoop.yarn.YarnException: java.lang.InterruptedException > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:170) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.sendContainerLaunchFailedMsg(ContainerLauncherImpl.java:405) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:330) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:619) > Caused by: java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:312) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:294) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:168) > ... 5 more > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3355) AM scheduling hangs frequently with sort job on 350 nodes
[ https://issues.apache.org/jira/browse/MAPREDUCE-3355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated MAPREDUCE-3355: -- Status: Patch Available (was: Open) > AM scheduling hangs frequently with sort job on 350 nodes > - > > Key: MAPREDUCE-3355 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3355 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster, mrv2 >Affects Versions: 0.23.0 >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli >Priority: Blocker > Fix For: 0.23.1 > > Attachments: MAPREDUCE-3355-2009.1.txt, > MAPREDUCE-3355-2009.txt, MAPREDUCE-3355-2015.txt, MR3355.txt > > > Another collaboration with [~karams]. Sort job hangs not so rarely on a 350 > node cluster. Found this in AM logs: > {code} > Exception in thread "ContainerLauncher #60" > org.apache.hadoop.yarn.YarnException: java.lang.InterruptedException > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:170) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:379) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:619) > Caused by: java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:312) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:294) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:168) > ... 4 more > Exception in thread "ContainerLauncher #53" > org.apache.hadoop.yarn.YarnException: java.lang.InterruptedException > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:170) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.sendContainerLaunchFailedMsg(ContainerLauncherImpl.java:405) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:330) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:619) > Caused by: java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:312) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:294) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:168) > ... 5 more > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3355) AM scheduling hangs frequently with sort job on 350 nodes
[ https://issues.apache.org/jira/browse/MAPREDUCE-3355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated MAPREDUCE-3355: -- Status: Open (was: Patch Available) > AM scheduling hangs frequently with sort job on 350 nodes > - > > Key: MAPREDUCE-3355 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3355 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster, mrv2 >Affects Versions: 0.23.0 >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli >Priority: Blocker > Fix For: 0.23.1 > > Attachments: MAPREDUCE-3355-2009.1.txt, > MAPREDUCE-3355-2009.txt, MAPREDUCE-3355-2015.txt, MR3355.txt > > > Another collaboration with [~karams]. Sort job hangs not so rarely on a 350 > node cluster. Found this in AM logs: > {code} > Exception in thread "ContainerLauncher #60" > org.apache.hadoop.yarn.YarnException: java.lang.InterruptedException > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:170) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:379) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:619) > Caused by: java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:312) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:294) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:168) > ... 4 more > Exception in thread "ContainerLauncher #53" > org.apache.hadoop.yarn.YarnException: java.lang.InterruptedException > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:170) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.sendContainerLaunchFailedMsg(ContainerLauncherImpl.java:405) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:330) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:619) > Caused by: java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:312) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:294) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:168) > ... 5 more > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3355) AM scheduling hangs frequently with sort job on 350 nodes
[ https://issues.apache.org/jira/browse/MAPREDUCE-3355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated MAPREDUCE-3355: -- Attachment: MR3355.txt > AM scheduling hangs frequently with sort job on 350 nodes > - > > Key: MAPREDUCE-3355 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3355 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster, mrv2 >Affects Versions: 0.23.0 >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli >Priority: Blocker > Fix For: 0.23.1 > > Attachments: MAPREDUCE-3355-2009.1.txt, > MAPREDUCE-3355-2009.txt, MAPREDUCE-3355-2015.txt, MR3355.txt > > > Another collaboration with [~karams]. Sort job hangs not so rarely on a 350 > node cluster. Found this in AM logs: > {code} > Exception in thread "ContainerLauncher #60" > org.apache.hadoop.yarn.YarnException: java.lang.InterruptedException > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:170) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:379) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:619) > Caused by: java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:312) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:294) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:168) > ... 4 more > Exception in thread "ContainerLauncher #53" > org.apache.hadoop.yarn.YarnException: java.lang.InterruptedException > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:170) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.sendContainerLaunchFailedMsg(ContainerLauncherImpl.java:405) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:330) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:619) > Caused by: java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:312) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:294) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:168) > ... 5 more > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3355) AM scheduling hangs frequently with sort job on 350 nodes
[ https://issues.apache.org/jira/browse/MAPREDUCE-3355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated MAPREDUCE-3355: --- Attachment: MAPREDUCE-3355-2015.txt Updated patch that should avoid the race condition reported by Sid. - Removed one Timer per launch, that's one more thread for every event! - Cancelling each TimerTask explicitly. > AM scheduling hangs frequently with sort job on 350 nodes > - > > Key: MAPREDUCE-3355 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3355 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster, mrv2 >Affects Versions: 0.23.0 >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli >Priority: Blocker > Fix For: 0.23.1 > > Attachments: MAPREDUCE-3355-2009.1.txt, > MAPREDUCE-3355-2009.txt, MAPREDUCE-3355-2015.txt > > > Another collaboration with [~karams]. Sort job hangs not so rarely on a 350 > node cluster. Found this in AM logs: > {code} > Exception in thread "ContainerLauncher #60" > org.apache.hadoop.yarn.YarnException: java.lang.InterruptedException > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:170) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:379) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:619) > Caused by: java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:312) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:294) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:168) > ... 4 more > Exception in thread "ContainerLauncher #53" > org.apache.hadoop.yarn.YarnException: java.lang.InterruptedException > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:170) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.sendContainerLaunchFailedMsg(ContainerLauncherImpl.java:405) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:330) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:619) > Caused by: java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:312) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:294) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:168) > ... 5 more > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3355) AM scheduling hangs frequently with sort job on 350 nodes
[ https://issues.apache.org/jira/browse/MAPREDUCE-3355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated MAPREDUCE-3355: --- Status: Patch Available (was: Open) > AM scheduling hangs frequently with sort job on 350 nodes > - > > Key: MAPREDUCE-3355 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3355 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster, mrv2 >Affects Versions: 0.23.0 >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli >Priority: Blocker > Fix For: 0.23.1 > > Attachments: MAPREDUCE-3355-2009.1.txt, > MAPREDUCE-3355-2009.txt, MAPREDUCE-3355-2015.txt > > > Another collaboration with [~karams]. Sort job hangs not so rarely on a 350 > node cluster. Found this in AM logs: > {code} > Exception in thread "ContainerLauncher #60" > org.apache.hadoop.yarn.YarnException: java.lang.InterruptedException > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:170) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:379) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:619) > Caused by: java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:312) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:294) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:168) > ... 4 more > Exception in thread "ContainerLauncher #53" > org.apache.hadoop.yarn.YarnException: java.lang.InterruptedException > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:170) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.sendContainerLaunchFailedMsg(ContainerLauncherImpl.java:405) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:330) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:619) > Caused by: java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:312) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:294) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:168) > ... 5 more > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3355) AM scheduling hangs frequently with sort job on 350 nodes
[ https://issues.apache.org/jira/browse/MAPREDUCE-3355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated MAPREDUCE-3355: --- Status: Open (was: Patch Available) bq. There's another extremely unlikely situation which could cause this. However unlikely it is, it'll be good if we can fix it. I'll see what I can do. > AM scheduling hangs frequently with sort job on 350 nodes > - > > Key: MAPREDUCE-3355 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3355 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster, mrv2 >Affects Versions: 0.23.0 >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli >Priority: Blocker > Fix For: 0.23.1 > > Attachments: MAPREDUCE-3355-2009.1.txt, > MAPREDUCE-3355-2009.txt > > > Another collaboration with [~karams]. Sort job hangs not so rarely on a 350 > node cluster. Found this in AM logs: > {code} > Exception in thread "ContainerLauncher #60" > org.apache.hadoop.yarn.YarnException: java.lang.InterruptedException > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:170) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:379) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:619) > Caused by: java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:312) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:294) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:168) > ... 4 more > Exception in thread "ContainerLauncher #53" > org.apache.hadoop.yarn.YarnException: java.lang.InterruptedException > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:170) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.sendContainerLaunchFailedMsg(ContainerLauncherImpl.java:405) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:330) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:619) > Caused by: java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:312) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:294) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:168) > ... 5 more > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3355) AM scheduling hangs frequently with sort job on 350 nodes
[ https://issues.apache.org/jira/browse/MAPREDUCE-3355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated MAPREDUCE-3355: --- Status: Patch Available (was: Open) > AM scheduling hangs frequently with sort job on 350 nodes > - > > Key: MAPREDUCE-3355 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3355 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster, mrv2 >Affects Versions: 0.23.0 >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli >Priority: Blocker > Fix For: 0.23.1 > > Attachments: MAPREDUCE-3355-2009.1.txt, > MAPREDUCE-3355-2009.txt > > > Another collaboration with [~karams]. Sort job hangs not so rarely on a 350 > node cluster. Found this in AM logs: > {code} > Exception in thread "ContainerLauncher #60" > org.apache.hadoop.yarn.YarnException: java.lang.InterruptedException > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:170) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:379) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:619) > Caused by: java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:312) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:294) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:168) > ... 4 more > Exception in thread "ContainerLauncher #53" > org.apache.hadoop.yarn.YarnException: java.lang.InterruptedException > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:170) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.sendContainerLaunchFailedMsg(ContainerLauncherImpl.java:405) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:330) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:619) > Caused by: java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:312) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:294) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:168) > ... 5 more > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3355) AM scheduling hangs frequently with sort job on 350 nodes
[ https://issues.apache.org/jira/browse/MAPREDUCE-3355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated MAPREDUCE-3355: --- Attachment: MAPREDUCE-3355-2009.1.txt Few imports clashed and made the patch invalid. Same patch that Karam tested but with fixed imports. > AM scheduling hangs frequently with sort job on 350 nodes > - > > Key: MAPREDUCE-3355 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3355 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster, mrv2 >Affects Versions: 0.23.0 >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli >Priority: Blocker > Fix For: 0.23.1 > > Attachments: MAPREDUCE-3355-2009.1.txt, > MAPREDUCE-3355-2009.txt > > > Another collaboration with [~karams]. Sort job hangs not so rarely on a 350 > node cluster. Found this in AM logs: > {code} > Exception in thread "ContainerLauncher #60" > org.apache.hadoop.yarn.YarnException: java.lang.InterruptedException > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:170) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:379) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:619) > Caused by: java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:312) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:294) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:168) > ... 4 more > Exception in thread "ContainerLauncher #53" > org.apache.hadoop.yarn.YarnException: java.lang.InterruptedException > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:170) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.sendContainerLaunchFailedMsg(ContainerLauncherImpl.java:405) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:330) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:619) > Caused by: java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:312) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:294) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:168) > ... 5 more > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3355) AM scheduling hangs frequently with sort job on 350 nodes
[ https://issues.apache.org/jira/browse/MAPREDUCE-3355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated MAPREDUCE-3355: --- Status: Open (was: Patch Available) > AM scheduling hangs frequently with sort job on 350 nodes > - > > Key: MAPREDUCE-3355 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3355 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster, mrv2 >Affects Versions: 0.23.0 >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli >Priority: Blocker > Fix For: 0.23.1 > > Attachments: MAPREDUCE-3355-2009.txt > > > Another collaboration with [~karams]. Sort job hangs not so rarely on a 350 > node cluster. Found this in AM logs: > {code} > Exception in thread "ContainerLauncher #60" > org.apache.hadoop.yarn.YarnException: java.lang.InterruptedException > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:170) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:379) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:619) > Caused by: java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:312) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:294) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:168) > ... 4 more > Exception in thread "ContainerLauncher #53" > org.apache.hadoop.yarn.YarnException: java.lang.InterruptedException > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:170) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.sendContainerLaunchFailedMsg(ContainerLauncherImpl.java:405) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:330) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:619) > Caused by: java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:312) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:294) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:168) > ... 5 more > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3355) AM scheduling hangs frequently with sort job on 350 nodes
[ https://issues.apache.org/jira/browse/MAPREDUCE-3355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated MAPREDUCE-3355: --- Status: Patch Available (was: Open) bq. After patch over MAPREDUCE-, Ran Sort twice and did not observe this issue anymore Thanks Karam! Submitting this to Jenkins. > AM scheduling hangs frequently with sort job on 350 nodes > - > > Key: MAPREDUCE-3355 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3355 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster, mrv2 >Affects Versions: 0.23.0 >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli >Priority: Blocker > Fix For: 0.23.1 > > Attachments: MAPREDUCE-3355-2009.txt > > > Another collaboration with [~karams]. Sort job hangs not so rarely on a 350 > node cluster. Found this in AM logs: > {code} > Exception in thread "ContainerLauncher #60" > org.apache.hadoop.yarn.YarnException: java.lang.InterruptedException > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:170) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:379) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:619) > Caused by: java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:312) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:294) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:168) > ... 4 more > Exception in thread "ContainerLauncher #53" > org.apache.hadoop.yarn.YarnException: java.lang.InterruptedException > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:170) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.sendContainerLaunchFailedMsg(ContainerLauncherImpl.java:405) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:330) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:619) > Caused by: java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:312) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:294) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:168) > ... 5 more > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3355) AM scheduling hangs frequently with sort job on 350 nodes
[ https://issues.apache.org/jira/browse/MAPREDUCE-3355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated MAPREDUCE-3355: --- Attachment: MAPREDUCE-3355-2009.txt This should fix it. It works by clearing the interrupt status so that event handling can continue to work and report failures etc. Will make it PA once MAPREDUCE- is in. > AM scheduling hangs frequently with sort job on 350 nodes > - > > Key: MAPREDUCE-3355 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3355 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster, mrv2 >Affects Versions: 0.23.0 >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli >Priority: Blocker > Fix For: 0.23.1 > > Attachments: MAPREDUCE-3355-2009.txt > > > Another collaboration with [~karams]. Sort job hangs not so rarely on a 350 > node cluster. Found this in AM logs: > {code} > Exception in thread "ContainerLauncher #60" > org.apache.hadoop.yarn.YarnException: java.lang.InterruptedException > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:170) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:379) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:619) > Caused by: java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:312) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:294) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:168) > ... 4 more > Exception in thread "ContainerLauncher #53" > org.apache.hadoop.yarn.YarnException: java.lang.InterruptedException > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:170) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.sendContainerLaunchFailedMsg(ContainerLauncherImpl.java:405) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:330) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:619) > Caused by: java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:312) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:294) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:168) > ... 5 more > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira