[jira] Commented: (MAPREDUCE-1436) Deadlock in preemption code in fair scheduler

2010-02-21 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836484#action_12836484
 ] 

Matei Zaharia commented on MAPREDUCE-1436:
--

Okay, in that case, I think everything is fine. I'm going to mark this JIRA as 
invalid. However, anyone who tries to backport the trunk fair scheduler to 0.20 
might run into it, so hopefully they'll find this result. I'm assuming you guys 
included MAPREDUCE-870 or something similar in CDH?

> Deadlock in preemption code in fair scheduler
> -
>
> Key: MAPREDUCE-1436
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1436
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: contrib/fair-share
>Affects Versions: 0.21.0, 0.22.0
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
>Priority: Blocker
> Attachments: deadlock.png, mapreduce-1436-v2.patch, 
> mapreduce-1436.patch
>
>
> In testing the fair scheduler with preemption, I found a deadlock between 
> updatePreemptionVariables and some code in the JobTracker. This was found 
> while testing a backport of the fair scheduler to Hadoop 0.20, but it looks 
> like it could also happen in trunk and 0.21. Details are in a comment below.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1436) Deadlock in preemption code in fair scheduler

2010-02-21 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836441#action_12836441
 ] 

Todd Lipcon commented on MAPREDUCE-1436:


Yep, I think your last bullet point is sufficient: there's an ordering of locks 
JT -> Scheduler -> JIP. A lock order is correct iff it is a subsequence of that 
ordering.

> Deadlock in preemption code in fair scheduler
> -
>
> Key: MAPREDUCE-1436
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1436
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: contrib/fair-share
>Affects Versions: 0.21.0, 0.22.0
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
>Priority: Blocker
> Attachments: deadlock.png, mapreduce-1436-v2.patch, 
> mapreduce-1436.patch
>
>
> In testing the fair scheduler with preemption, I found a deadlock between 
> updatePreemptionVariables and some code in the JobTracker. This was found 
> while testing a backport of the fair scheduler to Hadoop 0.20, but it looks 
> like it could also happen in trunk and 0.21. Details are in a comment below.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1436) Deadlock in preemption code in fair scheduler

2010-02-21 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836439#action_12836439
 ] 

Matei Zaharia commented on MAPREDUCE-1436:
--

I'm still a little concerned about the update thread not locking the JT all the 
time, but maybe I don't need to be. Just to clarify, the convention for locking 
is the following:

* If both the JT and the FairScheduler must be locked, the JT is locked first.
* If both the FairScheduler and a JIP must be locked, the FairScheduler is 
locked first.
* If both the JT and a JIP must be locked, the JT is locked first.
* If the JT, FS and JIP must all be locked, the order is JT -> FS -> JIP.

If this is it, then I think we're fine with the current usage.

> Deadlock in preemption code in fair scheduler
> -
>
> Key: MAPREDUCE-1436
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1436
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: contrib/fair-share
>Affects Versions: 0.21.0, 0.22.0
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
>Priority: Blocker
> Attachments: deadlock.png, mapreduce-1436-v2.patch, 
> mapreduce-1436.patch
>
>
> In testing the fair scheduler with preemption, I found a deadlock between 
> updatePreemptionVariables and some code in the JobTracker. This was found 
> while testing a backport of the fair scheduler to Hadoop 0.20, but it looks 
> like it could also happen in trunk and 0.21. Details are in a comment below.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1436) Deadlock in preemption code in fair scheduler

2010-02-21 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836438#action_12836438
 ] 

Matei Zaharia commented on MAPREDUCE-1436:
--

You're right Todd, it looks like with the new job retiring process, the JT lock 
is always grabbed before the fair scheduler lock, and neither of the two 
deadlocks I pointed out could occur. MAPREDUCE-870 seems to be in both trunk 
and 0.21, so everything should be fine. However, the issue in MAPREDUCE-1499 
does affect Hadoop 0.20 and should be fixed.

> Deadlock in preemption code in fair scheduler
> -
>
> Key: MAPREDUCE-1436
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1436
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: contrib/fair-share
>Affects Versions: 0.21.0, 0.22.0
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
>Priority: Blocker
> Attachments: deadlock.png, mapreduce-1436-v2.patch, 
> mapreduce-1436.patch
>
>
> In testing the fair scheduler with preemption, I found a deadlock between 
> updatePreemptionVariables and some code in the JobTracker. This was found 
> while testing a backport of the fair scheduler to Hadoop 0.20, but it looks 
> like it could also happen in trunk and 0.21. Details are in a comment below.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1436) Deadlock in preemption code in fair scheduler

2010-02-16 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834665#action_12834665
 ] 

Todd Lipcon commented on MAPREDUCE-1436:


Hi Matei,

I don't think this issue actually occurs in trunk - I'm speaking now about the 
top issue from this issue involving finalizeJob. MAPREDUCE-870 refactored this 
stuff, and finalizeJob therefore no longer has to go "upwards" to the scheduler 
lock.

I turned on preemption and ran a couple of sleep jobs under jcarder, and it 
didn't identify this issue either. Just to sanity check myself, I added a 
synchronized (taskTrackerScheduler) { System.err.println("hi"); } in the 
finalizeJob method, and reran the test. jcarder spit out the expected potential 
deadlock just like you saw it.

I think, really, the JT was at fault here for inverting the lock heirarchy. It 
no longer does this, so I think we're safe without this patch.

What do you think?

> Deadlock in preemption code in fair scheduler
> -
>
> Key: MAPREDUCE-1436
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1436
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: contrib/fair-share
>Affects Versions: 0.21.0, 0.22.0
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
>Priority: Blocker
> Attachments: deadlock.png, mapreduce-1436-v2.patch, 
> mapreduce-1436.patch
>
>
> In testing the fair scheduler with preemption, I found a deadlock between 
> updatePreemptionVariables and some code in the JobTracker. This was found 
> while testing a backport of the fair scheduler to Hadoop 0.20, but it looks 
> like it could also happen in trunk and 0.21. Details are in a comment below.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1436) Deadlock in preemption code in fair scheduler

2010-02-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833700#action_12833700
 ] 

Hadoop QA commented on MAPREDUCE-1436:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12435844/mapreduce-1436-v2.patch
  against trunk revision 909993.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/453/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/453/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/453/artifact/trunk/build/test/checkstyle-errors.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/453/console

This message is automatically generated.

> Deadlock in preemption code in fair scheduler
> -
>
> Key: MAPREDUCE-1436
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1436
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: contrib/fair-share
>Affects Versions: 0.21.0, 0.22.0
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
>Priority: Blocker
> Attachments: deadlock.png, mapreduce-1436-v2.patch, 
> mapreduce-1436.patch
>
>
> In testing the fair scheduler with preemption, I found a deadlock between 
> updatePreemptionVariables and some code in the JobTracker. This was found 
> while testing a backport of the fair scheduler to Hadoop 0.20, but it looks 
> like it could also happen in trunk and 0.21. Details are in a comment below.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1436) Deadlock in preemption code in fair scheduler

2010-02-14 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833670#action_12833670
 ] 

Matei Zaharia commented on MAPREDUCE-1436:
--

I'm also reposting the second deadlock with better formatting in case my 
previous post was hard to read:

{code}
Found one Java-level deadlock:
=
"IPC Server handler 24 on 9001":
  waiting to lock monitor 0x40c91750 (object 0x7fc0243e2c20, a 
org.apache.hadoop.mapred.JobTracker),
  which is held by "IPC Server handler 0 on 9001"
"IPC Server handler 0 on 9001":
  waiting to lock monitor 0x40bc0770 (object 0x7fc0243e3080, a 
org.apache.hadoop.mapred.FairScheduler),
  which is held by "FairScheduler update thread"
"FairScheduler update thread":
  waiting to lock monitor 0x4095dd98 (object 0x7fc0258bc0d0, a 
org.apache.hadoop.mapred.JobInProgress),
  which is held by "IPC Server handler 0 on 9001"

Java stack information for the threads listed above:
===
"IPC Server handler 24 on 9001":
at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:2487)
- waiting to lock <0x7fc0243e2c20> (a 
org.apache.hadoop.mapred.JobTracker)
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
"IPC Server handler 0 on 9001":
at org.apache.hadoop.mapred.JobTracker.finalizeJob(JobTracker.java:2115)
- waiting to lock <0x7fc0243e3080> (a 
org.apache.hadoop.mapred.FairScheduler)
- locked <0x7fc0243e3420> (a java.util.TreeMap)
- locked <0x7fc0243e2c20> (a org.apache.hadoop.mapred.JobTracker)
at 
org.apache.hadoop.mapred.JobInProgress.garbageCollect(JobInProgress.java:2510)
- locked <0x7fc0258bc0d0> (a org.apache.hadoop.mapred.JobInProgress)
at 
org.apache.hadoop.mapred.JobInProgress.jobComplete(JobInProgress.java:2146)
at 
org.apache.hadoop.mapred.JobInProgress.completedTask(JobInProgress.java:2084)
- locked <0x7fc0258bc0d0> (a org.apache.hadoop.mapred.JobInProgress)
at 
org.apache.hadoop.mapred.JobInProgress.updateTaskStatus(JobInProgress.java:883)
- locked <0x7fc0258bc0d0> (a org.apache.hadoop.mapred.JobInProgress)
at 
org.apache.hadoop.mapred.JobTracker.updateTaskStatuses(JobTracker.java:3564)
at 
org.apache.hadoop.mapred.JobTracker.processHeartbeat(JobTracker.java:2758)
- locked <0x7fc0243e2c20> (a org.apache.hadoop.mapred.JobTracker)
at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:2553)
- locked <0x7fc0243e2c20> (a org.apache.hadoop.mapred.JobTracker)
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
"FairScheduler update thread":
at 
org.apache.hadoop.mapred.JobInProgress.scheduleReduces(JobInProgress.java:1203)
- waiting to lock <0x7fc0258bc0d0> (a 
org.apache.hadoop.mapred.JobInProgress)
at 
org.apache.hadoop.mapred.JobSchedulable.updateDemand(JobSchedulable.java:53)
at 
org.apache.hadoop.mapred.PoolSchedulable.updateDemand(PoolSchedulable.java:81)
at org.apache.hadoop.mapred.FairScheduler.update(FairScheduler.java:577)
- locked <0x7fc0243e3080> (a org.apache.hadoop.mapred.FairScheduler)
at 
org.apache.hadoop.mapred.FairScheduler$UpdateThread.run(FairScheduler.java:277)

Found 1 deadlock.
{code}

A review of the new patch would be appreciated!

> Deadlock in preemption code in fair scheduler
> -
>
> Key: MAPREDUCE-1436
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1436
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: contri

[jira] Commented: (MAPREDUCE-1436) Deadlock in preemption code in fair scheduler

2010-02-11 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832623#action_12832623
 ] 

Matei Zaharia commented on MAPREDUCE-1436:
--

Are you suggesting that I add a JobTracker lock in update() or in the 
JobListener methods? I think it's best to add it in update() because it also 
gets called from a separate thread. This actually happens quite rarely now (it 
used to be every few seconds, but it's every 15 seconds after MAPREDUCE-706, 
and can be set higher pretty safely).

BTW, I found another deadlock that seems to be much rarer (it happened when I 
was submitting about 50 jobs simultaneously) but is not related to preemption:



Found one Java-level deadlock:
=
"IPC Server handler 24 on 9001":
  waiting to lock monitor 0x40c91750 (object 0x7fc0243e2c20, a 
org.apache.hadoop.mapred.JobTracker),
  which is held by "IPC Server handler 0 on 9001"
"IPC Server handler 0 on 9001":
  waiting to lock monitor 0x40bc0770 (object 0x7fc0243e3080, a 
org.apache.hadoop.mapred.FairScheduler),
  which is held by "FairScheduler update thread"
"FairScheduler update thread":
  waiting to lock monitor 0x4095dd98 (object 0x7fc0258bc0d0, a 
org.apache.hadoop.mapred.JobInProgress),
  which is held by "IPC Server handler 0 on 9001"

Java stack information for the threads listed above:
===
"IPC Server handler 24 on 9001":
at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:2487)
- waiting to lock <0x7fc0243e2c20> (a 
org.apache.hadoop.mapred.JobTracker)
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
"IPC Server handler 0 on 9001":
at org.apache.hadoop.mapred.JobTracker.finalizeJob(JobTracker.java:2115)
- waiting to lock <0x7fc0243e3080> (a 
org.apache.hadoop.mapred.FairScheduler)
- locked <0x7fc0243e3420> (a java.util.TreeMap)
- locked <0x7fc0243e2c20> (a org.apache.hadoop.mapred.JobTracker)
at 
org.apache.hadoop.mapred.JobInProgress.garbageCollect(JobInProgress.java:2510)
- locked <0x7fc0258bc0d0> (a org.apache.hadoop.mapred.JobInProgress)
at 
org.apache.hadoop.mapred.JobInProgress.jobComplete(JobInProgress.java:2146)
at 
org.apache.hadoop.mapred.JobInProgress.completedTask(JobInProgress.java:2084)
- locked <0x7fc0258bc0d0> (a org.apache.hadoop.mapred.JobInProgress)
at 
org.apache.hadoop.mapred.JobInProgress.updateTaskStatus(JobInProgress.java:883)
- locked <0x7fc0258bc0d0> (a org.apache.hadoop.mapred.JobInProgress)
at 
org.apache.hadoop.mapred.JobTracker.updateTaskStatuses(JobTracker.java:3564)
at 
org.apache.hadoop.mapred.JobTracker.processHeartbeat(JobTracker.java:2758)
- locked <0x7fc0243e2c20> (a org.apache.hadoop.mapred.JobTracker)
at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:2553)
- locked <0x7fc0243e2c20> (a org.apache.hadoop.mapred.JobTracker)
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
"FairScheduler update thread":
at 
org.apache.hadoop.mapred.JobInProgress.scheduleReduces(JobInProgress.java:1203)
- waiting to lock <0x7fc0258bc0d0> (a 
org.apache.hadoop.mapred.JobInProgress)
at 
org.apache.hadoop.mapred.JobSchedulable.updateDemand(JobSchedulable.java:53)
at 
org.apache.hadoop.mapred.PoolSchedulable.updateDemand(PoolSchedulable.java:81)
at org.apache.hadoop.mapred.FairScheduler.update(FairScheduler.java:577)
- locked <0x7fc0243e3080> (a org.apache.hadoop.mapred.FairScheduler)
at 
org.apache.hadoop.mapred.FairScheduler$UpdateThread.run(FairScheduler.ja

[jira] Commented: (MAPREDUCE-1436) Deadlock in preemption code in fair scheduler

2010-02-08 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831071#action_12831071
 ] 

Todd Lipcon commented on MAPREDUCE-1436:


Hey Matei, sorry for the slow response - I forgot to watch this ticket.

bq. JobTracker only calls listener.jobAdded/jobRemoved when it is already 
holding a lock on itself (e.g. in JobTracker.addJob).

I think it's best to still synchronize on TaskTrackerManager here from within 
the fairsched. I think a synchronized block on a monitor you've already got 
locked has essentially no cost, and it will reduce the jcarder output so we can 
notice if we accidentally introduce "real" bugs later. Do you agree?

bq. , I always locked the JT before locking the scheduler.

I can see why the coarse locking isn't a great idea for scalability. In this 
case, though, we're just adding a lock in a place where we already assume the 
lock is taken, yea? (so it isn't any more coarse than before)

> Deadlock in preemption code in fair scheduler
> -
>
> Key: MAPREDUCE-1436
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1436
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: contrib/fair-share
>Affects Versions: 0.21.0, 0.22.0
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
>Priority: Blocker
> Attachments: deadlock.png, mapreduce-1436.patch
>
>
> In testing the fair scheduler with preemption, I found a deadlock between 
> updatePreemptionVariables and some code in the JobTracker. This was found 
> while testing a backport of the fair scheduler to Hadoop 0.20, but it looks 
> like it could also happen in trunk and 0.21. Details are in a comment below.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1436) Deadlock in preemption code in fair scheduler

2010-02-03 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829508#action_12829508
 ] 

Matei Zaharia commented on MAPREDUCE-1436:
--

Hi Todd,

It looks like that particular problem won't happen with a real JobTracker 
because the JobTracker only calls listener.jobAdded/jobRemoved when it is 
already holding a lock on itself (e.g. in JobTracker.addJob). However, it might 
not hurt to acquire the lock in FairScheduler, in case this JobTracker behavior 
changes. Do you think it's better to do that, or to "fix" the fake 
TaskTrackerManager?

In older versions of the fair scheduler, I *always* locked the JT before 
locking the scheduler. Some of the Yahoo guys removed this because they said it 
led to scalability issues, though maybe that isn't a problem anymore.

> Deadlock in preemption code in fair scheduler
> -
>
> Key: MAPREDUCE-1436
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1436
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: contrib/fair-share
>Affects Versions: 0.21.0, 0.22.0
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
>Priority: Blocker
> Attachments: deadlock.png, mapreduce-1436.patch
>
>
> In testing the fair scheduler with preemption, I found a deadlock between 
> updatePreemptionVariables and some code in the JobTracker. This was found 
> while testing a backport of the fair scheduler to Hadoop 0.20, but it looks 
> like it could also happen in trunk and 0.21. Details are in a comment below.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1436) Deadlock in preemption code in fair scheduler

2010-02-02 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828821#action_12828821
 ] 

Matei Zaharia commented on MAPREDUCE-1436:
--

The patch contains no unit tests because it's a bug fix for a deadlock. The 
unit test failure is unrelated, as it is in streaming.

> Deadlock in preemption code in fair scheduler
> -
>
> Key: MAPREDUCE-1436
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1436
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: contrib/fair-share
>Affects Versions: 0.21.0, 0.22.0
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
>Priority: Blocker
> Attachments: mapreduce-1436.patch
>
>
> In testing the fair scheduler with preemption, I found a deadlock between 
> updatePreemptionVariables and some code in the JobTracker. This was found 
> while testing a backport of the fair scheduler to Hadoop 0.20, but it looks 
> like it could also happen in trunk and 0.21. Details are in a comment below.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1436) Deadlock in preemption code in fair scheduler

2010-02-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828608#action_12828608
 ] 

Hadoop QA commented on MAPREDUCE-1436:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12434430/mapreduce-1436.patch
  against trunk revision 905008.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/425/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/425/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/425/artifact/trunk/build/test/checkstyle-errors.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/425/console

This message is automatically generated.

> Deadlock in preemption code in fair scheduler
> -
>
> Key: MAPREDUCE-1436
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1436
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: contrib/fair-share
>Affects Versions: 0.21.0, 0.22.0
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
>Priority: Blocker
> Attachments: mapreduce-1436.patch
>
>
> In testing the fair scheduler with preemption, I found a deadlock between 
> updatePreemptionVariables and some code in the JobTracker. This was found 
> while testing a backport of the fair scheduler to Hadoop 0.20, but it looks 
> like it could also happen in trunk and 0.21. Details are in a comment below.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1436) Deadlock in preemption code in fair scheduler

2010-02-01 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828463#action_12828463
 ] 

Matei Zaharia commented on MAPREDUCE-1436:
--

Looked at the JT code in trunk again, and it does look like it could run into 
the same bug. Todd, is it okay for me to commit the patch? I'm going to commit 
it to both trunk and 0.21.

> Deadlock in preemption code in fair scheduler
> -
>
> Key: MAPREDUCE-1436
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1436
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: contrib/fair-share
>Affects Versions: 0.21.0, 0.22.0
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
>Priority: Blocker
> Attachments: mapreduce-1436.patch
>
>
> In testing the fair scheduler with preemption, I found a deadlock between 
> updatePreemptionVariables and some code in the JobTracker. This was found 
> while testing a backport of the fair scheduler to Hadoop 0.20, but it looks 
> like it could also happen in trunk and 0.21. Details are in a comment below.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1436) Deadlock in preemption code in fair scheduler

2010-02-01 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828277#action_12828277
 ] 

Matei Zaharia commented on MAPREDUCE-1436:
--

I've looked at the JobTracker code and the locking seems to be similar in trunk 
to how it is in 0.20. However, I'll double-check.

> Deadlock in preemption code in fair scheduler
> -
>
> Key: MAPREDUCE-1436
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1436
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: contrib/fair-share
>Affects Versions: 0.21.0, 0.22.0
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
>Priority: Blocker
> Attachments: mapreduce-1436.patch
>
>
> In testing the fair scheduler with preemption, I found a deadlock between 
> updatePreemptionVariables and some code in the JobTracker. This was found 
> while testing a backport of the fair scheduler to Hadoop 0.20, but it looks 
> like it could also happen in trunk and 0.21. Details are in a comment below.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1436) Deadlock in preemption code in fair scheduler

2010-02-01 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828276#action_12828276
 ] 

Todd Lipcon commented on MAPREDUCE-1436:


Patch looks good to me. Confused why jcarder didn't pick this up when I ran it 
recently, but I'll investigate that separately.

> Deadlock in preemption code in fair scheduler
> -
>
> Key: MAPREDUCE-1436
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1436
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: contrib/fair-share
>Affects Versions: 0.21.0, 0.22.0
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
>Priority: Blocker
> Attachments: mapreduce-1436.patch
>
>
> In testing the fair scheduler with preemption, I found a deadlock between 
> updatePreemptionVariables and some code in the JobTracker. This was found 
> while testing a backport of the fair scheduler to Hadoop 0.20, but it looks 
> like it could also happen in trunk and 0.21. Details are in a comment below.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1436) Deadlock in preemption code in fair scheduler

2010-02-01 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828249#action_12828249
 ] 

Matei Zaharia commented on MAPREDUCE-1436:
--

Here is jstack output showing the problem:

{code}
Found one Java-level deadlock:
=
"72353...@qtp0-7":
  waiting to lock monitor 0x423f6370 (object 0x7f22039e2b48, a 
org.apache.hadoop.mapred.JobTracker),
  which is held by "IPC Server handler 14 on 9001"
"IPC Server handler 14 on 9001":
  waiting to lock monitor 0x41cdc130 (object 0x7f22039e2fa8, a 
org.apache.hadoop.mapred.FairScheduler),
  which is held by "FairScheduler update thread"
"FairScheduler update thread":
  waiting to lock monitor 0x41c29fa8 (object 0x7f2260640948, a 
org.apache.hadoop.mapred.JobInProgress),
  which is held by "IPC Server handler 14 on 9001"

Java stack information for the threads listed above:
===
"72353...@qtp0-7":
at 
org.apache.hadoop.mapred.JobTracker.getClusterStatus(JobTracker.java:3071)
- waiting to lock <0x7f22039e2b48> (a 
org.apache.hadoop.mapred.JobTracker)
at 
org.apache.hadoop.mapred.jobtracker_jsp._jspService(jobtracker_jsp.java:91)
at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:97)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at 
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:324)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
at 
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
"IPC Server handler 14 on 9001":
at org.apache.hadoop.mapred.JobTracker.finalizeJob(JobTracker.java:2115)
- waiting to lock <0x7f22039e2fa8> (a 
org.apache.hadoop.mapred.FairScheduler)
- locked <0x7f22039e33e0> (a java.util.TreeMap)
- locked <0x7f22039e2b48> (a org.apache.hadoop.mapred.JobTracker)
at 
org.apache.hadoop.mapred.JobInProgress.garbageCollect(JobInProgress.java:2510)
- locked <0x7f2260640948> (a org.apache.hadoop.mapred.JobInProgress)
at 
org.apache.hadoop.mapred.JobInProgress.jobComplete(JobInProgress.java:2146)
at 
org.apache.hadoop.mapred.JobInProgress.completedTask(JobInProgress.java:2084)
- locked <0x7f2260640948> (a org.apache.hadoop.mapred.JobInProgress)
at 
org.apache.hadoop.mapred.JobInProgress.updateTaskStatus(JobInProgress.java:883)
- locked <0x7f2260640948> (a org.apache.hadoop.mapred.JobInProgress)
at 
org.apache.hadoop.mapred.JobTracker.updateTaskStatuses(JobTracker.java:3564)
at 
org.apache.hadoop.mapred.JobTracker.processHeartbeat(JobTracker.java:2758)
- locked <0x7f22039e2b48> (a org.apache.hadoop.mapred.JobTracker)
at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:2553)
- locked <0x7f22039e2b48> (a org.apache.hadoop.mapred.JobTracker)
at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
"FairScheduler update thread":
at 
org.apache.hadoop.mapred.JobInProgress.runningMaps(JobInProgress.java:549)
- waiting to lock <0x7f2260640948> (