Yes, you are right. There is one likely place in the path the pgbouncer
take that will wait for a buffer to finish being written to disk. And
the jbd2 task is waiting for a range of pages to be written out. Maybe
related but I cannot see a reason why this should deadlock. And the same
is true for run-parts.

This leads to the question whether we actually see an ext4 issue here.
Unfortunately we have no clue what is running on the other side (dom0)
for sure. From the past I have seen users of kvm having very similar
issues on 2.6.32 hosts. There has been a lot of fiddling with the
generic writeback interface. And even on bare metal we have seen
completely poor performance when multiple people tried IO bound tasks
(like doing kernel compiles, where one could massively starve other
people). This is why the Ubuntu 10.04 kernel has a huge pile of patches
backported from 2.6.35.

Another lead could be some patches in recent 2.6.35 that fix a problem
in xen about lost interrupts. If we are waiting for pages written to
disk and the completion interrupt gets lost, this would be showing up
like it does here.

commit a29059dc766af0bd2783614399972950fc99a99d
    xen: handle events as edge-triggered
...
    The most noticable symptom of these lost events is occasional lockups
    of blkfront.

So if it would be the writeback issues on older dom0, then I would
expect the messages to go away eventually (though it could take a really
long time, potentially more than 10 minutes). This might be completely
coincidental but for some reason run-parts seems to vanish in the 4th
batch of messages:

2040s: jdb2 and pgbouncer
2160s: jbd2, pgbouncer and run-parts
2280s: jbd2, pgbouncer and run-parts
2400s: jbd2 and pgbouncer

If it would be that and go away, then this would need to be addressed in
dom0.

For the lost interrupt case: That patch only changes the handler, so I
guess changing the domU should be effective. As maverick instances use
pv-grub it is simple to try that. If you boot your instance and install
the other software, you also can do a

wget
https://launchpad.net/ubuntu/+source/linux/2.6.35-23.37/+build/2033771/+files
/linux-image-2.6.35-23-virtual_2.6.35-23.37_amd64.deb

to download a kernel that includes that (amongst other things) fix. Then
you can reboot the instance into that kernel and see whether it shows
the issue or seems to solve it.

-- 
maverick on ec2 64bit ext4 deadlock
https://bugs.launchpad.net/bugs/666211
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to