[Bug 708920] Re: Strange 'fork/clone' blocking behavior under high cpu usage on EC2

2011-04-12 Thread Mike Malone
Since no one developed a reproducible test case it is, unfortunately,
difficult to say whether this bug is resolved. We moved to a 2.6.35
series kernel and stopped seeing this particular problem (although we
have seen other problems that are eerily similar). All of the
information to reproduce the behavior is in the comments above, but it
does take a lot of time and patience to trigger it. Either that, or
hundreds of instances running with production load.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is a direct subscriber.
https://bugs.launchpad.net/bugs/708920

Title:
  Strange 'fork/clone' blocking behavior under high cpu usage on EC2

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 708920] Re: Strange 'fork/clone' blocking behavior under high cpu usage on EC2

2011-02-01 Thread Mike Malone
Matt,

That's definitely in line with what we're seeing. Not to beat a dead
horse, and I am not a kernel hacker (so I apologize for any naivety),
etc... but that's basically what led me to believe the
CLOCK_PROCESS_CPUTIME_ID timers, top issues, and other CPU time
monitoring may be relevant here.

Based on my rather high level understanding of CFS -- the big idea being
that processes are scheduled based on their run time --  it seems
plausible that a process that appears to be using no CPU time would
continue to be prioritized by the scheduler. I'll leave any further
analysis to someone who knows how these systems work.

We have done a bit more testing on the 2.6.35 series kernels, with
Ubuntu 10.10 (Maverick). So far we have not been able to reproduce this
issue there. If we do I'll report back, but for now we are moving
forward with migrating effected systems to Maverick. We're happy to
assist with efforts to track this further, but I think we've documented
most of what we've experienced.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is a direct subscriber.
https://bugs.launchpad.net/bugs/708920

Title:
  Strange 'fork/clone' blocking behavior under high cpu usage on EC2

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 708920] Re: Strange 'fork/clone' blocking behavior under high cpu usage on EC2

2011-02-01 Thread Mike Malone
Hey Matt,

I ran it once while the system was idle:
https://gist.github.com/fb35566354afc442bf2d

And then again in a tight loop with a 50ms sleep while I locked the
system up, in the hopes of catching something interesting just before or
during the period when the system was locked:
https://gist.github.com/ce0887dcdc125afcca94

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is a direct subscriber.
https://bugs.launchpad.net/bugs/708920

Title:
  Strange 'fork/clone' blocking behavior under high cpu usage on EC2

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 708920] Re: Strange 'fork/clone' blocking behavior under high cpu usage on EC2

2011-01-31 Thread Mike Malone
Starting irqbalance (without rebooting) on a node in a state where
fork() will hang does not help.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is a direct subscriber.
https://bugs.launchpad.net/bugs/708920

Title:
  Strange 'fork/clone' blocking behavior under high cpu usage on EC2

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 708920] Re: Strange 'fork/clone' blocking behavior under high cpu usage on EC2

2011-01-30 Thread Mike Malone
Matt,

Still no way to reproduce deterministically. I've just been running
variations of the test I posted above, writing to /dev/null and setting
timer signals. At some point the tests/system start hanging.

irqbalance is not running on any of the test instances that are hanging
(it appears to be started automatically on the Maverick ami, but that
node is happy). I believe we've run it in the past on some nodes. I'll
check if we there are any that are hanging that also have irqbalance
started at boot. I know that I personally started irqbalance on some
nodes after boot to see if that would fix things (to no avail), but it
wasn't started at boot.

If we don't have any running nodes with irqbalance I'll have to reboot a
node and start over, which may take some time.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is a direct subscriber.
https://bugs.launchpad.net/bugs/708920

Title:
  Strange 'fork/clone' blocking behavior under high cpu usage on EC2

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 708920] Re: Strange 'fork/clone' blocking behavior under high cpu usage on EC2

2011-01-29 Thread Mike Malone
Also, potentially related, the 2.6.32 kernels seem to sometimes drop
timer signals for CLOCK_PROCESS_CPUTIME_ID and continue to report
unusual process CPU times in system monitoring tools like top and via
the proc filesystem. I can reproduce the signal behavior with this
program: https://gist.github.com/14585329e013d1bf5134 -- some runs are
uneventful and proceed as expected, others hang forever without ever
receiving a signal.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is a direct subscriber.
https://bugs.launchpad.net/bugs/708920

Title:
  Strange 'fork/clone' blocking behavior under high cpu usage on EC2

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 708920] Re: Strange 'fork/clone' blocking behavior under high cpu usage on EC2

2011-01-29 Thread Mike Malone
Fat fingered that kernel version. Should be 2.6.32-311-ec2.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is a direct subscriber.
https://bugs.launchpad.net/bugs/708920

Title:
  Strange 'fork/clone' blocking behavior under high cpu usage on EC2

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 708920] Re: Strange 'fork/clone' blocking behavior under high cpu usage on EC2

2011-01-29 Thread Mike Malone
I am able to reproduce this behavior on an instance running kernel
version 2.3.32-311 using ami-f8f40591. As with the older kernel, new
instances don't immediately exhibit symptoms.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is a direct subscriber.
https://bugs.launchpad.net/bugs/708920

Title:
  Strange 'fork/clone' blocking behavior under high cpu usage on EC2

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 708920] Re: Strange 'fork/clone' blocking behavior under high cpu usage on EC2

2011-01-28 Thread Mike Malone
We're still working on a way to repro this on a new instance. Meanwhile,
we're moving forward with testing on the newer 2.6.32 kernel and on
Maverick's 2.6.35 kernel.

One observation we have made is that if we run the libctest in a loop
(`while :; do ./libctest; done`) on 2.6.32 it will eventually hang the
process (apparently forever, I've left it as long as 4 hours). This
behavior is reproducible on a fresh instance and happens on both
2.6.32-305 and 2.6.32-311. We are NOT able to repro on 2.6.35. Could be
unrelated, or expected behavior, but from my review of the libctest code
what we're seeing does appear pathological.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is a direct subscriber.
https://bugs.launchpad.net/bugs/708920

Title:
  Strange 'fork/clone' blocking behavior under high cpu usage on EC2

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 708920] Re: Strange 'fork/clone' blocking behavior under high cpu usage on EC2

2011-01-28 Thread Mike Malone
We are running ami-fd4aa494 with 2.6.32-305-ec2 in us-east. I'll see
what I can do about setting up a couple nodes with the more recent
2.6.32 kernel build and report back.

We've already started running a few Maverick instances with
2.6.35-24-virtual, and so far they appear to be more stable.
Unfortunately, the issue is not easy to reproduce initially (as
evidenced by the efforts in this thread). Recently restarted instances
appear to be more stable than those that have been running for a while
under load. It looks like some heisenbug gets the system into a sideways
state, and once that happens you can lock things up pretty
deterministically with something as trivial as a tight loop. So it's
possible that Maverick will go sideways too at some point and we simply
haven't seen it yet. Hard to say for sure without knowing what the
trigger is.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is a direct subscriber.
https://bugs.launchpad.net/bugs/708920

Title:
  Strange 'fork/clone' blocking behavior under high cpu usage on EC2

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 708920] Re: Strange 'fork/clone' blocking behavior under high cpu usage on EC2

2011-01-27 Thread Mike Malone
The node we were working on this morning was:

vendor_id   : GenuineIntel
cpu family  : 6
model   : 26
model name  : Intel(R) Xeon(R) CPU   X5550  @ 2.67GHz
stepping: 5
cpu MHz : 2666.760
cache size  : 8192 KB

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is a direct subscriber.
https://bugs.launchpad.net/bugs/708920

Title:
  Strange 'fork/clone' blocking behavior under high cpu usage on EC2

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs