[Bug 708920] Re: Strange 'fork/clone' blocking behavior under high cpu usage on EC2
Since no one developed a reproducible test case it is, unfortunately, difficult to say whether this bug is resolved. We moved to a 2.6.35 series kernel and stopped seeing this particular problem (although we have seen other problems that are eerily similar). All of the information to reproduce the behavior is in the comments above, but it does take a lot of time and patience to trigger it. Either that, or hundreds of instances running with production load. -- You received this bug notification because you are a member of Ubuntu Bugs, which is a direct subscriber. https://bugs.launchpad.net/bugs/708920 Title: Strange 'fork/clone' blocking behavior under high cpu usage on EC2 -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 708920] Re: Strange 'fork/clone' blocking behavior under high cpu usage on EC2
Matt, That's definitely in line with what we're seeing. Not to beat a dead horse, and I am not a kernel hacker (so I apologize for any naivety), etc... but that's basically what led me to believe the CLOCK_PROCESS_CPUTIME_ID timers, top issues, and other CPU time monitoring may be relevant here. Based on my rather high level understanding of CFS -- the big idea being that processes are scheduled based on their run time -- it seems plausible that a process that appears to be using no CPU time would continue to be prioritized by the scheduler. I'll leave any further analysis to someone who knows how these systems work. We have done a bit more testing on the 2.6.35 series kernels, with Ubuntu 10.10 (Maverick). So far we have not been able to reproduce this issue there. If we do I'll report back, but for now we are moving forward with migrating effected systems to Maverick. We're happy to assist with efforts to track this further, but I think we've documented most of what we've experienced. -- You received this bug notification because you are a member of Ubuntu Bugs, which is a direct subscriber. https://bugs.launchpad.net/bugs/708920 Title: Strange 'fork/clone' blocking behavior under high cpu usage on EC2 -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 708920] Re: Strange 'fork/clone' blocking behavior under high cpu usage on EC2
Hey Matt, I ran it once while the system was idle: https://gist.github.com/fb35566354afc442bf2d And then again in a tight loop with a 50ms sleep while I locked the system up, in the hopes of catching something interesting just before or during the period when the system was locked: https://gist.github.com/ce0887dcdc125afcca94 -- You received this bug notification because you are a member of Ubuntu Bugs, which is a direct subscriber. https://bugs.launchpad.net/bugs/708920 Title: Strange 'fork/clone' blocking behavior under high cpu usage on EC2 -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 708920] Re: Strange 'fork/clone' blocking behavior under high cpu usage on EC2
Starting irqbalance (without rebooting) on a node in a state where fork() will hang does not help. -- You received this bug notification because you are a member of Ubuntu Bugs, which is a direct subscriber. https://bugs.launchpad.net/bugs/708920 Title: Strange 'fork/clone' blocking behavior under high cpu usage on EC2 -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 708920] Re: Strange 'fork/clone' blocking behavior under high cpu usage on EC2
Matt, Still no way to reproduce deterministically. I've just been running variations of the test I posted above, writing to /dev/null and setting timer signals. At some point the tests/system start hanging. irqbalance is not running on any of the test instances that are hanging (it appears to be started automatically on the Maverick ami, but that node is happy). I believe we've run it in the past on some nodes. I'll check if we there are any that are hanging that also have irqbalance started at boot. I know that I personally started irqbalance on some nodes after boot to see if that would fix things (to no avail), but it wasn't started at boot. If we don't have any running nodes with irqbalance I'll have to reboot a node and start over, which may take some time. -- You received this bug notification because you are a member of Ubuntu Bugs, which is a direct subscriber. https://bugs.launchpad.net/bugs/708920 Title: Strange 'fork/clone' blocking behavior under high cpu usage on EC2 -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 708920] Re: Strange 'fork/clone' blocking behavior under high cpu usage on EC2
Also, potentially related, the 2.6.32 kernels seem to sometimes drop timer signals for CLOCK_PROCESS_CPUTIME_ID and continue to report unusual process CPU times in system monitoring tools like top and via the proc filesystem. I can reproduce the signal behavior with this program: https://gist.github.com/14585329e013d1bf5134 -- some runs are uneventful and proceed as expected, others hang forever without ever receiving a signal. -- You received this bug notification because you are a member of Ubuntu Bugs, which is a direct subscriber. https://bugs.launchpad.net/bugs/708920 Title: Strange 'fork/clone' blocking behavior under high cpu usage on EC2 -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 708920] Re: Strange 'fork/clone' blocking behavior under high cpu usage on EC2
Fat fingered that kernel version. Should be 2.6.32-311-ec2. -- You received this bug notification because you are a member of Ubuntu Bugs, which is a direct subscriber. https://bugs.launchpad.net/bugs/708920 Title: Strange 'fork/clone' blocking behavior under high cpu usage on EC2 -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 708920] Re: Strange 'fork/clone' blocking behavior under high cpu usage on EC2
I am able to reproduce this behavior on an instance running kernel version 2.3.32-311 using ami-f8f40591. As with the older kernel, new instances don't immediately exhibit symptoms. -- You received this bug notification because you are a member of Ubuntu Bugs, which is a direct subscriber. https://bugs.launchpad.net/bugs/708920 Title: Strange 'fork/clone' blocking behavior under high cpu usage on EC2 -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 708920] Re: Strange 'fork/clone' blocking behavior under high cpu usage on EC2
We're still working on a way to repro this on a new instance. Meanwhile, we're moving forward with testing on the newer 2.6.32 kernel and on Maverick's 2.6.35 kernel. One observation we have made is that if we run the libctest in a loop (`while :; do ./libctest; done`) on 2.6.32 it will eventually hang the process (apparently forever, I've left it as long as 4 hours). This behavior is reproducible on a fresh instance and happens on both 2.6.32-305 and 2.6.32-311. We are NOT able to repro on 2.6.35. Could be unrelated, or expected behavior, but from my review of the libctest code what we're seeing does appear pathological. -- You received this bug notification because you are a member of Ubuntu Bugs, which is a direct subscriber. https://bugs.launchpad.net/bugs/708920 Title: Strange 'fork/clone' blocking behavior under high cpu usage on EC2 -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 708920] Re: Strange 'fork/clone' blocking behavior under high cpu usage on EC2
We are running ami-fd4aa494 with 2.6.32-305-ec2 in us-east. I'll see what I can do about setting up a couple nodes with the more recent 2.6.32 kernel build and report back. We've already started running a few Maverick instances with 2.6.35-24-virtual, and so far they appear to be more stable. Unfortunately, the issue is not easy to reproduce initially (as evidenced by the efforts in this thread). Recently restarted instances appear to be more stable than those that have been running for a while under load. It looks like some heisenbug gets the system into a sideways state, and once that happens you can lock things up pretty deterministically with something as trivial as a tight loop. So it's possible that Maverick will go sideways too at some point and we simply haven't seen it yet. Hard to say for sure without knowing what the trigger is. -- You received this bug notification because you are a member of Ubuntu Bugs, which is a direct subscriber. https://bugs.launchpad.net/bugs/708920 Title: Strange 'fork/clone' blocking behavior under high cpu usage on EC2 -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 708920] Re: Strange 'fork/clone' blocking behavior under high cpu usage on EC2
The node we were working on this morning was: vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Xeon(R) CPU X5550 @ 2.67GHz stepping: 5 cpu MHz : 2666.760 cache size : 8192 KB -- You received this bug notification because you are a member of Ubuntu Bugs, which is a direct subscriber. https://bugs.launchpad.net/bugs/708920 Title: Strange 'fork/clone' blocking behavior under high cpu usage on EC2 -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs