We recently managed to crash 10 of our test machines at the same time. Half of the machines were running a 3.1.9 kernel and half were running 3.4.9. I realize that these are both fairly old kernels but I've skimmed the list of fixes in the 3.4.* stable series and didn't see anything that appeared to be relevant to this issue.
All we managed to get was some screenshots of the stacks from the consoles. On one of the 3.1.9 machines you can see we hit the BUG_ON(want) statement in __disable_runtime() at kernel/sched_rt.c:493, and all of the machines had essentially the same stack showing: rt_offline_rt rq_attach_root cpu_attach_domain partition_sched_domains do_rebuild_sched_domains Here is one of the screenshots of the 3.1.9 machines: https://dl.dropbox.com/u/84066079/berbox38.png And here is one from a 3.4.9 machine: https://dl.dropbox.com/u/84066079/berbox18.png Three of the five 3.4.9 machines also managed to print "[sched_delayed] sched: RT throttling activated" ~7 minutes before the machines locked up. I've tried reproducing the issue, but so far I've been unsuccessful but I believe that is because my RT tasks aren't using enough CPU cause borrowing from the other runqueues. Normally our RT tasks use very little CPU so I'm not entirely sure what conditions caused them to run into throttling on the day that this happened. The details that I do know about the workload that caused this are as follows. 1) These are all dual socket 4 core X5460 systems with no hyperthreading. Thus there are 8 cores total in the system. 2) We use the cpuset cgroup to apply CPU affinity to various types of processes. Initially everything starts out in a single cpuset and the top level cpuset has cpuset.sched_load_balance=1 thus there is only a single scheduling domain. 3) In this case tasks were then placed into four non overlapping cpusets. 1 containing a single core and single SCHED_FIFO task, 2 containing two cores, and multiple SCHED_FIFO tasks, and 1 containing 3 cores and everything else on the system running as SCHED_OTHER. 4) In the case of cpusets that contain SCHED_FIFO tasks, the tasks start out as SCHED_OTHER are placed into the cpuset then change their policy to SCHED_FIFO. 5) Once all tasks are placed into non overlapping cpusets the top level cpuset.sched_load_balance is set to 0 to split the system into four scheduling domains. 6) The system ran like this for some unknown amount of time. 7) All the processes are then sent a signal to exit, and at the same time the top level cpuset.sched_load_balance is set back to 1. This is when the systems locked up. Hopefully that is enough information to give someone more familiar with the scheduler code an idea of where the bug is here. I will point out that in step #5 above there is a small window where the RT tasks could encounter runtime limits but are still in a single big scheduling domain. I don't know if that is what happened or if it is simply sufficient to hit the runtime limits while the system is split into four domains. For the curious we are using the default RT runtime limits: # grep . /proc/sys/kernel/sched_rt_* /proc/sys/kernel/sched_rt_period_us:1000000 /proc/sys/kernel/sched_rt_runtime_us:950000 Let me know if you anyone needs any more information about this issue. Thanks, Shawn -- --------------------------------------------------------------- This email, along with any attachments, is confidential. If you believe you received this message in error, please contact the sender immediately and delete all copies of the message. Thank you. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/