We have a 16-core x86 system that, at seemingly random intervals, will completely stop responding for several seconds. Running an mpstat 1 showed something horrifiying:
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 1004691 397 170 0 0 0 5 0 0 0 100 0 0 (rest of CPUs omitted) That's over a million cross-calls a second. Seeing them on CPU0 made me nervous that they were kernel-related, so I wrote a dtrace to print out xcalls per second aggregated by PID to see if a specific process was the culprit. Here's the output during another random system outage: 2009 Sep 22 12:51:49, load average: 5.90, 5.35, 5.39 xcalls: 637511 PID XCALLCOUNT 6164 15 6165 15 28339 26 0 637455 PID 0 is "sched" (aka the kernel). At this point I'm completely stumped as to what could be causing this. Any hints or ideas? -- This message posted from opensolaris.org
