Please consider this SMP i386 patch against 2.2.10 (or an analogous
patch against 2.3.11) for spinlock "metering".
Spinlock metering is the runtime recording of data about spinlock usage
-- how often each spinlock is acquired by each locker, how often
an acquisition faced contention and had to wait because someone else
owned that spinlock, and how much time passed before the contention
went away.
The patch also contains a new command, "lockstat", which turns the
functional act of metering on or off in the kernel, and which
retrieves the kernel's data and displays it to the user in a useful
human-readable tabular format. Spinlock names and their callers are
displayed, whenever possible, using their symbolic names from the
System.map, not their virtual addresses, and mean wait-times are
displayed in microseconds.
After the kernel patch is applied to the kernel, a new config variable
(in the "Kernel hacking" subsection) controls whether or not
lockmetering gets compiled into the kernel. A lockmeter-capable kernel
has essentially the same size as a non-lockmeter-capable kernel
-- even smaller, in fact -- because the non-metered kernel's inline
locking code gets replaced by procedure calls, and the
multiple-reader-single-writer locks gets significantly smaller. A
lockmeter-capable kernel is negligibly slower than a normal
non-lockmeter-capable kernel when metering is turned off, and is 1-2%
slower when metering (the act of collecting the data itself) is
turned on.
Care has been taken to minimize runtime performance impact of
lockmetering. For example, the data structures that record the counts
and
times are separated per-CPU, which means there is no cache coherency
overhead when different CPUs update counts for the same spinlock
being called by the same caller.
As an example of the usefulness of this lockmetering code, I exercised
2x, 3x, and 4xCPU (4x400 MHz Xeon) configurations with an AIM7
workload that had been modified to remove three synchronous disk
subtests that otherwise would contend on a single disk spindle and
produce substantial idle time. My test workload ran with effectively
zero idle time, about 75% user and 25% system time.
My test results:
The 4xCPU 2.3.11 kernel performs about 6-8% faster than the 2.2.10
kernel at the highest loads. The 2.3.11 kernel did almost 3x the
number of spinlocks-per-second vs. the 2.2.10 kernel, due to the finer
granularity of 2.3's locking scheme, but 2.3 exhibits lock
contention on 2% of these calls vs. 18% in 2.2. When 2.3 does contend,
the mean wait-times are almost 2x those in 2.2. One likely
hypothesis for the longer mean wait-times is that 2.3 has eliminated the
quick, trivial contentions, leaving the longer contentions to
raise the mean.
With this workload on this 4xCPU Xeon hardware, spinlock contention in
2.2.10 consumed about 8% of theoretically available CPU cycles
(340 milliseconds/second of waiting per 4,000 milliseconds/second
theoretically available) vs. about 2% in 2.3.11 (95 milliseconds of
waiting per 4,000 milliseconds).
The kernel_flag usage is still significant in 2.3, but its contention is
greatly reduced. In 2.2 I saw contention on 40% of the
41K-per-second acquisitions, vs. 10% of the 19K-per-second acquisitions
done in the 2.3 kernel. Despite the almost 2x increase in mean
wait-time on the kernel_flag in 2.3, the overall number of kernel_flag
contention cycles is about 3x higher in 2.2.10 vs. 2.3.11. That
is, the 2.3 kernel goes after kernel_flag much less frequently and sees
much less contention on it; but when there is contention, that
contention results in a mean wait that is twice as long as in 2.2.
Further investigation (using a hacked, unreleased version of the
lockmetering code) shows that the biggest kernel_flag pig in 2.3.11 is
sys_close(), which holds the kernel_flag for as long as 10-13
milliseconds (on a 400 MHz Xeon CPU) on a regular basis.
A broader discussion of spinlock metering is available at the SGI Open
Source website:
http://oss.sgi.com/projects/lockmeter
John Hawkes
([EMAIL PROTECTED])
lockmeter1.0-2.2.10.gz
lockmeter1.0-2.3.11.gz