On Mon, Aug 24, 2015 at 22:30:03 -0400, Emilio G. Cota wrote: > On Sun, Aug 23, 2015 at 18:04:46 -0700, Paolo Bonzini wrote: > > On 23/08/2015 17:23, Emilio G. Cota wrote: > (snip) > > Applied, but in the end the spinlock will probably simply use a simple > > test-and-test-and-set lock, or an MCS lock. There is no need to use > > pthreads for this. (snip) > Note that fair locks (such as MCS) for user-space programs are not > necessarily a good idea when preemption is considered--and for usermode > we'd be forced (if we allowed MCS's to nest) to use per-lock stack variables > given that the number of threads is unbounded, which is pretty ugly. > > If contention is a problem, a simple, fast spinlock combined with an > exponential > backoff is already pretty good. Fairness is not a requirement (the cache > substrate of a NUMA machine isn't necessarily fair, is it?); scalability is. > If the algorithm in the guest requires fairness, the guest must use a fair > lock > (e.g. MCS), and that works as intended when run natively or under qemu. > > I just tested a fetch-and-swap+exp.backoff spinlock with usermode on a > program that spawns N threads and each thread performs an 2**M atomic > increments > on the same variable. That is, a degenerate worst-case kind of contention. > N varies from 1 to 64, and M=15 on all runs, 5 runs per experiment: > > http://imgur.com/XpYctyT > With backoff, the per-access latency grows roughly linearly with the number > of > cores, i.e. this is scalable. The other two are clearly superlinear.
Just tried MCS, CLH and ticket spinlocks (with and without backoff). They take essentially forever for this (admittedly worst-case) test; this is not suprising when we realise that we're essentially trying to do in software what the coherence protocol in the cache does (in hardware) for us when using greedy spinlocks. I'm not even showing numbers because around N=9 each experiment starts taking waay too long for me to wait. > In light of these results I see very little against going for a solution > with exponential backoff. The only reason I can think of against this option is that we'd be altering the dynamic behaviour of the emulated code. For example, code that is not clearly not scalable (such as the example above) would be made to "scale" by the use of the backoffs (as a result the CPU that gets a lock is very likely to grab it again right after unlocking). This makes me a uneasy; it makes intuitive sense to me that unscalable code should not scale under QEMU, instead of us playing tricks that would confuse users. Emilio