On Thu, 29 Mar 2007, Nick Piggin wrote: > On Thu, Mar 29, 2007 at 03:36:52AM +0200, Nick Piggin wrote: > > In most cases, no. For the uncontended case they should be about the > > same. They have the same spinning behaviour. However there is a little > > window where they might be a bit slower I think... actually perhaps I'm > > wrong! > > > > Currently if you have 4 CPUs spinning and the lock is released, all 4 > > CPU cachelines will be invalidated, then they will be loaded again, and > > found to be 0, so they all try to atomic_dec_return the counter, each > > one invalidating others' cachelines. 1 gets through. > > > > With my queued locks, all 4 cachelines are invalidated and loaded, but > > only one will be allowed to proceed, and there are 0 atomic operations > > or stores of any kind. > > > > So I take that back: our current spinlocks have a worse thundering herd > > behaviour under contention than my queued ones. So I'll definitely > > push the patch through. > > OK, it isn't a big difference, but a user-space test is showing slightly > (~2%) improvement in the contended case on a 16 core Opteron. > > There is a case where the present spinlocks are almost twice as fast on > this machine (in terms of aggregate throughput), and that is when a lock > is taken right after it is released. This is because the same CPU will > often be able to retake the lock without transitioning the cache. This is > going to be a rare case for us, and would suggest suboptimal code anyway > (ie. the lock should just be kept rather than dropped and retaken). > > Actually, one situation where it comes up is when we drop and retake a > lock that needs_lockbreak. Of course, the queued lock behaviour is > desired in that case anyway. > > However single-thread performance is presently a bit down. OTOH, the > assembly generated by gcc looks like it could be improved upon (even by > me :P). > > This is what I've got so far. Should work for i386 and x86_64. Any > enhancements or results from other CPUs would be interesting.
I slightly modified it to use cycles: http://www.xmailserver.org/qspins.c Here (Dual Opteron 252) queued locks (ticklocks) are about 10% slower in both cases. This is really a microbench, and assembly matter a lot. I did not have time to look at the generated one yet, but optimizing branches can help in those cases. - Davide - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/