David Jacobson wrote: > I'm only familiar with Solaris. In that system the real stuff > in a mutex is a byte about 12 bytes into the lock structure. > On SPARC the mutex_lock function accesses it with an LDSTUB > instruction, which is a special atomic instruction that loads > the old value into a register, and stores 0xff into it.
That's pretty typical of what happens on most platforms. An uncontended lock/release will typically result in only a single atomic operation and the possible cache miss of the actually lock guts. Last I measured on x86 with modern hardware (this was P4/Netburst), the cost of the atomic operation is the same whether or not it results in a cache miss. This was probably due to the pipelines being so deep. I would expect both numbers to be much less on a more modern Core2, but perhaps the gap between them has grown. (Fewer cycles performing the atomic operation means less time to hide the cache miss.) Most of the cost is believed to be due to pipeline flush. > If that byte is in the same cache line as some other variable > that gets written, and if another processor writes that other stuff, > the cache line will be owned by that other processor, For just this reason, it is common to put the lock guts byte on its own cache line or to get the equivalent affect another way. On modern x86 systems, the cost of the atomic operation generally swamps the cost of the cache miss, which is mitigated anyway by prefetching. > and an > uncontended mutex_lock can be terribly expensive. To some extent, it depends on what you're comparing it to. I think it's unreasonable to describe a single atomic operation, even if a cache miss is guaranteed, as "terribly expensive". You'd have to work pretty hard to make that matter, and certainly it shrinks to nothing in comparison to anything OpenSSL might do while it holds the lock. In the vast majority of ordinary cases, the cost of an uncontended lock is effectively zero. You have to spin in tight loops to measure it. In any event, in the OP's case, he won't have the CPU ping-pong issue anyway. He never has a case where two threads manipulate the same object, so they won't be ping-ponging the same lock either. It doesn't seem possible for him to get the kind of tight loop that would make the cost measurable. It would make sense for him to measure in his application though. There's always the off chance that for some odd reason that isn't easily forseeable this is expensive in his particular situation. Engineering intuition should guide the original design, but you should always measure to make sure your intuition was right. Plus, that's needed to develop intuition in the first place. My point is that the OP is proceeding on a path that conflicts with sensible engineering. He would need a very unusual problem or some very unusual measurements of a more obvious solution to justify the risk his path entails. Or, of course, if this was a research or experimental approach rather than a real-world application. In that case, it's perfectly sensible to intentionally try things that are atypical to see what happens. However, he should have a baseline to show that the overhead he's trying to minimize is at least measurable first. Otherwise, he can never show a reduction. DS ______________________________________________________________________ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager [EMAIL PROTECTED]