David Jacobson wrote:

> I'm only familiar with Solaris.  In that system the real stuff
> in a mutex is a byte about 12 bytes into the lock structure.
> On SPARC the mutex_lock function accesses it with an LDSTUB
> instruction, which is a special atomic instruction that loads
> the old value into a register, and stores 0xff into it.

That's pretty typical of what happens on most platforms. An uncontended
lock/release will typically result in only a single atomic operation and the
possible cache miss of the actually lock guts.

Last I measured on x86 with modern hardware (this was P4/Netburst), the cost
of the atomic operation is the same whether or not it results in a cache
miss. This was probably due to the pipelines being so deep. I would expect
both numbers to be much less on a more modern Core2, but perhaps the gap
between them has grown. (Fewer cycles performing the atomic operation means
less time to hide the cache miss.)

Most of the cost is believed to be due to pipeline flush.

> If that byte is in the same cache line as some other variable
> that gets written, and if another processor writes that other stuff,
> the cache line will be owned by that other processor,

For just this reason, it is common to put the lock guts byte on its own
cache line or to get the equivalent affect another way. On modern x86
systems, the cost of the atomic operation generally swamps the cost of the
cache miss, which is mitigated anyway by prefetching.

> and an
> uncontended mutex_lock can be terribly expensive.

To some extent, it depends on what you're comparing it to. I think it's
unreasonable to describe a single atomic operation, even if a cache miss is
guaranteed, as "terribly expensive". You'd have to work pretty hard to make
that matter, and certainly it shrinks to nothing in comparison to anything
OpenSSL might do while it holds the lock.

In the vast majority of ordinary cases, the cost of an uncontended lock is
effectively zero. You have to spin in tight loops to measure it.

In any event, in the OP's case, he won't have the CPU ping-pong issue
anyway. He never has a case where two threads manipulate the same object, so
they won't be ping-ponging the same lock either. It doesn't seem possible
for him to get the kind of tight loop that would make the cost measurable.

It would make sense for him to measure in his application though. There's
always the off chance that for some odd reason that isn't easily forseeable
this is expensive in his particular situation. Engineering intuition should
guide the original design, but you should always measure to make sure your
intuition was right. Plus, that's needed to develop intuition in the first
place.

My point is that the OP is proceeding on a path that conflicts with sensible
engineering. He would need a very unusual problem or some very unusual
measurements of a more obvious solution to justify the risk his path
entails.

Or, of course, if this was a research or experimental approach rather than a
real-world application. In that case, it's perfectly sensible to
intentionally try things that are atypical to see what happens. However, he
should have a baseline to show that the overhead he's trying to minimize is
at least measurable first. Otherwise, he can never show a reduction.

DS


______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       openssl-dev@openssl.org
Automated List Manager                           [EMAIL PROTECTED]

Reply via email to