I was wrong about the Intel architecture: it has a pretty strong memory-ordering guarantee. But other architectures, like PowerPC and ARM, have a weak memory-ordering guarantee. Here is a good summary: https://en.wikipedia.org/wiki/Memory_ordering. The doc in footnote 3 is a good document to read.
I'm not sure we should include memory barriers inside the apr_atomic_read32 and apr_atomic_set32 functions, because you don't necessarily need a full memory barrier on each read or set. Instead, perhaps we should add three functions to APR: 1. apr_atomic_rmb(): read memory barrier. 2. apr_atomic_wmb(): write memory barrier. 3. apr_atomic_mb(): full memory barrier. As a use-case, the new util_time.c algorithm could be changed to be: For the read-from-cache portion: const apr_uint32_t key = apr_atomic_read32(&cache_element->key); /* Above is done speculatively, no memory barrier used. if (seconds == key && seconds != 0) { /* seconds == 0 may mean cache is uninitialized, so don't use cache */ *xt = cache_element->xt; /* After copying xt, make sure cache_element was not marked invalid * by another thread beginning an update, and that cache_element * really contained data for our second. * Requires memory barrier. */ apr_atomic_rmb(); if (apr_atomic_read32(&cache_element->key)==seconds) { xt->tm_usec = (int)apr_time_usec(t); return APR_SUCCESS; } } And the write-to-cache portion would be: if (key == apr_atomic_cas32(&cache_element->key, ~seconds, key)) { /* We won the race to update this cache_element. * Above marks cache_element as invalid by using ~seconds, * because we are starting an update: it's the start of a * transaction. */ cache_element->xt = *xt; /* Finished copying, now update key with our key, * ending the transaction. Need to use a * memory barrier. */ apr_atomic_wmb(); apr_atomic_set32(&cache_element->key, seconds); } On Sat, Jan 5, 2013 at 8:40 AM, Daniel Lescohier <daniel.lescoh...@cbsi.com>wrote: > I'd have to go back and review the Intel documentation to be sure, but for > this particular algorithm, an explicit memory barrier may be required on > Intel architecture, also. If I remember correctly, without a memory > barrier, Intel arch only guarantees total memory ordering within one cache > line. For this algorithm, we have an array of 16 cache_elements of 48 > bytes each, so half of the cache_elements cross 64-byte cache lines. When > reading the cache_element->key after the copy of the cache_element value, > we need to make sure that the cache_element value read is ordered before > the read of the cache_element->key, so one needs a memory barrier just > before the read of the cache_element->key to guarantee the ordering. > > > On Sat, Jan 5, 2013 at 5:08 AM, Igor Galić <i.ga...@brainsware.org> wrote: > >> >> > > Sigh. I was too much focused on x86. There the compiler barrier >> > > caused >> > > by the function call is enough. But you are right, on other >> > > architectures these functions may also require cpu memory barriers. >> > >> > " on x86 … is enough" - would it harm x86 to add CPU barriers, or >> > do we want to have # define distinctions per arch? >> >> ignore me, I just realized it's going to be different >> calls per arch anyway! >> >> -- >> Igor Galić >> >> Tel: +43 (0) 664 886 22 883 >> Mail: i.ga...@brainsware.org >> URL: http://brainsware.org/ >> GPG: 6880 4155 74BD FD7C B515 2EA5 4B1D 9E08 A097 C9AE >> >> >