Re: thread-safety of time conversion caches in util_time.c and mod_log_config.c
It looks like apr_atomic is not in good-enough shape to be used for lock-free algorithms. It's probably good enough for the use-cases that the event mpm uses it for, because the event mpm is not using it for protecting memory besides the counters it's atomically changing. However, the time caches would use the cache key to protect other memory (the cache value), so apr_atomic is not good enough, because it's inconsistent on the different platforms with respect to memory barriers. For example, with x86, the add32 and cas32 functions act as a full memory barrier on all memory. However, for powerpc, the add32 and cas32 only acts as a memory barrier on the individual item being changed, it doesn't act as a memory barrier on all memory. One would have to add lightweight sync assembler instructions to the add32 and cas32 function implementations for powerpc if you want the same semantics as x86. The problem is that the apr_atomic API is not explicit with the memory ordering semantics. I agree with you that in the future the apr_atomic API should be changed to be more like the C11 atomic API, which allows explicit choice of memory ordering. But that leaves us with the problem of the current time caching code, which is not thread-safe. We should fix it using the existing APR APIs, and not use the apr_atomic API which is broken for this use-case. I found that there is a lock-free method available to us with the current APR API. We can use apr_thread_mutex_trylock. In glibc, that just attempts to cmpxchg the userspace futex, and if it fails, it doesn't do a futex system call and a wait on the caller, so it's lock free. On Windows, TryEnterCriticalSection probably works the same way. So, the pseudo-code would be something like: For reading from cache: apr_uint32_t key = cache_element-key; /* Above is done speculatively */ if (seconds == key seconds != 0) { /* seconds == 0 may mean cache is uninitialized, so don't use cache */ *xt = cache_element-xt; /* After copying xt, make sure cache_element was not marked invalid * by another thread beginning an update, and that cache_element * really contained data for our second. */ if (apr_thread_mutex_trylock(cache_element-mutex) == APR_SUCCESS) { key = cache_element-key; apr_thread_mutex_unlock(cache_element-mutex); if (key == seconds) { xt-tm_usec = (int)apr_time_usec(t); return APR_SUCCESS; } } } /* calculate the value if reached here */ Write to cache code: if (apr_thread_mutex_trylock(cache_element-mutex) == APR_SUCCESS) { /* We won the race to update this cache_element. */ cache_element-key = ~seconds; /* Above marks cache_element as invalid by using ~seconds, * for the speculative reads not using the mutex. */ cache_element-xt = *xt; /* Finished copying, now update key with our key */ cache_element-key = seconds; apr_thread_mutex_unlock(cache_element-mutex); } The above is lock-free, because process scheduling decisions by the OS won't cause the current thread to block, it'll keep on doing work without blocking as long as this thread is scheduled. It would be fast, though perhaps not quite as fast than if we had an atomic api we could use effectively. With an atomic API, the read-from-time-cache code would only be reading from cache-lines, not writing to cache-lines with the futex's cmpxchg instruction. You might also be able to use a lighter-weight memory barrier. But the mutex is lock-free and fast, and it's something that'd work with today's APIs; and the existing code in production is not correct. BTW, I looked at rwlock_trylock, and that isn't lock-free in glibc, because the readers/writers count variables are protected with a low-level lock futex that is acquired with lll_lock, not lll_trylock. So, if a thread acquires the low-level lock, but is unscheduled by the OS before it finishes updating the reader/writers count variables and unlocks the low-level lock, that would block other threads from acquiring the low-level lock, and other threads would block before being able to access the readers/writers count variables. If it had used lll_trylock instead, I think it would've been lock-free. However, I don't think rwlock would give us any particular advantage than a regular mutex for the above algorithm. So using the simpler mutex is better and is lock-free. I think the important part is to have a lock-free algorithm: if we lose a little bit of performance by not being able to use the lightest lock-free algorithm, I think that is OK. On Mon, Jan 7, 2013 at 5:50 PM, Stefan Fritsch s...@sfritsch.de wrote: On Sunday 06 January 2013, Daniel Lescohier wrote: I'm not sure we should include memory barriers inside the apr_atomic_read32 and apr_atomic_set32 functions, because
Re: thread-safety of time conversion caches in util_time.c and mod_log_config.c
Some timing tests: gcc -I/usr/include/apr-1 -lrt -lapr-1 -lpthread test.c ./a.out apr_time_exp_lt: 108 nanosecs per call localtime_r: 85 nanosecs per call apr_time_exp_gmt: 59 nanosecs per call gmtime_r: 35 nanosecs per call pthread_mutex_trylock/unlock/read cache: 19 nanosecs per call lfence/read cache: 8 nanosecs per call read cache: 4 nanosecs per call cat test.c #include time.h #include stdio.h #include assert.h #include apr_time.h #include pthread.h int main (int argc, char *argv[]) { struct timespec old, new; struct tm tmv; int i, iterations=1000; long nanosecs; #define CALC_NS(new, old) ((new.tv_sec - old.tv_sec)*10 + new.tv_nsec - old.tv_nsec) apr_time_t t = apr_time_now(); apr_time_exp_t xt; pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; unsigned int seconds=1; typedef struct { char s[44]; unsigned int key;} data_t; static volatile data_t cache_data; data_t my_copy; cache_data.key = 1; assert(clock_gettime(CLOCK_MONOTONIC_RAW, old)==0); for (i=0; i iterations; ++i) { apr_time_exp_lt(xt, t); } assert(clock_gettime(CLOCK_MONOTONIC_RAW, new)==0); nanosecs = CALC_NS(new, old); printf(apr_time_exp_lt: %d nanosecs per call\n, nanosecs/iterations); assert(clock_gettime(CLOCK_MONOTONIC_RAW, old)==0); for (i=0; i iterations; ++i) { localtime_r(old.tv_sec, tmv); } assert(clock_gettime(CLOCK_MONOTONIC_RAW, new)==0); nanosecs = CALC_NS(new, old); printf(localtime_r: %d nanosecs per call\n, nanosecs/iterations); assert(clock_gettime(CLOCK_MONOTONIC_RAW, old)==0); for (i=0; i iterations; ++i) { apr_time_exp_gmt(xt, t); } assert(clock_gettime(CLOCK_MONOTONIC_RAW, new)==0); nanosecs = CALC_NS(new, old); printf(apr_time_exp_gmt: %d nanosecs per call\n, nanosecs/iterations); assert(clock_gettime(CLOCK_MONOTONIC_RAW, old)==0); for (i=0; i iterations; ++i) { gmtime_r(old.tv_sec, tmv); } assert(clock_gettime(CLOCK_MONOTONIC_RAW, new)==0); nanosecs = CALC_NS(new, old); printf(gmtime_r: %d nanosecs per call\n, nanosecs/iterations); assert(clock_gettime(CLOCK_MONOTONIC_RAW, old)==0); for (i=0; i iterations; ++i) { int key = cache_data.key; if (seconds==key key!=0) { my_copy = cache_data; if (pthread_mutex_trylock(mutex)==0) { key = cache_data.key; pthread_mutex_unlock(mutex); } } } assert(clock_gettime(CLOCK_MONOTONIC_RAW, new)==0); nanosecs = CALC_NS(new, old); printf(pthread_mutex_trylock/unlock/read cache: %d nanosecs per call\n, nanosecs/iterations); assert(clock_gettime(CLOCK_MONOTONIC_RAW, old)==0); for (i=0; i iterations; ++i) { int key = cache_data.key; if (seconds==key key!=0) { my_copy = cache_data; asm volatile(lfence:::memory); key = cache_data.key; } } assert(clock_gettime(CLOCK_MONOTONIC_RAW, new)==0); nanosecs = CALC_NS(new, old); printf(lfence/read cache: %d nanosecs per call\n, nanosecs/iterations); assert(clock_gettime(CLOCK_MONOTONIC_RAW, old)==0); for (i=0; i iterations; ++i) { int key = cache_data.key; if (seconds==key key!=0) { my_copy = cache_data; key = cache_data.key; } } assert(clock_gettime(CLOCK_MONOTONIC_RAW, new)==0); nanosecs = CALC_NS(new, old); printf(read cache: %d nanosecs per call\n, nanosecs/iterations); } On Tue, Jan 8, 2013 at 1:50 PM, Daniel Lescohier daniel.lescoh...@cbsi.comwrote: It looks like apr_atomic is not in good-enough shape to be used for lock-free algorithms. It's probably good enough for the use-cases that the event mpm uses it for, because the event mpm is not using it for protecting memory besides the counters it's atomically changing. However, the time caches would use the cache key to protect other memory (the cache value), so apr_atomic is not good enough, because it's inconsistent on the different platforms with respect to memory barriers. For example, with x86, the add32 and cas32 functions act as a full memory barrier on all memory. However, for powerpc, the add32 and cas32 only acts as a memory barrier on the individual item being changed, it doesn't act as a memory barrier on all memory. One would have to add lightweight sync assembler instructions to the add32 and cas32 function implementations for powerpc if you want the same semantics as x86. The problem is that the apr_atomic API is not explicit with the memory ordering semantics. I agree with you that in the future the apr_atomic API should be changed to be more like the C11 atomic API, which allows explicit choice of memory ordering. But that leaves us with the problem of the current time caching code, which is not thread-safe. We should fix it using the existing APR APIs, and not use the
Re: thread-safety of time conversion caches in util_time.c and mod_log_config.c
On Sunday 06 January 2013, Daniel Lescohier wrote: I'm not sure we should include memory barriers inside the apr_atomic_read32 and apr_atomic_set32 functions, because you don't necessarily need a full memory barrier on each read or set. Instead, perhaps we should add three functions to APR: 1. apr_atomic_rmb(): read memory barrier. 2. apr_atomic_wmb(): write memory barrier. 3. apr_atomic_mb(): full memory barrier. Introducing new functions into APR may not be the perfect solution for this problems because there will be some reluctance to bump the minimum apr version required for httpd, especially for 2.2.x. Also, writing such functions for all supported architectures and compilers is quite some effort. In any case, if we introduce new atomic and/or barrier functions, I would be in favor of something that aligns with a subset of the C11 atomic API. An alternative that would work without changes to apr would be per- thread caches for cached_explode(). This would at least not be less efficient than the current code is with mpm-prefork. Or one could do that only if compiled with an apr that does not have the new barrier functions...
Re: thread-safety of time conversion caches in util_time.c and mod_log_config.c
On Saturday 05 January 2013, Daniel Lescohier wrote: apr_atomic_read32 is not implemented with a memory barrier. The implementation of apr_atomic_read32 in the APR code base is just a read from a pointer marked volatile. E.g., here is the atomic/unix/builtins.c implementation: APR_DECLARE(apr_uint32_t) apr_atomic_read32(volatile apr_uint32_t *mem) { return *mem; } Sigh. I was too much focused on x86. There the compiler barrier caused by the function call is enough. But you are right, on other architectures these functions may also require cpu memory barriers. The functions are meant to include the barriers, according to the documentation in apr_atomic.h. So they should be fixed in apr.
Re: thread-safety of time conversion caches in util_time.c and mod_log_config.c
- Original Message - On Saturday 05 January 2013, Daniel Lescohier wrote: apr_atomic_read32 is not implemented with a memory barrier. The implementation of apr_atomic_read32 in the APR code base is just a read from a pointer marked volatile. E.g., here is the atomic/unix/builtins.c implementation: APR_DECLARE(apr_uint32_t) apr_atomic_read32(volatile apr_uint32_t *mem) { return *mem; } Sigh. I was too much focused on x86. There the compiler barrier caused by the function call is enough. But you are right, on other architectures these functions may also require cpu memory barriers. on x86 … is enough - would it harm x86 to add CPU barriers, or do we want to have # define distinctions per arch? The functions are meant to include the barriers, according to the documentation in apr_atomic.h. So they should be fixed in apr. -- Igor Galić Tel: +43 (0) 664 886 22 883 Mail: i.ga...@brainsware.org URL: http://brainsware.org/ GPG: 6880 4155 74BD FD7C B515 2EA5 4B1D 9E08 A097 C9AE
Re: thread-safety of time conversion caches in util_time.c and mod_log_config.c
Sigh. I was too much focused on x86. There the compiler barrier caused by the function call is enough. But you are right, on other architectures these functions may also require cpu memory barriers. on x86 … is enough - would it harm x86 to add CPU barriers, or do we want to have # define distinctions per arch? ignore me, I just realized it's going to be different calls per arch anyway! -- Igor Galić Tel: +43 (0) 664 886 22 883 Mail: i.ga...@brainsware.org URL: http://brainsware.org/ GPG: 6880 4155 74BD FD7C B515 2EA5 4B1D 9E08 A097 C9AE
Re: thread-safety of time conversion caches in util_time.c and mod_log_config.c
I'd have to go back and review the Intel documentation to be sure, but for this particular algorithm, an explicit memory barrier may be required on Intel architecture, also. If I remember correctly, without a memory barrier, Intel arch only guarantees total memory ordering within one cache line. For this algorithm, we have an array of 16 cache_elements of 48 bytes each, so half of the cache_elements cross 64-byte cache lines. When reading the cache_element-key after the copy of the cache_element value, we need to make sure that the cache_element value read is ordered before the read of the cache_element-key, so one needs a memory barrier just before the read of the cache_element-key to guarantee the ordering. On Sat, Jan 5, 2013 at 5:08 AM, Igor Galić i.ga...@brainsware.org wrote: Sigh. I was too much focused on x86. There the compiler barrier caused by the function call is enough. But you are right, on other architectures these functions may also require cpu memory barriers. on x86 … is enough - would it harm x86 to add CPU barriers, or do we want to have # define distinctions per arch? ignore me, I just realized it's going to be different calls per arch anyway! -- Igor Galić Tel: +43 (0) 664 886 22 883 Mail: i.ga...@brainsware.org URL: http://brainsware.org/ GPG: 6880 4155 74BD FD7C B515 2EA5 4B1D 9E08 A097 C9AE
Re: thread-safety of time conversion caches in util_time.c and mod_log_config.c
I was wrong about the Intel architecture: it has a pretty strong memory-ordering guarantee. But other architectures, like PowerPC and ARM, have a weak memory-ordering guarantee. Here is a good summary: https://en.wikipedia.org/wiki/Memory_ordering. The doc in footnote 3 is a good document to read. I'm not sure we should include memory barriers inside the apr_atomic_read32 and apr_atomic_set32 functions, because you don't necessarily need a full memory barrier on each read or set. Instead, perhaps we should add three functions to APR: 1. apr_atomic_rmb(): read memory barrier. 2. apr_atomic_wmb(): write memory barrier. 3. apr_atomic_mb(): full memory barrier. As a use-case, the new util_time.c algorithm could be changed to be: For the read-from-cache portion: const apr_uint32_t key = apr_atomic_read32(cache_element-key); /* Above is done speculatively, no memory barrier used. if (seconds == key seconds != 0) { /* seconds == 0 may mean cache is uninitialized, so don't use cache */ *xt = cache_element-xt; /* After copying xt, make sure cache_element was not marked invalid * by another thread beginning an update, and that cache_element * really contained data for our second. * Requires memory barrier. */ apr_atomic_rmb(); if (apr_atomic_read32(cache_element-key)==seconds) { xt-tm_usec = (int)apr_time_usec(t); return APR_SUCCESS; } } And the write-to-cache portion would be: if (key == apr_atomic_cas32(cache_element-key, ~seconds, key)) { /* We won the race to update this cache_element. * Above marks cache_element as invalid by using ~seconds, * because we are starting an update: it's the start of a * transaction. */ cache_element-xt = *xt; /* Finished copying, now update key with our key, * ending the transaction. Need to use a * memory barrier. */ apr_atomic_wmb(); apr_atomic_set32(cache_element-key, seconds); } On Sat, Jan 5, 2013 at 8:40 AM, Daniel Lescohier daniel.lescoh...@cbsi.comwrote: I'd have to go back and review the Intel documentation to be sure, but for this particular algorithm, an explicit memory barrier may be required on Intel architecture, also. If I remember correctly, without a memory barrier, Intel arch only guarantees total memory ordering within one cache line. For this algorithm, we have an array of 16 cache_elements of 48 bytes each, so half of the cache_elements cross 64-byte cache lines. When reading the cache_element-key after the copy of the cache_element value, we need to make sure that the cache_element value read is ordered before the read of the cache_element-key, so one needs a memory barrier just before the read of the cache_element-key to guarantee the ordering. On Sat, Jan 5, 2013 at 5:08 AM, Igor Galić i.ga...@brainsware.org wrote: Sigh. I was too much focused on x86. There the compiler barrier caused by the function call is enough. But you are right, on other architectures these functions may also require cpu memory barriers. on x86 … is enough - would it harm x86 to add CPU barriers, or do we want to have # define distinctions per arch? ignore me, I just realized it's going to be different calls per arch anyway! -- Igor Galić Tel: +43 (0) 664 886 22 883 Mail: i.ga...@brainsware.org URL: http://brainsware.org/ GPG: 6880 4155 74BD FD7C B515 2EA5 4B1D 9E08 A097 C9AE
thread-safety of time conversion caches in util_time.c and mod_log_config.c
I entered a bug on the thread-safety of the time conversion caches in server/util_time.c and modules/loggers/mod_log_config.c. In brief, they're not thread-safe because: 1. The algorithm depends on total memory ordering, and both the volatile qualifier and memory barriers are not used. 2. The algorithm is subject to the ABA problem. The details of the problem are here: https://issues.apache.org/bugzilla/show_bug.cgi?id=54363 I included in the bug not-yet-tested patches with a different algorithm. Do other people agree that the existing algorithm have the problems explained in detail in the bug? Thanks, Daniel Lescohier
Re: thread-safety of time conversion caches in util_time.c and mod_log_config.c
On Friday 04 January 2013, Daniel Lescohier wrote: I entered a bug on the thread-safety of the time conversion caches in server/util_time.c and modules/loggers/mod_log_config.c. In brief, they're not thread-safe because: 1. The algorithm depends on total memory ordering, and both the volatile qualifier and memory barriers are not used. 2. The algorithm is subject to the ABA problem. I agree that the lack of memory barriers is a problem. And it ABA problem would not exist if callers of ap_recent_* would actually check that the time *is* recent (as written in the docs in util_time.h). This is not a problem for the error log, because it always uses apr_time_now(). But the request time in mod_log_config may be further in the past. Out of interest, have you seen the function give wrong results in practice? The details of the problem are here: https://issues.apache.org/bugzilla/show_bug.cgi?id=54363 I included in the bug not-yet-tested patches with a different algorithm. Do other people agree that the existing algorithm have the problems explained in detail in the bug? I haven't reviewed your patches in detail, yet. One thing: apr_atomic_read should be enough, you don't need apr_atomic_cas with the same value. Cheers, Stefan
Re: thread-safety of time conversion caches in util_time.c and mod_log_config.c
apr_atomic_read32 is not implemented with a memory barrier. The implementation of apr_atomic_read32 in the APR code base is just a read from a pointer marked volatile. E.g., here is the atomic/unix/builtins.c implementation: APR_DECLARE(apr_uint32_t) apr_atomic_read32(volatile apr_uint32_t *mem) { return *mem; } The atomic/unix/builtins.c implementation would have to change to this to have a memory barrier: APR_DECLARE(apr_uint32_t) apr_atomic_read32(volatile apr_uint32_t *mem) { __sync_synchronize(); return *mem; } All the other apr atomic implementations would have to change if apr_atomic_read32 is supposed to mean with-a-memory-barrier. All the current implementations don't implement a memory barrier (all the implementations use the C definition shown at the beginning of this email), so I assume apr_atomic_read32 (and apr_atomic_set32) means without-a-memory-barrier. The Apache Portable Runtime API has no memory barrier function in the API (like gcc's builtin __sync_synchronize()), so I had to choose among the other functions in the API. I decided to use Compare-And-Swap for the memory barrier, because the generic Compare-And-Swap concept implies the use of memory barriers, and also the APR implementation actually implements a memory barrier in apr_atomic_cas32. I also could've used apr_atomic_add32(cache_element-key, 0)==seconds, but Compare-And-Swap is the more generic concept, so I used that instead. I agree, though, for the read-the-cache_element portion of the function, it'd be better to do just an atomic_read, instead of a compare-and-swap, if only the APR Atomic API had a memory barrier function that I could call before the read. After copying from the time cache_element to this thread's memory allocated from the request pool, the thread needs a memory barrier when checking the cache_element-key, in order to make sure another cpu thread/core didn't start modifying the cache_element while this thread was copying it. But note, before doing the copy to this thread's memory, the function doesn't use a memory barrier, it checks the cache_element-key without a memory barrier, because it's speculatively doing the copy (because almost all the time there won't be a conflict), and then doing the validation with a memory barrier after the copy is done (and in the rare case that the validation fails, the thread just calculates the time value itself). For the old code, I haven't seen bad results in practice (not that I probably would've noticed it), I only noticed that the code is not portably thread-safe because I was working on a project to upgrade our web servers to 2.4, and I was updating our own internal custom Apache modules, and I noticed this when looking at the source of mod_log_config.c. So, I brought it up because the algorithm does not look correct. The problem was that the old code was supposedly a lock-free and wait-free algorithm, but it didn't use the atomic processor instructions you need to use in order to implement lock-free and wait-free algorithms. On Fri, Jan 4, 2013 at 7:58 PM, Stefan Fritsch s...@sfritsch.de wrote: On Friday 04 January 2013, Daniel Lescohier wrote: I entered a bug on the thread-safety of the time conversion caches in server/util_time.c and modules/loggers/mod_log_config.c. In brief, they're not thread-safe because: 1. The algorithm depends on total memory ordering, and both the volatile qualifier and memory barriers are not used. 2. The algorithm is subject to the ABA problem. I agree that the lack of memory barriers is a problem. And it ABA problem would not exist if callers of ap_recent_* would actually check that the time *is* recent (as written in the docs in util_time.h). This is not a problem for the error log, because it always uses apr_time_now(). But the request time in mod_log_config may be further in the past. Out of interest, have you seen the function give wrong results in practice? The details of the problem are here: https://issues.apache.org/bugzilla/show_bug.cgi?id=54363 I included in the bug not-yet-tested patches with a different algorithm. Do other people agree that the existing algorithm have the problems explained in detail in the bug? I haven't reviewed your patches in detail, yet. One thing: apr_atomic_read should be enough, you don't need apr_atomic_cas with the same value. Cheers, Stefan