Re: thread-safety of time conversion caches in util_time.c and mod_log_config.c

2013-01-08 Thread Daniel Lescohier
It looks like apr_atomic is not in good-enough shape to be used for
lock-free algorithms.  It's probably good enough for the use-cases that the
event mpm uses it for, because the event mpm is not using it for protecting
memory besides the counters it's atomically changing.  However, the time
caches would use the cache key to protect other memory (the cache value),
so apr_atomic is not good enough, because it's inconsistent on the
different platforms with respect to memory barriers.

For example, with x86, the add32 and cas32 functions act as a full memory
barrier on all memory.  However, for powerpc, the add32 and cas32 only acts
as a memory barrier on the individual item being changed, it doesn't act as
a memory barrier on all memory.  One would have to add lightweight sync
assembler instructions to the add32 and cas32 function implementations for
powerpc if you want the same semantics as x86.

The problem is that the apr_atomic API is not explicit with the memory
ordering semantics.  I agree with you that in the future the apr_atomic API
should be changed to be more like the C11 atomic API, which allows explicit
choice of memory ordering.

But that leaves us with the problem of the current time caching code, which
is not thread-safe.  We should fix it using the existing APR APIs, and not
use the apr_atomic API which is broken for this use-case.

I found that there is a lock-free method available to us with the current
APR API.  We can use apr_thread_mutex_trylock.  In glibc, that just
attempts to cmpxchg the userspace futex, and if it fails, it doesn't do a
futex system call and a wait on the caller, so it's lock free.  On Windows,
TryEnterCriticalSection probably works the same way.

So, the pseudo-code would be something like:

For reading from cache:

apr_uint32_t key = cache_element-key;
/* Above is done speculatively */
if (seconds == key  seconds != 0) {
/* seconds == 0 may mean cache is uninitialized, so don't use cache
*/
*xt = cache_element-xt;
/* After copying xt, make sure cache_element was not marked invalid
 * by another thread beginning an update, and that cache_element
 * really contained data for our second. */
if (apr_thread_mutex_trylock(cache_element-mutex) == APR_SUCCESS) {
key = cache_element-key;
apr_thread_mutex_unlock(cache_element-mutex);
if (key == seconds) {
xt-tm_usec = (int)apr_time_usec(t);
return APR_SUCCESS;
}
}
}
/* calculate the value if reached here */

Write to cache code:

if (apr_thread_mutex_trylock(cache_element-mutex) == APR_SUCCESS)
{  /* We won the race to update this cache_element. */
cache_element-key = ~seconds;
/* Above marks cache_element as invalid by using ~seconds,
 * for the speculative reads not using the mutex. */
cache_element-xt = *xt;
/* Finished copying, now update key with our key */
cache_element-key = seconds;
apr_thread_mutex_unlock(cache_element-mutex);
}

The above is lock-free, because process scheduling decisions by the OS
won't cause the current thread to block, it'll keep on doing work without
blocking as long as this thread is scheduled.  It would be fast, though
perhaps not quite as fast than if we had an atomic api we could use
effectively.  With an atomic API, the read-from-time-cache code would only
be reading from cache-lines, not writing to cache-lines with the futex's
cmpxchg instruction.  You might also be able to use a lighter-weight memory
barrier.  But the mutex is lock-free and fast, and it's something that'd
work with today's APIs; and the existing code in production is not correct.

BTW, I looked at rwlock_trylock, and that isn't lock-free in glibc, because
the readers/writers count variables are protected with a low-level lock
futex that is acquired with lll_lock, not lll_trylock.  So, if a thread
acquires the low-level lock, but is unscheduled by the OS before it
finishes updating the reader/writers count variables and unlocks the
low-level lock, that would block other threads from acquiring the
low-level lock, and other threads would block before being able to access
the readers/writers count variables.  If it had used lll_trylock instead, I
think it would've been lock-free.  However, I don't think rwlock would give
us any particular advantage than a regular mutex for the above algorithm.

So using the simpler mutex is better and is lock-free.  I think the
important part is to have a lock-free algorithm: if we lose a little bit of
performance by not being able to use the lightest lock-free algorithm, I
think that is OK.

On Mon, Jan 7, 2013 at 5:50 PM, Stefan Fritsch s...@sfritsch.de wrote:

 On Sunday 06 January 2013, Daniel Lescohier wrote:
  I'm not sure we should include memory barriers inside the
  apr_atomic_read32 and apr_atomic_set32 functions, because 

Re: thread-safety of time conversion caches in util_time.c and mod_log_config.c

2013-01-08 Thread Daniel Lescohier
Some timing tests:

 gcc -I/usr/include/apr-1 -lrt -lapr-1 -lpthread test.c
 ./a.out
apr_time_exp_lt: 108 nanosecs per call
localtime_r: 85 nanosecs per call
apr_time_exp_gmt: 59 nanosecs per call
gmtime_r: 35 nanosecs per call
pthread_mutex_trylock/unlock/read cache: 19 nanosecs per call
lfence/read cache: 8 nanosecs per call
read cache: 4 nanosecs per call
 cat test.c
#include time.h
#include stdio.h
#include assert.h
#include apr_time.h
#include pthread.h


int main (int argc, char *argv[]) {
struct timespec old, new;
struct tm tmv;
int i, iterations=1000;
long nanosecs;
#define CALC_NS(new, old) ((new.tv_sec - old.tv_sec)*10 +
new.tv_nsec - old.tv_nsec)
apr_time_t t = apr_time_now();
apr_time_exp_t xt;
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
unsigned int seconds=1;
typedef struct { char s[44]; unsigned int key;} data_t;
static volatile data_t cache_data;
data_t my_copy;

cache_data.key = 1;

assert(clock_gettime(CLOCK_MONOTONIC_RAW, old)==0);
for (i=0; i  iterations; ++i) {
apr_time_exp_lt(xt, t);
}
assert(clock_gettime(CLOCK_MONOTONIC_RAW, new)==0);
nanosecs = CALC_NS(new, old);
printf(apr_time_exp_lt: %d nanosecs per call\n, nanosecs/iterations);

assert(clock_gettime(CLOCK_MONOTONIC_RAW, old)==0);
for (i=0; i  iterations; ++i) {
localtime_r(old.tv_sec, tmv);
}
assert(clock_gettime(CLOCK_MONOTONIC_RAW, new)==0);
nanosecs = CALC_NS(new, old);
printf(localtime_r: %d nanosecs per call\n, nanosecs/iterations);

assert(clock_gettime(CLOCK_MONOTONIC_RAW, old)==0);
for (i=0; i  iterations; ++i) {
apr_time_exp_gmt(xt, t);
}
assert(clock_gettime(CLOCK_MONOTONIC_RAW, new)==0);
nanosecs = CALC_NS(new, old);
printf(apr_time_exp_gmt: %d nanosecs per call\n, nanosecs/iterations);

assert(clock_gettime(CLOCK_MONOTONIC_RAW, old)==0);
for (i=0; i  iterations; ++i) {
gmtime_r(old.tv_sec, tmv);
}
assert(clock_gettime(CLOCK_MONOTONIC_RAW, new)==0);
nanosecs = CALC_NS(new, old);
printf(gmtime_r: %d nanosecs per call\n, nanosecs/iterations);

assert(clock_gettime(CLOCK_MONOTONIC_RAW, old)==0);
for (i=0; i  iterations; ++i) {
int key = cache_data.key;
if (seconds==key  key!=0) {
my_copy = cache_data;
if (pthread_mutex_trylock(mutex)==0) {
key = cache_data.key;
pthread_mutex_unlock(mutex);
}
}
}
assert(clock_gettime(CLOCK_MONOTONIC_RAW, new)==0);
nanosecs = CALC_NS(new, old);
printf(pthread_mutex_trylock/unlock/read cache: %d nanosecs per
call\n, nanosecs/iterations);

assert(clock_gettime(CLOCK_MONOTONIC_RAW, old)==0);
for (i=0; i  iterations; ++i) {
int key = cache_data.key;
if (seconds==key  key!=0) {
my_copy = cache_data;
asm volatile(lfence:::memory);
key = cache_data.key;
}
}
assert(clock_gettime(CLOCK_MONOTONIC_RAW, new)==0);
nanosecs = CALC_NS(new, old);
printf(lfence/read cache: %d nanosecs per call\n,
nanosecs/iterations);

assert(clock_gettime(CLOCK_MONOTONIC_RAW, old)==0);
for (i=0; i  iterations; ++i) {
int key = cache_data.key;
if (seconds==key  key!=0) {
my_copy = cache_data;
key = cache_data.key;
}
}
assert(clock_gettime(CLOCK_MONOTONIC_RAW, new)==0);
nanosecs = CALC_NS(new, old);
printf(read cache: %d nanosecs per call\n, nanosecs/iterations);
}


On Tue, Jan 8, 2013 at 1:50 PM, Daniel Lescohier
daniel.lescoh...@cbsi.comwrote:

 It looks like apr_atomic is not in good-enough shape to be used for
 lock-free algorithms.  It's probably good enough for the use-cases that the
 event mpm uses it for, because the event mpm is not using it for protecting
 memory besides the counters it's atomically changing.  However, the time
 caches would use the cache key to protect other memory (the cache value),
 so apr_atomic is not good enough, because it's inconsistent on the
 different platforms with respect to memory barriers.

 For example, with x86, the add32 and cas32 functions act as a full memory
 barrier on all memory.  However, for powerpc, the add32 and cas32 only acts
 as a memory barrier on the individual item being changed, it doesn't act as
 a memory barrier on all memory.  One would have to add lightweight sync
 assembler instructions to the add32 and cas32 function implementations for
 powerpc if you want the same semantics as x86.

 The problem is that the apr_atomic API is not explicit with the memory
 ordering semantics.  I agree with you that in the future the apr_atomic API
 should be changed to be more like the C11 atomic API, which allows explicit
 choice of memory ordering.

 But that leaves us with the problem of the current time caching code,
 which is not thread-safe.  We should fix it using the existing APR APIs,
 and not use the 

Re: thread-safety of time conversion caches in util_time.c and mod_log_config.c

2013-01-07 Thread Stefan Fritsch
On Sunday 06 January 2013, Daniel Lescohier wrote:
 I'm not sure we should include memory barriers inside the
 apr_atomic_read32 and apr_atomic_set32 functions, because you
 don't necessarily need a full memory barrier on each read or set. 
 Instead, perhaps we should add three functions to APR:
 
1. apr_atomic_rmb(): read memory barrier.
2. apr_atomic_wmb(): write memory barrier.
3. apr_atomic_mb(): full memory barrier.

Introducing new functions into APR may not be the perfect solution for 
this problems because there will be some reluctance to bump the 
minimum apr version required for httpd, especially for 2.2.x. Also, 
writing such functions for all supported architectures and compilers 
is quite some effort.

In any case, if we introduce new atomic and/or barrier functions, I 
would be in favor of something that aligns with a subset of the C11 
atomic API.

An alternative that would work without changes to apr would be per-
thread caches for cached_explode(). This would at least not be less 
efficient than the current code is with mpm-prefork. Or one could do 
that only if compiled with an apr that does not have the new barrier 
functions...


Re: thread-safety of time conversion caches in util_time.c and mod_log_config.c

2013-01-05 Thread Stefan Fritsch
On Saturday 05 January 2013, Daniel Lescohier wrote:
 apr_atomic_read32 is not implemented with a memory barrier.  The
 implementation of apr_atomic_read32 in the APR code base is just a
 read from a pointer marked volatile.  E.g., here is the
 atomic/unix/builtins.c implementation:
 
 APR_DECLARE(apr_uint32_t) apr_atomic_read32(volatile apr_uint32_t
 *mem)
 {
 return *mem;
 }

Sigh. I was too much focused on x86. There the compiler barrier caused 
by the function call is enough. But you are right, on other 
architectures these functions may also require cpu memory barriers.

The functions are meant to include the barriers, according to the 
documentation in apr_atomic.h. So they should be fixed in apr.


Re: thread-safety of time conversion caches in util_time.c and mod_log_config.c

2013-01-05 Thread Igor Galić


- Original Message -
 On Saturday 05 January 2013, Daniel Lescohier wrote:
  apr_atomic_read32 is not implemented with a memory barrier.  The
  implementation of apr_atomic_read32 in the APR code base is just a
  read from a pointer marked volatile.  E.g., here is the
  atomic/unix/builtins.c implementation:
  
  APR_DECLARE(apr_uint32_t) apr_atomic_read32(volatile apr_uint32_t
  *mem)
  {
  return *mem;
  }
 
 Sigh. I was too much focused on x86. There the compiler barrier
 caused
 by the function call is enough. But you are right, on other
 architectures these functions may also require cpu memory barriers.

 on x86 … is enough - would it harm x86 to add CPU barriers, or
do we want to have # define distinctions per arch?
 
 The functions are meant to include the barriers, according to the
 documentation in apr_atomic.h. So they should be fixed in apr.
 

-- 
Igor Galić

Tel: +43 (0) 664 886 22 883
Mail: i.ga...@brainsware.org
URL: http://brainsware.org/
GPG: 6880 4155 74BD FD7C B515  2EA5 4B1D 9E08 A097 C9AE



Re: thread-safety of time conversion caches in util_time.c and mod_log_config.c

2013-01-05 Thread Igor Galić

  Sigh. I was too much focused on x86. There the compiler barrier
  caused
  by the function call is enough. But you are right, on other
  architectures these functions may also require cpu memory barriers.
 
  on x86 … is enough - would it harm x86 to add CPU barriers, or
 do we want to have # define distinctions per arch?

ignore me, I just realized it's going to be different
calls per arch anyway!

-- 
Igor Galić

Tel: +43 (0) 664 886 22 883
Mail: i.ga...@brainsware.org
URL: http://brainsware.org/
GPG: 6880 4155 74BD FD7C B515  2EA5 4B1D 9E08 A097 C9AE



Re: thread-safety of time conversion caches in util_time.c and mod_log_config.c

2013-01-05 Thread Daniel Lescohier
I'd have to go back and review the Intel documentation to be sure, but for
this particular algorithm, an explicit memory barrier may be required on
Intel architecture, also.  If I remember correctly, without a memory
barrier, Intel arch only guarantees total memory ordering within one cache
line.  For this algorithm, we have an array of 16 cache_elements of 48
bytes each, so half of the cache_elements cross 64-byte cache lines.  When
reading the cache_element-key after the copy of the cache_element value,
we need to make sure that the cache_element value read is ordered before
the read of the cache_element-key, so one needs a memory barrier just
before the read of the cache_element-key to guarantee the ordering.

On Sat, Jan 5, 2013 at 5:08 AM, Igor Galić i.ga...@brainsware.org wrote:


   Sigh. I was too much focused on x86. There the compiler barrier
   caused
   by the function call is enough. But you are right, on other
   architectures these functions may also require cpu memory barriers.
 
   on x86 … is enough - would it harm x86 to add CPU barriers, or
  do we want to have # define distinctions per arch?

 ignore me, I just realized it's going to be different
 calls per arch anyway!

 --
 Igor Galić

 Tel: +43 (0) 664 886 22 883
 Mail: i.ga...@brainsware.org
 URL: http://brainsware.org/
 GPG: 6880 4155 74BD FD7C B515  2EA5 4B1D 9E08 A097 C9AE




Re: thread-safety of time conversion caches in util_time.c and mod_log_config.c

2013-01-05 Thread Daniel Lescohier
I was wrong about the Intel architecture: it has a pretty strong
memory-ordering guarantee.  But other architectures, like PowerPC and ARM,
have a weak memory-ordering guarantee.  Here is a good summary:
https://en.wikipedia.org/wiki/Memory_ordering.  The doc in footnote 3 is a
good document to read.

I'm not sure we should include memory barriers inside the apr_atomic_read32
and apr_atomic_set32 functions, because you don't necessarily need a full
memory barrier on each read or set.  Instead, perhaps we should add three
functions to APR:

   1. apr_atomic_rmb(): read memory barrier.
   2. apr_atomic_wmb(): write memory barrier.
   3. apr_atomic_mb(): full memory barrier.

As a use-case, the new util_time.c algorithm could be changed to be:

For the read-from-cache portion:

const apr_uint32_t key = apr_atomic_read32(cache_element-key);
/* Above is done speculatively, no memory barrier used.
if (seconds == key  seconds != 0) {
/* seconds == 0 may mean cache is uninitialized, so don't use cache
*/
*xt = cache_element-xt;
/* After copying xt, make sure cache_element was not marked invalid
 * by another thread beginning an update, and that cache_element
 * really contained data for our second.
 * Requires memory barrier. */
apr_atomic_rmb();
if (apr_atomic_read32(cache_element-key)==seconds) {
xt-tm_usec = (int)apr_time_usec(t);
return APR_SUCCESS;
}
}

And the write-to-cache portion would be:

if (key == apr_atomic_cas32(cache_element-key, ~seconds, key))
{   /* We won the race to update this cache_element.
 * Above marks cache_element as invalid by using ~seconds,
 * because we are starting an update: it's the start of a
 * transaction. */
cache_element-xt = *xt;
/* Finished copying, now update key with our key,
 * ending the transaction. Need to use a
 * memory barrier.
 */
apr_atomic_wmb();
apr_atomic_set32(cache_element-key, seconds);
}

On Sat, Jan 5, 2013 at 8:40 AM, Daniel Lescohier
daniel.lescoh...@cbsi.comwrote:

 I'd have to go back and review the Intel documentation to be sure, but for
 this particular algorithm, an explicit memory barrier may be required on
 Intel architecture, also.  If I remember correctly, without a memory
 barrier, Intel arch only guarantees total memory ordering within one cache
 line.  For this algorithm, we have an array of 16 cache_elements of 48
 bytes each, so half of the cache_elements cross 64-byte cache lines.  When
 reading the cache_element-key after the copy of the cache_element value,
 we need to make sure that the cache_element value read is ordered before
 the read of the cache_element-key, so one needs a memory barrier just
 before the read of the cache_element-key to guarantee the ordering.


 On Sat, Jan 5, 2013 at 5:08 AM, Igor Galić i.ga...@brainsware.org wrote:


   Sigh. I was too much focused on x86. There the compiler barrier
   caused
   by the function call is enough. But you are right, on other
   architectures these functions may also require cpu memory barriers.
 
   on x86 … is enough - would it harm x86 to add CPU barriers, or
  do we want to have # define distinctions per arch?

 ignore me, I just realized it's going to be different
 calls per arch anyway!

 --
 Igor Galić

 Tel: +43 (0) 664 886 22 883
 Mail: i.ga...@brainsware.org
 URL: http://brainsware.org/
 GPG: 6880 4155 74BD FD7C B515  2EA5 4B1D 9E08 A097 C9AE





thread-safety of time conversion caches in util_time.c and mod_log_config.c

2013-01-04 Thread Daniel Lescohier
I entered a bug on the thread-safety of the time conversion caches in
server/util_time.c and modules/loggers/mod_log_config.c.

In brief, they're not thread-safe because:

   1. The algorithm depends on total memory ordering, and both the volatile
   qualifier and memory barriers are not used.
   2. The algorithm is subject to the ABA problem.

The details of the problem are here:

https://issues.apache.org/bugzilla/show_bug.cgi?id=54363

I included in the bug not-yet-tested patches with a different algorithm.

Do other people agree that the existing algorithm have the problems
explained in detail in the bug?

Thanks,

Daniel Lescohier


Re: thread-safety of time conversion caches in util_time.c and mod_log_config.c

2013-01-04 Thread Stefan Fritsch
On Friday 04 January 2013, Daniel Lescohier wrote:
 I entered a bug on the thread-safety of the time conversion caches
 in server/util_time.c and modules/loggers/mod_log_config.c.
 
 In brief, they're not thread-safe because:
 
1. The algorithm depends on total memory ordering, and both the
 volatile qualifier and memory barriers are not used.
2. The algorithm is subject to the ABA problem.

I agree that the lack of memory barriers is a problem.

And it ABA problem would not exist if callers of ap_recent_* would 
actually check that the time *is* recent (as written in the docs in 
util_time.h). This is not a problem for the error log, because it 
always uses apr_time_now(). But the request time in mod_log_config may 
be further in the past.

Out of interest, have you seen the function give wrong results in 
practice?

 The details of the problem are here:
 
 https://issues.apache.org/bugzilla/show_bug.cgi?id=54363
 
 I included in the bug not-yet-tested patches with a different
 algorithm.
 
 Do other people agree that the existing algorithm have the problems
 explained in detail in the bug?

I haven't reviewed your patches in detail, yet. One thing: 
apr_atomic_read should be enough, you don't need apr_atomic_cas with 
the same value.

Cheers,
Stefan


Re: thread-safety of time conversion caches in util_time.c and mod_log_config.c

2013-01-04 Thread Daniel Lescohier
apr_atomic_read32 is not implemented with a memory barrier.  The
implementation of apr_atomic_read32 in the APR code base is just a read
from a pointer marked volatile.  E.g., here is the atomic/unix/builtins.c
implementation:

APR_DECLARE(apr_uint32_t) apr_atomic_read32(volatile apr_uint32_t *mem)
{
return *mem;
}

The atomic/unix/builtins.c implementation would have to change to this to
have a memory barrier:

APR_DECLARE(apr_uint32_t) apr_atomic_read32(volatile apr_uint32_t *mem)
{
__sync_synchronize();
return *mem;
}

All the other apr atomic implementations would have to change if
apr_atomic_read32 is supposed to mean with-a-memory-barrier.  All the
current implementations don't implement a memory barrier (all the
implementations use the C definition shown at the beginning of this email),
so I assume apr_atomic_read32 (and apr_atomic_set32) means
without-a-memory-barrier.

The Apache Portable Runtime API has no memory barrier function in the API
(like gcc's builtin __sync_synchronize()), so I had to choose among the
other functions in the API.  I decided to use Compare-And-Swap for the
memory barrier, because the generic Compare-And-Swap concept implies the
use of memory barriers, and also the APR implementation actually implements
a memory barrier in apr_atomic_cas32.  I also could've used
apr_atomic_add32(cache_element-key, 0)==seconds, but Compare-And-Swap is
the more generic concept, so I used that instead.  I agree, though, for the
read-the-cache_element portion of the function, it'd be better to do just
an atomic_read, instead of a compare-and-swap, if only the APR Atomic API
had a memory barrier function that I could call before the read.

After copying from the time cache_element to this thread's memory allocated
from the request pool, the thread needs a memory barrier when checking the
cache_element-key, in order to make sure another cpu thread/core didn't
start modifying the cache_element while this thread was copying it.  But
note, before doing the copy to this thread's memory, the function doesn't
use a memory barrier, it checks the cache_element-key without a memory
barrier, because it's speculatively doing the copy (because almost all the
time there won't be a conflict), and then doing the validation with a
memory barrier after the copy is done (and in the rare case that the
validation fails, the thread just calculates the time value itself).

For the old code, I haven't seen bad results in practice (not that I
probably would've noticed it), I only noticed that the code is not portably
thread-safe because I was working on a project to upgrade our web servers
to 2.4, and I was updating our own internal custom Apache modules, and I
noticed this when looking at the source of mod_log_config.c.  So, I brought
it up because the algorithm does not look correct.  The problem was that
the old code was supposedly a lock-free and wait-free algorithm, but it
didn't use the atomic processor instructions you need to use in order to
implement lock-free and wait-free algorithms.

On Fri, Jan 4, 2013 at 7:58 PM, Stefan Fritsch s...@sfritsch.de wrote:

 On Friday 04 January 2013, Daniel Lescohier wrote:
  I entered a bug on the thread-safety of the time conversion caches
  in server/util_time.c and modules/loggers/mod_log_config.c.
 
  In brief, they're not thread-safe because:
 
 1. The algorithm depends on total memory ordering, and both the
  volatile qualifier and memory barriers are not used.
 2. The algorithm is subject to the ABA problem.

 I agree that the lack of memory barriers is a problem.

 And it ABA problem would not exist if callers of ap_recent_* would
 actually check that the time *is* recent (as written in the docs in
 util_time.h). This is not a problem for the error log, because it
 always uses apr_time_now(). But the request time in mod_log_config may
 be further in the past.

 Out of interest, have you seen the function give wrong results in
 practice?

  The details of the problem are here:
 
  https://issues.apache.org/bugzilla/show_bug.cgi?id=54363
 
  I included in the bug not-yet-tested patches with a different
  algorithm.
 
  Do other people agree that the existing algorithm have the problems
  explained in detail in the bug?

 I haven't reviewed your patches in detail, yet. One thing:
 apr_atomic_read should be enough, you don't need apr_atomic_cas with
 the same value.

 Cheers,
 Stefan