Hi Ivan,

On Tue, Oct 15, 2019 at 11:13 AM Ivan Zhakov <i...@apache.org> wrote:
>
> On Tue, 15 Oct 2019 at 01:16, Yann Ylavic <ylavic....@gmail.com> wrote:
> >
> > Then I don't know if we care enough about WIN32, but if so the
> > InterlockedCompareExchange64() call here is maybe a bit of a hammer
> > for usual apr_atomic_read64() usage. For things like lightly check for
> > the current 64bit shared value and do the heavy CAS work for some
> > special/unlikely condition only, this change is going to kill
> > performances. Since we provide _cas64() already, I fell that
> > read64/set64() should always be lightweight (when/if possible).
> >
> I considered this optimization, but decided not to implement it to
> avoid additional #ifdef.
>
> Do you have any evidence that CAS would be significantly slower than
> just read on x64?

Consider a code like this:

    apr_uint64_t old;
    do {
        old = apr_atomic_read64(shared);
    } while (apr_atomic_cas64(shared, old + 1, old) != old);

which is a common lock-free pattern (here it's a re-implementation of
apr_atomic_inc64() with cas64, but it's just for the example).

A "lock cmpxchg" for apr_atomic_read64() is going to "cost" one more
full barrier (i.e. serialization of outstanding loads and stores on
all cores) per loop, while the loop is intended to spin when the race
is lost.
The minimum garantee needed for read64() here is a single/atomic
memory access, which is clearly not the case on WIN32 according to
your results (see my comments below), but which should be the case on
64bit CPUs Windows cares about (and if we care about arm64 too, we
should use ReadAcquire64() here to keep the same acquire/release
ordering garantees than on x86).
So, IMHO, the full barrier for apr_atomic_read64() is overkill on
64bit CPUs with strong ordering garantees (like x86_64), and if one
wants that s/he can still use apr_atomic_cas64(mem, 0, 0) explicitely.

I wrote the attached test and ran it on my linux x86_64 quad-core-CPU
(8 w/ hyperthreading) laptop.
The test is essentially starting T threads which run the above loop N
times to update C counters, and finally prints the average time and
number of (re)tries (per loop).
The implementation of apr_atomic_read64 (volatile or cas64), T, N and
C can be specified by the command line.

Some results:
$ (T=10000000; N=8; C=4; \
   for i in {1..2}; do
      for impl in cas64 volatile; do
         echo "# $impl; T=$T; N=$N; C=$C";
         ./volatile_vs_cas64 $impl $T $N $C;
      done;
   done)
# cas64; T=10000000; N=8; C=4
tries = 1.629 (per loop)
usecs = 0.486 (per loop)
# volatile; T=10000000; N=8; C=4
tries = 1.713 (per loop)
usecs = 0.296 (per loop)
# cas64; T=10000000; N=8; C=4
tries = 1.646 (per loop)
usecs = 0.527 (per loop)
# volatile; T=10000000; N=8; C=4
tries = 1.722 (per loop)
usecs = 0.371 (per loop)


Multiple runs like this, including with other parameters, seem to
indicate that cas64 causes less retries but takes overall more time to
complete.
My interpretation is that the serialization in read64 gives the
following cas64 more chances to use an up-to-date old value, but it
also kills CPU caching.
On the other end, volatile access needs no cache consistency thus the
old value is less accurate, but spinning is cheaper (faster).

This kind of micro-benching is of course to be taken with a pinch of
salt, and can't replace a real workload..

>
> But I can add special code for x64 if you think it's worth it, along the lines
> of the attached patch.

I think it's worth it, though I have no real proof...

> >
> > So possibly r1868129 is not necessary, or if
> > ReadAcquire64/WriteRelease64() can be used easily could we do that
> > instead? (having a look at the WIN32 assembly code for
> > apr_atomic_read64/set64() using ReadAcquire64/WriteRelease64() could
> > be instructive, though)
> >
> The r1868129 is necessary: there is test in r1868129 that fails on x86,
> without fix.

OK, I was hoping that "volatile" could be special enough for the
compiler to issue a single load, but the limited number of 64bit
registers on x86 (the x87 one mainly) make this unrealistic for the
general case. So unless we use assembly language here (or C11), we
can't have this garantee.

>
> While ReadAcquire64()/WriteRelease64() look promising,
> 1. They doesn't provide atomic read: test added in r1868129 fails if
>    I replace InterlockedCompareExchange64() to ReadAcquire64() in
>    apr_atomic_read64()..

That's on WIN32 right? Or it also fails on WIN64??

> 2. They are undocumented.

Indeed.. Since we don't support arm64 CPUs (AFAIK), your
"apr-optimize-read64-set64-v1" patch (which does the same) works for
me.


Regards,
Yann.
#include "apr.h"
#include "apr_time.h"
#include "apr_atomic.h"
#include "apr_thread_proc.h"

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

static apr_thread_t **threads;
static int nthreads, nloops, ncounters, use_volatile;
static apr_uint64_t *counters, tries, usecs;

void *APR_THREAD_FUNC thread_func(apr_thread_t *thd, void *nil)
{
    apr_uint64_t n = 0;
    apr_time_t start;
    int i;

    start = apr_time_now();
    for (i = 0; i < nloops; i++) {
        apr_uint64_t *counter = &counters[i % ncounters];
        apr_uint64_t old;
        do {
            n++;
            if (use_volatile) {
                old = *(apr_uint64_t volatile *)counter;
            }
            else {
                old = apr_atomic_cas64(counter, 0, 0);
            }
        } while (apr_atomic_cas64(counter, old + 1, old) != old);
    }
    apr_atomic_add64(&usecs, apr_time_now() - start);
    apr_atomic_add64(&tries, n);

    apr_thread_exit(thd, 0);
    return NULL;
}

int main(int argc, const char *argv[])
{
    apr_pool_t *p;
    apr_status_t status;
    apr_uint64_t counting = 0;
    int i;

    apr_initialize();
    atexit(apr_terminate);
    apr_pool_create(&p, NULL);

    use_volatile = (argc > 1 && strcmp(argv[1], "volatile") == 0);
    nloops = (argc > 2) ? atoi(argv[2]) : 1000000;
    nthreads = (argc > 3) ? atoi(argv[3]) : 8;
    ncounters = (argc > 4) ? atoi(argv[4]) : 1;

    threads = apr_pcalloc(p, nthreads * sizeof(apr_thread_t*));
    counters = apr_pcalloc(p, ncounters * sizeof(apr_uint64_t));

    for (i = 0; i < nthreads; ++i) {
        apr_thread_create(&threads[i], NULL, thread_func, NULL, p);
    }
    for (i = 0; i < nthreads; ++i) {
        apr_thread_join(&status, threads[i]);
    }

    counting = nthreads * nloops;
    for (i = 0; i < ncounters; ++i) {
        counting -= counters[i];
    }
    assert(counting == 0);

    printf("tries = %.3lf (per loop)\n"
           "usecs = %.3lf (per loop)\n",
           tries / (double)nthreads / (double)nloops,
           usecs / (double)nthreads / (double)nloops);

    return 0;
}

Reply via email to