On Tue, 26 Nov 2013, Davidlohr Bueso wrote:
> On Tue, 2013-11-26 at 11:25 -0800, Davidlohr Bueso wrote:
> > *sigh* I just realized I had some extra debugging options in the .config
> > I ran for the patched kernel. This probably explains why the huge
> > overhead. I'll rerun and report shortly.

So you pulled a FUTEX, i.e. F*d Up That EXperiment :)

> I'm very sorry about the false alarm -- after midnight my brain starts
> to melt. After re-running everything on my laptop (yes, with the
> correct .config file), I can see that the differences are rather minimal
> and variation also goes down, as expected. I've also included the
> results for the original atomic ops approach, which mostly measures the
> atomic_dec when we dequeue the woken task. Results are in the noise
> range and virtually the same for both approaches (at least on a smaller
> x86_64 system).
>
> +---------+-----------------------------+----------------------------+------------------------------+
> | threads | baseline time (ms) [stddev] | barrier time (ms) [stddev] | 
> atomicops time (ms) [stddev] |
> +---------+-----------------------------+----------------------------+------------------------------+
> |     512 | 2.8360 [0.5168]             | 4.4100 [1.1150]            | 3.8150 
> [1.3293]              |
> |     256 | 2.5080 [0.6375]             | 2.3070 [0.5112]            | 2.5980 
> [0.9079]              |
> |     128 | 1.0200 [0.4264]             | 1.3980 [0.3391]            | 1.5180 
> [0.4902]              |
> |      64 | 0.7890 [0.2667]             | 0.6970 [0.3374]            | 0.4020 
> [0.2447]              |
> |      32 | 0.1150 [0.0184]             | 0.1870 [0.1428]            | 0.1490 
> [0.1156]              |
> +---------+-----------------------------+----------------------------+------------------------------+

That probably wants more than 10 repeated runs to converge into stable
numbers. Thanks for providing the atomicops comparison! That's very
helpful.

It would be interesting to measure the overhead on the waiter side as
well for both approaches (mb and atomic_inc), but I'm sure that at
least for x86 it's going to be in the same ballpark.

So I discovered earlier today, that your atomic ops variant is working
because the atomic_inc() in get_futex_key_refs() is accidentally
providing the required memory barrier on the waker side (on x86 and
all other architectures which have an implict mb in atomic_inc()).

For !fshared ones it's even a full mb on all architectiures today, see
ihold().

Aside of that get_user_pages_fast() is using atomic ops as well, not
sure if it's a full mb on all architectures today, but it is on x86
and others.

Now I'm tempted to turn this accidental into a documented behaviour.
The basic requirement for the fast lockless check of the plist is:

    record_waiter(hb)   |      *uaddr = newval
    mb                  |      mb
    *uaddr == oldval ?  |      nr_waiters(hb) != 0?

So on the waker side we can conditonally (dependent on the arch
implementation) rely on the mb in get_futex_key_refs(). See below.

Now it does not matter much in terms of barrier related overhead
whether the record_waiter() is implemented via atomic_inc() or via the
early enqueue into the plist + smp_mb. In both cases we have a full
smp_mb(), whether implicit or explicit.

And versus memory/cache footprint it's probably not relevant either
whether we add the counter or not. Assumed we have no debug options
enabled then the resulting size of futex_hash_bucket will be:

 16 bytes on 32bit (12 bytes today)

 24 bytes on 64bit (24 bytes today)

In fact for 32bit despite the slightly larger memory foot print the
cache line efficiency is actually better than now as we avoid hash
buckets crossing cache line boundaries.

No change on the 64 bit side.

As a side note, it might be worthwhile to epxlore whether forcing the
hash buckets to be cache line aligned gives us an advantage over
blindly increasing the hash size:

 We could avoid access across cacheline boundaries and also avoid
 massive cache line bouncing if multiple cpus are hammering away at
 different hash buckets which happen to reside in the same cache line.

But back to the problem at hand.

I'm leaning towards your initial atomic_inc() approach for the
following reasons:

1) It avoids the queue/unqueue dance in the error and (*uaddr != uval)
   case. The latter is the one you are optimizing for.

   We dirty the cache line in any case on the waiter side.

   With the optimized check, we avoid dirtying the cache line on the
   waker side in the case that there is no waiter in flight or
   enqueued.

2) It's very simple to make it OPT-IN. That allows architectures to
   trade the mb/memory overhead with the spinlock contention overhead.

   A lot of [embedded] architectures do not care at all about the
   futex performance simply because they do not run futex sensitive
   workloads. And others might want to avoid the heavy smp_mb() for
   obvious reasons.

   Make that a CONFIG switch which can be selected by the user or by a
   select statement in the arch. That also allows archs to determine
   the costs of that optimization just by recompiling with a different
   .config.

   All it needs are conditional implementations for futex_get_mm(),
   hb_waiters_inc(x), hb_waiters_dec() and hb_waiters_pending()

   futex_get_mm(x)
   {
      atomic_inc(x); 
      #ifdef CONFIG_FUTEX_SOMESENSIBLE_OPTION
        /*
         * Reduced to a simple barrier() where the atomic_inc
         * has an implicit mb()
         */
        smp_mb__after_atomic_inc();
      #endif
   }

   futex_get_mm() must be used for incrementing the refcount of
   &key->private.mm->mm_count in get_futex_key_refs().

   hb_waiters_inc(hb)
   {
      #ifdef CONFIG_FUTEX_SOMESENSIBLE_OPTION
      atomic_inc(&hb->waiters); 
        /*
         * Reduced to a simple barrier() where the atomic_inc
         * has an implicit mb()
         */
        smp_mb__after_atomic_inc();
      #endif
   }

   hb_waiters_inc() must be used in futex_wait()

   hb_waiters_dec(hb)
   {
      #ifdef CONFIG_FUTEX_SOMESENSIBLE_OPTION
      atomic_dec(&hb->waiters); 
        /*
         * Reduced to a simple barrier() where the atomic_dec
         * has an implicit mb().
         *
         * For the other archs it's debatable whether this has
         * a hard requirement to be guarded. The optimized
         * hb_waiters_pending() check for pending wakers might
         * fail in rare cases, but just for the cost of a
         * spinlock/unlock. The consistency of hb->waiters itself
         * is always guaranteed, i.e. it can't go below 0.
         */
        smp_mb__after_atomic_dec();
      #endif }

   hb_waiters_dec() must be used for dequeueing in all cases which
   are counterparts to a queueing futex_wait().

   hb_waiters_pending(x)
   {
      #ifdef CONFIG_FUTEX_SOMESENSIBLE_OPTION
          return atomic_read(x);
      #else
          return 1;
      #endif
   }

   So the compiler can optimize the quick check in futex_wait() out:

           if (!hb_waiters_pending(&hb->waiters))
              goto out_put_keys;


Of course that wants to be documented very carefully in the code, so
we can avoid the brain melting exercise of the futex/memory ordering
combo next time.

Thanks,

      tglx, who is about to apply a full memory barrier to himself in
            order to cure all this FU[BAR]TEX induced brain damage.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to