On Thu, Dec 14, 2017 at 04:06:41PM +0100, Martin Pieuchot wrote:
> Diff below moves amd64 and i386 mutex to the common C implementation.
>
> The differences are:
>   - membar_enter_after_atomic(9) instead of membar_enter(9), and
>   - membar_exit_before_atomic(9) instead of membar_exit(9)
>
> I'd appreciate any performance test to know if the performance
> degradation is acceptable with these barriers.
>

These are only compiler barriers and there should be effectively no
difference.

However, contended behaviour is a regression compared to the asm
variant.

>From what I gather this is a step towards unifying all mutex
implementations, hence the membar_* use.

> +void
> +__mtx_enter(struct mutex *mtx)
> +{
> +#ifdef MP_LOCKDEBUG
> +     int nticks = __mp_lock_spinout;
> +#endif
> +
> +     while (__mtx_enter_try(mtx) == 0) {
> +             CPU_BUSY_CYCLE();

So this is effectively __mtx_enter_try with single pause in-between.

> +}
> +
> +int
> +__mtx_enter_try(struct mutex *mtx)
> +{
> +     struct cpu_info *owner, *ci = curcpu();
> +     int s;
> +
> +     if (mtx->mtx_wantipl != IPL_NONE)
> +             s = splraise(mtx->mtx_wantipl);
> +

This is at least one read.

> +     owner = atomic_cas_ptr(&mtx->mtx_owner, NULL, ci);
> +#ifdef DIAGNOSTIC
> +     if (__predict_false(owner == ci))
> +             panic("mtx %p: locking against myself", mtx);
> +#endif
> +     if (owner == NULL) {
> +             membar_enter_after_atomic();
> +             if (mtx->mtx_wantipl != IPL_NONE)
> +                     mtx->mtx_oldipl = s;

This repeats the read done earlier.

Since the caller loops, this is effectively a very inefficient lock of
the form:
while (!atomic_cas_ptr(...))
        CPU_BUSY_CYCLE();

+ some reads in-between

Assembly code would spin waiting for the lock to become free before
playing with spl level and attempting to lock. I don't know how
contended your mutexes are right now, but this will have to be very
quickly changed at least to current asm behaviour as more of the kernel
gets unlocked. Going for ticket locks or backoff later will probably
work fine enough for the foreseeable future and will postpone the need
to invest into anything fancy.

That said, proposed restructure is as follows (pseudo-code, no debug):

void
__mtx_enter(struct mutex *mtx)
{
        int want_ipl, s;

        /* mark mtx_wantipl as volatile or add a read casted through
         * one to force a read *here* and no re-reads later */
        want_ipl = mtx->mtx_wantipl;

        for (;;) {
                if (want_ipl != IPL_NONE)
                        s = splraise(want_ipl);
                owner = atomic_cas_ptr(&mtx->mtx_owner, NULL, ci);
                if (owner == NULL) {
                        membar_enter_after_atomic();
                        if (want_ipl != IPL_NONE)
                                mtx->mtx_oldipl = s;
                        break;
                }

                if (want_ipl != IPL_NONE)
                        splx(s);

                do {
                        CPU_BUSY_CYCLE();
                } while (mtx->mtx_owner != NULL);
        }
}

I don't think partial duplication of try_enter can be avoided without
serious ugliness.

> +void
> +__mtx_leave(struct mutex *mtx)
> +{
> +     int s;
> +
> +     MUTEX_ASSERT_LOCKED(mtx);
> +
> +#ifdef DIAGNOSTIC
> +     curcpu()->ci_mutex_level--;
> +#endif
> +
> +     s = mtx->mtx_oldipl;
> +#ifdef MULTIPROCESSOR
> +     membar_exit_before_atomic();
> +#endif
> +     mtx->mtx_owner = NULL;
> +     if (mtx->mtx_wantipl != IPL_NONE)
> +             splx(s);
> +}

It should be somewhat faster to read mtx_wantipl while the lock is still
held - it is less likely that someone else dirtied the cacheline. Then
mtx_oldipl can be only conditionally read.

This function does not do atomic ops, so membar_exit_before_atomic does
not really fit, even though it happens to do the exact same thing the
"correctly" named primitive would - just provide a compiler barrier on
amd64/i386.

>From what I gather you are trying to mimick illumos nomenclature, but
they don't have an equivalent afaics. (perhaps Solaris grew one in the
meantime?)

In FreeBSD an appropriate routine is named atomic_thread_fence_rel (see
amd64/include/atomic.h) and I suggest just borrowing the api.

Side note is that you probably can shorten ipl to vars to make the lock
smaller. It can be doable to fit it into lock word, but I don't know how
much
sense would playing with it make.

-- 
Mateusz Guzik <mjguzik gmail.com>

Reply via email to