Re: Hitting the assert in lockfree::mutex::unlock()

Nadav Har'El Thu, 24 Jan 2019 06:28:56 -0800

On Thu, Jan 24, 2019 at 5:35 AM Rick Payne <ri...@rossfell.co.uk> wrote:


>
> Anyone seen a crash like this?
>

No, I haven't... Some thoughts below:


> [backtrace]
> 0x000000000022926a <__assert_fail+26>
> 0x00000000003d3884 <lockfree::mutex::unlock()+548>
> 0x000000000041e889 <waitqueue::wait(lockfree::mutex&)+121>
> 0x00000000003e7aeb <futex(int*, int, int, timespec const*, int*, int)+907>
> 0x00000000003e8086 <__syscall+1254>
>
> The particular assert is this one:
>
>     while(true) {
>         wait_record *other = waitqueue.pop();
>         if (other) {
>             assert(other->thread() != sched::thread::current()); // this
> thread isn't waiting, we know that :(
>

This case indeed should never happen: If a thread is calling unlock(), it
cannot possibly be waiting on lock() on the same time.
One of the things that made it relatively easy to ensure that this didn't
happen was the fact we don't support timeouts on mutex waits, so a thread
cannot call lock() and put itself on the waiting list, and then suddenly
decide to stop waiting and run other code.

However, looking at the code, I'm starting to think we can have a problem
with wait morphing (mutex::send_lock), which *can* timeout.
I tried to look at the wait morphing code, and it's very complex (involving
among other things special scheduler states) and it *seems* like it does
the right things, also considering cases of races between wait morphing and
other causes of wakeup (like timeouts), but it's hard to be sure. Before I
invest too much more time into trying to think if there's a case you
missed, can you please confirm or deny that this bug happens only in the
futex-with-timeout use case?

            other->wake();
>             return;
>         }
>         // Some concurrent lock() is in progress (we know this because of
>         // count) but it hasn't yet put itself on the wait queue.
>         if (++sequence == 0U) ++sequence;  // pick a number, but not 0
>         auto ourhandoff = sequence;
>         handoff.store(ourhandoff);
>         // If the queue is empty, the concurrent lock() is before adding
>         // itself, and therefore will definitely find our handoff later.
>         if (waitqueue.empty())
>             return;
>         // A thread already appeared on the queue, let's try to take the
>         // handoff ourselves and awaken it. If somebody else already took
>         // the handoff, great, we're done - they are responsible now.
>         if (!handoff.compare_exchange_strong(ourhandoff, 0U))
>             return;
>     }
>
> Unfortunately, I don't have this trapped in gdb yet, so hard to do any
> further debugging. We're working on that - but just wondered if anyone had
> tripped over something similar?
>
> Rick
>
> --
> You received this message because you are subscribed to the Google Groups
> "OSv Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to osv-dev+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Hitting the assert in lockfree::mutex::unlock()

Reply via email to