On Thu, Jan 24, 2019 at 5:35 AM Rick Payne <ri...@rossfell.co.uk> wrote:
> > Anyone seen a crash like this? > No, I haven't... Some thoughts below: > [backtrace] > 0x000000000022926a <__assert_fail+26> > 0x00000000003d3884 <lockfree::mutex::unlock()+548> > 0x000000000041e889 <waitqueue::wait(lockfree::mutex&)+121> > 0x00000000003e7aeb <futex(int*, int, int, timespec const*, int*, int)+907> > 0x00000000003e8086 <__syscall+1254> > > The particular assert is this one: > > while(true) { > wait_record *other = waitqueue.pop(); > if (other) { > assert(other->thread() != sched::thread::current()); // this > thread isn't waiting, we know that :( > This case indeed should never happen: If a thread is calling unlock(), it cannot possibly be waiting on lock() on the same time. One of the things that made it relatively easy to ensure that this didn't happen was the fact we don't support timeouts on mutex waits, so a thread cannot call lock() and put itself on the waiting list, and then suddenly decide to stop waiting and run other code. However, looking at the code, I'm starting to think we can have a problem with wait morphing (mutex::send_lock), which *can* timeout. I tried to look at the wait morphing code, and it's very complex (involving among other things special scheduler states) and it *seems* like it does the right things, also considering cases of races between wait morphing and other causes of wakeup (like timeouts), but it's hard to be sure. Before I invest too much more time into trying to think if there's a case you missed, can you please confirm or deny that this bug happens only in the futex-with-timeout use case? other->wake(); > return; > } > // Some concurrent lock() is in progress (we know this because of > // count) but it hasn't yet put itself on the wait queue. > if (++sequence == 0U) ++sequence; // pick a number, but not 0 > auto ourhandoff = sequence; > handoff.store(ourhandoff); > // If the queue is empty, the concurrent lock() is before adding > // itself, and therefore will definitely find our handoff later. > if (waitqueue.empty()) > return; > // A thread already appeared on the queue, let's try to take the > // handoff ourselves and awaken it. If somebody else already took > // the handoff, great, we're done - they are responsible now. > if (!handoff.compare_exchange_strong(ourhandoff, 0U)) > return; > } > > Unfortunately, I don't have this trapped in gdb yet, so hard to do any > further debugging. We're working on that - but just wondered if anyone had > tripped over something similar? > > Rick > > -- > You received this message because you are subscribed to the Google Groups > "OSv Development" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to osv-dev+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.