On Fri, Mar 26, 2021 at 6:10 AM Waldek Kozaczuk <jwkozac...@gmail.com>
wrote:

> As I have been researching a bit the SMP issue described here -
> https://github.com/cloudius-systems/osv/issues/1123 - I have noticed that
> the 't' variable in
> https://github.com/cloudius-systems/osv/blob/master/include/osv/wait_record.hh#L33
> might be related to why OSv hangs with two or more CPUs.
>
> As I am describing here -
> https://github.com/cloudius-systems/osv/issues/1123#issuecomment-804545270
> - writing to the thread pointer variable by thread T1 on cpu0, may not be
> visible to thread T2 on cpu1 in the weak-memory model scenario.
>

This is an interesting question. Indeed, the waiter::t is not marked
atomic, and it's been a very long time since I wrote this code . Note that
"t" is not public, and the only methods that access t, wake() and wait().

I *think* (if I remember correctly) my thinking was that each of these
wake() and wait() operations already has its *own* memory barriers. namely
wake_impl() and prepare_wait() both access the state of the thread with CST
ordering -
https://en.cppreference.com/w/cpp/atomic/memory_order#Sequentially-consistent_ordering).
I think (again, I hope I remember correctly), we were hoping that these
barriers also apply to the t access before of after them.

The thinking was that it is similar to how one uses a shared variable
protected by the mutex: You don't need the shared variable to be
std::atomic or care about memory order or barriers - the mutex already
takes care of that for you.

But you're right that it's hard to catch bugs in this memory ordering on
x86 and it's easier to do it wrong on ARM.

The question is what exactly is wrong:

   1. Is the ordering in wake_impl() and prepare_wait() (called by
   waiter::wake() and waiter::wait()) not strong enough on ARM to serialize
   *everything*, not just a specific variable?
   2. Do we have this serialization in the right place? E.g., maybe we have
   the ordering point before writing t but we need it after writing t, and so.
   3. Maybe the problem isn't in waiter::wake() or waiter::wait() but in
   other functions like waiter::woken(), waiter::clear()?

I suggest that you change waiter.hh to use std::atomic for t, and then
using "relaxed" ordering to read/write t for some of the waiter methods and
"cst" ordering for other methods - and see which method needs the
non-relaxed ordering. When we know if one specific method is missing the
ordering, we can try to understand why it is missing.

Unfortunately, after several years of working in Seastar more than OSv, my
understanding of memory ordering issues has rusted, so my intuition on
these issues isn't as good as it used to be :-(


> According to
> https://github.com/Broadcom/arm64-linux/blob/master/Documentation/memory-barriers.txt,
> especially "WHERE ARE MEMORY BARRIERS NEEDED?" and "INTERPROCESSOR
> INTERACTION" chapters, it seems that every time some piece of memory needs
> to be accessed (read/written) by many CPUs for example in the scheduler or
> other kind of coordination, one must use proper memory barriers to enforce
> that those memory changes are visible to between the CPUs. One way to
> achieve this is using atomic with the default memory ordering (see "One-way
> barriers" in https://developer.arm.com/documentation/100941/0100/Barriers).
> The instructions *LDAR/STLR* are used to implement strong versions of the
> atomics.
>
> My experiments seem to prove that using atomic seems to help with the hang
> problem. But I am afraid that 't' variable in the waiter object maybe not
> the ONLY place where we need to worry about memory change visibility.
>

Definitely this is not the only other place where we needed to care about
memory ordering, but we were pretty careful to do it properly, and in fact
one of the main reasons why we chose C++11 when we started the OSv project
is its nice support for std::atomic and memory ordering. But, it is indeed
not unlikely that we have bugs in this area, because OSv only ran on x86
initially, where the memory reorderings that can happen is less than can
happen on ARM.


>
> I wonder what you think about my findings and the reasoning?
>
> Regards,
> Waldek
>
> --
> You received this message because you are subscribed to the Google Groups
> "OSv Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to osv-dev+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/osv-dev/5b16202b-8a9b-4ff3-8bc1-68bc87baa812n%40googlegroups.com
> <https://groups.google.com/d/msgid/osv-dev/5b16202b-8a9b-4ff3-8bc1-68bc87baa812n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/osv-dev/CANEVyjv%3Dk9mzKETfzw7WHKHJo_g7eSNCy7mR%3DdeK5drkUyr28Q%40mail.gmail.com.

Reply via email to