Re: [lock-free] Experimental Read/Write Mutex..

Dmitry Vyukov Tue, 26 Feb 2019 00:16:46 -0800

On Wed, Feb 20, 2019 at 1:31 AM Chris M. Thomasson <cris...@charter.net> wrote:
>
> On Monday, February 18, 2019 at 12:13:21 PM UTC-8, Dmitry Vyukov wrote:
>>
>> On Sun, Feb 17, 2019 at 11:51 PM Chris M. Thomasson <cri...@charter.net> 
>> wrote:
>> >
>> > On Saturday, February 16, 2019 at 11:24:42 PM UTC-8, Dmitry Vyukov wrote:
>> >>
>> >> On Sun, Feb 17, 2019 at 12:34 AM Chris M. Thomasson <cri...@charter.net> 
>> >> wrote:
>> [...]
>> >> I can also test on 2 CPU system with 88 hardware threads in total, but
>> >> the bench hardcodes 16 threads (there is std::hardware_concurrency()
>> >> or something like that).
>> >
>> >
>> > Indeed! Will correct this issue. Make it dynamic. :^)
>> >
>> > Although, it can be beneficial to create more threads just to see how the 
>> > algorithm handles these cases of "oversubscribed" contention.
>>
>> Agree. So std::hardware_concurrency(), 2*std::hardware_concurrency(),
>> 4*std::hardware_concurrency().
>
>
> Indeed.
>
>
>>
>> >> I've found that it's critical to model realistic/target ratio between
>> >> reads/writes and amount of local work and work inside of critical
>> >> section in such benchmarks. Otherwise in 100% synthetic hammering
>> >> usually the worst mutex shows the best results. Synthetic benchmarks
>> >> produce extreme contention, so a mutex that blocks threads as much as
>> >> possible and allows as few threads as possible to work, shows the best
>> >> result because contention is minimized. The best would be to run all
>> >> threads sequentially one-by-one. But that's counter productive for
>> >> real life because blocked threads don't do useful work too.
>> >
>> >
>> > Agreed. Just wondering why it performs so much better in some high 
>> > contention scenarios. Load imbalance is a factor for sure.
>> >
>> > https://groups.google.com/d/msg/comp.lang.c++/g6veUOvOAu0/L7Sjs0xOBAAJ
>> >
>> > https://groups.google.com/d/msg/comp.lang.c++/g6veUOvOAu0/hMIu5VBSBAAJ
>>
>> I don't need to tell you that performance of sync primitives can be
>> very non-linear and counter-intuitive :)
>>
>> Perhaps the other mutexes were not super dumb, so they did not perform
>> the best under very high contention.
>
>
> :^D
>
>
>> >> Also this benchmark has fixed amount of work per thread, so I suspect
>> >> it may be subject to load imbalance when an unlucky outliner delays
>> >> total result. A better option would be to run all threads for X
>> >> seconds and then set a global var for all of them to stop and then
>> >> join all of them and measure number of iterations.
>> >
>> >
>> > Indeed. A benckmark that just modeled a list that readers iterated, and 
>> > writers pushed and popped from. Then we can show how many operations, 
>> > reads or writes, they performed per-second. So it would be the number of 
>> > reads and writes per-second, per-thread.
>> >
>> > This reminds of some of Joe Seighs tests back on comp.programming.threads.
>> >
>> >
>> >>
>> >>
>> >> Also for me it lacked #include <climits>.
>> >
>> >
>> > Fixed that. Thanks again:
>> >
>> > https://pastebin.com/raw/xCBHY9qd
>> > (reload the page...)
>> >
>> >
>> > Fwiw, I am wondering what you think about the algorithm itself? Is it 
>> > crap, or decent? :^)
>>
>>
>> I think it's one of the best algorithms out there.
>
>
> Thanks Dmitry. So far, wrt a general purpose centralized rwmutex, it is not 
> all that bad. :^)
>
>
>>
>> The complete fairness for both readers and writers is notable.
>
>
> Indeed. My algorithm has a strict bakery style about it. In this case, the 
> readers get the "main benefit" wrt the wait-free fast-path, assuming 
> fetch_add is a single operation in hardware like x86, not a CAS or LL/SC loop 
> in software. Well, that is great because who uses a rwmutex for writer 
> performance anyway? ;^)
>
>
>>
>> And the wait-free fast-path for readers.
>
>
> Big time. Imvvho, this is the most important aspect. Almost ready with a much 
> better benchmark that should show this off quite well.
>
>
>>
>> It still suffers from high
>> cache line contention even on read-only workload, but it's as good as
>> one can get with a centralized design (i.e. not per-thread/cpu
>> distributed stuff which has own problems).
>
>
> Agreed.
>
>
>>
>> I would expect a
>> CAS-loop-based read lock to degrade significantly under high read
>> load.
>
>
> Agreed. It would not be able to really compete wrt the explicit loop vs a 
> loop-free fetch_add under periods of heavy, sustained read activity. Think of 
> a shi% load of read-only database searches hitting the system all at once.
>
>
>>
>> Btw your algorithm is used as the standard Go RWMutex for the past 7+ years:
>> https://github.com/golang/go/commit/daaf29cf932
>> (I should have mentioned your authorship somewhere!)
>
>
> Oh wow. I am glad that you used it. Wrt to proper attribution, well, you just 
> did that here. Thanks. Wrt the code, well, perhaps you can stick my name in 
> there somewhere, if at all possible?
>
>> As for potential improvements I can only think of optimizing
>> uncontented writer lock/unlock to be 1 RMW each, i.e. not offloading
>> writer competition resolution to a separate mutex, but rather
>> incorporate it into the same atomic variable readers and writers use
>> for synchronization with each other. Do you think it's possible? With
>> CAS-loop? Or maybe with some smart atomic_fetch_or? Wait,
>> atomic_fetch_or is compiled to a CAS loop on x86 anyway. These new
>> C/C++ atomic interfaces sometimes make me forget that.
>
>
>
> Humm... I can artificially limit the range of m_count and gain some extra 
> state. Right now, it uses LONG_MAX. Okay, well, what about using LONG_MAX - 
> sizeof(metadata), or something. Then, the writes can use CAS, or even 
> fetch_add in a special way to gain write access using ct_rwmutex::m_count 
> directly. Need to think on this. Perhaps, if I come up with something, it can 
> be added to the source code, with my name in there, somewhere? Thanks again.


We can try with some new change :)

I never seen any names in Go sources before. And everything is usually
very consistent, it's not that everybody can do whatever they want in
a new file just because they wrote it and nobody else will care.
People actually care a lot about consistency and common style for
everything.
But now grepping the source I've found an existing precedent:
https://github.com/golang/go/blob/master/src/go/printer/nodes.go#L662
So we may try too :)

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3uTma%2B41A83BnV1TT%3DaPOtojCYzzO%2Bcv%3DBB6WQ%3DmPDaMQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [lock-free] Experimental Read/Write Mutex..

Reply via email to