[lock-free] Re: Experimental Read/Write Mutex..
Fwiw, I wrote a crude new benchmark that measures how many reads and writes can be performed in a given amount of time. My algorithm vs std::shared_mutex. So, we are _primarily_ looking for how many reads can be performed in this test at 60 seconds. The number of threads is variable and is determined by std::hardware_concurrency * THREADS, set at 8 in the test. So on my system the setup is: ___ cpu_threads_n = 4 threads_n = 32 writers = 16 readers = 16 test duration = 60 seconds ___ This is going to be different for each system. Readers iterate a shared linked list with 100 nodes in this test. Writers pop and push items, and never use new or delete. Well, so far, the timings I am getting on my end using MSVC 2017 is: Testing: Chris M. Thomasson's Experimental Read/Write Mutex ___ cpu_threads_n = 4 threads_n = 32 writers = 16 readers = 16 test duration = 60 seconds ___ ___ Raw Reads: 18046464 Raw Writes: 347498 reads_per_tick = 300693 writes_per_tick = 5790 Ticks = 60.0161 ___ Testing: std::shared_mutex ___ cpu_threads_n = 4 threads_n = 32 writers = 16 readers = 16 test duration = 60 seconds ___ ___ Raw Reads: 650006 Raw Writes: 168741 reads_per_tick = 10829 writes_per_tick = 2811 Ticks = 60.0193 ___ As you can see, my algorithm is performing many more reads than std::shared_mutex. Anyway, here is my code: When you get some time, can you please give it a go? I want to see timings on other systems, the good, bad and ugly. ;^) Thanks. https://pastebin.com/raw/1QtPCGhV ___ /* Crude Read/Write Mutex Benchmark This tests my algorithm ct_rwmutex vs. std::shared_mutex. It shows how many reads and writes can be performed within a fixed amount of time. by: Chris M. Thomasson __*/ #include #include #include #include #include #include #include #include #include #include #include #include #include // undefine to test std::shared_mutex #define CT_TEST_FAST_MUTEX 1 #define THREADS 8 // multiplied by std::hardware_concurrency #define NODE_PRIME 100 // number of nodes #define CRUDE_CACHE_PAD 256 // crude cache pad #define TEST_DURATION_SEC 60 // number of seconds for the test // bare bones mutex/condvar based semaphore struct ct_slow_semaphore { unsigned long m_state; std::mutex m_mutex; std::condition_variable m_cond; ct_slow_semaphore(unsigned long state) : m_state(state) {} void inc() { m_mutex.lock(); ++m_state; m_mutex.unlock(); m_cond.notify_one(); } void add(unsigned long addend) { { std::unique_lock lock(m_mutex); m_state += addend; } m_cond.notify_all(); } void dec() { std::unique_lock lock(m_mutex); while (m_state == 0) m_cond.wait(lock); --m_state; } }; // bin-sema struct ct_auto_reset_event { bool m_state; std::mutex m_mutex; std::condition_variable m_cond; ct_auto_reset_event() : m_state(false) {} void signal() { std::unique_lock lock(m_mutex); m_state = true; m_cond.notify_one(); } void wait() { std::unique_lock lock(m_mutex); while (m_state == false) m_cond.wait(lock); m_state = false; // auto-reset } }; // just a layer over an auto-reset event struct ct_fast_mutex { std::atomic m_state; ct_auto_reset_event m_waitset; ct_fast_mutex() : m_state(0) {} void lock() { if (m_state.exchange(1, std::memory_order_acquire)) { while (m_state.exchange(2, std::memory_order_acquire)) { m_waitset.wait(); } } } void unlock() { if (m_state.exchange(0, std::memory_order_release) == 2) { m_waitset.signal(); } } }; // Chris M. Thomassons Experimental Read/Write Mutex // Yeah, it is pretty damn fat wrt the state, however // it has some interesting properties... // The state can be compressed a bit... // btw, it has no loops... // Take a look at the lock_shared and unlock_shared functions // The number of readers _must_ never exceed LONG_MAX! #define RWMUTEX_COUNT_MAX LONG_MAX struct ct_rwmutex { unsigned char m_crude_cache_pad_0[CRUDE_CACHE_PAD]; // shared state std::atomic m_count; unsigned char m_crude_cache_pad_1[CRUDE_CACHE_PAD]; std::atomic m_rdwake; unsigned char m_crude_cache_pad
[lock-free] Re: Experimental Read/Write Mutex..
On Monday, February 18, 2019 at 8:50:21 PM UTC-8, Chris M. Thomasson wrote: > > Fwiw, here is a more realistic benchmark, in the Relacy stage for > verification purposes: > > https://pastebin.com/raw/j4wrg050 > > Fwiw, I have almost completed my new benchmark in pure C++11. The timings that I am getting measure a different thing. How many reads and writes can be completed within a given time frame. Well, here are some initial results: Testing: Chris M. Thomasson's Experimental Read/Write Mutex Raw Reads: 3086724 Raw Writes: 59651 reads_per_tick = 308309 writes_per_tick = 5958 Ticks = 10.0118 Testing: std::shared_mutex Raw Reads: 110922 Raw Writes: 29502 reads_per_tick = 11071 writes_per_tick = 2944 Ticks = 10.019 For a simple test of 10 seconds, my algorithm allows for 3086724 reads and 59651 writes to be completed. While std::shared_mutex allows for only 110922 reads and 29502 writes, damn. Look at the difference in the read rate per tick! Wow. I can't wait for you, and others to test it and tell me the results, good or bad. Heck, std::shared_mutex should beat my work, wouldn't ya think? Humm... -- --- You received this message because you are subscribed to the Google Groups "Scalable Synchronization Algorithms" group. To unsubscribe from this group and stop receiving emails from it, send an email to lock-free+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/lock-free/dc5d90c2-3c8f-416c-8d50-3940a50ccc29%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [lock-free] Experimental Read/Write Mutex..
On Monday, February 18, 2019 at 12:13:21 PM UTC-8, Dmitry Vyukov wrote: > > On Sun, Feb 17, 2019 at 11:51 PM Chris M. Thomasson > wrote: > > > > On Saturday, February 16, 2019 at 11:24:42 PM UTC-8, Dmitry Vyukov > wrote: > >> > >> On Sun, Feb 17, 2019 at 12:34 AM Chris M. Thomasson > wrote: > [...] > >> I can also test on 2 CPU system with 88 hardware threads in total, but > >> the bench hardcodes 16 threads (there is std::hardware_concurrency() > >> or something like that). > > > > > > Indeed! Will correct this issue. Make it dynamic. :^) > > > > Although, it can be beneficial to create more threads just to see how > the algorithm handles these cases of "oversubscribed" contention. > > Agree. So std::hardware_concurrency(), 2*std::hardware_concurrency(), > 4*std::hardware_concurrency(). > Indeed. > >> I've found that it's critical to model realistic/target ratio between > >> reads/writes and amount of local work and work inside of critical > >> section in such benchmarks. Otherwise in 100% synthetic hammering > >> usually the worst mutex shows the best results. Synthetic benchmarks > >> produce extreme contention, so a mutex that blocks threads as much as > >> possible and allows as few threads as possible to work, shows the best > >> result because contention is minimized. The best would be to run all > >> threads sequentially one-by-one. But that's counter productive for > >> real life because blocked threads don't do useful work too. > > > > > > Agreed. Just wondering why it performs so much better in some high > contention scenarios. Load imbalance is a factor for sure. > > > > https://groups.google.com/d/msg/comp.lang.c++/g6veUOvOAu0/L7Sjs0xOBAAJ > > > > https://groups.google.com/d/msg/comp.lang.c++/g6veUOvOAu0/hMIu5VBSBAAJ > > I don't need to tell you that performance of sync primitives can be > very non-linear and counter-intuitive :) > > Perhaps the other mutexes were not super dumb, so they did not perform > the best under very high contention. > :^D >> Also this benchmark has fixed amount of work per thread, so I suspect > >> it may be subject to load imbalance when an unlucky outliner delays > >> total result. A better option would be to run all threads for X > >> seconds and then set a global var for all of them to stop and then > >> join all of them and measure number of iterations. > > > > > > Indeed. A benckmark that just modeled a list that readers iterated, and > writers pushed and popped from. Then we can show how many operations, reads > or writes, they performed per-second. So it would be the number of reads > and writes per-second, per-thread. > > > > This reminds of some of Joe Seighs tests back on > comp.programming.threads. > > > > > >> > >> > >> Also for me it lacked #include . > > > > > > Fixed that. Thanks again: > > > > https://pastebin.com/raw/xCBHY9qd > > (reload the page...) > > > > > > Fwiw, I am wondering what you think about the algorithm itself? Is it > crap, or decent? :^) > > > I think it's one of the best algorithms out there. > Thanks Dmitry. So far, wrt a general purpose centralized rwmutex, it is not all that bad. :^) > The complete fairness for both readers and writers is notable. > Indeed. My algorithm has a strict bakery style about it. In this case, the readers get the "main benefit" wrt the wait-free fast-path, assuming fetch_add is a single operation in hardware like x86, not a CAS or LL/SC loop in software. Well, that is great because who uses a rwmutex for writer performance anyway? ;^) > And the wait-free fast-path for readers. Big time. Imvvho, this is the most important aspect. Almost ready with a much better benchmark that should show this off quite well. > It still suffers from high > cache line contention even on read-only workload, but it's as good as > one can get with a centralized design (i.e. not per-thread/cpu > distributed stuff which has own problems). Agreed. > I would expect a > CAS-loop-based read lock to degrade significantly under high read > load. > Agreed. It would not be able to really compete wrt the explicit loop vs a loop-free fetch_add under periods of heavy, sustained read activity. Think of a shi% load of read-only database searches hitting the system all at once. > Btw your algorithm is used as the standard Go RWMutex for the past 7+ > years: > https://github.com/golang/go/commit/daaf29cf932 > (I should have mentioned your authorship somewhere!) > Oh wow. I am glad that you used it. Wrt to proper attribution, well, you just did that here. Thanks. Wrt the code, well, perhaps you can stick my name in there somewhere, if at all possible? As for potential improvements I can only think of optimizing > uncontented writer lock/unlock to be 1 RMW each, i.e. not offloading > writer competition resolution to a separate mutex, but rather > incorporate it into the same atomic variable readers and writers us