[lock-free] Re: Experimental Read/Write Mutex..

2019-02-19 Thread Chris M. Thomasson
Fwiw, I wrote a crude new benchmark that measures how many reads and writes 
can be performed in a given amount of time. My algorithm vs 
std::shared_mutex. So, we are _primarily_ looking for how many reads can be 
performed in this test at 60 seconds. The number of threads is variable and 
is determined by std::hardware_concurrency * THREADS, set at 8 in the test. 
So on my system the setup is: 
___ 
cpu_threads_n = 4 
threads_n = 32 
writers = 16 
readers = 16 
test duration = 60 seconds 
___ 

This is going to be different for each system. Readers iterate a shared 
linked list with 100 nodes in this test. Writers pop and push items, 
and never use new or delete. Well, so far, the timings I am getting on my 
end using MSVC 2017 is: 


Testing: Chris M. Thomasson's Experimental Read/Write Mutex 

___ 
cpu_threads_n = 4 
threads_n = 32 
writers = 16 
readers = 16 
test duration = 60 seconds 
___ 


___ 
Raw Reads: 18046464 
Raw Writes: 347498 
reads_per_tick = 300693 
writes_per_tick = 5790 
Ticks = 60.0161 
___ 



Testing: std::shared_mutex 

___ 
cpu_threads_n = 4 
threads_n = 32 
writers = 16 
readers = 16 
test duration = 60 seconds 
___ 


___ 
Raw Reads: 650006 
Raw Writes: 168741 
reads_per_tick = 10829 
writes_per_tick = 2811 
Ticks = 60.0193 
___ 



As you can see, my algorithm is performing many more reads than 
std::shared_mutex. Anyway, here is my code: 

When you get some time, can you please give it a go? I want to see timings 
on other systems, the good, bad and ugly. ;^) Thanks. 

https://pastebin.com/raw/1QtPCGhV 
___ 
/* Crude Read/Write Mutex Benchmark 

   This tests my algorithm ct_rwmutex vs. std::shared_mutex. 

   It shows how many reads and writes can be performed within 
   a fixed amount of time. 

   by: Chris M. Thomasson 
__*/ 



#include  
#include  
#include  
#include  
#include  
#include  
#include  
#include  
#include  
#include  
#include  
#include  
#include  


// undefine to test std::shared_mutex 
#define CT_TEST_FAST_MUTEX 1 


#define THREADS 8 // multiplied by std::hardware_concurrency 
#define NODE_PRIME 100 // number of nodes 
#define CRUDE_CACHE_PAD 256 // crude cache pad 
#define TEST_DURATION_SEC 60 // number of seconds for the test 


// bare bones mutex/condvar based semaphore 
struct ct_slow_semaphore 
{ 
unsigned long m_state; 
std::mutex m_mutex; 
std::condition_variable m_cond; 

ct_slow_semaphore(unsigned long state) : m_state(state) {} 

void inc() 
{ 
m_mutex.lock(); 
++m_state; 
m_mutex.unlock(); 
m_cond.notify_one(); 
} 

void add(unsigned long addend) 
{ 
{ 
std::unique_lock lock(m_mutex); 
m_state += addend; 
} 

m_cond.notify_all(); 
} 

void dec() 
{ 
std::unique_lock lock(m_mutex); 
while (m_state == 0) m_cond.wait(lock); 
--m_state; 
} 
}; 




// bin-sema 
struct ct_auto_reset_event 
{ 
bool m_state; 
std::mutex m_mutex; 
std::condition_variable m_cond; 

ct_auto_reset_event() : m_state(false) {} 

void signal() 
{ 
std::unique_lock lock(m_mutex); 
m_state = true; 
m_cond.notify_one(); 
} 

void wait() 
{ 
std::unique_lock lock(m_mutex); 
while (m_state == false) m_cond.wait(lock); 
m_state = false; // auto-reset 
} 
}; 




// just a layer over an auto-reset event 
struct ct_fast_mutex 
{ 
std::atomic m_state; 
ct_auto_reset_event m_waitset; 

ct_fast_mutex() : m_state(0) {} 

void lock() 
{ 
if (m_state.exchange(1, std::memory_order_acquire)) 
{ 
while (m_state.exchange(2, std::memory_order_acquire)) 
{ 
m_waitset.wait(); 
} 
} 
} 

void unlock() 
{ 
if (m_state.exchange(0, std::memory_order_release) == 2) 
{ 
m_waitset.signal(); 
} 
} 
}; 



// Chris M. Thomassons Experimental Read/Write Mutex 
// Yeah, it is pretty damn fat wrt the state, however 
// it has some interesting properties... 
// The state can be compressed a bit... 
// btw, it has no loops... 
// Take a look at the lock_shared and unlock_shared functions 

// The number of readers _must_ never exceed LONG_MAX! 

#define RWMUTEX_COUNT_MAX LONG_MAX 

struct ct_rwmutex 
{ 
unsigned char m_crude_cache_pad_0[CRUDE_CACHE_PAD]; 

// shared state 
std::atomic m_count; 
unsigned char m_crude_cache_pad_1[CRUDE_CACHE_PAD]; 

std::atomic m_rdwake; 
unsigned char m_crude_cache_pad

[lock-free] Re: Experimental Read/Write Mutex..

2019-02-19 Thread Chris M. Thomasson
On Monday, February 18, 2019 at 8:50:21 PM UTC-8, Chris M. Thomasson wrote:
>
> Fwiw, here is a more realistic benchmark, in the Relacy stage for 
> verification purposes:
>
> https://pastebin.com/raw/j4wrg050
>
>

Fwiw, I have almost completed my new benchmark in pure C++11. The timings 
that I am getting measure a different thing. How many reads and writes can 
be completed within a given time frame. Well, here are some initial results:

Testing: Chris M. Thomasson's Experimental Read/Write Mutex

Raw Reads: 3086724
Raw Writes: 59651
reads_per_tick = 308309
writes_per_tick = 5958
Ticks = 10.0118


Testing: std::shared_mutex

Raw Reads: 110922
Raw Writes: 29502
reads_per_tick = 11071
writes_per_tick = 2944
Ticks = 10.019


For a simple test of 10 seconds, my algorithm allows for 3086724 reads and 
59651 writes to be completed. While std::shared_mutex allows for only 
110922 reads and 29502 writes, damn. Look at the difference in the read 
rate per tick! Wow. I can't wait for you, and others to test it and tell me 
the results, good or bad. Heck, std::shared_mutex should beat my work, 
wouldn't ya think? Humm... 

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/dc5d90c2-3c8f-416c-8d50-3940a50ccc29%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [lock-free] Experimental Read/Write Mutex..

2019-02-19 Thread Chris M. Thomasson
On Monday, February 18, 2019 at 12:13:21 PM UTC-8, Dmitry Vyukov wrote:
>
> On Sun, Feb 17, 2019 at 11:51 PM Chris M. Thomasson  > wrote: 
> > 
> > On Saturday, February 16, 2019 at 11:24:42 PM UTC-8, Dmitry Vyukov 
> wrote: 
> >> 
> >> On Sun, Feb 17, 2019 at 12:34 AM Chris M. Thomasson  
> wrote: 
> [...]
> >> I can also test on 2 CPU system with 88 hardware threads in total, but 
> >> the bench hardcodes 16 threads (there is std::hardware_concurrency() 
> >> or something like that). 
> > 
> > 
> > Indeed! Will correct this issue. Make it dynamic. :^) 
> > 
> > Although, it can be beneficial to create more threads just to see how 
> the algorithm handles these cases of "oversubscribed" contention. 
>
> Agree. So std::hardware_concurrency(), 2*std::hardware_concurrency(), 
> 4*std::hardware_concurrency(). 
>

Indeed.

 

> >> I've found that it's critical to model realistic/target ratio between 
> >> reads/writes and amount of local work and work inside of critical 
> >> section in such benchmarks. Otherwise in 100% synthetic hammering 
> >> usually the worst mutex shows the best results. Synthetic benchmarks 
> >> produce extreme contention, so a mutex that blocks threads as much as 
> >> possible and allows as few threads as possible to work, shows the best 
> >> result because contention is minimized. The best would be to run all 
> >> threads sequentially one-by-one. But that's counter productive for 
> >> real life because blocked threads don't do useful work too. 
> > 
> > 
> > Agreed. Just wondering why it performs so much better in some high 
> contention scenarios. Load imbalance is a factor for sure. 
> > 
> > https://groups.google.com/d/msg/comp.lang.c++/g6veUOvOAu0/L7Sjs0xOBAAJ 
> > 
> > https://groups.google.com/d/msg/comp.lang.c++/g6veUOvOAu0/hMIu5VBSBAAJ 
>
> I don't need to tell you that performance of sync primitives can be 
> very non-linear and counter-intuitive :) 
>
> Perhaps the other mutexes were not super dumb, so they did not perform 
> the best under very high contention. 
>

:^D


>> Also this benchmark has fixed amount of work per thread, so I suspect 
> >> it may be subject to load imbalance when an unlucky outliner delays 
> >> total result. A better option would be to run all threads for X 
> >> seconds and then set a global var for all of them to stop and then 
> >> join all of them and measure number of iterations. 
> > 
> > 
> > Indeed. A benckmark that just modeled a list that readers iterated, and 
> writers pushed and popped from. Then we can show how many operations, reads 
> or writes, they performed per-second. So it would be the number of reads 
> and writes per-second, per-thread. 
> > 
> > This reminds of some of Joe Seighs tests back on 
> comp.programming.threads. 
> > 
> > 
> >> 
> >> 
> >> Also for me it lacked #include . 
> > 
> > 
> > Fixed that. Thanks again: 
> > 
> > https://pastebin.com/raw/xCBHY9qd 
> > (reload the page...) 
> > 
> > 
> > Fwiw, I am wondering what you think about the algorithm itself? Is it 
> crap, or decent? :^) 
>
>
> I think it's one of the best algorithms out there. 
>

Thanks Dmitry. So far, wrt a general purpose centralized rwmutex, it is not 
all that bad. :^)

 

> The complete fairness for both readers and writers is notable. 
>

Indeed. My algorithm has a strict bakery style about it. In this case, the 
readers get the "main benefit" wrt the wait-free fast-path, assuming 
fetch_add is a single operation in hardware like x86, not a CAS or LL/SC 
loop in software. Well, that is great because who uses a rwmutex for writer 
performance anyway? ;^)

 

> And the wait-free fast-path for readers. 


Big time. Imvvho, this is the most important aspect. Almost ready with a 
much better benchmark that should show this off quite well.

 

> It still suffers from high 
> cache line contention even on read-only workload, but it's as good as 
> one can get with a centralized design (i.e. not per-thread/cpu 
> distributed stuff which has own problems).

 
Agreed.

 

> I would expect a 
> CAS-loop-based read lock to degrade significantly under high read 
> load. 
>

Agreed. It would not be able to really compete wrt the explicit loop vs a 
loop-free fetch_add under periods of heavy, sustained read activity. Think 
of a shi% load of read-only database searches hitting the system all at 
once.

 

> Btw your algorithm is used as the standard Go RWMutex for the past 7+ 
> years: 
> https://github.com/golang/go/commit/daaf29cf932 
> (I should have mentioned your authorship somewhere!) 
>

Oh wow. I am glad that you used it. Wrt to proper attribution, well, you 
just did that here. Thanks. Wrt the code, well, perhaps you can stick my 
name in there somewhere, if at all possible?


As for potential improvements I can only think of optimizing 
> uncontented writer lock/unlock to be 1 RMW each, i.e. not offloading 
> writer competition resolution to a separate mutex, but rather 
> incorporate it into the same atomic variable readers and writers us