Re: [lock-free] Experimental Read/Write Mutex..

2019-03-04 Thread Chris M. Thomasson


On Sunday, March 3, 2019 at 1:28:03 AM UTC-8, Chris M. Thomasson wrote:
>
> On Tuesday, February 26, 2019 at 12:16:24 AM UTC-8, Dmitry Vyukov wrote:
>>
>> On Wed, Feb 20, 2019 at 1:31 AM Chris M. Thomasson  
>> wrote: 
>> [...]
>> We can try with some new change :) 
>>
>
> Just as a first start, I changed m_count to unsigned long, and is 
> "integrated" with lock and unlock, and still has no loops, just a first 
> exploratory experiment. Here is the code:
>
> It still only uses fetch_add. Humm, not sure what to make of it. Also, I 
> am pissed that fetch_or goes to CMPXCHG loop on x86.
>
> _
> // Chris M. Thomassons Experimental Read/Write Mutex
> [...]
>
> // WRITE, more hefty
> void wrlock()
> {
> unsigned long wrcount = m_count.fetch_add(WRITE, 
> std::memory_order_acquire);
>
> if ((wrcount & WRITE_MASK) > 0)
> {
> m_wrmtx.dec();
> }
>
^^

Still thinking about getting rid of the above mutex, and directly 
integrating the waiting into a single semaphore without using loops. 
Humm... I would have to change the way readers can unlock writers and vise 
versa. It is a bit challenging to do without loops.

 

>
> unsigned long count = m_count.fetch_add(WRITE_BIT, 
> std::memory_order_acquire);
>

[...]

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/6da39ca7-7b5c-4709-a569-faf3f00853d3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [lock-free] Experimental Read/Write Mutex..

2019-03-03 Thread Chris M. Thomasson
On Tuesday, February 26, 2019 at 12:16:24 AM UTC-8, Dmitry Vyukov wrote:
>
> On Wed, Feb 20, 2019 at 1:31 AM Chris M. Thomasson  > wrote: 
> [...]
> We can try with some new change :) 
>

Just as a first start, I changed m_count to unsigned long, and is 
"integrated" with lock and unlock, and still has no loops, just a first 
exploratory experiment. Here is the code:

It still only uses fetch_add. Humm, not sure what to make of it. Also, I am 
pissed that fetch_or goes to CMPXCHG loop on x86.

_
// Chris M. Thomassons Experimental Read/Write Mutex
// Yeah, it is pretty damn fat wrt the state, however
// it has some interesting properties...
// The state can be compressed a bit...
// btw, it has no loops...

// 32-bit layout
#define READ0x0001UL
#define READ_MASK   0xUL
#define WRITE_BIT   0x0001UL
#define WRITE   0x0002UL
#define WRITE_MASK  0xUL
#define NEUTRAL 0xUL


struct ct_rwmutex
{
// shared state
std::atomic m_count;
std::atomic m_rdwake;

ct_slow_semaphore m_rdwset;
ct_slow_semaphore m_wrwset;
ct_slow_semaphore m_wrmtx;


ct_rwmutex() :
m_count(NEUTRAL),
m_rdwake(0),
m_rdwset(0),
m_wrwset(0),
m_wrmtx(0) {
}


~ct_rwmutex()
{
RL_ASSERT(m_count.load(std::memory_order_relaxed) == 0);
}


// READ, pretty slim...
void rdlock()
{
if (m_count.fetch_add(READ, std::memory_order_acquire) & WRITE_BIT)
{
m_rdwset.dec();
}
}

void rdunlock()
{
if (m_count.fetch_sub(READ, std::memory_order_release) & WRITE_BIT)
{
if (m_rdwake.fetch_add(-1, std::memory_order_acq_rel) == 1)
{
m_wrwset.inc();
}
}
}


// WRITE, more hefty
void wrlock()
{
unsigned long wrcount = m_count.fetch_add(WRITE, 
std::memory_order_acquire);

if ((wrcount & WRITE_MASK) > 0)
{
m_wrmtx.dec();
}

unsigned long count = m_count.fetch_add(WRITE_BIT, 
std::memory_order_acquire);
unsigned long readers = count & READ_MASK;

if (readers)
{
long rdwake = m_rdwake.fetch_add(readers, 
std::memory_order_acquire);

if (rdwake + readers)
{
m_wrwset.dec();
}
}
}

// write unlock
void wrunlock()
{
unsigned long count = m_count.fetch_sub(WRITE_BIT, 
std::memory_order_release);
unsigned long readers = count & READ_MASK;

if (readers)
{
m_rdwset.add(readers);
}

unsigned long wrcount = m_count.fetch_sub(WRITE, 
std::memory_order_relaxed);

if ((wrcount & WRITE_MASK) > WRITE)
{
m_wrmtx.inc();
}
}
};
_


Humm... Any thoughts? ;^)

 

>
> I never seen any names in Go sources before. And everything is usually 
> very consistent, it's not that everybody can do whatever they want in 
> a new file just because they wrote it and nobody else will care. 
> People actually care a lot about consistency and common style for 
> everything. 
> But now grepping the source I've found an existing precedent: 
> https://github.com/golang/go/blob/master/src/go/printer/nodes.go#L662 
> 
>  
> So we may try too :) 
>

Very cool. Thanks Dmitry. 

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/843fb17a-f470-4abb-813e-4fa059c05218%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [lock-free] Experimental Read/Write Mutex..

2019-02-26 Thread Dmitry Vyukov
On Wed, Feb 20, 2019 at 1:31 AM Chris M. Thomasson  wrote:
>
> On Monday, February 18, 2019 at 12:13:21 PM UTC-8, Dmitry Vyukov wrote:
>>
>> On Sun, Feb 17, 2019 at 11:51 PM Chris M. Thomasson  
>> wrote:
>> >
>> > On Saturday, February 16, 2019 at 11:24:42 PM UTC-8, Dmitry Vyukov wrote:
>> >>
>> >> On Sun, Feb 17, 2019 at 12:34 AM Chris M. Thomasson  
>> >> wrote:
>> [...]
>> >> I can also test on 2 CPU system with 88 hardware threads in total, but
>> >> the bench hardcodes 16 threads (there is std::hardware_concurrency()
>> >> or something like that).
>> >
>> >
>> > Indeed! Will correct this issue. Make it dynamic. :^)
>> >
>> > Although, it can be beneficial to create more threads just to see how the 
>> > algorithm handles these cases of "oversubscribed" contention.
>>
>> Agree. So std::hardware_concurrency(), 2*std::hardware_concurrency(),
>> 4*std::hardware_concurrency().
>
>
> Indeed.
>
>
>>
>> >> I've found that it's critical to model realistic/target ratio between
>> >> reads/writes and amount of local work and work inside of critical
>> >> section in such benchmarks. Otherwise in 100% synthetic hammering
>> >> usually the worst mutex shows the best results. Synthetic benchmarks
>> >> produce extreme contention, so a mutex that blocks threads as much as
>> >> possible and allows as few threads as possible to work, shows the best
>> >> result because contention is minimized. The best would be to run all
>> >> threads sequentially one-by-one. But that's counter productive for
>> >> real life because blocked threads don't do useful work too.
>> >
>> >
>> > Agreed. Just wondering why it performs so much better in some high 
>> > contention scenarios. Load imbalance is a factor for sure.
>> >
>> > https://groups.google.com/d/msg/comp.lang.c++/g6veUOvOAu0/L7Sjs0xOBAAJ
>> >
>> > https://groups.google.com/d/msg/comp.lang.c++/g6veUOvOAu0/hMIu5VBSBAAJ
>>
>> I don't need to tell you that performance of sync primitives can be
>> very non-linear and counter-intuitive :)
>>
>> Perhaps the other mutexes were not super dumb, so they did not perform
>> the best under very high contention.
>
>
> :^D
>
>
>> >> Also this benchmark has fixed amount of work per thread, so I suspect
>> >> it may be subject to load imbalance when an unlucky outliner delays
>> >> total result. A better option would be to run all threads for X
>> >> seconds and then set a global var for all of them to stop and then
>> >> join all of them and measure number of iterations.
>> >
>> >
>> > Indeed. A benckmark that just modeled a list that readers iterated, and 
>> > writers pushed and popped from. Then we can show how many operations, 
>> > reads or writes, they performed per-second. So it would be the number of 
>> > reads and writes per-second, per-thread.
>> >
>> > This reminds of some of Joe Seighs tests back on comp.programming.threads.
>> >
>> >
>> >>
>> >>
>> >> Also for me it lacked #include .
>> >
>> >
>> > Fixed that. Thanks again:
>> >
>> > https://pastebin.com/raw/xCBHY9qd
>> > (reload the page...)
>> >
>> >
>> > Fwiw, I am wondering what you think about the algorithm itself? Is it 
>> > crap, or decent? :^)
>>
>>
>> I think it's one of the best algorithms out there.
>
>
> Thanks Dmitry. So far, wrt a general purpose centralized rwmutex, it is not 
> all that bad. :^)
>
>
>>
>> The complete fairness for both readers and writers is notable.
>
>
> Indeed. My algorithm has a strict bakery style about it. In this case, the 
> readers get the "main benefit" wrt the wait-free fast-path, assuming 
> fetch_add is a single operation in hardware like x86, not a CAS or LL/SC loop 
> in software. Well, that is great because who uses a rwmutex for writer 
> performance anyway? ;^)
>
>
>>
>> And the wait-free fast-path for readers.
>
>
> Big time. Imvvho, this is the most important aspect. Almost ready with a much 
> better benchmark that should show this off quite well.
>
>
>>
>> It still suffers from high
>> cache line contention even on read-only workload, but it's as good as
>> one can get with a centralized design (i.e. not per-thread/cpu
>> distributed stuff which has own problems).
>
>
> Agreed.
>
>
>>
>> I would expect a
>> CAS-loop-based read lock to degrade significantly under high read
>> load.
>
>
> Agreed. It would not be able to really compete wrt the explicit loop vs a 
> loop-free fetch_add under periods of heavy, sustained read activity. Think of 
> a shi% load of read-only database searches hitting the system all at once.
>
>
>>
>> Btw your algorithm is used as the standard Go RWMutex for the past 7+ years:
>> https://github.com/golang/go/commit/daaf29cf932
>> (I should have mentioned your authorship somewhere!)
>
>
> Oh wow. I am glad that you used it. Wrt to proper attribution, well, you just 
> did that here. Thanks. Wrt the code, well, perhaps you can stick my name in 
> there somewhere, if at all possible?
>
>> As for potential improvements I can only think of optimizing
>> uncontented writer lock/unlock to be 1 RMW each, 

Re: [lock-free] Experimental Read/Write Mutex..

2019-02-19 Thread Chris M. Thomasson
On Monday, February 18, 2019 at 12:13:21 PM UTC-8, Dmitry Vyukov wrote:
>
> On Sun, Feb 17, 2019 at 11:51 PM Chris M. Thomasson  > wrote: 
> > 
> > On Saturday, February 16, 2019 at 11:24:42 PM UTC-8, Dmitry Vyukov 
> wrote: 
> >> 
> >> On Sun, Feb 17, 2019 at 12:34 AM Chris M. Thomasson  
> wrote: 
> [...]
> >> I can also test on 2 CPU system with 88 hardware threads in total, but 
> >> the bench hardcodes 16 threads (there is std::hardware_concurrency() 
> >> or something like that). 
> > 
> > 
> > Indeed! Will correct this issue. Make it dynamic. :^) 
> > 
> > Although, it can be beneficial to create more threads just to see how 
> the algorithm handles these cases of "oversubscribed" contention. 
>
> Agree. So std::hardware_concurrency(), 2*std::hardware_concurrency(), 
> 4*std::hardware_concurrency(). 
>

Indeed.

 

> >> I've found that it's critical to model realistic/target ratio between 
> >> reads/writes and amount of local work and work inside of critical 
> >> section in such benchmarks. Otherwise in 100% synthetic hammering 
> >> usually the worst mutex shows the best results. Synthetic benchmarks 
> >> produce extreme contention, so a mutex that blocks threads as much as 
> >> possible and allows as few threads as possible to work, shows the best 
> >> result because contention is minimized. The best would be to run all 
> >> threads sequentially one-by-one. But that's counter productive for 
> >> real life because blocked threads don't do useful work too. 
> > 
> > 
> > Agreed. Just wondering why it performs so much better in some high 
> contention scenarios. Load imbalance is a factor for sure. 
> > 
> > https://groups.google.com/d/msg/comp.lang.c++/g6veUOvOAu0/L7Sjs0xOBAAJ 
> > 
> > https://groups.google.com/d/msg/comp.lang.c++/g6veUOvOAu0/hMIu5VBSBAAJ 
>
> I don't need to tell you that performance of sync primitives can be 
> very non-linear and counter-intuitive :) 
>
> Perhaps the other mutexes were not super dumb, so they did not perform 
> the best under very high contention. 
>

:^D


>> Also this benchmark has fixed amount of work per thread, so I suspect 
> >> it may be subject to load imbalance when an unlucky outliner delays 
> >> total result. A better option would be to run all threads for X 
> >> seconds and then set a global var for all of them to stop and then 
> >> join all of them and measure number of iterations. 
> > 
> > 
> > Indeed. A benckmark that just modeled a list that readers iterated, and 
> writers pushed and popped from. Then we can show how many operations, reads 
> or writes, they performed per-second. So it would be the number of reads 
> and writes per-second, per-thread. 
> > 
> > This reminds of some of Joe Seighs tests back on 
> comp.programming.threads. 
> > 
> > 
> >> 
> >> 
> >> Also for me it lacked #include . 
> > 
> > 
> > Fixed that. Thanks again: 
> > 
> > https://pastebin.com/raw/xCBHY9qd 
> > (reload the page...) 
> > 
> > 
> > Fwiw, I am wondering what you think about the algorithm itself? Is it 
> crap, or decent? :^) 
>
>
> I think it's one of the best algorithms out there. 
>

Thanks Dmitry. So far, wrt a general purpose centralized rwmutex, it is not 
all that bad. :^)

 

> The complete fairness for both readers and writers is notable. 
>

Indeed. My algorithm has a strict bakery style about it. In this case, the 
readers get the "main benefit" wrt the wait-free fast-path, assuming 
fetch_add is a single operation in hardware like x86, not a CAS or LL/SC 
loop in software. Well, that is great because who uses a rwmutex for writer 
performance anyway? ;^)

 

> And the wait-free fast-path for readers. 


Big time. Imvvho, this is the most important aspect. Almost ready with a 
much better benchmark that should show this off quite well.

 

> It still suffers from high 
> cache line contention even on read-only workload, but it's as good as 
> one can get with a centralized design (i.e. not per-thread/cpu 
> distributed stuff which has own problems).

 
Agreed.

 

> I would expect a 
> CAS-loop-based read lock to degrade significantly under high read 
> load. 
>

Agreed. It would not be able to really compete wrt the explicit loop vs a 
loop-free fetch_add under periods of heavy, sustained read activity. Think 
of a shi% load of read-only database searches hitting the system all at 
once.

 

> Btw your algorithm is used as the standard Go RWMutex for the past 7+ 
> years: 
> https://github.com/golang/go/commit/daaf29cf932 
> (I should have mentioned your authorship somewhere!) 
>

Oh wow. I am glad that you used it. Wrt to proper attribution, well, you 
just did that here. Thanks. Wrt the code, well, perhaps you can stick my 
name in there somewhere, if at all possible?


As for potential improvements I can only think of optimizing 
> uncontented writer lock/unlock to be 1 RMW each, i.e. not offloading 
> writer competition resolution to a separate mutex, but rather 
> incorporate it into the same atomic variable readers and writers 

Re: [lock-free] Experimental Read/Write Mutex..

2019-02-18 Thread Dmitry Vyukov
On Sun, Feb 17, 2019 at 11:51 PM Chris M. Thomasson  wrote:
>
> On Saturday, February 16, 2019 at 11:24:42 PM UTC-8, Dmitry Vyukov wrote:
>>
>> On Sun, Feb 17, 2019 at 12:34 AM Chris M. Thomasson  
>> wrote:
>> >
>> > Fwiw, here is a simple benchmark for an older read/write algorithm I 
>> > invented:
>> >
>> >
>> > https://pastebin.com/raw/xCBHY9qd
>> >
>> >
>> > It should compile fine on MSVC 2017 and GCC with -std=c++17. There is a 
>> > macro in the code called CT_TEST_FAST_MUTEX. If it is defined then the 
>> > code will test my algorithm. If not defined then it will test against 
>> > std::shared_mutex. I am wondering if anybody can compile and run the code? 
>> > Test it out on as many operating systems as you can. Can you please show 
>> > the output?
>> >
>> >
>> > Thanks everybody. :^)
>> >
>> > Here is the code:
>> >
>> > /* Simple, crude read/write mutex test
>> >by: Chris M. Thomasson
>> > __*/
>> [...]
>
>
>>
>>
>> Hey Chris!
>>
>> Here are my results on i7-8650U (4 HT cores) on Debian 4.19.16-1 gcc 7.3.0
>>
>> std::shared_mutex
>> msec = 56118
>> msec = 49149
>> msec = 69416
>> msec = 68737
>> msec = 59174
>> msec = 65915
>>
>> ct_rwmutex
>> msec = 62772
>> msec = 71139
>> msec = 68110
>> msec = 66038
>> msec = 60436
>> msec = 74238
>
>
>
> Thank you so much! Okay, not great, but all that "radically" hardcore 
> terrible when compared to a std::shared_mutex on your end. Also, my work can 
> benefit from many sessions of padding and alignment therapy wrt 
> data-structure layout.
>
>
>>
>>
>> I can also test on 2 CPU system with 88 hardware threads in total, but
>> the bench hardcodes 16 threads (there is std::hardware_concurrency()
>> or something like that).
>
>
> Indeed! Will correct this issue. Make it dynamic. :^)
>
> Although, it can be beneficial to create more threads just to see how the 
> algorithm handles these cases of "oversubscribed" contention.

Agree. So std::hardware_concurrency(), 2*std::hardware_concurrency(),
4*std::hardware_concurrency().


>> I've found that it's critical to model realistic/target ratio between
>> reads/writes and amount of local work and work inside of critical
>> section in such benchmarks. Otherwise in 100% synthetic hammering
>> usually the worst mutex shows the best results. Synthetic benchmarks
>> produce extreme contention, so a mutex that blocks threads as much as
>> possible and allows as few threads as possible to work, shows the best
>> result because contention is minimized. The best would be to run all
>> threads sequentially one-by-one. But that's counter productive for
>> real life because blocked threads don't do useful work too.
>
>
> Agreed. Just wondering why it performs so much better in some high contention 
> scenarios. Load imbalance is a factor for sure.
>
> https://groups.google.com/d/msg/comp.lang.c++/g6veUOvOAu0/L7Sjs0xOBAAJ
>
> https://groups.google.com/d/msg/comp.lang.c++/g6veUOvOAu0/hMIu5VBSBAAJ

I don't need to tell you that performance of sync primitives can be
very non-linear and counter-intuitive :)

Perhaps the other mutexes were not super dumb, so they did not perform
the best under very high contention.


>> Also this benchmark has fixed amount of work per thread, so I suspect
>> it may be subject to load imbalance when an unlucky outliner delays
>> total result. A better option would be to run all threads for X
>> seconds and then set a global var for all of them to stop and then
>> join all of them and measure number of iterations.
>
>
> Indeed. A benckmark that just modeled a list that readers iterated, and 
> writers pushed and popped from. Then we can show how many operations, reads 
> or writes, they performed per-second. So it would be the number of reads and 
> writes per-second, per-thread.
>
> This reminds of some of Joe Seighs tests back on comp.programming.threads.
>
>
>>
>>
>> Also for me it lacked #include .
>
>
> Fixed that. Thanks again:
>
> https://pastebin.com/raw/xCBHY9qd
> (reload the page...)
>
>
> Fwiw, I am wondering what you think about the algorithm itself? Is it crap, 
> or decent? :^)


I think it's one of the best algorithms out there.
The complete fairness for both readers and writers is notable.
And the wait-free fast-path for readers. It still suffers from high
cache line contention even on read-only workload, but it's as good as
one can get with a centralized design (i.e. not per-thread/cpu
distributed stuff which has own problems). I would expect a
CAS-loop-based read lock to degrade significantly under high read
load.
Btw your algorithm is used as the standard Go RWMutex for the past 7+ years:
https://github.com/golang/go/commit/daaf29cf932
(I should have mentioned your authorship somewhere!)

As for potential improvements I can only think of optimizing
uncontented writer lock/unlock to be 1 RMW each, i.e. not offloading
writer competition resolution to a separate mutex, but rather
incorporate it into the same atomic variable readers and 

Re: [lock-free] Experimental Read/Write Mutex..

2019-02-17 Thread Chris M. Thomasson


On Sunday, February 17, 2019 at 2:51:04 PM UTC-8, Chris M. Thomasson wrote:
>
> On Saturday, February 16, 2019 at 11:24:42 PM UTC-8, Dmitry Vyukov wrote:
>>
>> On Sun, Feb 17, 2019 at 12:34 AM Chris M. Thomasson  
>> wrote: 
>
>  

> [...]
>>
> Thank you so much! Okay, not great, but all that "radically" hardcore 
> terrible when compared to a std::shared_mutex on your end. Also, my work 
> can benefit from many sessions of padding and alignment therapy wrt 
> data-structure layout.
> [...]
>

I meant to say it is _not_ all that "radically" hardcore terrible when 
compared to a std::shared_mutex on your end.

Well, lets wait and see for the next more realistic benchmark. lol. ;^)

However, I do like the way my algorithm acquires and releases read access:

_

// READ, pretty slim...
void lock_shared()
{
if (m_count.fetch_add(-1, std::memory_order_acquire) < 1)
{
m_rdwset.dec();
}
}

void unlock_shared()
{
if (m_count.fetch_add(1, std::memory_order_release) < 0)
{
if (m_rdwake.fetch_add(-1, std::memory_order_acq_rel) == 1)
{
m_wrwset.inc();
}
}
}

__


Not all "that" bad? ;^)



>
> Fwiw, I am wondering what you think about the algorithm itself? Is it 
> crap, or decent? :^)
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/73754f3f-620b-4723-89d2-808430e5637f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [lock-free] Experimental Read/Write Mutex..

2019-02-17 Thread Chris M. Thomasson
On Saturday, February 16, 2019 at 11:24:42 PM UTC-8, Dmitry Vyukov wrote:
>
> On Sun, Feb 17, 2019 at 12:34 AM Chris M. Thomasson  > wrote: 
> > 
> > Fwiw, here is a simple benchmark for an older read/write algorithm I 
> invented: 
> > 
> > 
> > https://pastebin.com/raw/xCBHY9qd 
> > 
> > 
> > It should compile fine on MSVC 2017 and GCC with -std=c++17. There is a 
> macro in the code called CT_TEST_FAST_MUTEX. If it is defined then the code 
> will test my algorithm. If not defined then it will test against 
> std::shared_mutex. I am wondering if anybody can compile and run the code? 
> Test it out on as many operating systems as you can. Can you please show 
> the output? 
> > 
> > 
> > Thanks everybody. :^) 
> > 
> > Here is the code: 
> > 
> > /* Simple, crude read/write mutex test 
> >by: Chris M. Thomasson 
> > __*/ 
> [...]
>
 

>
> Hey Chris! 
>
> Here are my results on i7-8650U (4 HT cores) on Debian 4.19.16-1 gcc 7.3.0 
>
> std::shared_mutex 
> msec = 56118 
> msec = 49149 
> msec = 69416 
> msec = 68737 
> msec = 59174 
> msec = 65915 
>
> ct_rwmutex 
> msec = 62772 
> msec = 71139 
> msec = 68110 
> msec = 66038 
> msec = 60436 
> msec = 74238 
>


Thank you so much! Okay, not great, but all that "radically" hardcore 
terrible when compared to a std::shared_mutex on your end. Also, my work 
can benefit from many sessions of padding and alignment therapy wrt 
data-structure layout.

 

>
> I can also test on 2 CPU system with 88 hardware threads in total, but 
> the bench hardcodes 16 threads (there is std::hardware_concurrency() 
> or something like that). 
>

Indeed! Will correct this issue. Make it dynamic. :^)

Although, it can be beneficial to create more threads just to see how the 
algorithm handles these cases of "oversubscribed" contention.

 

>
> I've found that it's critical to model realistic/target ratio between 
> reads/writes and amount of local work and work inside of critical 
> section in such benchmarks. Otherwise in 100% synthetic hammering 
> usually the worst mutex shows the best results. Synthetic benchmarks 
> produce extreme contention, so a mutex that blocks threads as much as 
> possible and allows as few threads as possible to work, shows the best 
> result because contention is minimized. The best would be to run all 
> threads sequentially one-by-one. But that's counter productive for 
> real life because blocked threads don't do useful work too. 
>

Agreed. Just wondering why it performs so much better in some high 
contention scenarios. Load imbalance is a factor for sure.

https://groups.google.com/d/msg/comp.lang.c++/g6veUOvOAu0/L7Sjs0xOBAAJ


https://groups.google.com/d/msg/comp.lang.c++/g6veUOvOAu0/hMIu5VBSBAAJ


 

>
> Also this benchmark has fixed amount of work per thread, so I suspect 
> it may be subject to load imbalance when an unlucky outliner delays 
> total result. A better option would be to run all threads for X 
> seconds and then set a global var for all of them to stop and then 
> join all of them and measure number of iterations. 
>

Indeed. A benckmark that just modeled a list that readers iterated, and 
writers pushed and popped from. Then we can show how many operations, reads 
or writes, they performed per-second. So it would be the number of reads 
and writes per-second, per-thread.

This reminds of some of Joe Seighs tests back on comp.programming.threads.

 

>
> Also for me it lacked #include . 
>

Fixed that. Thanks again:

https://pastebin.com/raw/xCBHY9qd
(reload the page...) 


Fwiw, I am wondering what you think about the algorithm itself? Is it crap, 
or decent? :^)

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/b1b05b02-3ac5-458d-88ea-7cc09c8e74ea%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [lock-free] Experimental Read/Write Mutex..

2019-02-16 Thread Dmitry Vyukov
On Sun, Feb 17, 2019 at 12:34 AM Chris M. Thomasson  wrote:
>
> Fwiw, here is a simple benchmark for an older read/write algorithm I invented:
>
>
> https://pastebin.com/raw/xCBHY9qd
>
>
> It should compile fine on MSVC 2017 and GCC with -std=c++17. There is a macro 
> in the code called CT_TEST_FAST_MUTEX. If it is defined then the code will 
> test my algorithm. If not defined then it will test against 
> std::shared_mutex. I am wondering if anybody can compile and run the code? 
> Test it out on as many operating systems as you can. Can you please show the 
> output?
>
>
> Thanks everybody. :^)
>
> Here is the code:
>
> /* Simple, crude read/write mutex test
>by: Chris M. Thomasson
> __*/
>
>
>
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
>
>
> #define THREADS 16UL
> #define ITERS 1000UL
> #define COUNT (THREADS * ITERS)
>
>
> // undefine to test std::shared_mutex
> #define CT_TEST_FAST_MUTEX 1
>
>
> // bare bones mutex/condvar based semaphore
> struct ct_slow_semaphore
> {
> unsigned long m_state;
> std::mutex m_mutex;
> std::condition_variable m_cond;
>
> ct_slow_semaphore(unsigned long state) : m_state(state) {}
>
> void inc()
> {
> {
> std::unique_lock lock(m_mutex);
> ++m_state;
> }
>
> m_cond.notify_one();
> }
>
> void add(unsigned long addend)
> {
> {
> std::unique_lock lock(m_mutex);
> m_state += addend;
> }
>
> m_cond.notify_all();
> }
>
> void dec()
> {
> std::unique_lock lock(m_mutex);
> while (m_state == 0) m_cond.wait(lock);
> --m_state;
> }
> };
>
>
>
>
> // bin-sema
> struct ct_auto_reset_event
> {
> bool m_state;
> std::mutex m_mutex;
> std::condition_variable m_cond;
>
> ct_auto_reset_event() : m_state(false) {}
>
> void signal()
> {
> std::unique_lock lock(m_mutex);
> m_state = true;
> m_cond.notify_one();
> }
>
> void wait()
> {
> std::unique_lock lock(m_mutex);
> while (m_state == false) m_cond.wait(lock);
> m_state = false; // auto-reset
> }
> };
>
>
> // just a layer over an auto-reset event
> struct ct_fast_mutex
> {
> std::atomic m_state;
> ct_auto_reset_event m_waitset;
>
> ct_fast_mutex() : m_state(0) {}
>
> void lock()
> {
> if (m_state.exchange(1, std::memory_order_acquire))
> {
> while (m_state.exchange(2, std::memory_order_acquire))
> {
> m_waitset.wait();
> }
> }
> }
>
> void unlock()
> {
> if (m_state.exchange(0, std::memory_order_release) == 2)
> {
> m_waitset.signal();
> }
> }
> };
>
>
>
> // Chris M. Thomassons Experimental Read/Write Mutex
> // Yeah, it is pretty damn fat wrt the state, however
> // it has some interesting properties...
> // The state can be compressed a bit...
> // btw, it has no loops...
> // Take a look at the lock_shared and unlock_shared functions
>
> #define RWMUTEX_COUNT_MAX LONG_MAX
>
> struct ct_rwmutex
> {
> // shared state
> std::atomic m_wrstate;
> std::atomic m_count;
> std::atomic m_rdwake;
>
> ct_slow_semaphore m_rdwset;
> ct_slow_semaphore m_wrwset;
> ct_fast_mutex m_wrlock;
>
>
> ct_rwmutex() :
> m_wrstate(1),
> m_count(RWMUTEX_COUNT_MAX),
> m_rdwake(0),
> m_rdwset(0),
> m_wrwset(0) {
> }
>
>
> // READ, pretty slim...
> void lock_shared()
> {
> if (m_count.fetch_add(-1, std::memory_order_acquire) < 1)
> {
> m_rdwset.dec();
> }
> }
>
> void unlock_shared()
> {
> if (m_count.fetch_add(1, std::memory_order_release) < 0)
> {
> if (m_rdwake.fetch_add(-1, std::memory_order_acq_rel) == 1)
> {
> m_wrwset.inc();
> }
> }
> }
>
>
> // WRITE, more hefty
> void lock()
> {
> m_wrlock.lock();
>
> long count = m_count.fetch_add(-RWMUTEX_COUNT_MAX, 
> std::memory_order_acquire);
>
> if (count < RWMUTEX_COUNT_MAX)
> {
> long rdwake = m_rdwake.fetch_add(RWMUTEX_COUNT_MAX - count, 
> std::memory_order_acquire);
>
> if (rdwake + RWMUTEX_COUNT_MAX - count)
> {
> m_wrwset.dec();
> }
> }
> }
>
> // write unlock
> void unlock()
> {
> long count = m_count.fetch_add(RWMUTEX_COUNT_MAX, 
> std::memory_order_release);
>
> if (count < 0)
> {
> m_rdwset.add(-count);
> }
>
> m_wrlock.unlock();
> }
> };
>
>
> struct ct_shared
> {
> std::atomic m_state;
>
> #if defined (CT_TEST_FAST_MUTEX)
> ct_rwmutex m_std_rwmutex;
> #else
> std::shared_mutex 

[lock-free] Experimental Read/Write Mutex..

2019-02-16 Thread Chris M. Thomasson
Fwiw, here is a simple benchmark for an older read/write algorithm I 
invented:


https://pastebin.com/raw/xCBHY9qd


It should compile fine on MSVC 2017 and GCC with -std=c++17. There is a 
macro in the code called CT_TEST_FAST_MUTEX. If it is defined then the code 
will test my algorithm. If not defined then it will test against 
std::shared_mutex. I am wondering if anybody can compile and run the code? 
Test it out on as many operating systems as you can. Can you please show 
the output?


Thanks everybody. :^)

Here is the code:

/* Simple, crude read/write mutex test
   by: Chris M. Thomasson
__*/



#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 


#define THREADS 16UL
#define ITERS 1000UL
#define COUNT (THREADS * ITERS)


// undefine to test std::shared_mutex
#define CT_TEST_FAST_MUTEX 1


// bare bones mutex/condvar based semaphore
struct ct_slow_semaphore
{
unsigned long m_state;
std::mutex m_mutex;
std::condition_variable m_cond;

ct_slow_semaphore(unsigned long state) : m_state(state) {}

void inc()
{
{
std::unique_lock lock(m_mutex);
++m_state;
}

m_cond.notify_one();
}

void add(unsigned long addend)
{
{
std::unique_lock lock(m_mutex);
m_state += addend;
}

m_cond.notify_all();
}

void dec()
{
std::unique_lock lock(m_mutex);
while (m_state == 0) m_cond.wait(lock);
--m_state;
}
};




// bin-sema
struct ct_auto_reset_event
{
bool m_state;
std::mutex m_mutex;
std::condition_variable m_cond;

ct_auto_reset_event() : m_state(false) {}

void signal()
{
std::unique_lock lock(m_mutex);
m_state = true;
m_cond.notify_one();
}

void wait()
{
std::unique_lock lock(m_mutex);
while (m_state == false) m_cond.wait(lock);
m_state = false; // auto-reset
}
};


// just a layer over an auto-reset event
struct ct_fast_mutex
{
std::atomic m_state;
ct_auto_reset_event m_waitset;

ct_fast_mutex() : m_state(0) {}

void lock()
{
if (m_state.exchange(1, std::memory_order_acquire))
{
while (m_state.exchange(2, std::memory_order_acquire))
{
m_waitset.wait();
}
}
}

void unlock()
{
if (m_state.exchange(0, std::memory_order_release) == 2)
{
m_waitset.signal();
}
}
};



// Chris M. Thomassons Experimental Read/Write Mutex
// Yeah, it is pretty damn fat wrt the state, however
// it has some interesting properties...
// The state can be compressed a bit...
// btw, it has no loops...
// Take a look at the lock_shared and unlock_shared functions

#define RWMUTEX_COUNT_MAX LONG_MAX

struct ct_rwmutex
{
// shared state
std::atomic m_wrstate;
std::atomic m_count;
std::atomic m_rdwake;

ct_slow_semaphore m_rdwset;
ct_slow_semaphore m_wrwset;
ct_fast_mutex m_wrlock;


ct_rwmutex() :
m_wrstate(1),
m_count(RWMUTEX_COUNT_MAX),
m_rdwake(0),
m_rdwset(0),
m_wrwset(0) {
}


// READ, pretty slim...
void lock_shared()
{
if (m_count.fetch_add(-1, std::memory_order_acquire) < 1)
{
m_rdwset.dec();
}
}

void unlock_shared()
{
if (m_count.fetch_add(1, std::memory_order_release) < 0)
{
if (m_rdwake.fetch_add(-1, std::memory_order_acq_rel) == 1)
{
m_wrwset.inc();
}
}
}


// WRITE, more hefty
void lock()
{
m_wrlock.lock();

long count = m_count.fetch_add(-RWMUTEX_COUNT_MAX, 
std::memory_order_acquire);

if (count < RWMUTEX_COUNT_MAX)
{
long rdwake = m_rdwake.fetch_add(RWMUTEX_COUNT_MAX - count, 
std::memory_order_acquire);

if (rdwake + RWMUTEX_COUNT_MAX - count)
{
m_wrwset.dec();
}
}
}

// write unlock
void unlock()
{
long count = m_count.fetch_add(RWMUTEX_COUNT_MAX, 
std::memory_order_release);

if (count < 0)
{
m_rdwset.add(-count);
}

m_wrlock.unlock();
}
};


struct ct_shared
{
std::atomic m_state;

#if defined (CT_TEST_FAST_MUTEX)
ct_rwmutex m_std_rwmutex;
#else
std::shared_mutex m_std_rwmutex;
#endif

ct_shared() : m_state(0) {}
};


void ct_thread(ct_shared& shared, std::size_t index)
{
for (unsigned int i = 0; i < ITERS; ++i)
{

shared.m_std_rwmutex.lock();
if (i % 256 == 0) std::this_thread::yield();
shared.m_state += 1;
shared.m_std_rwmutex.unlock();


shared.m_std_rwmutex.lock_shared();
if (i % 512 == 0) std::this_thread::yield();
//shared.m_state += 1;