[lock-free] Re: Experimental Read/Write Mutex..

2019-03-15 Thread Chris M. Thomasson
Sorry for not working on this like I should be doing. Fwiw, I got 
distracted because several of my fractals are now on a very nice site:

http://paulbourke.net/fractals

My work:

http://paulbourke.net/fractals/logspiral

http://paulbourke.net/fractals/multijulia

http://paulbourke.net/fractals/cubicjulia

http://paulbourke.net/fractals/fractionalpowers

The multi julia is very interesting, and creates some strange looking 3d 
renderings, even though the algorithm is 100% 2d in nature. Interested? 

To clarify, Paul used my special IFS algorithm to create the renderings. He 
used is own coloring algorithms. 

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/88f28b42-4d34-4360-a2df-19e4f617c4f0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [lock-free] Re: TensorFlow scheduler

2019-03-07 Thread Chris M. Thomasson


On Thursday, March 7, 2019 at 2:59:51 PM UTC-8, Chris M. Thomasson wrote:
>
> On Wednesday, March 6, 2019 at 9:56:54 PM UTC-8, Dmitry Vyukov wrote:
>>
>> On Wed, Mar 6, 2019 at 7:11 AM Chris M. Thomasson  
>> wrote: 
>> >>> 
>> >>> Hi, 
>> >>> 
>> >>> TensorFlow CPU task scheduler I wrote some time ago: 
>> >>> 
>> >>> 
>> https://bitbucket.org/eigen/eigen/src/9fdbc9e653c1adc64db1078fd2ec899c1e33ba0c/unsupported/Eigen/CXX11/src/ThreadPool/NonBlockingThreadPool.h?at=default=file-view-default
>>  
>> >>> 
>> >>> 
>> https://bitbucket.org/eigen/eigen/src/9fdbc9e653c1adc64db1078fd2ec899c1e33ba0c/unsupported/Eigen/CXX11/src/ThreadPool/RunQueue.h?at=default=file-view-default
>>  
>> >>> 
>> >>> 
>> https://bitbucket.org/eigen/eigen/src/9fdbc9e653c1adc64db1078fd2ec899c1e33ba0c/unsupported/Eigen/CXX11/src/ThreadPool/EventCount.h?at=default=file-view-default
>>  
>> >>> 
>> >>> This and some other fixes improved model training time on CPU up to 
>> 3x. 
>> >>> There is some interesting lock-free stuff in there. 
>> >>> [...] 
>> >> 
>> >> 
>> >> The code is a work of art: Well done Dmitry! 
>>
>> Thanks! 
>>
>
> No problem. Well, when I see awesome code, I know it, in a sense. :^)
>  
>
>>
>> > Love the: 
>> > 
>> >  Work PushFront(Work w) { 
>> > unsigned front = front_.load(std::memory_order_relaxed); 
>> > Elem* e = _[front & kMask]; 
>> > uint8_t s = e->state.load(std::memory_order_relaxed); 
>> > if (s != kEmpty || 
>> > !e->state.compare_exchange_strong(s, kBusy, 
>> std::memory_order_acquire)) 
>> >   return w; 
>> > front_.store(front + 1 + (kSize << 1), std::memory_order_relaxed); 
>> > e->w = std::move(w); 
>> > e->state.store(kReady, std::memory_order_release); 
>> > return Work(); 
>> > 
>> > } 
>> > 
>> > part. It reminds me of the same basic logic in your mpmc queue: 
>> > 
>> > 
>> http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue 
>> > 
>> > Ahh, I still like my little tweak of your most excellent algorithm: 
>> > 
>> > https://groups.google.com/d/topic/lock-free/acjQ3-89abE/discussion 
>> > 
>> > ;^) 
>>
>> IIRC once a thread executed XADD it has no way back, which implies 
>> blocking/waiting on empty/full queue. 
>> Here I needed "try" versions because this is not a producer-consumer 
>> scenario and we don't need back-pressure in cooperative task execution 
>> scenario. If we fail to pop from one queue, we will go and try to pop 
>> from another. If we fail to push onto a queue, we can push onto 
>> another or execute directly. 
>>
>
> Hard to disagree with this point. Humm... Thanks again. 
>

Perhaps there can be a mix of try and/or wait commit access. Kind of like 
my XADD tweak to your excellent CAS based queue.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/b8f0c23d-cd29-4ab9-91c5-2fea64d0ea66%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [lock-free] Re: TensorFlow scheduler

2019-03-07 Thread Chris M. Thomasson
On Wednesday, March 6, 2019 at 9:56:54 PM UTC-8, Dmitry Vyukov wrote:
>
> On Wed, Mar 6, 2019 at 7:11 AM Chris M. Thomasson  > wrote: 
> >>> 
> >>> Hi, 
> >>> 
> >>> TensorFlow CPU task scheduler I wrote some time ago: 
> >>> 
> >>> 
> https://bitbucket.org/eigen/eigen/src/9fdbc9e653c1adc64db1078fd2ec899c1e33ba0c/unsupported/Eigen/CXX11/src/ThreadPool/NonBlockingThreadPool.h?at=default=file-view-default
>  
> >>> 
> >>> 
> https://bitbucket.org/eigen/eigen/src/9fdbc9e653c1adc64db1078fd2ec899c1e33ba0c/unsupported/Eigen/CXX11/src/ThreadPool/RunQueue.h?at=default=file-view-default
>  
> >>> 
> >>> 
> https://bitbucket.org/eigen/eigen/src/9fdbc9e653c1adc64db1078fd2ec899c1e33ba0c/unsupported/Eigen/CXX11/src/ThreadPool/EventCount.h?at=default=file-view-default
>  
> >>> 
> >>> This and some other fixes improved model training time on CPU up to 
> 3x. 
> >>> There is some interesting lock-free stuff in there. 
> >>> [...] 
> >> 
> >> 
> >> The code is a work of art: Well done Dmitry! 
>
> Thanks! 
>

No problem. Well, when I see awesome code, I know it, in a sense. :^)
 

>
> > Love the: 
> > 
> >  Work PushFront(Work w) { 
> > unsigned front = front_.load(std::memory_order_relaxed); 
> > Elem* e = _[front & kMask]; 
> > uint8_t s = e->state.load(std::memory_order_relaxed); 
> > if (s != kEmpty || 
> > !e->state.compare_exchange_strong(s, kBusy, 
> std::memory_order_acquire)) 
> >   return w; 
> > front_.store(front + 1 + (kSize << 1), std::memory_order_relaxed); 
> > e->w = std::move(w); 
> > e->state.store(kReady, std::memory_order_release); 
> > return Work(); 
> > 
> > } 
> > 
> > part. It reminds me of the same basic logic in your mpmc queue: 
> > 
> > 
> http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue 
> > 
> > Ahh, I still like my little tweak of your most excellent algorithm: 
> > 
> > https://groups.google.com/d/topic/lock-free/acjQ3-89abE/discussion 
> > 
> > ;^) 
>
> IIRC once a thread executed XADD it has no way back, which implies 
> blocking/waiting on empty/full queue. 
> Here I needed "try" versions because this is not a producer-consumer 
> scenario and we don't need back-pressure in cooperative task execution 
> scenario. If we fail to pop from one queue, we will go and try to pop 
> from another. If we fail to push onto a queue, we can push onto 
> another or execute directly. 
>

Hard to disagree with this point. Humm... Thanks again. 

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/a5d2faa1-8e90-402d-ba16-8edc85e69cea%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[lock-free] Re: TensorFlow scheduler

2019-03-05 Thread Chris M. Thomasson


On Tuesday, March 5, 2019 at 10:04:40 PM UTC-8, Chris M. Thomasson wrote:
>
>
>
> On Wednesday, February 27, 2019 at 1:35:04 AM UTC-8, Dmitry Vyukov wrote:
>>
>> Hi, 
>>
>> TensorFlow CPU task scheduler I wrote some time ago: 
>>
>>
>> https://bitbucket.org/eigen/eigen/src/9fdbc9e653c1adc64db1078fd2ec899c1e33ba0c/unsupported/Eigen/CXX11/src/ThreadPool/NonBlockingThreadPool.h?at=default=file-view-default
>>  
>>
>>
>> https://bitbucket.org/eigen/eigen/src/9fdbc9e653c1adc64db1078fd2ec899c1e33ba0c/unsupported/Eigen/CXX11/src/ThreadPool/RunQueue.h?at=default=file-view-default
>>  
>>
>>
>> https://bitbucket.org/eigen/eigen/src/9fdbc9e653c1adc64db1078fd2ec899c1e33ba0c/unsupported/Eigen/CXX11/src/ThreadPool/EventCount.h?at=default=file-view-default
>>  
>>
>> This and some other fixes improved model training time on CPU up to 3x. 
>> There is some interesting lock-free stuff in there. 
>> [...]
>
>
> The code is a work of art: Well done Dmitry!
>

Love the:

 Work PushFront(Work w) {unsigned front = 
front_.load(std::memory_order_relaxed);Elem* e = _[front & kMask];
uint8_t s = e->state.load(std::memory_order_relaxed);if (s != kEmpty || 
   !e->state.compare_exchange_strong(s, kBusy, std::memory_order_acquire))  
return w;front_.store(front + 1 + (kSize << 1), std::memory_order_relaxed); 
   e->w = std::move(w);e->state.store(kReady, std::memory_order_release);   
 return Work(); 

}

part. It reminds me of the same basic logic in your mpmc queue:

http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue

Ahh, I still like my little tweak of your most excellent algorithm:

https://groups.google.com/d/topic/lock-free/acjQ3-89abE/discussion

;^)

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/9fbdcb12-626e-4f0e-b723-26a532ef567d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[lock-free] Re: Experimental Read/Write Mutex..

2019-03-04 Thread Chris M. Thomasson


On Saturday, March 2, 2019 at 7:28:28 AM UTC-8, Manuel Pöter wrote:
>
>
>
> On Friday, 1 March 2019 23:29:41 UTC+1, Chris M. Thomasson wrote:
>>
>> On Friday, March 1, 2019 at 7:11:20 AM UTC-8, Manuel Pöter wrote:
>>>
>>> I could get my hands on an AMD EPYC 7351P system and a Xeon Phi (Knights 
>>> Corner) if you are interested in more results...
>>>
>>
>> I would be grateful if you can find the time to do this Manuel: Thanks. 
>> :^)
>>
>  
> Will do! :) 
>

Thanks. I am interested in all the results you can get.
 

>
> Really like the "difference" between the Intel and SPARC results. Afaict, 
>> std::shared_mutex, on Solaris? Needs to be fixed? I do not like the way it 
>> seems to "tank" the reads and obtain a shi% load of writes. Humm...
>>
>
> Yes, the SPARC system is running Solaris 11.3; the binaries were built 
> using a self-compiled GCC 6.3. I have to admit that I did not really look 
> at the results before posting them here, but now that you pointed it out I 
> agree - the shared_mutex results on Solaris are definitely not what I would 
> have expected!
>

Agreed. Very strange, massive writer preference. Humm...

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/ca0474ba-6eb9-4df1-bd85-c7f33ca20f02%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [lock-free] Experimental Read/Write Mutex..

2019-03-04 Thread Chris M. Thomasson


On Sunday, March 3, 2019 at 1:28:03 AM UTC-8, Chris M. Thomasson wrote:
>
> On Tuesday, February 26, 2019 at 12:16:24 AM UTC-8, Dmitry Vyukov wrote:
>>
>> On Wed, Feb 20, 2019 at 1:31 AM Chris M. Thomasson  
>> wrote: 
>> [...]
>> We can try with some new change :) 
>>
>
> Just as a first start, I changed m_count to unsigned long, and is 
> "integrated" with lock and unlock, and still has no loops, just a first 
> exploratory experiment. Here is the code:
>
> It still only uses fetch_add. Humm, not sure what to make of it. Also, I 
> am pissed that fetch_or goes to CMPXCHG loop on x86.
>
> _
> // Chris M. Thomassons Experimental Read/Write Mutex
> [...]
>
> // WRITE, more hefty
> void wrlock()
> {
> unsigned long wrcount = m_count.fetch_add(WRITE, 
> std::memory_order_acquire);
>
> if ((wrcount & WRITE_MASK) > 0)
> {
> m_wrmtx.dec();
> }
>
^^

Still thinking about getting rid of the above mutex, and directly 
integrating the waiting into a single semaphore without using loops. 
Humm... I would have to change the way readers can unlock writers and vise 
versa. It is a bit challenging to do without loops.

 

>
> unsigned long count = m_count.fetch_add(WRITE_BIT, 
> std::memory_order_acquire);
>

[...]

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/6da39ca7-7b5c-4709-a569-faf3f00853d3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [lock-free] Experimental Read/Write Mutex..

2019-03-03 Thread Chris M. Thomasson
On Tuesday, February 26, 2019 at 12:16:24 AM UTC-8, Dmitry Vyukov wrote:
>
> On Wed, Feb 20, 2019 at 1:31 AM Chris M. Thomasson  > wrote: 
> [...]
> We can try with some new change :) 
>

Just as a first start, I changed m_count to unsigned long, and is 
"integrated" with lock and unlock, and still has no loops, just a first 
exploratory experiment. Here is the code:

It still only uses fetch_add. Humm, not sure what to make of it. Also, I am 
pissed that fetch_or goes to CMPXCHG loop on x86.

_
// Chris M. Thomassons Experimental Read/Write Mutex
// Yeah, it is pretty damn fat wrt the state, however
// it has some interesting properties...
// The state can be compressed a bit...
// btw, it has no loops...

// 32-bit layout
#define READ0x0001UL
#define READ_MASK   0xUL
#define WRITE_BIT   0x0001UL
#define WRITE   0x0002UL
#define WRITE_MASK  0xUL
#define NEUTRAL 0xUL


struct ct_rwmutex
{
// shared state
std::atomic m_count;
std::atomic m_rdwake;

ct_slow_semaphore m_rdwset;
ct_slow_semaphore m_wrwset;
ct_slow_semaphore m_wrmtx;


ct_rwmutex() :
m_count(NEUTRAL),
m_rdwake(0),
m_rdwset(0),
m_wrwset(0),
m_wrmtx(0) {
}


~ct_rwmutex()
{
RL_ASSERT(m_count.load(std::memory_order_relaxed) == 0);
}


// READ, pretty slim...
void rdlock()
{
if (m_count.fetch_add(READ, std::memory_order_acquire) & WRITE_BIT)
{
m_rdwset.dec();
}
}

void rdunlock()
{
if (m_count.fetch_sub(READ, std::memory_order_release) & WRITE_BIT)
{
if (m_rdwake.fetch_add(-1, std::memory_order_acq_rel) == 1)
{
m_wrwset.inc();
}
}
}


// WRITE, more hefty
void wrlock()
{
unsigned long wrcount = m_count.fetch_add(WRITE, 
std::memory_order_acquire);

if ((wrcount & WRITE_MASK) > 0)
{
m_wrmtx.dec();
}

unsigned long count = m_count.fetch_add(WRITE_BIT, 
std::memory_order_acquire);
unsigned long readers = count & READ_MASK;

if (readers)
{
long rdwake = m_rdwake.fetch_add(readers, 
std::memory_order_acquire);

if (rdwake + readers)
{
m_wrwset.dec();
}
}
}

// write unlock
void wrunlock()
{
unsigned long count = m_count.fetch_sub(WRITE_BIT, 
std::memory_order_release);
unsigned long readers = count & READ_MASK;

if (readers)
{
m_rdwset.add(readers);
}

unsigned long wrcount = m_count.fetch_sub(WRITE, 
std::memory_order_relaxed);

if ((wrcount & WRITE_MASK) > WRITE)
{
m_wrmtx.inc();
}
}
};
_


Humm... Any thoughts? ;^)

 

>
> I never seen any names in Go sources before. And everything is usually 
> very consistent, it's not that everybody can do whatever they want in 
> a new file just because they wrote it and nobody else will care. 
> People actually care a lot about consistency and common style for 
> everything. 
> But now grepping the source I've found an existing precedent: 
> https://github.com/golang/go/blob/master/src/go/printer/nodes.go#L662 
> <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Fgolang%2Fgo%2Fblob%2Fmaster%2Fsrc%2Fgo%2Fprinter%2Fnodes.go%23L662=D=1=AFQjCNHyQGmtZUt0WELuY4FmKMCbFidCqA>
>  
> So we may try too :) 
>

Very cool. Thanks Dmitry. 

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/843fb17a-f470-4abb-813e-4fa059c05218%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[lock-free] Re: Experimental Read/Write Mutex..

2019-03-02 Thread Chris M. Thomasson
I need to create another type of test, one that does not hammer things so 
much wrt creating an equal amount of reader and writer threads. I need to 
basically model a "read-mostly" work environment. Instead of a 
hyper-hardcore read and write focused contention massacre...

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/93213150-1e75-4e69-9510-cfc302deca59%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[lock-free] Re: Experimental Read/Write Mutex..

2019-03-01 Thread Chris M. Thomasson
On Friday, March 1, 2019 at 7:11:20 AM UTC-8, Manuel Pöter wrote:
>
> I could get my hands on an AMD EPYC 7351P system and a Xeon Phi (Knights 
> Corner) if you are interested in more results...
>

I would be grateful if you can find the time to do this Manuel: Thanks. :^)

Really like the "difference" between the Intel and SPARC results. Afaict, 
std::shared_mutex, on Solaris? Needs to be fixed? I do not like the way it 
seems to "tank" the reads and obtain a shi% load of writes. Humm...

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/1dc19580-0c09-4464-953e-75343985cb57%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[lock-free] Re: Experimental Read/Write Mutex..

2019-02-28 Thread Chris M. Thomasson
On Wednesday, February 27, 2019 at 6:34:41 AM UTC-8, Manuel Pöter wrote:
>
> Benchmarked this on 2 systems with `THREADS` set to 1, 2 and 4.
>
> Result 1:
>
> 8x Intel(R) Xeon(R) CPU E7- 8850  @ 2.00GHz (80 cores, 2x SMT)
>
> Testing Version 0.1: Chris M. Thomasson's Experimental Read/Write Mutex
> ___
> cpu_threads_n = 160
> threads_n = 160
> writers = 80
> readers = 80
> test duration = 60 seconds
> ___
> ___
> Raw Reads: 186803
> Raw Writes: 1922
> reads_per_tick = 3111
> writes_per_tick = 32
> Ticks = 60.0334
> ___
>
> Testing Version 0.1: Chris M. Thomasson's Experimental Read/Write Mutex
> ___
> cpu_threads_n = 160
> threads_n = 320
> writers = 160
> readers = 160
> test duration = 60 seconds
> ___
> ___
> Raw Reads: 336265
> Raw Writes: 2559
> reads_per_tick = 5596
> writes_per_tick = 42
> Ticks = 60.0848
> ___
>
> Testing Version 0.1: Chris M. Thomasson's Experimental Read/Write Mutex
> ___
> cpu_threads_n = 160
> threads_n = 640
> writers = 320
> readers = 320
> test duration = 60 seconds
> ___
> ___
> Raw Reads: 449283
> Raw Writes: 3718
> reads_per_tick = 7471
> writes_per_tick = 61
> Ticks = 60.1302
> ___
>
> Testing Version 0.1: std::shared_mutex
> ___
> cpu_threads_n = 160
> threads_n = 160
> writers = 80
> readers = 80
> test duration = 60 seconds
> ___
> ___
> Raw Reads: 191840
> Raw Writes: 784
> reads_per_tick = 3194
> writes_per_tick = 13
> Ticks = 60.0533
> ___
>
> Testing Version 0.1: std::shared_mutex
> ___
> cpu_threads_n = 160
> threads_n = 320
> writers = 160
> readers = 160
> test duration = 60 seconds
> ___
> ___
> Raw Reads: 350020
> Raw Writes: 1738
> reads_per_tick = 5826
> writes_per_tick = 28
> Ticks = 60.0688
> ___
>
> Testing Version 0.1: std::shared_mutex
> ___
> cpu_threads_n = 160
> threads_n = 640
> writers = 320
> readers = 320
> test duration = 60 seconds
> ___
> ___
> Raw Reads: 706867
> Raw Writes: 1830
> reads_per_tick = 11752
> writes_per_tick = 30
> Ticks = 60.1452
> ___
>


Okay. Thank you. So, my work is losing wrt reads, but performing more 
writes. This has to be an aspect of my algorithms strict bakery style 
fairness. Writers can never starve readers, and vise versa. It has no read 
or writer preference.

 

>
> Result 2:
>
> 4x SPARC-T5-4 (64 cores, 8x SMT)
>
> Testing Version 0.1: Chris M. Thomasson's Experimental Read/Write Mutex
> ___
> cpu_threads_n = 512
> threads_n = 512
> writers = 256
> readers = 256
> test duration = 60 seconds
> ___
> ___
> Raw Reads: 640255
> Raw Writes: 7999
> reads_per_tick = 10650
> writes_per_tick = 133
> Ticks = 60.1149
> ___
>
> Testing Version 0.1: Chris M. Thomasson's Experimental Read/Write Mutex
> ___
> cpu_threads_n = 512
> threads_n = 1024
> writers = 512
> readers = 512
> test duration = 60 seconds
> ___
> ___
> Raw Reads: 948097
> Raw Writes: 12602
> reads_per_tick = 15746
> writes_per_tick = 209
> Ticks = 60.2094
> ___
>
> Testing Version 0.1: Chris M. Thomasson's Experimental Read/Write Mutex
> ___
> cpu_threads_n = 512
> threads_n = 2048
> writers = 1024
> readers = 1024
> test duration = 60 seconds
> ___
> ___
> Raw Reads: 1718250
> Raw Writes: 23019
> reads_per_tick = 28402
> writes_per_tick = 380
> Ticks = 60.4974
> ___
>
>
> Testing Version 0.1: std::shared_mutex
> ___
> cpu_threads_n = 512
> threads_n = 512
> writers = 256
> readers = 256
> test duration = 60 seconds
> ___
> ___
> Raw Reads: 4482
> Raw Writes: 2166488
> reads_per_tick = 74
> writes_per_tick = 36045
> Ticks = 60.1037
> ___
>
> Testing Version 0.1: std::shared_mutex
> ___
> cpu_threads_n = 512
> threads_n = 1024
> writers = 512
> readers = 512
> test duration = 60 seconds
> ___
> ___
> Raw 

[lock-free] Re: Abseil/nsync Mutex

2019-02-20 Thread Chris M. Thomasson
On Wednesday, February 20, 2019 at 12:59:28 AM UTC-8, Dmitry Vyukov wrote:
>
> Hi, 
>
> FYI, if you haven't seen this yet: 
>
> https://github.com/abseil/abseil-cpp/blob/master/absl/synchronization/mutex.h 
>
> https://github.com/abseil/abseil-cpp/blob/master/absl/synchronization/mutex.cc
>  
>
> Some minor improvements are possible, but overall it's a piece of art. 
> Originally developed by Mike Burrows 
> (https://en.wikipedia.org/wiki/Michael_Burrows). If you like learning 
> synchronization primitives, this one will be a good time investment. 
> The main design goal I guess is handling all possible usage scenarios 
> gracefully (i.e. not just readers, or writers, or light contention, or 
> heavy contention, or few waiters, or thousands of waiters, etc) which 
> is expected of a primitive used in literally millions of place across 
> thousands of systems. It also has some interesting features like 
> contention profiling and deadlock detection. FWIW it also fits into a 
> single intptr (i.e. 4 bytes on 32-bit systems). 
>
> There is also a bit simplified version of the same algo in nsync library: 
> https://github.com/google/nsync/blob/master/README 
> https://github.com/google/nsync/blob/master/internal/mu.c 
> It's C and does not have contention profiling and deadlock detection, 
> so may be easier for leaning of the mutex algorithm itself. 
>

Interesting! Will take a look. Thank you for the heads up Dmitry. :^) 

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/ccb8a2a8-5d39-4381-b848-2a5e67df0f17%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[lock-free] crude read/write benchmark code...

2019-02-20 Thread Chris M. Thomasson
Fwiw, I wrote a crude new benchmark that measures how many reads and writes 
can be performed in a given amount of time. My algorithm vs 
std::shared_mutex. So, we are _primarily_ looking for how many reads can be 
performed in this test at 60 seconds. The number of threads is variable and 
is determined by std::hardware_concurrency * THREADS, set at 8 in the test. 
This is the latest version 0.1. It will say the version number in the 
output:

https://pastebin.com/raw/1QtPCGhV
(hit refresh if you don't see version 0.1)


I had to change something, basically add a volatile so the compiler would 
not optimize the read iteration over the shared list away into oblivion.

https://groups.google.com/d/msg/comp.lang.c++/DBIG55vCBSA/JZOkaBlYAAAJ

Sorry about that... Anyway my system is:
 
___ 
cpu_threads_n = 4 
threads_n = 32 
writers = 16 
readers = 16 
test duration = 60 seconds 
___ 



This is going to be different for each system. Readers iterate a shared 
linked list with 100 nodes in this test. Writers pop and push items, 
and never use new or delete. Well, so far, the timings I am getting on my 
end using MSVC 2017 is:

Testing Version 0.1: Chris M. Thomasson's Experimental Read/Write Mutex 

___ 
Raw Reads: 54195 
Raw Writes: 3232 
reads_per_tick = 902 
writes_per_tick = 53 
Ticks = 60.0432 
___ 



Testing Version 0.1: std::shared_mutex 

___ 
Raw Reads: 23452 
Raw Writes: 1873 
reads_per_tick = 390 
writes_per_tick = 31 
Ticks = 60.0513 
___ 



Well, when you get some time, can you run it?

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/e1ba1c0e-6a27-4cdb-8f76-9350029b766b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[lock-free] Re: Experimental Read/Write Mutex..

2019-02-19 Thread Chris M. Thomasson
Fwiw, I wrote a crude new benchmark that measures how many reads and writes 
can be performed in a given amount of time. My algorithm vs 
std::shared_mutex. So, we are _primarily_ looking for how many reads can be 
performed in this test at 60 seconds. The number of threads is variable and 
is determined by std::hardware_concurrency * THREADS, set at 8 in the test. 
So on my system the setup is: 
___ 
cpu_threads_n = 4 
threads_n = 32 
writers = 16 
readers = 16 
test duration = 60 seconds 
___ 

This is going to be different for each system. Readers iterate a shared 
linked list with 100 nodes in this test. Writers pop and push items, 
and never use new or delete. Well, so far, the timings I am getting on my 
end using MSVC 2017 is: 


Testing: Chris M. Thomasson's Experimental Read/Write Mutex 

___ 
cpu_threads_n = 4 
threads_n = 32 
writers = 16 
readers = 16 
test duration = 60 seconds 
___ 


___ 
Raw Reads: 18046464 
Raw Writes: 347498 
reads_per_tick = 300693 
writes_per_tick = 5790 
Ticks = 60.0161 
___ 



Testing: std::shared_mutex 

___ 
cpu_threads_n = 4 
threads_n = 32 
writers = 16 
readers = 16 
test duration = 60 seconds 
___ 


___ 
Raw Reads: 650006 
Raw Writes: 168741 
reads_per_tick = 10829 
writes_per_tick = 2811 
Ticks = 60.0193 
___ 



As you can see, my algorithm is performing many more reads than 
std::shared_mutex. Anyway, here is my code: 

When you get some time, can you please give it a go? I want to see timings 
on other systems, the good, bad and ugly. ;^) Thanks. 

https://pastebin.com/raw/1QtPCGhV 
___ 
/* Crude Read/Write Mutex Benchmark 

   This tests my algorithm ct_rwmutex vs. std::shared_mutex. 

   It shows how many reads and writes can be performed within 
   a fixed amount of time. 

   by: Chris M. Thomasson 
__*/ 



#include  
#include  
#include  
#include  
#include  
#include  
#include  
#include  
#include  
#include  
#include  
#include  
#include  


// undefine to test std::shared_mutex 
#define CT_TEST_FAST_MUTEX 1 


#define THREADS 8 // multiplied by std::hardware_concurrency 
#define NODE_PRIME 100 // number of nodes 
#define CRUDE_CACHE_PAD 256 // crude cache pad 
#define TEST_DURATION_SEC 60 // number of seconds for the test 


// bare bones mutex/condvar based semaphore 
struct ct_slow_semaphore 
{ 
unsigned long m_state; 
std::mutex m_mutex; 
std::condition_variable m_cond; 

ct_slow_semaphore(unsigned long state) : m_state(state) {} 

void inc() 
{ 
m_mutex.lock(); 
++m_state; 
m_mutex.unlock(); 
m_cond.notify_one(); 
} 

void add(unsigned long addend) 
{ 
{ 
std::unique_lock lock(m_mutex); 
m_state += addend; 
} 

m_cond.notify_all(); 
} 

void dec() 
{ 
std::unique_lock lock(m_mutex); 
while (m_state == 0) m_cond.wait(lock); 
--m_state; 
} 
}; 




// bin-sema 
struct ct_auto_reset_event 
{ 
bool m_state; 
std::mutex m_mutex; 
std::condition_variable m_cond; 

ct_auto_reset_event() : m_state(false) {} 

void signal() 
{ 
std::unique_lock lock(m_mutex); 
m_state = true; 
m_cond.notify_one(); 
} 

void wait() 
{ 
std::unique_lock lock(m_mutex); 
while (m_state == false) m_cond.wait(lock); 
m_state = false; // auto-reset 
} 
}; 




// just a layer over an auto-reset event 
struct ct_fast_mutex 
{ 
std::atomic m_state; 
ct_auto_reset_event m_waitset; 

ct_fast_mutex() : m_state(0) {} 

void lock() 
{ 
if (m_state.exchange(1, std::memory_order_acquire)) 
{ 
while (m_state.exchange(2, std::memory_order_acquire)) 
{ 
m_waitset.wait(); 
} 
} 
} 

void unlock() 
{ 
if (m_state.exchange(0, std::memory_order_release) == 2) 
{ 
m_waitset.signal(); 
} 
} 
}; 



// Chris M. Thomassons Experimental Read/Write Mutex 
// Yeah, it is pretty damn fat wrt the state, however 
// it has some interesting properties... 
// The state can be compressed a bit... 
// btw, it has no loops... 
// Take a look at the lock_shared and unlock_shared functions 

// The number of readers _must_ never exceed LONG_MAX! 

#define RWMUTEX_COUNT_MAX LONG_MAX 

struct ct_rwmutex 
{ 
unsigned char m_crude_cache_pad_0[CRUDE_CACHE_PAD]; 

// shared state 
std::atomic m_count; 
unsigned char m_crude_cache_pad_1[CRUDE_CACHE_PAD]; 

std::atomic m_rdwake; 
unsigned char

[lock-free] Re: Experimental Read/Write Mutex..

2019-02-19 Thread Chris M. Thomasson
On Monday, February 18, 2019 at 8:50:21 PM UTC-8, Chris M. Thomasson wrote:
>
> Fwiw, here is a more realistic benchmark, in the Relacy stage for 
> verification purposes:
>
> https://pastebin.com/raw/j4wrg050
>
>

Fwiw, I have almost completed my new benchmark in pure C++11. The timings 
that I am getting measure a different thing. How many reads and writes can 
be completed within a given time frame. Well, here are some initial results:

Testing: Chris M. Thomasson's Experimental Read/Write Mutex

Raw Reads: 3086724
Raw Writes: 59651
reads_per_tick = 308309
writes_per_tick = 5958
Ticks = 10.0118


Testing: std::shared_mutex

Raw Reads: 110922
Raw Writes: 29502
reads_per_tick = 11071
writes_per_tick = 2944
Ticks = 10.019


For a simple test of 10 seconds, my algorithm allows for 3086724 reads and 
59651 writes to be completed. While std::shared_mutex allows for only 
110922 reads and 29502 writes, damn. Look at the difference in the read 
rate per tick! Wow. I can't wait for you, and others to test it and tell me 
the results, good or bad. Heck, std::shared_mutex should beat my work, 
wouldn't ya think? Humm... 

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/dc5d90c2-3c8f-416c-8d50-3940a50ccc29%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [lock-free] Experimental Read/Write Mutex..

2019-02-19 Thread Chris M. Thomasson
On Monday, February 18, 2019 at 12:13:21 PM UTC-8, Dmitry Vyukov wrote:
>
> On Sun, Feb 17, 2019 at 11:51 PM Chris M. Thomasson  > wrote: 
> > 
> > On Saturday, February 16, 2019 at 11:24:42 PM UTC-8, Dmitry Vyukov 
> wrote: 
> >> 
> >> On Sun, Feb 17, 2019 at 12:34 AM Chris M. Thomasson  
> wrote: 
> [...]
> >> I can also test on 2 CPU system with 88 hardware threads in total, but 
> >> the bench hardcodes 16 threads (there is std::hardware_concurrency() 
> >> or something like that). 
> > 
> > 
> > Indeed! Will correct this issue. Make it dynamic. :^) 
> > 
> > Although, it can be beneficial to create more threads just to see how 
> the algorithm handles these cases of "oversubscribed" contention. 
>
> Agree. So std::hardware_concurrency(), 2*std::hardware_concurrency(), 
> 4*std::hardware_concurrency(). 
>

Indeed.

 

> >> I've found that it's critical to model realistic/target ratio between 
> >> reads/writes and amount of local work and work inside of critical 
> >> section in such benchmarks. Otherwise in 100% synthetic hammering 
> >> usually the worst mutex shows the best results. Synthetic benchmarks 
> >> produce extreme contention, so a mutex that blocks threads as much as 
> >> possible and allows as few threads as possible to work, shows the best 
> >> result because contention is minimized. The best would be to run all 
> >> threads sequentially one-by-one. But that's counter productive for 
> >> real life because blocked threads don't do useful work too. 
> > 
> > 
> > Agreed. Just wondering why it performs so much better in some high 
> contention scenarios. Load imbalance is a factor for sure. 
> > 
> > https://groups.google.com/d/msg/comp.lang.c++/g6veUOvOAu0/L7Sjs0xOBAAJ 
> > 
> > https://groups.google.com/d/msg/comp.lang.c++/g6veUOvOAu0/hMIu5VBSBAAJ 
>
> I don't need to tell you that performance of sync primitives can be 
> very non-linear and counter-intuitive :) 
>
> Perhaps the other mutexes were not super dumb, so they did not perform 
> the best under very high contention. 
>

:^D


>> Also this benchmark has fixed amount of work per thread, so I suspect 
> >> it may be subject to load imbalance when an unlucky outliner delays 
> >> total result. A better option would be to run all threads for X 
> >> seconds and then set a global var for all of them to stop and then 
> >> join all of them and measure number of iterations. 
> > 
> > 
> > Indeed. A benckmark that just modeled a list that readers iterated, and 
> writers pushed and popped from. Then we can show how many operations, reads 
> or writes, they performed per-second. So it would be the number of reads 
> and writes per-second, per-thread. 
> > 
> > This reminds of some of Joe Seighs tests back on 
> comp.programming.threads. 
> > 
> > 
> >> 
> >> 
> >> Also for me it lacked #include . 
> > 
> > 
> > Fixed that. Thanks again: 
> > 
> > https://pastebin.com/raw/xCBHY9qd 
> > (reload the page...) 
> > 
> > 
> > Fwiw, I am wondering what you think about the algorithm itself? Is it 
> crap, or decent? :^) 
>
>
> I think it's one of the best algorithms out there. 
>

Thanks Dmitry. So far, wrt a general purpose centralized rwmutex, it is not 
all that bad. :^)

 

> The complete fairness for both readers and writers is notable. 
>

Indeed. My algorithm has a strict bakery style about it. In this case, the 
readers get the "main benefit" wrt the wait-free fast-path, assuming 
fetch_add is a single operation in hardware like x86, not a CAS or LL/SC 
loop in software. Well, that is great because who uses a rwmutex for writer 
performance anyway? ;^)

 

> And the wait-free fast-path for readers. 


Big time. Imvvho, this is the most important aspect. Almost ready with a 
much better benchmark that should show this off quite well.

 

> It still suffers from high 
> cache line contention even on read-only workload, but it's as good as 
> one can get with a centralized design (i.e. not per-thread/cpu 
> distributed stuff which has own problems).

 
Agreed.

 

> I would expect a 
> CAS-loop-based read lock to degrade significantly under high read 
> load. 
>

Agreed. It would not be able to really compete wrt the explicit loop vs a 
loop-free fetch_add under periods of heavy, sustained read activity. Think 
of a shi% load of read-only database searches hitting the system all at 
once.

 

> Btw your algorithm is used as the standard Go RWMutex for the past 7+ 
> years: 
> https://github.com/golang/go/commit/daaf29cf932

[lock-free] Re: Experimental Read/Write Mutex..

2019-02-18 Thread Chris M. Thomasson
Fwiw, here is a more realistic benchmark, in the Relacy stage for 
verification purposes:

https://pastebin.com/raw/j4wrg050

Looking good, and passing tests so far. Here is the code, using Relacy:
___

// Experimental Read-Write Mutex Test
// More "realistic" test, in Relacy...
// By: Chris M. Thomasson
//___


//#define RL_DEBUGBREAK_ON_ASSERT
//#define RL_MSVC_OUTPUT
//#define RL_FORCE_SEQ_CST
//#define RL_GC


#include 
#include 


// Simple macro based redirection of the verbose  std membars.
#define CT_MB_ACQ std::memory_order_acquire
#define CT_MB_REL std::memory_order_release
#define CT_MB_RLX std::memory_order_relaxed
#define CT_MB_ACQ_REL std::memory_order_acq_rel
#define CT_MB_SEQ_CST std::memory_order_seq_cst
#define CT_MB(mp_0) std::atomic_thread_fence(mp_0)


#define READERS 4
#define WRITERS 2
#define THREADS (READERS + WRITERS)
#define ITERS 3







// bare bones mutex/condvar based semaphore
struct ct_slow_semaphore
{
VAR_T(unsigned long) m_state;
std::mutex m_mutex;
std::condition_variable m_cond;

ct_slow_semaphore(unsigned long state) : m_state(state) {}

void inc()
{
m_mutex.lock($);
++VAR(m_state);
m_mutex.unlock($);
m_cond.notify_one($);
}

void add(unsigned long addend)
{
m_mutex.lock($);
VAR(m_state) += addend;
m_mutex.unlock($);
m_cond.notify_all($);
}

void dec()
{
m_mutex.lock($);
while (VAR(m_state) == 0) m_cond.wait(m_mutex, $);
--VAR(m_state);
m_mutex.unlock($);
}
};




// bin-sema
struct ct_auto_reset_event
{
bool m_state;
std::mutex m_mutex;
std::condition_variable m_cond;

ct_auto_reset_event() : m_state(false) {}

void signal()
{
m_mutex.lock($);
m_state = true;
m_cond.notify_one($);
m_mutex.unlock($);
}

void wait()
{
m_mutex.lock($);
while (m_state == false) m_cond.wait(m_mutex, $);
m_state = false; // auto-reset
m_mutex.unlock($);
}
};




// just a layer over an auto-reset event
struct ct_fast_mutex
{
std::atomic m_state;
ct_auto_reset_event m_waitset;

ct_fast_mutex() : m_state(0) {}

void lock()
{
if (m_state.exchange(1, std::memory_order_acquire))
{
while (m_state.exchange(2, std::memory_order_acquire))
{
m_waitset.wait();
}
}
}

void unlock()
{
if (m_state.exchange(0, std::memory_order_release) == 2)
{
m_waitset.signal();
}
}
};



// Chris M. Thomassons Experimental Read/Write Mutex
// Yeah, it is pretty damn fat wrt the state, however
// it has some interesting properties...
// The state can be compressed a bit...
// btw, it has no loops...
// Take a look at the lock_shared and unlock_shared functions

#define RWMUTEX_COUNT_MAX LONG_MAX

struct ct_rwmutex
{
// shared state
std::atomic m_wrstate;
std::atomic m_count;
std::atomic m_rdwake;

ct_slow_semaphore m_rdwset;
ct_slow_semaphore m_wrwset;
ct_fast_mutex m_wrlock;


ct_rwmutex() :
m_wrstate(1),
m_count(RWMUTEX_COUNT_MAX),
m_rdwake(0),
m_rdwset(0),
m_wrwset(0) {
}


// READ, pretty slim...
void lock_shared()
{
if (m_count.fetch_add(-1, std::memory_order_acquire) < 1)
{
m_rdwset.dec();
}
}

void unlock_shared()
{
if (m_count.fetch_add(1, std::memory_order_release) < 0)
{
if (m_rdwake.fetch_add(-1, std::memory_order_acq_rel) == 1)
{
m_wrwset.inc();
}
}
}


// WRITE, more hefty
void lock()
{
m_wrlock.lock();

long count = m_count.fetch_add(-RWMUTEX_COUNT_MAX, 
std::memory_order_acquire);

if (count < RWMUTEX_COUNT_MAX)
{
long rdwake = m_rdwake.fetch_add(RWMUTEX_COUNT_MAX - count, 
std::memory_order_acquire);

if (rdwake + RWMUTEX_COUNT_MAX - count)
{
m_wrwset.dec();
}
}
}

// write unlock
void unlock()
{
long count = m_count.fetch_add(RWMUTEX_COUNT_MAX, 
std::memory_order_release);

if (count < 0)
{
m_rdwset.add(-count);
}

m_wrlock.unlock();
}
};






struct ct_simple_stack
{
struct node
{
VAR_T(node*) m_next;
VAR_T(unsigned int) m_tid;

node(unsigned int tid) : m_tid(tid) {}
};

VAR_T(node*) m_head;

ct_simple_stack() : m_head(nullptr) {}

void push(node* n)
{
VAR(n->m_next) = VAR(m_head);
VAR(m_head) = n;
}

node* pop()
{
node* n = VAR(m_head);

if (n)
{
VA

Re: [lock-free] Experimental Read/Write Mutex..

2019-02-17 Thread Chris M. Thomasson


On Sunday, February 17, 2019 at 2:51:04 PM UTC-8, Chris M. Thomasson wrote:
>
> On Saturday, February 16, 2019 at 11:24:42 PM UTC-8, Dmitry Vyukov wrote:
>>
>> On Sun, Feb 17, 2019 at 12:34 AM Chris M. Thomasson  
>> wrote: 
>
>  

> [...]
>>
> Thank you so much! Okay, not great, but all that "radically" hardcore 
> terrible when compared to a std::shared_mutex on your end. Also, my work 
> can benefit from many sessions of padding and alignment therapy wrt 
> data-structure layout.
> [...]
>

I meant to say it is _not_ all that "radically" hardcore terrible when 
compared to a std::shared_mutex on your end.

Well, lets wait and see for the next more realistic benchmark. lol. ;^)

However, I do like the way my algorithm acquires and releases read access:

_

// READ, pretty slim...
void lock_shared()
{
if (m_count.fetch_add(-1, std::memory_order_acquire) < 1)
{
m_rdwset.dec();
}
}

void unlock_shared()
{
if (m_count.fetch_add(1, std::memory_order_release) < 0)
{
if (m_rdwake.fetch_add(-1, std::memory_order_acq_rel) == 1)
{
m_wrwset.inc();
}
}
}

__


Not all "that" bad? ;^)



>
> Fwiw, I am wondering what you think about the algorithm itself? Is it 
> crap, or decent? :^)
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/73754f3f-620b-4723-89d2-808430e5637f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [lock-free] Experimental Read/Write Mutex..

2019-02-17 Thread Chris M. Thomasson
On Saturday, February 16, 2019 at 11:24:42 PM UTC-8, Dmitry Vyukov wrote:
>
> On Sun, Feb 17, 2019 at 12:34 AM Chris M. Thomasson  > wrote: 
> > 
> > Fwiw, here is a simple benchmark for an older read/write algorithm I 
> invented: 
> > 
> > 
> > https://pastebin.com/raw/xCBHY9qd 
> > 
> > 
> > It should compile fine on MSVC 2017 and GCC with -std=c++17. There is a 
> macro in the code called CT_TEST_FAST_MUTEX. If it is defined then the code 
> will test my algorithm. If not defined then it will test against 
> std::shared_mutex. I am wondering if anybody can compile and run the code? 
> Test it out on as many operating systems as you can. Can you please show 
> the output? 
> > 
> > 
> > Thanks everybody. :^) 
> > 
> > Here is the code: 
> > 
> > /* Simple, crude read/write mutex test 
> >by: Chris M. Thomasson 
> > __*/ 
> [...]
>
 

>
> Hey Chris! 
>
> Here are my results on i7-8650U (4 HT cores) on Debian 4.19.16-1 gcc 7.3.0 
>
> std::shared_mutex 
> msec = 56118 
> msec = 49149 
> msec = 69416 
> msec = 68737 
> msec = 59174 
> msec = 65915 
>
> ct_rwmutex 
> msec = 62772 
> msec = 71139 
> msec = 68110 
> msec = 66038 
> msec = 60436 
> msec = 74238 
>


Thank you so much! Okay, not great, but all that "radically" hardcore 
terrible when compared to a std::shared_mutex on your end. Also, my work 
can benefit from many sessions of padding and alignment therapy wrt 
data-structure layout.

 

>
> I can also test on 2 CPU system with 88 hardware threads in total, but 
> the bench hardcodes 16 threads (there is std::hardware_concurrency() 
> or something like that). 
>

Indeed! Will correct this issue. Make it dynamic. :^)

Although, it can be beneficial to create more threads just to see how the 
algorithm handles these cases of "oversubscribed" contention.

 

>
> I've found that it's critical to model realistic/target ratio between 
> reads/writes and amount of local work and work inside of critical 
> section in such benchmarks. Otherwise in 100% synthetic hammering 
> usually the worst mutex shows the best results. Synthetic benchmarks 
> produce extreme contention, so a mutex that blocks threads as much as 
> possible and allows as few threads as possible to work, shows the best 
> result because contention is minimized. The best would be to run all 
> threads sequentially one-by-one. But that's counter productive for 
> real life because blocked threads don't do useful work too. 
>

Agreed. Just wondering why it performs so much better in some high 
contention scenarios. Load imbalance is a factor for sure.

https://groups.google.com/d/msg/comp.lang.c++/g6veUOvOAu0/L7Sjs0xOBAAJ


https://groups.google.com/d/msg/comp.lang.c++/g6veUOvOAu0/hMIu5VBSBAAJ


 

>
> Also this benchmark has fixed amount of work per thread, so I suspect 
> it may be subject to load imbalance when an unlucky outliner delays 
> total result. A better option would be to run all threads for X 
> seconds and then set a global var for all of them to stop and then 
> join all of them and measure number of iterations. 
>

Indeed. A benckmark that just modeled a list that readers iterated, and 
writers pushed and popped from. Then we can show how many operations, reads 
or writes, they performed per-second. So it would be the number of reads 
and writes per-second, per-thread.

This reminds of some of Joe Seighs tests back on comp.programming.threads.

 

>
> Also for me it lacked #include . 
>

Fixed that. Thanks again:

https://pastebin.com/raw/xCBHY9qd
(reload the page...) 


Fwiw, I am wondering what you think about the algorithm itself? Is it crap, 
or decent? :^)

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/b1b05b02-3ac5-458d-88ea-7cc09c8e74ea%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[lock-free] Experimental Read/Write Mutex..

2019-02-16 Thread Chris M. Thomasson
Fwiw, here is a simple benchmark for an older read/write algorithm I 
invented:


https://pastebin.com/raw/xCBHY9qd


It should compile fine on MSVC 2017 and GCC with -std=c++17. There is a 
macro in the code called CT_TEST_FAST_MUTEX. If it is defined then the code 
will test my algorithm. If not defined then it will test against 
std::shared_mutex. I am wondering if anybody can compile and run the code? 
Test it out on as many operating systems as you can. Can you please show 
the output?


Thanks everybody. :^)

Here is the code:

/* Simple, crude read/write mutex test
   by: Chris M. Thomasson
__*/



#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 


#define THREADS 16UL
#define ITERS 1000UL
#define COUNT (THREADS * ITERS)


// undefine to test std::shared_mutex
#define CT_TEST_FAST_MUTEX 1


// bare bones mutex/condvar based semaphore
struct ct_slow_semaphore
{
unsigned long m_state;
std::mutex m_mutex;
std::condition_variable m_cond;

ct_slow_semaphore(unsigned long state) : m_state(state) {}

void inc()
{
{
std::unique_lock lock(m_mutex);
++m_state;
}

m_cond.notify_one();
}

void add(unsigned long addend)
{
{
std::unique_lock lock(m_mutex);
m_state += addend;
}

m_cond.notify_all();
}

void dec()
{
std::unique_lock lock(m_mutex);
while (m_state == 0) m_cond.wait(lock);
--m_state;
}
};




// bin-sema
struct ct_auto_reset_event
{
bool m_state;
std::mutex m_mutex;
std::condition_variable m_cond;

ct_auto_reset_event() : m_state(false) {}

void signal()
{
std::unique_lock lock(m_mutex);
m_state = true;
m_cond.notify_one();
}

void wait()
{
std::unique_lock lock(m_mutex);
while (m_state == false) m_cond.wait(lock);
m_state = false; // auto-reset
}
};


// just a layer over an auto-reset event
struct ct_fast_mutex
{
std::atomic m_state;
ct_auto_reset_event m_waitset;

ct_fast_mutex() : m_state(0) {}

void lock()
{
if (m_state.exchange(1, std::memory_order_acquire))
{
while (m_state.exchange(2, std::memory_order_acquire))
{
m_waitset.wait();
}
}
}

void unlock()
{
if (m_state.exchange(0, std::memory_order_release) == 2)
{
m_waitset.signal();
}
}
};



// Chris M. Thomassons Experimental Read/Write Mutex
// Yeah, it is pretty damn fat wrt the state, however
// it has some interesting properties...
// The state can be compressed a bit...
// btw, it has no loops...
// Take a look at the lock_shared and unlock_shared functions

#define RWMUTEX_COUNT_MAX LONG_MAX

struct ct_rwmutex
{
// shared state
std::atomic m_wrstate;
std::atomic m_count;
std::atomic m_rdwake;

ct_slow_semaphore m_rdwset;
ct_slow_semaphore m_wrwset;
ct_fast_mutex m_wrlock;


ct_rwmutex() :
m_wrstate(1),
m_count(RWMUTEX_COUNT_MAX),
m_rdwake(0),
m_rdwset(0),
m_wrwset(0) {
}


// READ, pretty slim...
void lock_shared()
{
if (m_count.fetch_add(-1, std::memory_order_acquire) < 1)
{
m_rdwset.dec();
}
}

void unlock_shared()
{
if (m_count.fetch_add(1, std::memory_order_release) < 0)
{
if (m_rdwake.fetch_add(-1, std::memory_order_acq_rel) == 1)
{
m_wrwset.inc();
}
}
}


// WRITE, more hefty
void lock()
{
m_wrlock.lock();

long count = m_count.fetch_add(-RWMUTEX_COUNT_MAX, 
std::memory_order_acquire);

if (count < RWMUTEX_COUNT_MAX)
{
long rdwake = m_rdwake.fetch_add(RWMUTEX_COUNT_MAX - count, 
std::memory_order_acquire);

if (rdwake + RWMUTEX_COUNT_MAX - count)
{
m_wrwset.dec();
}
}
}

// write unlock
void unlock()
{
long count = m_count.fetch_add(RWMUTEX_COUNT_MAX, 
std::memory_order_release);

if (count < 0)
{
m_rdwset.add(-count);
}

m_wrlock.unlock();
}
};


struct ct_shared
{
std::atomic m_state;

#if defined (CT_TEST_FAST_MUTEX)
ct_rwmutex m_std_rwmutex;
#else
std::shared_mutex m_std_rwmutex;
#endif

ct_shared() : m_state(0) {}
};


void ct_thread(ct_shared& shared, std::size_t index)
{
for (unsigned int i = 0; i < ITERS; ++i)
{

shared.m_std_rwmutex.lock();
if (i % 256 == 0) std::this_thread::yield();
shared.m_state += 1;
shared.m_std_rwmutex.unlock();


shared.m_std_rwmutex.lock_shared();
if (i % 512 == 0) std::this_thread::yield();
//

[lock-free] Re: Atomic XCHG based Stack, simple for learning...

2019-01-04 Thread Chris M. Thomasson
On Friday, December 28, 2018 at 10:38:50 PM UTC-8, Chris M. Thomasson wrote:
>
> This experimental algorithm uses only XCHG at the cost of having a 
> consumer wait for the next pointer to be set in a node. However, it allows 
> for real work to be done before any waits are performed. So, the "real" 
> work should provide a "natural backoff" that might minimize the waiting. 
> The algorithm can be improved upon quite a bit. I have some "interesting" 
> ideas for it. Well, here is the code in the form of a Relacy unit test:
>
>
Fwiw, I finally found the time to code this experimental implementation up 
in c++11. Here is the code: 

https://pastebin.com/raw/j41cPT9S 
(raw text, C code, no ads...) 
________ 
// Simple XCHG based Atomic Stack 
// By: Chris M. Thomasson 


#include  
#include  
#include  
#include  
#include  
#include  


// sorry about the macros... 
#define THREADS 42 
#define ITERS 10 


#define CT_MB_RLX std::memory_order_relaxed 
#define CT_MB_ACQ std::memory_order_acquire 
#define CT_MB_REL std::memory_order_release 


// HACK! Humm... 
#define CT_WAIT ((ct_work*)(0xDEADBEEFU)) 



// Just to track all the dynamic allocations... 
static std::atomic g_allocs(0); 
static std::mutex g_cout_mtx; 


// A work node 
struct ct_work 
{ 
std::atomic m_next; 
std::thread::id m_data; 
ct_work(std::thread::id data) : m_next(nullptr), m_data(data) {} 


void process() 
{ 
// Simulate just a tiny little work? 
g_cout_mtx.lock(); 
std::this_thread::yield(); 
std::this_thread::yield(); 
std::this_thread::yield(); 

std::thread::id local = std::this_thread::get_id(); 

if (m_data == local) 
{ 
   // std::cout << "processing local = " << m_data << 
   // " from " << std::this_thread::get_id() << "\n"; 
} 

else 
{ 
std::cout << "processing foreign = " << m_data << 
" from " << std::this_thread::get_id() << "\n"; 
} 

std::this_thread::yield(); 
std::this_thread::yield(); 
std::this_thread::yield(); 
g_cout_mtx.unlock(); 
} 


ct_work* get_next() const 
{ 
ct_work* w = nullptr; 

while ((w = m_next.load(CT_MB_RLX)) == CT_WAIT) 
{ 
// we can spin, or even do other work right here... 
std::this_thread::yield(); 
} 

return w; 
} 
}; 



// Easy Stack, only uses XCHG 
struct ct_estack 
{ 
std::atomic m_head; 
ct_estack() : m_head(nullptr) {} 


void push(ct_work* n) 
{ 
n->m_next.store(CT_WAIT, CT_MB_RLX); 
ct_work* head = m_head.exchange(n, CT_MB_REL); // release 
n->m_next.store(head, CT_MB_RLX); 
} 


ct_work* flush_try() 
{ 
return m_head.exchange(nullptr, CT_MB_ACQ); // acquire 
} 
}; 



// Consume an Easy Stack... 
void ct_consume( 
ct_estack& estack 
) { 
ct_work* w = estack.flush_try(); 

while (w) 
{ 
// Process FIRST! 
w->process(); 

// Now, we can gain the next pointer. 
ct_work* next = w->get_next(); 

// Okay, we can delete the work 
delete w; 
g_allocs.fetch_sub(1, CT_MB_RLX); // dec 

w = next; 
} 
} 



// Our shared state 
struct ct_shared 
{ 
ct_estack m_estack; 
}; 



// Produce some work... 
void ct_produce( 
ct_estack& estack 
) { 
ct_work* w = new ct_work(std::this_thread::get_id()); 
g_allocs.fetch_add(1, CT_MB_RLX); // inc 
estack.push(w); 
} 


// Do some work... 
void ct_worker(ct_shared& shared) 
{ 
for (unsigned int i = 0; i < ITERS; ++i) 
{ 
ct_produce(shared.m_estack); 
ct_produce(shared.m_estack); 
ct_produce(shared.m_estack); 

std::this_thread::yield(); // little spice... 

ct_consume(shared.m_estack); 
} 

ct_consume(shared.m_estack); 
} 



int main(void) 
{ 
{ 
ct_shared shared; 
std::thread threads[THREADS]; 

for (unsigned long i = 0; i < THREADS; ++i) 
{ 
threads[i] = std::thread(ct_worker, std::ref(shared)); 
} 

for (unsigned long i = 0; i < THREADS; ++i) 
{ 
threads[i].join(); 
} 
} 

if (g_allocs.load(CT_MB_RLX) != 0) 
{ 
std::cout << "\n\nLEAKED\n"; 
} 

std::cout << "\n\nFIN!\n"; 

return 0; 
} 
 

Can anybody get this sucker to compile and run to completion? If so, can 
you show me some of the outp

[lock-free] Atomic XCHG based Stack, simple for learning...

2018-12-28 Thread Chris M. Thomasson
This experimental algorithm uses only XCHG at the cost of having a consumer 
wait for the next pointer to be set in a node. However, it allows for real 
work to be done before any waits are performed. So, the "real" work should 
provide a "natural backoff" that might minimize the waiting. The algorithm 
can be improved upon quite a bit. I have some "interesting" ideas for it. 
Well, here is the code in the form of a Relacy unit test:

Notice the "hand off" comments and 0xDEADBEEF, this can be made to work 
much better...
__
// Simple Atomic Stack
// For Academic and Experimental things... 
// Beginner, Moderate?
// In Good Ol' Relacy! Nice as always.
//___


//#define RL_DEBUGBREAK_ON_ASSERT
//#define RL_MSVC_OUTPUT
//#define RL_FORCE_SEQ_CST
//#define RL_GC


#include 
#include 


// Simple macro based redirection of the verbose  std membars.
#define CT_MB_ACQ std::memory_order_acquire
#define CT_MB_REL std::memory_order_release
#define CT_MB_RLX std::memory_order_relaxed
#define CT_MB_ACQ_REL std::memory_order_acq_rel
#define CT_MB_SEQ_CST std::memory_order_seq_cst
#define CT_MB(mp_0) std::atomic_thread_fence(mp_0)


// Some global vars directing the show...
#define THREADS 3
#define ITERS 2


// Experimental Stack
struct ct_ecstack
{
#define WAITNEXT ((node*)0xDEADBEEF)

struct node
{
std::atomic m_next;
VAR_T(unsigned int) m_data;

node(unsigned int data) : m_next(nullptr), m_data(data) {}

void process()
{
// simulate some work?
rl::yield(1 + rl::rand(2), $); // ;^)
}

node* next_wait()
{
node* next = nullptr;

while ((next = m_next.load(CT_MB_RLX)) == WAITNEXT)
{
// Humm, we can actually do other work right here...
// Hand off...
rl::yield(1, $);
}

return next;
}
};


std::atomic m_head;

ct_ecstack() : m_head(nullptr) {}


void push(node* n)
{
n->m_next.store(WAITNEXT, CT_MB_RLX);
node* head = m_head.exchange(n, CT_MB_REL); // release
n->m_next.store(head, CT_MB_RLX);
}


node* flush_try()
{
return m_head.exchange(nullptr, CT_MB_ACQ); // acquire
}
};




// Relacy Stack Test...
struct ct_ecstack_test
: rl::test_suite
{
ct_ecstack g_ecstack;

void process()
{
ct_ecstack::node* n = g_ecstack.flush_try(); // flush all

while (n)
{
// Process n first, acts like a backoff for the next wait
// Hand off some other nodes? Future note...
n->process();

// Wait for the next pointer, or hand off?
ct_ecstack::node* next = n->next_wait();

// Destroy
delete n;

// Loop on...
n = next;
}
}

void thread(unsigned int tidx)
{
for (unsigned int i = 0; i < ITERS; ++i)
{
g_ecstack.push(new ct_ecstack::node(tidx));
g_ecstack.push(new ct_ecstack::node(tidx));
g_ecstack.push(new ct_ecstack::node(tidx));

process();

g_ecstack.push(new ct_ecstack::node(tidx));
}

process();
}
};



// Test away... Or fly? Humm...
int main()
{
{
rl::test_params p;

p.iteration_count = 1000;
//p.execution_depth_limit = 3;
//p.search_type = rl::sched_bound;
//p.search_type = rl::fair_full_search_scheduler_type;
//p.search_type = rl::fair_context_bound_scheduler_type;

rl::simulate(p);
}

return 0;
}
__


Any thoughts? This should be fairly basic in nature.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/1a2de2c4-9c5c-4bd9-a663-a084e16e3bcd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[lock-free] Re: Eventcount with timeout

2018-12-28 Thread Chris M. Thomasson


On Thursday, December 27, 2018 at 9:55:57 PM UTC-8, Artur Brugeman wrote:
>
> Hi Chris,
>
> Thank you for your thoughts on this, my thinking was in line with yours 
> and that's why I tried to add timed waiting in Dmitry's eventcount. I guess 
> I need to spend some more time there and figure it out. 
>

Iirc, Dmitrys algo used some per-thread waitsets for an eventcount algo. 
Cannot remember right now if it is the same one you are referring to.

Fwiw, one can even spin on the lock for wait_begin, wrt something like:

bool wait_begin_try()
{
if (! m_mutex.try_lock($)) return false;
m_waitbit.store(1, CT_MB_RLX);
CT_MB(CT_MB_SEQ_CST);
return true;
}


The predicate loop can look like:

unsigned int signal = 0;

while ((signal = m_signal.load(CT_MB_RLX)) != 2)
{
if (! g_ecount.wait_begin_try())
{
rl::yield(1, $);
continue;
}

signal = m_signal.load(CT_MB_RLX);

if (signal == 2)
{
g_ecount.wait_cancel();
break;
}

g_ecount.wait_commit();
}


It can even be adaptive in nature where one can fail the try_lock a couple 
of times, then just lock the sucker!

;^)
 

>
>
> On Friday, December 28, 2018 at 4:32:04 AM UTC+5, Chris M. Thomasson wrote:
>>
>> On Sunday, December 23, 2018 at 9:30:03 PM UTC-8, Artur Brugeman wrote:
>>>
>>> Hi Dmitry,
>>>
>>> I want to use your eventcount (took the source from intel forum). 
>>>
>>> Currently I was using semaphores, which allowed me to set waiting 
>>> timeout. 
>>>
>>> Questions:
>>> 1. Is the source from intel forum 'the latest and stable'? You had a 
>>> pretty long discussion there and I'm not sure the posted sources 
>>> incorporated all the fixes.
>>> 2. Can eventcount support waiting timeouts? Can I just add 'timeout' 
>>> param to prepare_wait and commit_wait and call 'sema.timedwait' instead of 
>>> 'wait'? In fact I did just that and now I get segfaults here and there, so 
>>> not sure it's the way to go.
>>>
>>> Can you please share your thoughts on this?
>>>
>>> Thanks a lot!
>>>
>>>
>> Fwiw, a timed wait on an eventcount works as easily as spinning on it wrt 
>> not waiting on a kernel waitset at all. The predicate is the user algorithm 
>> itself. So, imagine if the waits never actually block, but spin around. 
>> Yet, everything still works. Fwiw, the following code, that should compile 
>> with Relacy, is the most simplistic eventcount I can imagine:
>> _
>> ...
>>
>> think if g_ecount.wait_commit(); was timed... It would still work as is. 
>> So, yes an eventcount can easily handle timed waits.
>>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/b67bbb52-d353-416e-85fa-7399c69e9349%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[lock-free] Re: Eventcount with timeout

2018-12-27 Thread Chris M. Thomasson
On Sunday, December 23, 2018 at 9:30:03 PM UTC-8, Artur Brugeman wrote:
>
> Hi Dmitry,
>
> I want to use your eventcount (took the source from intel forum). 
>
> Currently I was using semaphores, which allowed me to set waiting timeout. 
>
> Questions:
> 1. Is the source from intel forum 'the latest and stable'? You had a 
> pretty long discussion there and I'm not sure the posted sources 
> incorporated all the fixes.
> 2. Can eventcount support waiting timeouts? Can I just add 'timeout' param 
> to prepare_wait and commit_wait and call 'sema.timedwait' instead of 
> 'wait'? In fact I did just that and now I get segfaults here and there, so 
> not sure it's the way to go.
>
> Can you please share your thoughts on this?
>
> Thanks a lot!
>
>
Fwiw, a timed wait on an eventcount works as easily as spinning on it wrt 
not waiting on a kernel waitset at all. The predicate is the user algorithm 
itself. So, imagine if the waits never actually block, but spin around. 
Yet, everything still works. Fwiw, the following code, that should compile 
with Relacy, is the most simplistic eventcount I can imagine:
_
// Simple Event Count
// For Academic and Experimental things... 
// Beginner, Moderate?
// In Good Ol' Relacy! Nice as always.
//___


//#define RL_DEBUGBREAK_ON_ASSERT
//#define RL_MSVC_OUTPUT
//#define RL_FORCE_SEQ_CST
//#define RL_GC


#include 
#include 


// Simple macro based redirection of the verbose  std membars.
#define CT_MB_ACQ std::memory_order_acquire
#define CT_MB_REL std::memory_order_release
#define CT_MB_RLX std::memory_order_relaxed
#define CT_MB_ACQ_REL std::memory_order_acq_rel
#define CT_MB_SEQ_CST std::memory_order_seq_cst
#define CT_MB(mp_0) std::atomic_thread_fence(mp_0)


// Some global vars directing the show...
#define WORKERS 5
#define THREADS (WORKERS)


// old school
struct ct_ecount
{
std::mutex m_mutex;
std::condition_variable m_cond;
std::atomic m_waitbit;

ct_ecount() : m_waitbit(0) {}

void broadcast()
{
CT_MB(CT_MB_SEQ_CST);

unsigned int waitbit = m_waitbit.load(CT_MB_RLX);
if (! waitbit) return;

m_mutex.lock($);
m_waitbit.store(0, CT_MB_RLX);
m_cond.notify_all($);
m_mutex.unlock($);
}


void wait_begin()
{
m_mutex.lock($);
m_waitbit.store(1, CT_MB_RLX);
CT_MB(CT_MB_SEQ_CST);
}


void wait_cancel()
{
m_mutex.unlock($);
}


void wait_commit()
{
m_cond.wait(m_mutex, $);
m_mutex.unlock($);
}
};


// Relacy Multex Test...
struct ct_ecount_test
: rl::test_suite
{
std::atomic m_signal;
ct_ecount g_ecount;

ct_ecount_test() : m_signal(0) {}

void thread(unsigned int tidx)
{
if (tidx < 2)
{
if (m_signal.fetch_add(1, CT_MB_RLX) == 1)
{
g_ecount.broadcast();
}
}

else
{
unsigned int signal = 0;

while ((signal = m_signal.load(CT_MB_RLX)) != 2)
{
g_ecount.wait_begin();
signal = m_signal.load(CT_MB_RLX);

if (signal == 2)
{
g_ecount.wait_cancel();
break;
}

g_ecount.wait_commit();
}
}
}
};



// Test away... Or fly? Humm...
int main()
{
{
rl::test_params p;

p.iteration_count = 500;
//p.execution_depth_limit = 3;
//p.search_type = rl::sched_bound;
//p.search_type = rl::fair_full_search_scheduler_type;
//p.search_type = rl::fair_context_bound_scheduler_type;

rl::simulate(p);
}

return 0;
}
_

think if g_ecount.wait_commit(); was timed... It would still work as is. 
So, yes an eventcount can easily handle timed waits.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/5828d4b0-ead6-4160-a9f1-ba68d08ae29c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [lock-free] Re: Simple Example, Very Basic Per-Thread Proxy GC...

2018-12-27 Thread Chris M. Thomasson


On Saturday, December 22, 2018 at 11:53:07 PM UTC-8, Dmitry Vyukov wrote:
>
> On Thu, Dec 20, 2018 at 12:17 AM Chris M. Thomasson  > wrote: 
> > 
> > On Wednesday, December 19, 2018 at 2:02:31 AM UTC-8, Dmitry Vyukov 
> wrote: 
> >> 
> >> On Wed, Dec 19, 2018 at 7:05 AM Chris M. Thomasson  
> wrote: 
> >> > 
> >> > On Monday, December 17, 2018 at 11:23:20 PM UTC-8, Chris M. Thomasson 
> wrote: 
> >> >> 
> >> >> If interested, I can give more details. For now, here is the code in 
> the form of a Relacy Test Unit: 
> >> 
> >> Can you give a short executive summary? What are the 
> >> advantages/disadvantages/novelty? 
> > 
> > 
> > Very quick, sorry: Should have some more time tonight. 
> > 
> > A simple proxy with per-thread mutexes. Threads enter a protected region 
> by taking their own lock, and leave it by releasing said lock. Very basic. 
> When a thread wants to defer a node for deletion it pushes it onto a global 
> lock-free stack via CAS. There is a reaper thread that flushes all of the 
> nodes with XCHG, and keeps it on an internal list. The reaper thread loop 
> then tries to acquire and release all of the per-thread locks. Once it has 
> done this, it says quiescent state. It keeps nodes in a way that ensures at 
> least two periods of this state have occurred before it actually calls 
> delete and destroys memory. Since the reapers use a try_lock to detect a 
> quiescent state, it can livelock in a reaper in the sense that it never 
> gains a lock. However, Relacy should never detect the condition because of 
> the limited iteration loop for workers wrt the test code itself. There is a 
> work around. We can let a reaper fail for a little while before it actually 
> ditches try_lock and just locks the per-thread quiescence lock. It would 
> act just like an adaptive mutex, in a sense... 
>
> So the main advantage is simplicity and ability for mere mortals to 
> understand how it works :) 
> Should be especially advantageous in teaching since it does not 
> involve atomic operations. 
>

Yeah, agreed. Fwiw, I still cannot get the program as-is to crash in Relacy.

It is very simple wrt the reaper thread detecting a quiescent state wrt 
trying to gain all of the per-thread locks, and deferring for two quiescent 
states before destroying memory.

 

>
> [...]

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/aa681e32-ad2e-4a4b-a3ab-fa8adfec7f56%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [lock-free] Re: Simple Example, Very Basic Per-Thread Proxy GC...

2018-12-22 Thread Chris M. Thomasson
On Thursday, December 20, 2018 at 3:03:05 PM UTC-8, Chris M. Thomasson 
wrote:
>
> On Wednesday, December 19, 2018 at 3:17:18 PM UTC-8, Chris M. Thomasson 
> wrote:
>>
>> On Wednesday, December 19, 2018 at 2:02:31 AM UTC-8, Dmitry Vyukov wrote:
>>>
>>> On Wed, Dec 19, 2018 at 7:05 AM Chris M. Thomasson  
>>> wrote: 
>>> > 
>>> > On Monday, December 17, 2018 at 11:23:20 PM UTC-8, Chris M. Thomasson 
>>> wrote: 
>>> >> 
>>> >> If interested, I can give more details. For now, here is the code in 
>>> the form of a Relacy Test Unit: 
>>>
>>> Can you give a short executive summary? What are the 
>>> advantages/disadvantages/novelty? 
>>>
>>
>> Very quick, sorry: Should have some more time tonight.
>>
>> A simple proxy with per-thread mutexes. Threads enter a protected region 
>> by taking their own lock, and leave it by releasing said lock. Very basic. 
>> When a thread wants to defer a node for deletion it pushes it onto a global 
>> lock-free stack via CAS. There is a reaper thread that flushes all of the 
>> nodes with XCHG, and keeps it on an internal list. The reaper thread loop 
>> then tries to acquire and release all of the per-thread locks. Once it has 
>> done this, it says quiescent state. It keeps nodes in a way that ensures at 
>> least two periods of this state have occurred before it actually calls 
>> delete and destroys memory. Since the reapers use a try_lock to detect a 
>> quiescent state, it can livelock in a reaper in the sense that it never 
>> gains a lock. However, Relacy should never detect the condition because of 
>> the limited iteration loop for workers wrt the test code itself. There is a 
>> work around. We can let a reaper fail for a little while before it actually 
>> ditches try_lock and just locks the per-thread quiescence lock. It would 
>> act just like an adaptive mutex, in a sense...
>>
>>
>> [...] 
>
>> So far, so good. It passes 1000 iterations. I am letting 
>> rl::sched_bound run for around 65 minutes now, no problems at iteration:
>>
>>
>>
>
> Fwiw, its been running overnight using rl:sched_bound, I am at iteration:
>
> 99% (3506831360/3506831361)
> 99% (3506896896/3506896897)
> 99% (3506962432/3506962433)
> 99% (3507027968/3507027969)
> 99% (3507093504/3507093505)
> 99% (3507159040/3507159041)
> 99% (3507224576/3507224577)
> 99% (3507290112/3507290113)
>
>
> No problems found so far at 3.5 billion iterations.
>


No problems so far, however the program has still not completed and the 
damn battery accidentally went dead on the testing laptop! This means that 
I have to run it again.

Isn't there a way to start from a given iteration count?



> I need to code up a scenario where a thread actually iterates through all 
> of the unode's in g_worker_stack. I think it should fail in this case. 
> Humm...
>
>  
>
>>
>>
>> Also, need to get the latest version of Relacy. I am using version: 2.3.
>>
>> Does it bite the dust in the latest version? ;^o
>>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/e90d82db-9031-4df2-845d-ccdb428d063a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [lock-free] Re: Simple Example, Very Basic Per-Thread Proxy GC...

2018-12-20 Thread Chris M. Thomasson
On Wednesday, December 19, 2018 at 3:17:18 PM UTC-8, Chris M. Thomasson 
wrote:
>
> On Wednesday, December 19, 2018 at 2:02:31 AM UTC-8, Dmitry Vyukov wrote:
>>
>> On Wed, Dec 19, 2018 at 7:05 AM Chris M. Thomasson  
>> wrote: 
>> > 
>> > On Monday, December 17, 2018 at 11:23:20 PM UTC-8, Chris M. Thomasson 
>> wrote: 
>> >> 
>> >> If interested, I can give more details. For now, here is the code in 
>> the form of a Relacy Test Unit: 
>>
>> Can you give a short executive summary? What are the 
>> advantages/disadvantages/novelty? 
>>
>
> Very quick, sorry: Should have some more time tonight.
>
> A simple proxy with per-thread mutexes. Threads enter a protected region 
> by taking their own lock, and leave it by releasing said lock. Very basic. 
> When a thread wants to defer a node for deletion it pushes it onto a global 
> lock-free stack via CAS. There is a reaper thread that flushes all of the 
> nodes with XCHG, and keeps it on an internal list. The reaper thread loop 
> then tries to acquire and release all of the per-thread locks. Once it has 
> done this, it says quiescent state. It keeps nodes in a way that ensures at 
> least two periods of this state have occurred before it actually calls 
> delete and destroys memory. Since the reapers use a try_lock to detect a 
> quiescent state, it can livelock in a reaper in the sense that it never 
> gains a lock. However, Relacy should never detect the condition because of 
> the limited iteration loop for workers wrt the test code itself. There is a 
> work around. We can let a reaper fail for a little while before it actually 
> ditches try_lock and just locks the per-thread quiescence lock. It would 
> act just like an adaptive mutex, in a sense...
>
>
> [...] 

> So far, so good. It passes 1000 iterations. I am letting 
> rl::sched_bound run for around 65 minutes now, no problems at iteration:
>
>
>

Fwiw, its been running overnight using rl:sched_bound, I am at iteration:

99% (3506831360/3506831361)
99% (3506896896/3506896897)
99% (3506962432/3506962433)
99% (3507027968/3507027969)
99% (3507093504/3507093505)
99% (3507159040/3507159041)
99% (3507224576/3507224577)
99% (3507290112/3507290113)


No problems found so far at 3.5 billion iterations.

I need to code up a scenario where a thread actually iterates through all 
of the unode's in g_worker_stack. I think it should fail in this case. 
Humm...

 

>
>
> Also, need to get the latest version of Relacy. I am using version: 2.3.
>
> Does it bite the dust in the latest version? ;^o
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/c888539c-4ccf-429e-84ea-ebc7512981bc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [lock-free] Re: Simple Example, Very Basic Per-Thread Proxy GC...

2018-12-19 Thread Chris M. Thomasson
On Wednesday, December 19, 2018 at 2:02:31 AM UTC-8, Dmitry Vyukov wrote:
>
> On Wed, Dec 19, 2018 at 7:05 AM Chris M. Thomasson  > wrote: 
> > 
> > On Monday, December 17, 2018 at 11:23:20 PM UTC-8, Chris M. Thomasson 
> wrote: 
> >> 
> >> If interested, I can give more details. For now, here is the code in 
> the form of a Relacy Test Unit: 
>
> Can you give a short executive summary? What are the 
> advantages/disadvantages/novelty? 
>

Very quick, sorry: Should have some more time tonight.

A simple proxy with per-thread mutexes. Threads enter a protected region by 
taking their own lock, and leave it by releasing said lock. Very basic. 
When a thread wants to defer a node for deletion it pushes it onto a global 
lock-free stack via CAS. There is a reaper thread that flushes all of the 
nodes with XCHG, and keeps it on an internal list. The reaper thread loop 
then tries to acquire and release all of the per-thread locks. Once it has 
done this, it says quiescent state. It keeps nodes in a way that ensures at 
least two periods of this state have occurred before it actually calls 
delete and destroys memory. Since the reapers use a try_lock to detect a 
quiescent state, it can livelock in a reaper in the sense that it never 
gains a lock. However, Relacy should never detect the condition because of 
the limited iteration loop for workers wrt the test code itself. There is a 
work around. We can let a reaper fail for a little while before it actually 
ditches try_lock and just locks the per-thread quiescence lock. It would 
act just like an adaptive mutex, in a sense...

___
fresh = new generation
old = old generation

// reaper loop

fresh = gain new nodes

if (quiescent)
{
dump = old;
old = fresh;
fresh = empty;

dump.destroy(); // reclaim memory...
}
___

 

>
>
> >> https://pastebin.com/raw/FpnvTqvM 
> >> (raw link, no ads! :^) 
> >> 
> >> ___ 
> >> 
> >> [...] 
> >> 
> >> ___ 
> >> 
> >> 
> >> 
> >> Can anybody run it with Relacy? If not, I need to learn why: what 
> problems were encountered? 
>
> Did you succeed with running it with Relacy? :) 
>

So far, so good. It passes 1000 iterations. I am letting 
rl::sched_bound run for around 65 minutes now, no problems at iteration:


99% (129564672/129564673)

99% (129630208/129630209)

99% (129695744/129695745)

99% (129761280/129761281)

99% (129826816/129826817)

99% (129892352/129892353)

99% (129957888/129957889)

99% (130023424/130023425)

[...]



Also, need to get the latest version of Relacy. I am using version: 2.3.

Does it bite the dust in the latest version? ;^o

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/1cfa7e79-7bcc-42e0-ac0c-06188176b1ae%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[lock-free] Re: Simple Example, Very Basic Per-Thread Proxy GC...

2018-12-18 Thread Chris M. Thomasson
On Monday, December 17, 2018 at 11:23:20 PM UTC-8, Chris M. Thomasson wrote:
>
> If interested, I can give more details. For now, here is the code in the 
> form of a Relacy Test Unit:
>
> https://pastebin.com/raw/FpnvTqvM
> (raw link, no ads! :^)
>
> ___
>
> [...]
>
> ___
>
>
>
> Can anybody run it with Relacy? If not, I need to learn why: what problems 
> were encountered?
>
>
Keep in mind that this does not have a proper back-link, so it should not 
work for iterating a data-structure, however it does work for ABA. 

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/a386a3da-2bb1-4c9f-809e-9bc6dee2d792%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[lock-free] Simple Example, Very Basic Per-Thread Proxy GC...

2018-12-17 Thread Chris M. Thomasson
If interested, I can give more details. For now, here is the code in the 
form of a Relacy Test Unit:

https://pastebin.com/raw/FpnvTqvM
(raw link, no ads! :^)

___

// Simple Per-Thread Mutex Based Proxy Collector
// For Academic and Experimental things... 
// Beginner, Moderate?
// In Good Ol' Relacy! Nice as always.
//___


//#define RL_DEBUGBREAK_ON_ASSERT
//#define RL_MSVC_OUTPUT
//#define RL_FORCE_SEQ_CST
//#define RL_GC


#include 
#include 
#include 
#include 


// Simple macro based redirection of the verobsse std membars.
#define CT_MB_ACQ std::memory_order_acquire
#define CT_MB_REL std::memory_order_release
#define CT_MB_RLX std::memory_order_relaxed
#define CT_MB_ACQ_REL std::memory_order_acq_rel
#define CT_MB(mp_0) std::atomic_thread_fence(mp_0)


// Some global vars directing the show...
#define WORKERS 4
#define REAPERS 2 // set it to one for now...
#define ITERS 5
#define THREADS (WORKERS + REAPERS)


// A User Node
struct unode
{
std::atomic m_next;
VAR_T(unsigned long) m_data;
unode(unsigned long data, unode* next = nullptr) : m_data(data), 
m_next(next) {}
};


// The global ptex state
template
struct ptex
{

// Per-thread state
struct per_thread
{
std::mutex m_quiesce; // essential for this scheme
VAR_T(unsigned long) m_ver_local; // for future use
std::atomic m_ver_reap; // for future use


per_thread() : m_ver_local(0), m_ver_reap(0)
{
//std::cout << "ptex::per_thread::per_thread()\n";
}

~per_thread() throw()
{
//std::cout << "ptex::per_thread::~per_thread()\n";
}


// A thread enters a region
void acquire()
{
unsigned long nver = VAR(m_ver_local)++;
m_quiesce.lock($);
m_ver_reap.store(nver, CT_MB_RLX);
CT_MB(CT_MB_ACQ);
}


// A thread exits a region
void release()
{
CT_MB(CT_MB_REL);
unsigned long ver = VAR(m_ver_local);
m_ver_reap.store(ver + 1, CT_MB_RLX);
m_quiesce.unlock($);
VAR(m_ver_local) = ver + 1;
}


void quiesce_brute()
{
// Ensure quiescence 
m_quiesce.lock($);
m_quiesce.unlock($);
}
};


per_thread m_threads[T_threads];
std::atomic m_collected;


ptex() : m_collected(nullptr)
{
//std::cout << "ptex::ptex()\n";
}


~ptex() throw()
{
RL_ASSERT(!m_collected.load(CT_MB_RLX));
//std::cout << "ptex::~ptex()\n";
}

// Gain a ref to our threads per-thread struct
per_thread& get_ref() throw()
{
return m_threads[rl::thread_index()];
}

// delete everything...
void dump(unode* n)
{
while (n)
{
unode* next = n->m_next.load(CT_MB_RLX);
delete n;
n = next;
}
}


// We are finished, collect the node...
void collect(unode* n) throw()
{
unode* head = m_collected.load(CT_MB_RLX);

do
{
n->m_next.store(head, CT_MB_RLX);
CT_MB(CT_MB_REL); // release

} while (!m_collected.compare_exchange_weak(
head, n, CT_MB_RLX));
}


// Flush the stack
unode* reap() throw()
{
unode* head = m_collected.exchange(nullptr, CT_MB_RLX);
if (head) CT_MB(CT_MB_ACQ); // acquire
return head;
}


// Wait for a quiesce
void quiesce_brute() throw()
{
for (std::size_t i = 0; i < T_threads; ++i)
{
per_thread& pt = m_threads[i];
//pt.quiesce();
pt.quiesce_brute();
}
}
};


// Relacy Multex Test...
struct ct_multex_test
: rl::test_suite
{
typedef ptex ptex_t;
ptex_t g_ptex; // Global ptex
std::atomic g_worker_stack; // Global User Stack
unsigned long g_contrived_shutdown; // Just for shutdown


void before()
{
g_worker_stack.store(nullptr, CT_MB_RLX);
g_contrived_shutdown = WORKERS;
}


void after()
{

}


// Add a user item to the stack
void worker_push(unode* n)
{
unode* head = g_worker_stack.load(CT_MB_RLX);

do
{
n->m_next.store(head, CT_MB_RLX);
CT_MB(CT_MB_REL); // release
} while (!g_worker_stack.compare_exchange_weak(head, n, CT_MB_RLX));
}

// Remove a user item from the stack
unode* worker_pop_try()
{
unode* head = g_worker_stack.load(CT_MB_RLX);
unode* next = nullptr;

do
{
if (!head) break;
CT_MB(CT_MB_ACQ); // acquire
next = head->m_next.load(CT_MB_RLX); // critical load! ;^o

} while (!g_worker_stack.compare_exchange_weak(head, next, CT_MB_RLX));

return head;
}


// A User Worker Example. The main point!
   

[lock-free] Re: wait-free combiner lock

2018-12-02 Thread Chris M. Thomasson
Need to check this out. Thanks for the heads up.


On Friday, November 9, 2018 at 4:13:13 PM UTC-8, Dmitry Vyukov wrote:
>
> Hi, 
>
> An interesting wait-free combiner lock with 2 XCHG to enqueue work: 
> https://github.com/romkatv/action-chain/blob/master/src/action_chain.h 
>
> This is somewhat similar to the XCHG-based enqueue here: 
>
> http://www.1024cores.net/home/lock-free-algorithms/queues/non-intrusive-mpsc-node-based-queue
>  
>
> But the algorithm does not have that nasty inconsistency window 
> between XCHG and store when the queue links are broken. The novel part 
> is that it uses another XCHG to restore the next link and figure out 
> if it managed to do so before anybody observed the broken link or not. 
>
> This is more expensive than 1 CAS to enqueue for uncontended case, but 
> I guess should sustain high contention much better. 
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/6b4b45b8-9a37-40e3-b7ff-7760a973b42d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [lock-free] One reason why I like atomic_thread_fence...

2018-04-23 Thread Chris M. Thomasson
On Friday, April 20, 2018 at 2:06:22 AM UTC-7, Dmitry Vyukov wrote:
>
> On Mon, Apr 16, 2018 at 12:32 AM, Chris M. Thomasson 
> <cri...@charter.net > wrote: 
> > 
> > 
> > On Friday, April 13, 2018 at 11:45:51 PM UTC-7, Dmitry Vyukov wrote: 
> >> 
> >> On Mon, Apr 9, 2018 at 3:38 AM, Chris M. Thomasson <cri...@charter.net> 
>
> >> wrote: 
> >> > On Saturday, April 7, 2018 at 1:46:20 AM UTC-7, Dmitry Vyukov wrote: 
> >> >> 
> >> >> On Thu, Apr 5, 2018 at 10:03 PM, Chris M. Thomasson 
> >> >> <cri...@charter.net> 
> >> >> wrote: 
> >> >> > On Tuesday, April 3, 2018 at 5:44:38 AM UTC-7, Dmitry Vyukov 
> wrote: 
> >> >> >> 
> >> >> >> On Sat, Mar 31, 2018 at 10:41 PM, Chris M. Thomasson 
> >> >> >> <cri...@charter.net> wrote: 
> [...]
> > D should be a pure relaxed store, and C should not be covered by the 
> > RELEASE. Iirc, it works this way on SPARC RMO mode. However, on x86, C 
> will 
> > be covered because each store has implied release characteristics, wb 
> memory 
> > aside for a moment. 
>
> C and D are completely symmetric wrt the RELEASE. Later you can 
> discover that there is also a thread that does: 
>
> // consumer 2 
> while (C != 3) backoff; 
> ACQUIRE 
> assert(A == 1 && B == 2); 
>
> And now suddenly C is release operation exactly the same way D is. 
>

Argh! D is the "control" signal. C should not be used as a signal to 
acquire. Damn. C should not have to be involved because the RELEASE is 
_before_ C was assigned.

Argh.


> >> At lease this is how this is defined in C/C++ standards. 
> >> ACQUIRE/RELEASE fences do not establish any happens-before relations 
> >> themselves. You still need a load in one thread to observe a value 
> >> stored in another thread. And only that "materializes" standalone 
> >> fence synchronization. So a store that materializes RELEASE fence will 
> >> always be a subsequent store. 
> > 
> > 
> > Humm... That is too strict, and has to be there whether we use 
> standalone 
> > fences or not. 
>
> No, in C/C++ memory ordering constrains tied to memory operations act 
> only on that memory operation. 


> Consider: 
>
> DATA = 1; 
> C.store(1, memory_order_release); 
> D.store(1, memory_order_relaxed); 
>
> vs: 
>
> DATA = 1; 
> atomic_thread_fence(memory_order_release); 
> C.store(1, memory_order_relaxed); 
> D.store(1, memory_order_relaxed); 
>
>
> And 2 consumers: 
>
> // consumer 1 
> while (C.load(memory_order_acquire) == 0) backoff(); 
> assert(DATA == 1); 
>
> // consumer 2 
> while (D.load(memory_order_acquire) == 0) backoff(); 
> assert(DATA == 1); 
>
> Both consumers are correct wrt the atomic_thread_fence version of 
> producer. But only the first one is correct wrt the store(1, 
> memory_order_release) version of producer. 
>
> And this can actually break on x86 because: 
>

Iirc, x86 has implied release semantics on stores, and acquire on loads. If 
we load C and observe 1, we will see DATA as 1. However, D is not in sync 
at all. If we load D and see it as being 1, then C = 1 and DATA = 1.

Consumer 1 will see C = 1 and DATA = 1: D will be out of sync
Consumer 2 will see D = 1, C = 1 and DATA = 1: They will all be in sync


 [...]

> > Yes! The stand alone fence can say, we want to perform an acquire 
> barrier 
> > wrt m_head. Something like that should be able to create more fine grain 
> > setups. Perhaps even something like the following pseudo-code: 
> > __ 
> > // setup 
> > int a = 0; 
> > int b = 0; 
> > int c = 0; 
> > signal = false; 
> > 
> > // producer 
> > a = 1; 
> > b = 2; 
> > RELEASE(, , ); 
> > c = 3; 
> > STORE_RELAXED(, true); 
> > 
> > // consumers 
> > while (LOAD_RELAXED() != true) backoff; 
> > ACQUIRE(, , ); 
> > assert(a == 1 && b == 2); 
> > __ 
> > 
> > The consumers would always see a and b as 1 and 2, however, c was not 
> > covered, so it is an incoherent state wrt said consumers. 
> > 
> > The acquire would only target a and b, as would the release. 
> > 
> > Hummm... Just thinking out loud here. :^) 
>
> This would help verification tools tremendously... but unfortunately 
> it's not the reality we are living in :) 
>

Shi% happens! However, it would allow one to combine the flexibility of 
standalone fences and targeted memory addresses all in one.

Need to think some more on this.

Thank you! :^D

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/24cd3cc6-49bc-4e9c-b045-4eb7650ddf7b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [lock-free] One reason why I like atomic_thread_fence...

2018-04-15 Thread Chris M. Thomasson


On Friday, April 13, 2018 at 11:45:51 PM UTC-7, Dmitry Vyukov wrote:
>
> On Mon, Apr 9, 2018 at 3:38 AM, Chris M. Thomasson <cri...@charter.net 
> > wrote: 
> > On Saturday, April 7, 2018 at 1:46:20 AM UTC-7, Dmitry Vyukov wrote: 
> >> 
> >> On Thu, Apr 5, 2018 at 10:03 PM, Chris M. Thomasson <cri...@charter.net> 
>
> >> wrote: 
> >> > On Tuesday, April 3, 2018 at 5:44:38 AM UTC-7, Dmitry Vyukov wrote: 
> >> >> 
> >> >> On Sat, Mar 31, 2018 at 10:41 PM, Chris M. Thomasson 
> >> >> <cri...@charter.net> wrote: 
> >> >> > Notice how there is an acquire barrier inside of the CAS loop 
> within 
> >> >> > the 
> >> >> > enqueue and dequeue functions of: 
> >> >> > 
> >> >> > 
> >> >> > 
> http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue 
> >> [...] 
> >> > Executing an acquire barrier on every iteration of the CAS loop is 
> not 
> >> > necessary. The actual version count keeps everything in order. 
> >> > 
> >> > However, you do need a single acquire fence _after_ the CAS loop 
> >> > succeeds in 
> >> > order to get a clear view of the element. 
> >> 
> >> This is true. 
> > 
> > 
> > Agreed. I personally like the ability to see the membars being separated 
> out 
> > and 
> > 
> > standing alone. It is a habit of mine from SPARC. Now, tagged standalone 
> > membars 
> > 
> > aside for a moment, perhaps ones that can include memory locations they 
> are 
> > 
> > interested in... ;^) 
> > 
> > 
> >> 
> >> 
> >> I don't like standalone fences because they are plague for 
> >> verification. Consider, a single release fences turns _all_ subsequent 
> >> relaxed atomic stores ever executed by the thread into release 
> >> operations (they release memory state up to the fence point) and 
> >> handling of acquire/release operations is an O(N) operation (and 
> >> generally done under a mutex). 
> > 
> > 
> > A release operation should make sure all _prior_ operations are visible 
> > _before_ 
> > 
> > they are visible to another thread. They have no effect on subsequent 
> > relaxed 
> > 
> > operations. For instance: 
> > 
> > 
> > // producer 
> > 
> > A = 1 
> > 
> > B = 2 
> > 
> > RELEASE 
> > 
> > C = 3 
> > 
> > D = 4 
> > 
> > 
> > // consumer 
> > 
> > while (D != 4) backoff; 
> > 
> > ACQUIRE 
> > 
> > assert(A == 1 && B == 2); 
> > 
> > 
> > Well, A and B are going be in sync with an acquire such that the assert 
> will 
> > never 
> > fail, however C can be hoisted up and not be in sync at all! C is 
> incoherent 
> > wrt the 
> > consumer because it was not covered by the standalone release barrier. 
>
>
> In this case the RELEASE turned store to D into a release-store (a 
> subsequent store). 
> And ACQUIRE turned load of D into an acquire-load (a preceding load). 
>

D should be a pure relaxed store, and C should not be covered by the 
RELEASE. Iirc, it works this way on SPARC RMO mode. However, on x86, C will 
be covered because each store has implied release characteristics, wb 
memory aside for a moment.

 

>
> At lease this is how this is defined in C/C++ standards. 
> ACQUIRE/RELEASE fences do not establish any happens-before relations 
> themselves. You still need a load in one thread to observe a value 
> stored in another thread. And only that "materializes" standalone 
> fence synchronization. So a store that materializes RELEASE fence will 
> always be a subsequent store. 
>

Humm... That is too strict, and has to be there whether we use standalone 
fences or not. The store to D = 4 makes A and B wrt the RELEASE visible to 
the consumer threads that look for D = 4 and execute the ACQUIRE barrier 
after that fact has been observed. Afaict, C should NOT be covered.

 

>
>
> >> The same for acquire fences: a single 
> >> acquire fences turns _all_ loads ever executed by the thread into 
> >> acquire operations ton he corresponding memory locations, which means 
> >> that you need to handle all relaxed loads as a "shadow" acquire loads 
> >> for the case they will be materialized by a subsequent acquire fence. 
>

That sounds to coarse.

 

> > 
> > 
> > An acquire operation should make sure all operations wrt

Re: [lock-free] One reason why I like atomic_thread_fence...

2018-04-08 Thread Chris M. Thomasson
On Saturday, April 7, 2018 at 1:46:20 AM UTC-7, Dmitry Vyukov wrote:
>
> On Thu, Apr 5, 2018 at 10:03 PM, Chris M. Thomasson <cri...@charter.net 
> > wrote: 
> > On Tuesday, April 3, 2018 at 5:44:38 AM UTC-7, Dmitry Vyukov wrote: 
> >> 
> >> On Sat, Mar 31, 2018 at 10:41 PM, Chris M. Thomasson 
> >> <cri...@charter.net> wrote: 
> >> > Notice how there is an acquire barrier inside of the CAS loop within 
> the 
> >> > enqueue and dequeue functions of: 
> >> > 
> >> > 
> http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue 
> [...]
> > Executing an acquire barrier on every iteration of the CAS loop is not 
> > necessary. The actual version count keeps everything in order. 
> > 
> > However, you do need a single acquire fence _after_ the CAS loop 
> succeeds in 
> > order to get a clear view of the element. 
>
> This is true. 
>

Agreed. I personally like the ability to see the membars being separated 
out and 

standing alone. It is a habit of mine from SPARC. Now, tagged standalone 
membars 

aside for a moment, perhaps ones that can include memory locations they are 

interested in... ;^)
 

>
> I don't like standalone fences because they are plague for 
> verification. Consider, a single release fences turns _all_ subsequent 
> relaxed atomic stores ever executed by the thread into release 
> operations (they release memory state up to the fence point) and 
> handling of acquire/release operations is an O(N) operation (and 
> generally done under a mutex).


A release operation should make sure all _prior_ operations are visible 
_before_ 

they are visible to another thread. They have no effect on subsequent 
relaxed 

operations. For instance:


// producer

A = 1

B = 2

RELEASE

C = 3

D = 4


// consumer

while (D != 4) backoff;

ACQUIRE

assert(A == 1 && B == 2);


Well, A and B are going be in sync with an acquire such that the assert 
will never 
fail, however C can be hoisted up and not be in sync at all! C is 
incoherent wrt the 
consumer because it was not covered by the standalone release barrier. 


The same for acquire fences: a single 
> acquire fences turns _all_ loads ever executed by the thread into 
> acquire operations ton he corresponding memory locations, which means 
> that you need to handle all relaxed loads as a "shadow" acquire loads 
> for the case they will be materialized by a subsequent acquire fence. 
>

An acquire operation should make sure all operations wrt the release are 
visible 

_before_ any subsequent operations can be performed _after_ that fact is 

accomplished.


Well, fwiw, the membars that can be embedded into the CAS wrt acquire and 

release do effect prior and subsequent activity anyway, standalone or not. 
A release will 

dump prior stores such that an acquire barrier will see them all. Now, when 
we 

are dealing with a consume barrier, well that is targeting the release 
dependency 

chain wrt the pointer. A consume barrier is more precisely targeted when 
compared 

to the wider spectrum of an acquire. Btw, iirc consume is emulated in 
Relacy as 

acquire right?


Also, think of popping all nodes at once from an atomic LIFO:


https://groups.google.com/d/topic/comp.lang.c++/V0s__czQwa0/discussion


Well, how can we accomplish the following without using standalone fences?:


  // try to flush all of our nodes 
  node* flush() 
  { 
  node* n = m_head.exchange(nullptr, mb_relaxed); 

  if (n) 
  { 
  mb_fence(mb_acquire); 
  } 

  return n; 
  }   

 

>
> The same is actually true for human reasoning. Say, I am reading your 
> code. We have 3 load operations in the loop and an acquire fences 
> after the loop. Now the question is: which of the loads we wanted to 
> turn into acquire by adding the fence? Or is it maybe 2 of them? 
> Which? Or maybe 1 in the loop and 1 somewhere before the loop, in a 
> different function? 
> One can, of course, comment that, but Relacy won't check comments, so 
> I won't trust them ;) 
>
 

Interesting. Still makes me think of tagged membars. I will get back to you 
with 

a more detailed response.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/22caa87d-eef7-462e-a7a2-8e54b5436fab%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[lock-free] One reason why I like atomic_thread_fence...

2018-03-31 Thread Chris M. Thomasson


Notice how there is an acquire barrier inside of the CAS loop within the 
enqueue and dequeue functions of:


http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue


?


Well, fwiw we can get rid of them by using stand alone fences:


Btw, sorry for the membar abstraction. I can quickly get sick and tired of 
having to type out (std::memory_order_relaxed) all the damn time. ;^)


Anyway, here is the simple code using Relacy:

___

/* Membar Abstraction

___*/

#define mbrlx std::memory_order_relaxed

#define mbacq std::memory_order_acquire

#define mbrel std::memory_order_release

#define mb(mb_membar) std::atomic_thread_fence(mb_membar) 



template

struct fifo

{

struct cell

{

VAR_T(T) m_data;

std::atomic m_ver;

};


std::atomic m_head;

std::atomic m_tail;

cell m_cells[T_N];


fifo() : m_head(0), m_tail(0)

{

for (unsigned int i = 0; i < T_N; ++i)

{

m_cells[i].m_ver.store(i, mbrlx);

}

}


bool push_try(T const& data)

{

cell* c = nullptr;

unsigned int ver = m_head.load(mbrlx);


do

{

c = _cells[ver & (T_N - 1)];

if (c->m_ver.load(mbrlx) != ver) return false;


} while (! m_head.compare_exchange_weak(ver, ver + 1, mbrlx));


mb(mbacq);

VAR(c->m_data) = data;

mb(mbrel);


c->m_ver.store(ver + 1, mbrlx);


return true;

}


bool pop_try(T& data)

{

cell* c = nullptr;

unsigned int ver = m_tail.load(mbrlx);


do

{

c = _cells[ver & (T_N - 1)];

if (c->m_ver.load(mbrlx) != ver + 1) return false;


} while (! m_tail.compare_exchange_weak(ver, ver + 1, mbrlx));


mb(mbacq);

data = VAR(c->m_data);

mb(mbrel);


c->m_ver.store(ver + T_N, mbrlx);


return true;

}

};

___



There is no need for any membars within the loops. The version count keeps 
everything in order.


:^)

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/6da927c5-173f-4c16-b5df-4da539882549%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [lock-free] Simple DWCAS based Proxy Collector, refined... ;^)

2018-03-29 Thread Chris M. Thomasson
On Wednesday, March 28, 2018 at 11:20:58 PM UTC-7, Dmitry Vyukov wrote:
>
> On Tue, Mar 27, 2018 at 11:23 PM, Chris M. Thomasson 
> <cri...@charter.net > wrote: 
> > > Fwiw Dmitry, I have been working with fractals a lot lately, and was 
> > > wondering if I could still program a proxy collector from scratch. 
> Remember 
> > > my old collector here: 
> [...]
>
> > Old habits... :) 
>
> > Is there anything new, of just an implementation of old ideas? 
>

Unfortunately, not anything new. I just wanted to see if I can code if from 
scratch.  

Btw, there might be something a little newish coming up. Keep in mind that 
I have not worked on this stuff in a while.

At least Relacy still works for me. Thank you for creating it. :^)

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/661f1264-2b1f-40b6-aa2a-83a1e3bbab88%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [lock-free] Simple DWCAS based Proxy Collector, refined... ;^)

2018-03-29 Thread Chris M. Thomasson


On Wednesday, March 28, 2018 at 11:20:58 PM UTC-7, Dmitry Vyukov wrote:
>
> On Tue, Mar 27, 2018 at 11:23 PM, Chris M. Thomasson 
> <cri...@charter.net > wrote: 
> > Fwiw Dmitry, I have been working with fractals a lot lately, and was 
> > wondering if I could still program a proxy collector from scratch. 
> Remember 
> > my old collector here: 
> > 
> > http://webpages.charter.net/appcore/misc/pc_sample_h_v1.html 
> > 
> > ? 
> > 
> > You are using it as an example in Relacy. 
> > 
> > Wrt the following code, all of the membars are standalone fence 
> operations 
> > std::atomic_thread_fence. All of the membars in the atomic operations 
> are 
> > relaxed. It uses DWCAS for the acquire function because I wanted to 
> start 
> > out simple. I have also replaced a CAS with XCHG in the collect, or 
> mutate 
> > function if you will.  So far, it works in Relacy. :^) 
> > 
> > Here is a link to my code: 
> > 
> > https://pastebin.com/raw/RKTLyXWt 
> > 
> > ___ 
> > 
> > /* Simple Proxy Collector (DWCAS) - Relacy Version 
> > 
> > http://www.1024cores.net/home/relacy-race-detector 
> > 
> > Copyright 3/27/2018 
> > ___*/ 
> > 
> > 
> > #include  
> > #include  
> > #include  
> > 
> > 
> > /* Membar Abstraction 
> > ___*/ 
> > #define ct_mb_relaxed std::memory_order_relaxed 
> > #define ct_mb_acquire std::memory_order_acquire 
> > #define ct_mb_release std::memory_order_release 
> > #define ct_mb_seq_cst std::memory_order_seq_cst 
> > #define ct_mb_fence(mb_membar) std::atomic_thread_fence(mb_membar) 
> > 
> > 
> > /* Simple Proxy Collector C++11 
> > ___*/ 
> > namespace ct { 
> > namespace proxy { 
> > 
> > // User object base 
> > struct object 
> > { 
> > virtual ~object() {} 
> > }; 
> > 
> > 
> > // Proxy node 
> > struct node 
> > { 
> > std::atomic count; 
> > VAR_T(node*) next; 
> > VAR_T(object*) obj; 
> > 
> > node(std::intptr_t count_, node* next_, object* obj_) 
> > : count(count_), next(next_), obj(obj_) 
> > { 
> > 
> > } 
> > 
> > ~node() 
> > { 
> > 
> > } 
> > }; 
> > 
> > 
> > // DWCAS target 
> > struct anchor 
> > { 
> > std::intptr_t count; 
> > node* head; 
> > 
> > anchor() 
> > { 
> > 
> > } 
> > 
> > anchor(std::intptr_t count_, node* head_) 
> > { 
> > count = count_; 
> > head = head_; 
> > } 
> > 
> > bool operator == (anchor const& right) const 
> > { 
> > return count == right.count && head == right.head; 
> > } 
> > 
> > anchor operator + (anchor const& right) const 
> > { 
> > anchor res; 
> > res.count = count + right.count; 
> > res.head = right.head; 
> > return res; 
> > } 
> > 
> > anchor operator - (anchor const& right) const 
> > { 
> > anchor res; 
> > res.count = count - right.count; 
> > res.head = right.head; 
> > return res; 
> > } 
> > }; 
> > 
> > std::ostream& operator << (std::ostream& s, anchor const& right) 
> > { 
> > return s << "{" << right.count << "," << right.head << "}"; 
> > } 
> > 
> > 
> > // Collector Logic 
> > struct gc 
> > { 
> > std::atomic head; 
> > 
> > gc() : head(anchor{ 0, new node(0, nullptr, nullptr) }) {} 
> > 
> > ~gc() 
> > { 
> > anchor cmp = head.load(ct_mb_relaxed); 
> > RL_ASSERT(cmp.count > -1); 
> > RL_ASSERT(VAR(cmp.head->next) == nullptr); 
> > prv_dtor(cmp.head); 
> > } 
> > 
> > // Drops a reference count on a node 
> > bool prv_release(node* n) 
> > { 
> >

[lock-free] Re: Simple DWCAS based Proxy Collector, refined... ;^)

2018-03-27 Thread Chris M. Thomasson
Fwiw, the following is some C++11 code that should compile and run fine on 
systems that support lock-free DWCAS. I like the Relacy code because it 
helps make sure everything is Kosher. And, I like the C++11 code because it 
is an example of a real implementation. Anyway, here is the code:

https://pastebin.com/raw/KAt4nhCj

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/1e1cc3e9-1b68-431a-8299-c11680f18a56%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[lock-free] Simple DWCAS based Proxy Collector, refined... ;^)

2018-03-27 Thread Chris M. Thomasson
Fwiw Dmitry , I have been 
working with fractals a lot lately, and was wondering if I could still 
program a proxy collector from scratch. Remember my old collector here:

http://webpages.charter.net/appcore/misc/pc_sample_h_v1.html

?

You are using it as an example in Relacy.

Wrt the following code, all of the membars are standalone fence operations 
std::atomic_thread_fence. 
All of the membars in the atomic operations are relaxed. It uses DWCAS for 
the acquire function because I wanted to start out simple. I have also 
replaced a CAS with XCHG in the collect, or mutate function if you will.  So 
far, it works in Relacy. :^)

Here is a link to my code:

https://pastebin.com/raw/RKTLyXWt

___

/* Simple Proxy Collector (DWCAS) - Relacy Version

http://www.1024cores.net/home/relacy-race-detector

Copyright 3/27/2018
___*/


#include 
#include 
#include 


/* Membar Abstraction
___*/
#define ct_mb_relaxed std::memory_order_relaxed
#define ct_mb_acquire std::memory_order_acquire
#define ct_mb_release std::memory_order_release
#define ct_mb_seq_cst std::memory_order_seq_cst
#define ct_mb_fence(mb_membar) std::atomic_thread_fence(mb_membar) 


/* Simple Proxy Collector C++11
___*/
namespace ct { 
namespace proxy {

// User object base
struct object
{
virtual ~object() {}
};


// Proxy node
struct node
{
std::atomic count;
VAR_T(node*) next;
VAR_T(object*) obj;

node(std::intptr_t count_, node* next_, object* obj_)
: count(count_), next(next_), obj(obj_)
{

}

~node()
{

}
};


// DWCAS target
struct anchor
{
std::intptr_t count;
node* head;

anchor()
{

}

anchor(std::intptr_t count_, node* head_)
{
count = count_;
head = head_;
}

bool operator == (anchor const& right) const
{
return count == right.count && head == right.head;
}

anchor operator + (anchor const& right) const
{
anchor res;
res.count = count + right.count;
res.head = right.head;
return res;
}

anchor operator - (anchor const& right) const
{
anchor res;
res.count = count - right.count;
res.head = right.head;
return res;
}
};

std::ostream& operator << (std::ostream& s, anchor const& right)
{
return s << "{" << right.count << "," << right.head << "}";
}


// Collector Logic
struct gc
{
std::atomic head;

gc() : head(anchor{ 0, new node(0, nullptr, nullptr) }) {}

~gc()
{
anchor cmp = head.load(ct_mb_relaxed);
RL_ASSERT(cmp.count > -1);
RL_ASSERT(VAR(cmp.head->next) == nullptr);
prv_dtor(cmp.head);
}

// Drops a reference count on a node
bool prv_release(node* n)
{
std::intptr_t count = n->count.fetch_sub(2, ct_mb_relaxed);
if (count == 3) return true;
return false;
}

// Deletes a node
void prv_dtor(node* n)
{
object* obj = VAR(n->obj);
if (obj != nullptr) delete obj;
delete n;
}

// Dumps backlinked nodes
void prv_dump(node* n)
{
node* cur = VAR(n->next);
VAR(n->next) = nullptr;

// Release and collect in cur LIFO
while (prv_release(cur))
{
ct_mb_fence(ct_mb_acquire);
node* next = VAR(cur->next);
VAR(cur->next) = n;
n = cur;
cur = next;
}

// Destroy cur LIFO
while (n)
{
node* next = VAR(n->next);
prv_dtor(n);
n = next;
}
}

// Collects a node
void prv_collect(node* n)
{
anchor xchg = { 0, n };

ct_mb_fence(ct_mb_release);

// Swap in new node
anchor cmp = head.exchange(xchg, ct_mb_relaxed);

// Backlink
ct_mb_fence(ct_mb_acquire);
VAR(cmp.head->next) = xchg.head;
ct_mb_fence(ct_mb_release);

// Release and mark as collected
std::intptr_t count = cmp.head->count.fetch_add(cmp.count + 1, 
ct_mb_relaxed);

if (count + cmp.count == 0)
{
prv_dump(cmp.head);
}
}


// Acquires current node
node* acquire()
{
anchor cmp = head.load(ct_mb_relaxed);
anchor xchg =