Wait-free thread communication

2016-01-08 Thread Jin via Digitalmars-d
Idea: no mutex, no CAS, only one direction queues without any 
locks.


My prototype (https://github.com/nin-jin/go.d) is up to 10x 
faster than std.concurrency.send/receive


---
writers =512
readers =1
std.concurency milliseconds=1313
jin.go milliseconds=113
---

Realization:

Queue! - buffered static typed channel. One and only one thread 
can push data to queue, and one and only one can take data from 
queue. Thread blocks when he takes from empty queue and pushes to 
fulfilled queue.


Queues! - list of queues with round robin balancing

go! - starts new thread, in future i want to implement something 
like goroutine, but i do not understand how yet.


I need some good advice, i am newbie in D :-)

Problems:

For blocking thread i use loop with Thread.sleep - this is bad 
decision IMHO.


On my system i can create only up to 1000 threads without errors. 
Fibers from core.thread or Tasks from std.parallelism potentiaдly 
can resolve this problem, but how to integrate their gracefully.


API of jin-go can be better?

Which dub modules can help me?


Re: Wait-free thread communication

2016-01-08 Thread Nordlöw via Digitalmars-d

On Friday, 8 January 2016 at 16:58:59 UTC, Jin wrote:
Idea: no mutex, no CAS, only one direction queues without any 
locks.


My prototype (https://github.com/nin-jin/go.d) is up to 10x 
faster than std.concurrency.send/receive


Very interesting. D has builtin unittests. You should add stress 
unittests to assure that your logic is correct. You can start by 
searching for keyword `unittest` in the std.concurrency module 
for advice how to do this.


Re: Wait-free thread communication

2016-01-08 Thread Jin via Digitalmars-d

On Friday, 8 January 2016 at 18:07:49 UTC, Nordlöw wrote:

On Friday, 8 January 2016 at 16:58:59 UTC, Jin wrote:
Idea: no mutex, no CAS, only one direction queues without any 
locks.


My prototype (https://github.com/nin-jin/go.d) is up to 10x 
faster than std.concurrency.send/receive


Very interesting. D has builtin unittests. You should add 
stress unittests to assure that your logic is correct. You can 
start by searching for keyword `unittest` in the 
std.concurrency module for advice how to do this.


I just add unit tests. But how to idiomatic implement benchmarks 
(compiling must be in release mode)? Currently, i was implement 
main function in app.d and run with "dub --build=release", but 
nobody can import my module.


Re: Wait-free thread communication

2016-01-09 Thread thedeemon via Digitalmars-d

On Friday, 8 January 2016 at 16:58:59 UTC, Jin wrote:
Idea: no mutex, no CAS, only one direction queues without any 
locks.


My prototype (https://github.com/nin-jin/go.d) is up to 10x 
faster than std.concurrency.send/receive


Sorry, that's not looking any good.
You call it wait-free when in fact it's just the opposite: if a 
queue buffer is full on push  it just waits in Thread.sleep which 
is

1) not wait-free at all
2) very heavy (call to kernel, context switch)

And when buffer is not full/empty it works as a simple 
single-threaded queue, which means it's fine for using with 
fibers inside one thread but will not work correctly in 
multithreaded setting.


Your benchmarks show time difference in your favor just because 
you compare very different things: your queue is benchmarked in 
single thread with fibers while std.concurrency is measured with 
multiple threads communicating with each other. Doing many 
switches between threads is much slower than switching between 
fibers in one thread, hence the time difference. It's not that 
your queue is any good, it's just you measure it wrong.


Re: Wait-free thread communication

2016-01-09 Thread Jin via Digitalmars-d

On Saturday, 9 January 2016 at 08:28:33 UTC, thedeemon wrote:
Your benchmarks show time difference in your favor just because 
you compare very different things: your queue is benchmarked in 
single thread with fibers while std.concurrency is measured 
with multiple threads communicating with each other. Doing many 
switches between threads is much slower than switching between 
fibers in one thread, hence the time difference. It's not that 
your queue is any good, it's just you measure it wrong.


No, jin.go creates new native thread on every call. And this is 
problem :-) We can not create thousand of threads without errors.


std.concurrency uses mutex to synchromize access to message 
queue. Cost of syncronization is proportional to count of threads.


On Saturday, 9 January 2016 at 08:28:33 UTC, thedeemon wrote:
You call it wait-free when in fact it's just the opposite: if a 
queue buffer is full on push  it just waits in Thread.sleep 
which is

1) not wait-free at all
2) very heavy (call to kernel, context switch)
And when buffer is not full/empty it works as a simple 
single-threaded queue, which means it's fine for using with 
fibers inside one thread but will not work correctly in 
multithreaded setting.


Taking data from empty queue and pushing data to full queue is 
logicaly impossible for queues. We have 3 strategies for this 
cases:
1. Throwing runtime exception, like std.concurrency.send on full 
message box by default.
2. Blocking thread, like std.concurrency.receive on empty message 
box.

2. Skipping action.

jin-go uses blocking strategy by default and you can check 
queue.empty and queue.full if you want to implement other 
strategy.


Re: Wait-free thread communication

2016-01-09 Thread thedeemon via Digitalmars-d

On Saturday, 9 January 2016 at 11:00:01 UTC, Jin wrote:

No, jin.go creates new native thread on every call. And this is 
problem :-) We can not create thousand of threads without 
errors.


Ah, sorry, I misread the source. FiberScheduler got me 
distracted. Why is it there?




Re: Wait-free thread communication

2016-01-09 Thread Andy Smith via Digitalmars-d

On Friday, 8 January 2016 at 16:58:59 UTC, Jin wrote:
Idea: no mutex, no CAS, only one direction queues without any 
locks.


My prototype (https://github.com/nin-jin/go.d) is up to 10x 
faster than std.concurrency.send/receive


---
writers =512
readers =1
std.concurency milliseconds=1313
jin.go milliseconds=113
---

Realization:

Queue! - buffered static typed channel. One and only one thread 
can push data to queue, and one and only one can take data from 
queue. Thread blocks when he takes from empty queue and pushes 
to fulfilled queue.


Queues! - list of queues with round robin balancing

go! - starts new thread, in future i want to implement 
something like goroutine, but i do not understand how yet.


I need some good advice, i am newbie in D :-)

Problems:

For blocking thread i use loop with Thread.sleep - this is bad 
decision IMHO.


On my system i can create only up to 1000 threads without 
errors. Fibers from core.thread or Tasks from std.parallelism 
potentiaдly can resolve this problem, but how to integrate 
their gracefully.


API of jin-go can be better?

Which dub modules can help me?



I'm a little worried you have no volatile writes or fences around 
your code when you 'publish' an event using head/tail etc. It 
looks like it's working but how are you ensuring no compiler/CPU 
reordering is ocurring. Does x86_64 actually allow you to get 
away with this? I know its memory model is stricter than others...


Cheeers,

A.







Re: Wait-free thread communication

2016-01-09 Thread Jin via Digitalmars-d

On Saturday, 9 January 2016 at 13:55:16 UTC, thedeemon wrote:

On Saturday, 9 January 2016 at 11:00:01 UTC, Jin wrote:

No, jin.go creates new native thread on every call. And this 
is problem :-) We can not create thousand of threads without 
errors.


Ah, sorry, I misread the source. FiberScheduler got me 
distracted. Why is it there?


This is artefact of experiments with fibers :-)


Re: Wait-free thread communication

2016-01-09 Thread Ola Fosheim Grøstad via Digitalmars-d

On Saturday, 9 January 2016 at 14:20:18 UTC, Andy Smith wrote:
I'm a little worried you have no volatile writes or fences 
around your code when you 'publish' an event using head/tail 
etc. It looks like it's working but how are you ensuring no 
compiler/CPU reordering is ocurring. Does x86_64 actually allow 
you to get away with this? I know its memory model is stricter 
than others...


But not on ARM, so he should use atomic acquire-release semantics 
on indices for push/pull.


I suggest porting over spsc_queue from Boost.



Re: Wait-free thread communication

2016-01-09 Thread Jin via Digitalmars-d

On Saturday, 9 January 2016 at 14:20:18 UTC, Andy Smith wrote:
I'm a little worried you have no volatile writes or fences 
around your code when you 'publish' an event using head/tail 
etc. It looks like it's working but how are you ensuring no 
compiler/CPU reordering is ocurring. Does x86_64 actually allow 
you to get away with this? I know its memory model is stricter 
than others...


This works fine in my tests, but how to enforce memory barrier in 
D?




Re: Wait-free thread communication

2016-01-09 Thread Jin via Digitalmars-d

On Saturday, 9 January 2016 at 14:20:18 UTC, Andy Smith wrote:
I'm a little worried you have no volatile writes or fences 
around your code when you 'publish' an event using head/tail 
etc. It looks like it's working but how are you ensuring no 
compiler/CPU reordering is ocurring. Does x86_64 actually allow 
you to get away with this? I know its memory model is stricter 
than others...


I just add atomic fence for push and take:

this.messages[ this.tail ] = value;
atomicFence;
this.tail = ( this.tail + 1 ) % this.size;



Re: Wait-free thread communication

2016-01-09 Thread Ola Fosheim Grøstad via Digitalmars-d

On Saturday, 9 January 2016 at 15:51:51 UTC, Jin wrote:

On Saturday, 9 January 2016 at 14:20:18 UTC, Andy Smith wrote:
I'm a little worried you have no volatile writes or fences 
around your code when you 'publish' an event using head/tail 
etc. It looks like it's working but how are you ensuring no 
compiler/CPU reordering is ocurring. Does x86_64 actually 
allow you to get away with this? I know its memory model is 
stricter than others...


I just add atomic fence for push and take:

this.messages[ this.tail ] = value;
atomicFence;
this.tail = ( this.tail + 1 ) % this.size;


You need it in the tests.

I haven't used atomics in D, but you have

  atomicLoad(MemoryOrder ms = MemoryOrder.seq, T)(ref const 
shared T val)
  atomicStore(MemoryOrder ms = MemoryOrder.seq, T, V1)(ref shared 
T val, V1 newval)


The compiler should then be able to ignore it if the CPU handles 
it well enough, assuming the D implementation work. Something 
along the lines of:


  immutable i = atomicLoad!(MemoryOrder.raw)(tail);
  immutable n = (i+1)%size;

  if( n == atomicLoad!(MemoryOrder.acq)(head) ) return QUEUE_FULL;

  buffer[n] = ...;
  atomicStore(MemoryOrder.rel, tail, n);

Or something like that.







Re: Wait-free thread communication

2016-01-09 Thread Jin via Digitalmars-d
On Saturday, 9 January 2016 at 16:05:34 UTC, Ola Fosheim Grøstad 
wrote:

You need it in the tests.


If memory writes will reorder, then current tests will fail. 
Memory reades can be in any order.


On Saturday, 9 January 2016 at 16:05:34 UTC, Ola Fosheim Grøstad 
wrote:

I haven't used atomics in D, but you have

  atomicLoad(MemoryOrder ms = MemoryOrder.seq, T)(ref const 
shared T val)
  atomicStore(MemoryOrder ms = MemoryOrder.seq, T, V1)(ref 
shared T val, V1 newval)


I do not understand how to use its right. So i simple use 
atomicFence. Performance does not degrade.


Re: Wait-free thread communication

2016-01-09 Thread Ola Fosheim Grøstad via Digitalmars-d

On Saturday, 9 January 2016 at 16:22:12 UTC, Jin wrote:
On Saturday, 9 January 2016 at 16:05:34 UTC, Ola Fosheim 
Grøstad wrote:

You need it in the tests.


If memory writes will reorder, then current tests will fail. 
Memory reades can be in any order.


They have to be atomic if you want your code to be portable.



Re: Wait-free thread communication

2016-01-09 Thread Jin via Digitalmars-d
On Saturday, 9 January 2016 at 16:29:07 UTC, Ola Fosheim Grøstad 
wrote:

They have to be atomic if you want your code to be portable.


Do not want yet :-)


Re: Wait-free thread communication

2016-01-09 Thread David Nadlinger via Digitalmars-d

On Saturday, 9 January 2016 at 16:22:12 UTC, Jin wrote:

So i simple use atomicFence. Performance does not degrade.


Either you are not calling it in the way you think you are, then, 
or your code is otherwise very unoptimized. You can definitely 
measure the impact of a full mfence on otherwise tight lock-free 
code.


 — David


Re: Wait-free thread communication

2016-01-09 Thread Andy Smith via Digitalmars-d

On Saturday, 9 January 2016 at 17:24:47 UTC, Jin wrote:
On Saturday, 9 January 2016 at 16:29:07 UTC, Ola Fosheim 
Grøstad wrote:

They have to be atomic if you want your code to be portable.


Do not want yet :-)


Okay I've just refreshed my mental X86/64 memory model :-) What 
you've done should be okay from a CPU-reordering perspective. In 
x86/64 Loads are not reordered with other loads and stores are 
not reordered with other stores. So from a CPU perspective you 
should be okay on X86/64.


There's still a grey area for me around possible *compiler* 
re-orderings... there's nothing to stop the compiler reordering 
those two lines with respect to each other because from an 
'as-if-sequential' perspective those two operations are 
commutative. Unfortunately I'm not well versed enough on the 
internals of the three main compiler versions to give an opinion 
on whether this will be a problem or not :-(


If your version is working for you on your platform/compiler, I'd 
recommend maybe being a bit conservative and wrap your code with 
appropriate version blocks until you know the answer to these 
questions for the other platforms/compilers (ARM, GDC, LDC etc.). 
And put appropriate explanation why that's the case in your 
README.md. I think that's just a basic courtesy to other users 
that you don't let them walk over a minefield without fair 
warning :-)


Cheers,

A.















Re: Wait-free thread communication

2016-01-09 Thread David Nadlinger via Digitalmars-d
On Saturday, 9 January 2016 at 15:16:43 UTC, Ola Fosheim Grøstad 
wrote:

On Saturday, 9 January 2016 at 14:20:18 UTC, Andy Smith wrote:
I'm a little worried you have no volatile writes or fences 
around your code when you 'publish' an event using head/tail 
etc. It looks like it's working but how are you ensuring no 
compiler/CPU reordering is ocurring. Does x86_64 actually 
allow you to get away with this? I know its memory model is 
stricter than others...


But not on ARM, so he should use atomic acquire-release 
semantics on indices for push/pull.


Not only that. It's a problem on x86 as well because advanced 
optimizers like those in GDC or LDC will happily assume that the 
members are not written to by another thread (because there would 
be a race otherwise) and cache the loads or even eliminate some 
stores. Any guarantees the processor would offer are irrelevant 
if the compiler has already reordered your operations. It should 
be quite easy to see such effects. Just compile a simple test 
case on GDC or LDC with optimizations on.



I suggest porting over spsc_queue from Boost.


That would certainly be a starting point, although a 
high-performance single-producer single-consumer queue is trivial 
to implement once you understand atomics.


Then again, I couldn't convince Andrei that a well-defined memory 
model (in which it even makes sense to talk about atomics in the 
first place) is important for D yet. Right now, you basically 
have to hope that everything works as in C++ – which is not a bad 
bet on GDC and LDC, and the weaker optimizer in DMD hides a lot 
of potential issues there – and stay away from things like 
`consume` loads that would depend on language semantics.


 — David


Re: Wait-free thread communication

2016-01-09 Thread David Nadlinger via Digitalmars-d

On Saturday, 9 January 2016 at 17:44:26 UTC, Andy Smith wrote:
Unfortunately I'm not well versed enough on the internals of 
the three main compiler versions to give an opinion on whether 
this will be a problem or not :-(


It might work on DMD, but I know from experience (one-to-many 
queues, i.e. single-consumer-multi-producer or the other way 
round) that the LDC optimizer will ruthlessly break code written 
in that "I know that it would work in assembly, let's just hope 
the compiler doesn't touch it" style.


 — David


Re: Wait-free thread communication

2016-01-09 Thread Andy Smith via Digitalmars-d
On Saturday, 9 January 2016 at 17:49:46 UTC, David Nadlinger 
wrote:

On Saturday, 9 January 2016 at 17:44:26 UTC, Andy Smith wrote:
Unfortunately I'm not well versed enough on the internals of 
the three main compiler versions to give an opinion on whether 
this will be a problem or not :-(


It might work on DMD, but I know from experience (one-to-many 
queues, i.e. single-consumer-multi-producer or the other way 
round) that the LDC optimizer will ruthlessly break code 
written in that "I know that it would work in assembly, let's 
just hope the compiler doesn't touch it" style.


 — David


Ah Cheers David... I think our messages crossed :-)

Cheers,

A.



Re: Wait-free thread communication

2016-01-09 Thread Ola Fosheim Grøstad via Digitalmars-d
On Saturday, 9 January 2016 at 17:42:41 UTC, David Nadlinger 
wrote:
Not only that. It's a problem on x86 as well because advanced 
optimizers like those in GDC or LDC will happily assume that 
the members are not written to by another thread (because there 
would be a race otherwise) and cache the loads or even 
eliminate some stores. Any guarantees the processor would offer 
are irrelevant if the compiler has already reordered your 
operations. It should be quite easy to see such effects. Just 
compile a simple test case on GDC or LDC with optimizations on.


Yes, well, he had a sleep() in there that might prevent the 
compiler from moving instructions etc, but even if so... it is a 
bad thing to rely on.


Explicit atomics document what is going on (even when not 
technically needed), and should always be used where you want 
atomic operations, I think.


That would certainly be a starting point, although a 
high-performance single-producer single-consumer queue is 
trivial to implement once you understand atomics.


Yes. But my experience from writing custom multi-single queues is 
that it can end up harder than it looks to get it working and 
efficient. Don't be surprised if it takes 5-10x more time than 
anticipated to get it 100% right...


So why not start with something that is correct? You still have 
to spend quite a bit of time to verify that the code is 
identical... but that is much easier than to "formally" prove 
that your own implementation is correct.


(Intuition is often wrong in this area...)




Re: Wait-free thread communication

2016-01-10 Thread David Nadlinger via Digitalmars-d
On Saturday, 9 January 2016 at 22:44:53 UTC, Ola Fosheim Grøstad 
wrote:
Yes. But my experience from writing custom multi-single queues 
is that it can end up harder than it looks to get it working 
and efficient. […] (Intuition is often wrong in this area...)


I wholeheartedly agree with that statement. However, if one 
doesn't understand atomics well enough to correctly implement 
even a simple single-single queue, I'm not sure whether one would 
be able to correctly port an existing implementation either.


Then again, the primitives might be similar enough between C++11 
and D so that a literal translation is possible without 
understanding what is going on.


 — David


Re: Wait-free thread communication

2016-01-10 Thread Martin Nowak via Digitalmars-d
On 01/08/2016 05:58 PM, Jin wrote:
> Idea: no mutex, no CAS, only one direction queues without any locks.
> 
> My prototype (https://github.com/nin-jin/go.d) is up to 10x faster than
> std.concurrency.send/receive

Yes single-reader single-writer queue are the fastest way for
inter-thread communication.
You might have a look at my
[lock-free](http://code.dlang.org/packages/lock-free) for a correct
implementation.

> Problems:
> 
> For blocking thread i use loop with Thread.sleep - this is bad decision
> IMHO.

Have a look at this exponential backoff implementation for my GC
spinlock PR.
https://github.com/D-Programming-Language/druntime/pull/1447/files#diff-fb5cbe06e1aaf83814ccf5ff08f05519R34

In general you need some sort of configurable or adaptive backoff or
you'll waste too much time context switching.

-Martin


Re: Wait-free thread communication

2016-01-10 Thread Martin Nowak via Digitalmars-d
On 01/09/2016 04:51 PM, Jin wrote:
> I just add atomic fence for push and take:
> 
> this.messages[ this.tail ] = value;
> atomicFence;
> this.tail = ( this.tail + 1 ) % this.size;

Don't do this, memory fences are expensive.

This is what you need for a spsc queue.
https://github.com/MartinNowak/lock-free/commit/233739262c14e00866a60f4a6a86c1b979ac968b


Re: Wait-free thread communication

2016-01-13 Thread Jin via Digitalmars-d

On Sunday, 10 January 2016 at 21:25:34 UTC, Martin Nowak wrote:
For blocking thread i use loop with Thread.sleep - this is bad 
decision IMHO.

Have a look at this exponential backoff implementation for my GC
spinlock PR.
https://github.com/D-Programming-Language/druntime/pull/1447/files#diff-fb5cbe06e1aaf83814ccf5ff08f05519R34
In general you need some sort of configurable or adaptive 
backoff or you'll waste too much time context switching.


I am using Waiter for this 
https://github.com/nin-jin/go.d/blob/master/source/jin/go.d#L171




Re: Wait-free thread communication

2016-03-27 Thread Jin via Digitalmars-d
I just use this queues to implement Go like api for concurrency 
(coroutines+channels): 
http://forum.dlang.org/thread/lcfnfnhjzonkdkeau...@forum.dlang.org





Re: Wait-free thread communication

2016-03-28 Thread Dmitry Olshansky via Digitalmars-d

On 27-Mar-2016 21:23, Jin wrote:

I just use this queues to implement Go like api for concurrency
(coroutines+channels):
http://forum.dlang.org/thread/lcfnfnhjzonkdkeau...@forum.dlang.org




If nothing changed implementation-wise this is just data-racy queues :)

--
Dmitry Olshansky


Re: Wait-free thread communication

2016-03-28 Thread Jin via Digitalmars-d

On Monday, 28 March 2016 at 16:39:45 UTC, Dmitry Olshansky wrote:

On 27-Mar-2016 21:23, Jin wrote:

I just use this queues to implement Go like api for concurrency
(coroutines+channels):
http://forum.dlang.org/thread/lcfnfnhjzonkdkeau...@forum.dlang.org




If nothing changed implementation-wise this is just data-racy 
queues :)


Why?


Re: Wait-free thread communication

2016-03-28 Thread Dmitry Olshansky via Digitalmars-d

On 28-Mar-2016 20:03, Jin wrote:

On Monday, 28 March 2016 at 16:39:45 UTC, Dmitry Olshansky wrote:

On 27-Mar-2016 21:23, Jin wrote:

I just use this queues to implement Go like api for concurrency
(coroutines+channels):
http://forum.dlang.org/thread/lcfnfnhjzonkdkeau...@forum.dlang.org




If nothing changed implementation-wise this is just data-racy queues :)


Why?


All I see is a ring buffer with hand-wavy atomicFence on one of mutating 
operations. popFront is not protected at all.


Also force yielding a thread is not a sane synchronization technique.

Over all - I suggest to not label this as "wait free" code as it's waaay 
far from what it takes to get that.


--
Dmitry Olshansky


Re: Wait-free thread communication

2016-03-28 Thread Jin via Digitalmars-d

On Monday, 28 March 2016 at 17:16:14 UTC, Dmitry Olshansky wrote:
If nothing changed implementation-wise this is just data-racy 
queues :)


Why?


All I see is a ring buffer with hand-wavy atomicFence on one of 
mutating operations. popFront is not protected at all.


popFront does not need protection. It is atommic for provider.

Also force yielding a thread is not a sane synchronization 
technique.


Here the fibers are used.

Over all - I suggest to not label this as "wait free" code as 
it's waaay far from what it takes to get that.


Each operation (clear,front,popFront,full,put) has a fixed number 
of steps if you checks for clear before access to front and check 
for full before put. If you not check this - you will be blockek 
of cource. What do you expect? Exception? Ignore?