Re: Threadpools, difference between DMD and LDC

2014-08-05 Thread Russel Winder via Digitalmars-d-learn
On Mon, 2014-08-04 at 18:34 +, Dicebot via Digitalmars-d-learn
wrote:
[…]
 Well it is a territory not completely alien to me either ;) I am 
 less aware of academia research on topic though, just happen to 
 work in industry where it matters.

I have been out of academia now for 14 years, but tracking the various
lists and blogs, not to mention SuperComputing conferences, there is
very little new stuff, the last 10 has been about improving. The one new
thing is though GPGPU, which started out as an interesting side show but
has now come front and centre for data parallelism.

 I think initial spread of multi-threading approach has happened 
 because it was so temptingly easy - no need to worry about 
 actually modelling the concurrency execution flow, blocking I/O 
 or scheduling; just write the code as usual and OS will take care 
 of it. But there is no place for magic in programming world in it 
 has fallen hard once network services started to scale.

Threads are infrastructure just like stack and heap, very, very, very
few people actually worry about and manage these resources explicitly,
most just leave the run time system to handle it. OK so the usual GC
argument can be plopped in here, let's not bother though as we've been
through it three times this quarter :-)

 Right now is the glorious moment when engineers are finally 
 starting to appreciate how previous academia research can help 
 them solve practical issues and all this good stuff goes 
 mainstream :)

Actors are mid 1960s, dataflow early 1970s, CSP mid 1970s, it has taken
the explicit shared-memory multithreading in applications fiasco a long
time to pass. I can think of some applications which are effectively
operating systems and so need all the best shared-memory multithreading
techniques (I was involved in one 1999–2004), but most applications
people should be using actors, dataflow, CSP or data parallelism as
their applications model supported by library frameworks/infrastructure.

[…]
 Doubt programming / engineering community will ever accept 
 research that states that choosing architecture can be done on 
 pure theoretical basis :) It simply contradicts too much all 
 daily experience which says that every concurrent application has 
 some unique traits to consider and only profiling can rule them 
 all.

Most solutions to problems or subproblems can be slotted into one of
actors, dataflow, pipeline, MVC, data parallelism, event loop for the
main picture. If tweaking is needed, profiling and small localized
tinkerings can do the trick. I have yet to find many cases in my
(computation oriented) world where that is needed. Maybe in an I/O world
there are different constraints.

-- 
Russel.
=
Dr Russel Winder  t: +44 20 7585 2200   voip: sip:russel.win...@ekiga.net
41 Buckmaster Roadm: +44 7770 465 077   xmpp: rus...@winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder


signature.asc
Description: This is a digitally signed message part


Re: Threadpools, difference between DMD and LDC

2014-08-04 Thread Kapps via Digitalmars-d-learn
On Monday, 4 August 2014 at 05:14:22 UTC, Philippe Sigaud via 
Digitalmars-d-learn wrote:


I have another question: it seems I can spawn hundreds of 
threads
(Heck, even 10_000 is accepted), even when I have 4-8 cores. Is 
there:

is there a limit to the number of threads? I tried a threadpool
because in my application I feared having to spawn ~100-200 
threads

but if that's not the case, I can drastically simplify my code.
Is spawning a thread a slow operation in general?


Without going into much detail: Threads are heavy, and creating a 
thread is an expensive operation (which is partially why 
virtually every standard library includes a ThreadPool). Along 
with the overhead of creating the thread, you also get the 
overhead of additional context switches for each thread you have 
actively running. Context switches are expensive and a 
significant waste of time where your CPU gets to sit there doing 
effectively nothing while the OS manages scheduling which thread 
will go and restoring its context to run again. If you have 
10,000 threads even if you won't run into limits of how many 
threads you can have, this will provide very significant overhead.


I haven't looked into detail your code, but consider using the 
TaskPool if you just want to schedule some tasks to run amongst a 
few threads, or potentially using Fibers (which are fairly 
light-weight) instead of Threads.


Re: Threadpools, difference between DMD and LDC

2014-08-04 Thread David Nadlinger via Digitalmars-d-learn
On Monday, 4 August 2014 at 05:14:22 UTC, Philippe Sigaud via 
Digitalmars-d-learn wrote:
This is correct – the LLVM optimizer indeed gets rid of the 
loop completely.


OK,that's clever. But I get this even when put a writeln(some 
msg)
inside the task. I thought a write couldn't be optimized away 
that way

and that it's a slow operation?


You need the _result_ of the computation for the writeln. LLVM's 
optimizer recognizes what the loop tries to compute, though, and 
replaces it with an equivalent expression for the sum of the 
series, as Trass3r alluded to.


Cheers,
David


Re: Threadpools, difference between DMD and LDC

2014-08-04 Thread Philippe Sigaud via Digitalmars-d-learn
 Without going into much detail: Threads are heavy, and creating a thread is
 an expensive operation (which is partially why virtually every standard
 library includes a ThreadPool).

 I haven't looked into detail your code, but consider using the TaskPool if
 you just want to schedule some tasks to run amongst a few threads, or
 potentially using Fibers (which are fairly light-weight) instead of Threads.

OK, I get it. Just to be sure, there is no ThreadPool in Phobos or in
core, right?
IIRC, there are fibers somewhere in core, I'll have a look. I also
heard the vibe.d has them.


Re: Threadpools, difference between DMD and LDC

2014-08-04 Thread Chris Cain via Digitalmars-d-learn
On Monday, 4 August 2014 at 12:05:31 UTC, Philippe Sigaud via 
Digitalmars-d-learn wrote:
OK, I get it. Just to be sure, there is no ThreadPool in Phobos 
or in

core, right?
IIRC, there are fibers somewhere in core, I'll have a look. I 
also

heard the vibe.d has them.


There is. It's called taskPool, though:

http://dlang.org/phobos/std_parallelism.html#.taskPool


Re: Threadpools, difference between DMD and LDC

2014-08-04 Thread Philippe Sigaud via Digitalmars-d-learn
On Mon, Aug 4, 2014 at 2:13 PM, Chris Cain via Digitalmars-d-learn
digitalmars-d-learn@puremagic.com wrote:

 OK, I get it. Just to be sure, there is no ThreadPool in Phobos or in
 core, right?

 There is. It's called taskPool, though:

 http://dlang.org/phobos/std_parallelism.html#.taskPool

Ah, std.parallelism. I stoopidly searched in std.concurrency and core.*
Thanks!


Re: Threadpools, difference between DMD and LDC

2014-08-04 Thread Dicebot via Digitalmars-d-learn
On Monday, 4 August 2014 at 05:14:22 UTC, Philippe Sigaud via 
Digitalmars-d-learn wrote:
I have another question: it seems I can spawn hundreds of 
threads
(Heck, even 10_000 is accepted), even when I have 4-8 cores. Is 
there:

is there a limit to the number of threads? I tried a threadpool
because in my application I feared having to spawn ~100-200 
threads

but if that's not the case, I can drastically simplify my code.
Is spawning a thread a slow operation in general?


Most likely those threads either do nothing or are short living 
so you don't get actually 10 000 threads running simultaneously. 
In general you should expect your operating system to start 
stalling at few thousands of concurrent threads competing for 
context switches and system resources. Creating new thread is 
rather costly operation though you may not spot it in synthetic 
snippets, only under actual load.


Modern default approach is to have amount of worker threads 
identical or close to amount of CPU cores and handle internal 
scheduling manually via fibers or some similar solution.


If you are totally new to the topic of concurrent services, 
getting familiar with http://en.wikipedia.org/wiki/C10k_problem 
may be useful :)


Re: Threadpools, difference between DMD and LDC

2014-08-04 Thread Dicebot via Digitalmars-d-learn
On Monday, 4 August 2014 at 12:05:31 UTC, Philippe Sigaud via 
Digitalmars-d-learn wrote:
IIRC, there are fibers somewhere in core, I'll have a look. I 
also

heard the vibe.d has them.


http://dlang.org/phobos/core_thread.html#.Fiber

vibe.d adds some own abstraction on top, for example Task 
concept and notion of Isolated types for message passing but 
basic are from Phobos.


Re: Threadpools, difference between DMD and LDC

2014-08-04 Thread Philippe Sigaud via Digitalmars-d-learn
On Mon, Aug 4, 2014 at 3:36 PM, Dicebot via Digitalmars-d-learn
digitalmars-d-learn@puremagic.com wrote:

 Most likely those threads either do nothing or are short living so you don't
 get actually 10 000 threads running simultaneously. In general you should
 expect your operating system to start stalling at few thousands of
 concurrent threads competing for context switches and system resources.
 Creating new thread is rather costly operation though you may not spot it in
 synthetic snippets, only under actual load.

 Modern default approach is to have amount of worker threads identical or
 close to amount of CPU cores and handle internal scheduling manually via
 fibers or some similar solution.

That's what I guessed. It's juste that I have task that will generate
other (linked) tasks, in a DAG. I can use a thread pool of 2-8
threads, but that means storing tasks and their relationships (which
is waiting on which, etc). I rather liked the idea of spawning new
threads when I needed them ;)




 If you are totally new to the topic of concurrent services, getting familiar
 with http://en.wikipedia.org/wiki/C10k_problem may be useful :)

I'll have a look. I'm quite new, my only knowledge comes from reading
the concurrency threads here, std.concurrency, std.parallelism and
TDPL :)


Re: Threadpools, difference between DMD and LDC

2014-08-04 Thread via Digitalmars-d-learn
On Monday, 4 August 2014 at 14:56:36 UTC, Philippe Sigaud via 
Digitalmars-d-learn wrote:

On Mon, Aug 4, 2014 at 3:36 PM, Dicebot via Digitalmars-d-learn
digitalmars-d-learn@puremagic.com wrote:
Modern default approach is to have amount of worker threads 
identical or
close to amount of CPU cores and handle internal scheduling 
manually via

fibers or some similar solution.


That's what I guessed. It's juste that I have task that will 
generate

other (linked) tasks, in a DAG. I can use a thread pool of 2-8
threads, but that means storing tasks and their relationships 
(which
is waiting on which, etc). I rather liked the idea of spawning 
new

threads when I needed them ;)


If you can live with the fact that your tasks might not be truly 
parallel (i.e. don't use busy waiting or other things that assume 
that other tasks make progress while a specific task is running), 
and you only use them for computing (no synchronous I/O), you can 
still use the fibers in core.thread:


http://dlang.org/phobos/core_thread.html#.Fiber


Re: Threadpools, difference between DMD and LDC

2014-08-04 Thread Dicebot via Digitalmars-d-learn
On Monday, 4 August 2014 at 14:56:36 UTC, Philippe Sigaud via 
Digitalmars-d-learn wrote:

On Mon, Aug 4, 2014 at 3:36 PM, Dicebot via Digitalmars-d-learn
digitalmars-d-learn@puremagic.com wrote:

Most likely those threads either do nothing or are short 
living so you don't
get actually 10 000 threads running simultaneously. In general 
you should
expect your operating system to start stalling at few 
thousands of
concurrent threads competing for context switches and system 
resources.
Creating new thread is rather costly operation though you may 
not spot it in

synthetic snippets, only under actual load.

Modern default approach is to have amount of worker threads 
identical or
close to amount of CPU cores and handle internal scheduling 
manually via

fibers or some similar solution.


That's what I guessed. It's juste that I have task that will 
generate

other (linked) tasks, in a DAG. I can use a thread pool of 2-8
threads, but that means storing tasks and their relationships 
(which
is waiting on which, etc). I rather liked the idea of spawning 
new

threads when I needed them ;)


vibe.d additions may help here:

http://vibed.org/api/vibe.core.core/runTask
http://vibed.org/api/vibe.core.core/runWorkerTask
http://vibed.org/api/vibe.core.core/workerThreadCount

task abstraction allows exactly that - spawning new execution 
context and have it scheduled automatically via underlying 
fiber/thread pool. However, I am not aware of any good tutorials 
about using those so jump in at your own risk.




If you are totally new to the topic of concurrent services, 
getting familiar

with http://en.wikipedia.org/wiki/C10k_problem may be useful :)


I'll have a look. I'm quite new, my only knowledge comes from 
reading
the concurrency threads here, std.concurrency, std.parallelism 
and

TDPL :)


Have fun :P It is rapidly changing topic though, best practices 
may be out of date by the time you have read them :)


Re: Threadpools, difference between DMD and LDC

2014-08-04 Thread Russel Winder via Digitalmars-d-learn
Sorry, I missed this thread (!) till now.

On Mon, 2014-08-04 at 13:36 +, Dicebot via Digitalmars-d-learn
wrote:
 On Monday, 4 August 2014 at 05:14:22 UTC, Philippe Sigaud via 
 Digitalmars-d-learn wrote:
  I have another question: it seems I can spawn hundreds of 
  threads
  (Heck, even 10_000 is accepted), even when I have 4-8 cores. Is 
  there:
  is there a limit to the number of threads? I tried a threadpool
  because in my application I feared having to spawn ~100-200 
  threads
  but if that's not the case, I can drastically simplify my code.
  Is spawning a thread a slow operation in general?

Are these std.concurrent threads or std.parallelism tasks?

A std.parallelism task is not a thread. Like Erlang or Java Fork/Join
framework, the program specifies units of work and then there is a
thread pool underneath that works on tasks as required. So you can have
zillions of tasks but there will only be a few actual threads working on
them.

 Most likely those threads either do nothing or are short living 
 so you don't get actually 10 000 threads running simultaneously. 

I suspect it is actually impossible to start this number of kernel
threads on any current kernel.

 In general you should expect your operating system to start 
 stalling at few thousands of concurrent threads competing for 
 context switches and system resources. Creating new thread is 
 rather costly operation though you may not spot it in synthetic 
 snippets, only under actual load.

 Modern default approach is to have amount of worker threads 
 identical or close to amount of CPU cores and handle internal 
 scheduling manually via fibers or some similar solution.

I have no current data, but it used to be that for a single system it
was best to have one or two more threads than the number of cores.
Processor architectures and caching changes so new data is required. I
am sure someone somewhere has it though.

 If you are totally new to the topic of concurrent services, 
 getting familiar with http://en.wikipedia.org/wiki/C10k_problem 
 may be useful :)

I thought they'd moved on the the 100k problem.

There is an issue here that I/O bound concurrency and CPU bound
concurrency/parallelism are very different beasties. Clearly tools and
techniques can apply to either or both.

-- 
Russel.
=
Dr Russel Winder  t: +44 20 7585 2200   voip: sip:russel.win...@ekiga.net
41 Buckmaster Roadm: +44 7770 465 077   xmpp: rus...@winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder


signature.asc
Description: This is a digitally signed message part


Re: Threadpools, difference between DMD and LDC

2014-08-04 Thread Dicebot via Digitalmars-d-learn
On Monday, 4 August 2014 at 16:38:24 UTC, Russel Winder via 
Digitalmars-d-learn wrote:
Modern default approach is to have amount of worker threads 
identical or close to amount of CPU cores and handle internal 
scheduling manually via fibers or some similar solution.


I have no current data, but it used to be that for a single 
system it
was best to have one or two more threads than the number of 
cores.
Processor architectures and caching changes so new data is 
required. I

am sure someone somewhere has it though.


This is why I had or close remark :) Exact number almost always 
depends on exact deployment layout - i.e. what other processes 
are running in the system, how hardware interrupts are handled 
and so on. It is something to decide for each specific 
application. Sometimes it is even best to have amount of worker 
threads _less_ than amount of CPU cores if affinity is to be used 
for some other background service for example.


If you are totally new to the topic of concurrent services, 
getting familiar with 
http://en.wikipedia.org/wiki/C10k_problem may be useful :)


I thought they'd moved on the the 100k problem.


True, C10K is a solved problem but it is best thing to start with 
to understand why people even bother with all the concurrency 
complexity - all details can be a bit overwhelming if one starts 
completely from scratch.



There is an issue here that I/O bound concurrency and CPU bound
concurrency/parallelism are very different beasties. Clearly 
tools and

techniques can apply to either or both.


Actually with CSP / actor model one can simply consider 
long-running CPU computation as form of I/O an apply same 
asynchronous design techniques. For example, have separate 
dedicated thread running the computation and send input there via 
message passing - respond message will act similar to I/O 
notification from the OS.


Choosing optimal concurrency architecture for application is 
probably even harder problem than naming identifiers.


Re: Threadpools, difference between DMD and LDC

2014-08-04 Thread Russel Winder via Digitalmars-d-learn
On Mon, 2014-08-04 at 16:57 +, Dicebot via Digitalmars-d-learn
wrote:
[…]
 This is why I had or close remark :) Exact number almost always 
 depends on exact deployment layout - i.e. what other processes 
 are running in the system, how hardware interrupts are handled 
 and so on. It is something to decide for each specific 
 application. Sometimes it is even best to have amount of worker 
 threads _less_ than amount of CPU cores if affinity is to be used 
 for some other background service for example.

David chose to have the pool thread default be (number-of-cores - 1) if
I remember correctly. I am not sure he manipulated affinity. This ought
to be on the list of things for a review of std.parallelism.

[…]

 Actually with CSP / actor model one can simply consider 
 long-running CPU computation as form of I/O an apply same 
 asynchronous design techniques. For example, have separate 
 dedicated thread running the computation and send input there via 
 message passing - respond message will act similar to I/O 
 notification from the OS.

Now you are on my territory :-) I have been banging on about message
passing parallelism architectures for 25 years, but sadly shared memory
multi-threading became the standard model for some totally bizarre
reason. Probably everyone was taught they had to use all the wonderful
OS implementation concurrency techniques in all their applications
codes.

CSP is great, cf. Go, Python-CSP, GPars, actors are great, cf. Erlang,
Akka, GPars, but do not forget dataflow, cf. GPars, Actian DataRush.

There have been a number of PhDs trying to provide tools for deciding
which parallelism architecture is best suited to a given problem. Sadly
most of them have been ignored by the programming language community at
large.

 Choosing optimal concurrency architecture for application is 
 probably even harder problem than naming identifiers.

'Fraid not, it's actually a lot easier.

-- 
Russel.
=
Dr Russel Winder  t: +44 20 7585 2200   voip: sip:russel.win...@ekiga.net
41 Buckmaster Roadm: +44 7770 465 077   xmpp: rus...@winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder


signature.asc
Description: This is a digitally signed message part


Re: Threadpools, difference between DMD and LDC

2014-08-04 Thread Dicebot via Digitalmars-d-learn
On Monday, 4 August 2014 at 18:22:47 UTC, Russel Winder via 
Digitalmars-d-learn wrote:
Actually with CSP / actor model one can simply consider 
long-running CPU computation as form of I/O an apply same 
asynchronous design techniques. For example, have separate 
dedicated thread running the computation and send input there 
via message passing - respond message will act similar to I/O 
notification from the OS.


Now you are on my territory :-) I have been banging on about 
message
passing parallelism architectures for 25 years, but sadly 
shared memory
multi-threading became the standard model for some totally 
bizarre
reason. Probably everyone was taught they had to use all the 
wonderful
OS implementation concurrency techniques in all their 
applications

codes.


Well it is a territory not completely alien to me either ;) I am 
less aware of academia research on topic though, just happen to 
work in industry where it matters.


I think initial spread of multi-threading approach has happened 
because it was so temptingly easy - no need to worry about 
actually modelling the concurrency execution flow, blocking I/O 
or scheduling; just write the code as usual and OS will take care 
of it. But there is no place for magic in programming world in it 
has fallen hard once network services started to scale.


Right now is the glorious moment when engineers are finally 
starting to appreciate how previous academia research can help 
them solve practical issues and all this good stuff goes 
mainstream :)


There have been a number of PhDs trying to provide tools for 
deciding
which parallelism architecture is best suited to a given 
problem. Sadly
most of them have been ignored by the programming language 
community at

large.


Doubt programming / engineering community will ever accept 
research that states that choosing architecture can be done on 
pure theoretical basis :) It simply contradicts too much all 
daily experience which says that every concurrent application has 
some unique traits to consider and only profiling can rule them 
all.


Re: Threadpools, difference between DMD and LDC

2014-08-04 Thread Philippe Sigaud via Digitalmars-d-learn
On Mon, Aug 4, 2014 at 6:21 PM, Dicebot via Digitalmars-d-learn
digitalmars-d-learn@puremagic.com wrote:

 vibe.d additions may help here:

 http://vibed.org/api/vibe.core.core/runTask
 http://vibed.org/api/vibe.core.core/runWorkerTask
 http://vibed.org/api/vibe.core.core/workerThreadCount

 task abstraction allows exactly that - spawning new execution context and
 have it scheduled automatically via underlying fiber/thread pool. However, I
 am not aware of any good tutorials about using those so jump in at your own
 risk.

Has anyone used (the fiber/taks of) vibe.d for something other than
powering websites?


Re: Threadpools, difference between DMD and LDC

2014-08-04 Thread Philippe Sigaud via Digitalmars-d-learn
On Mon, Aug 4, 2014 at 6:38 PM, Russel Winder via Digitalmars-d-learn
digitalmars-d-learn@puremagic.com wrote:

 Are these std.concurrent threads or std.parallelism tasks?

 A std.parallelism task is not a thread. Like Erlang or Java Fork/Join
 framework, the program specifies units of work and then there is a
 thread pool underneath that works on tasks as required. So you can have
 zillions of tasks but there will only be a few actual threads working on
 them.

That's it. Many tasks, a few working threads. That's what I'm
converging to. They are not particularly 'concurrent', but they can
depend on one another.

My only gripes with std.parallelism is that I cannot understand
whether it's interesting to use the module if tasks can create other
tasks and depend on them in a deeply interconnected graph. I mean, if
I have to write lots of scaffolding just to manage dependencies
between task, I might as well built it on core.thread and message
passing directly. I'm becoming quite enamored of message passing,
maybe because it's a new shiny toy for me :)

That's for parsing, btw. I'm trying to write a n-core engine for my
Pegged parser generator project.



 Most likely those threads either do nothing or are short living
 so you don't get actually 10 000 threads running simultaneously.

 I suspect it is actually impossible to start this number of kernel
 threads on any current kernel

So, what happens when I do

void doWork() { ... }

Tid[] children;
foreach(_; 0 .. 10_000)
children ~= spawn(doWork);

?

I mean, it compiles and runs happily.
In my current tests, I end the application by sending all thread a
CloseDown message and waiting for an answer from each of them. That
takes about 1s on my machine.

 I have no current data, but it used to be that for a single system it
 was best to have one or two more threads than the number of cores.
 Processor architectures and caching changes so new data is required. I
 am sure someone somewhere has it though.

I can add that, depending on the tasks I'm using, it's sometime better
to use 4, 6, 8 or 10 threads, repeatedly for a given task. I'm using a
Core i7, Linux sees it as an 8-core.
So, well, I'll try and see.


Re: Threadpools, difference between DMD and LDC

2014-08-04 Thread Dicebot via Digitalmars-d-learn
On Monday, 4 August 2014 at 21:19:14 UTC, Philippe Sigaud via 
Digitalmars-d-learn wrote:
Has anyone used (the fiber/taks of) vibe.d for something other 
than

powering websites?


Atila has implemented MQRR broker with it : 
https://github.com/atilaneves/mqtt
It it still networking application though - I don't know of any 
pure offline usage.


Re: Threadpools, difference between DMD and LDC

2014-08-04 Thread Sean Kelly via Digitalmars-d-learn

On Monday, 4 August 2014 at 21:19:14 UTC, Philippe Sigaud via
Digitalmars-d-learn wrote:


Has anyone used (the fiber/taks of) vibe.d for something other 
than powering websites?


https://github.com/D-Programming-Language/phobos/pull/1910


Re: Threadpools, difference between DMD and LDC

2014-08-04 Thread Philippe Sigaud via Digitalmars-d-learn
 https://github.com/D-Programming-Language/phobos/pull/1910

Very interesting discussion, thanks. I'm impressed by the amount of
work you guys do on github.


Threadpools, difference between DMD and LDC

2014-08-03 Thread Philippe Sigaud via Digitalmars-d-learn

I'm trying to grok message passing. That's my very first foray
into this, so I'm probably making every mistake in the book :-)

I wrote a small threadpool test, it's there:

http://dpaste.dzfl.pl/3d3a65a00425

I'm playing with the number of threads and the number of tasks,
and getting a feel about how message passing works. I must say I
quite like it: it's a bit like suddenly being able to safely
return different types from a function.

What I don't get is the difference between DMD (I'm using 2.065)
and LDC (0.14-alpha1).

For DMD, I compile with -O -inline -noboundscheck
For LDC, I use -03 -inline

LDC gives me smaller executables than DMD (also, 3 to 5 times
smaller than 0.13, good job!) but above all else incredibly,
astoundingly faster. I'm used to LDC producing 20-30% faster
programs, but here it's 1000 times faster!

8 threads, 1000 tasks: DMD:  4000 ms, LDC: 3 ms (!)

So my current hypothesis is a) I'm doing something wrong or b)
the tasks are optimized away or something.

Can someone confirm the results and tell me what I'm doing wrong?


Threadpools, difference between DMD and LDC

2014-08-03 Thread Philippe Sigaud via Digitalmars-d-learn
I'm trying to grok message passing. That's my very first foray 
into this, so I'm probably making every mistake in the book :-)


I wrote a small threadpool test, it's there:

http://dpaste.dzfl.pl/3d3a65a00425

I'm playing with the number of threads and the number of tasks, 
and getting a feel about how message passing works. I must say I 
quite like it: it's a bit like suddenly being able to safely 
return different types from a function.


What I don't get is the difference between DMD (I'm using 2.065) 
and LDC (0.14-alpha1).


For DMD, I compile with -O -inline -noboundscheck
For LDC, I use -03 -inline

LDC gives me smaller executables than DMD (also, 3 to 5 times 
smaller than 0.13, good job!) but above all else incredibly, 
astoundingly faster. I'm used to LDC producing 20-30% faster 
programs, but here it's 1000 times faster!


8 threads, 1000 tasks: DMD:  4000 ms, LDC: 3 ms (!)

So my current hypothesis is a) I'm doing something wrong or b) 
the tasks are optimized away or something.


Can someone confirm the results and tell me what I'm doing wrong?



Re: Threadpools, difference between DMD and LDC

2014-08-03 Thread safety0ff via Digitalmars-d-learn

On Sunday, 3 August 2014 at 19:52:42 UTC, Philippe Sigaud wrote:


Can someone confirm the results and tell me what I'm doing 
wrong?


LDC is likely optimizing the summation:

int sum = 0;
foreach(i; 0..task.goal)
sum += i;

To something like:

int sum = cast(int)(cast(ulong)(task.goal-1)*task.goal/2);


Re: Threadpools, difference between DMD and LDC

2014-08-03 Thread David Nadlinger via Digitalmars-d-learn

On Sunday, 3 August 2014 at 22:24:22 UTC, safety0ff wrote:

On Sunday, 3 August 2014 at 19:52:42 UTC, Philippe Sigaud wrote:


Can someone confirm the results and tell me what I'm doing 
wrong?


LDC is likely optimizing the summation:

int sum = 0;
foreach(i; 0..task.goal)
sum += i;

To something like:

int sum = cast(int)(cast(ulong)(task.goal-1)*task.goal/2);


This is correct – the LLVM optimizer indeed gets rid of the loop 
completely.


Although I'd be more than happy to be able to claim a 
thousandfold speedup over DMD on real-world applications. ;)


Cheers,
David


Re: Threadpools, difference between DMD and LDC

2014-08-03 Thread Philippe Sigaud via Digitalmars-d-learn
 This is correct – the LLVM optimizer indeed gets rid of the loop completely.

OK,that's clever. But I get this even when put a writeln(some msg)
inside the task. I thought a write couldn't be optimized away that way
and that it's a slow operation?

Anyway, I discovered Thread.wait() in core in the meantime, I'll use
that. I just wanted to have tasks taking a different amount of time
each time.

I have another question: it seems I can spawn hundreds of threads
(Heck, even 10_000 is accepted), even when I have 4-8 cores. Is there:
is there a limit to the number of threads? I tried a threadpool
because in my application I feared having to spawn ~100-200 threads
but if that's not the case, I can drastically simplify my code.
Is spawning a thread a slow operation in general?