@araq:

> > My preferred option is to replace the whole file with something that works
> 
> Yes.

I haven't dropped this project, but have spent extra time in stress testing and 
considering the `std/channels` library's use for its eventual application, 
especially in a new `std/threadpools` library.

Current status is as follows:

  1. The **serious data race problems have been eliminated** as far as I can 
tell with extensive stress testing and code review. To do this, I changed the 
whole algorithm away from the original file which was based on an 
implementation of a lazy list from the paper (The paper implementation worked 
almost identically to the `MVar` based `Chan`'s used by Haskell) because it had 
been changed to use buffers and that made the double-wait-condition algorithm 
unworkable and the reason for the data race.
  2. I **kept the ability to move the channel freely between threads** as a 
reference based (actually pointer based) object that is tracked by atomic 
reference counting.
  3. Since it is now a reference based object, **it needs to be "new upped" 
with the required buffer size** (defaults to infinite, can be buffer size of 
one for unbuffered) and can't be initialized as a value struct as the original 
built-in version.
  4. I **dropped the `open`/`close` proc 's** which originally would initialize 
the value based object from the API because they no longer can do anything and 
if we did use them to set a meaningful flag to stop sending/receiving, it would 
be dangerous when the channel could possibly be referenced by multiple threads 
as the API has no means of indicating whether a send/receive completely failed 
(even the "try..." proc's only indicate whether there is room/data).
  5. I **eliminated the whole idea of and the required items in the API to 
control a global memory pool** from which the channel objects and channel 
buffers were formerly allocated as per your instruction. This was a premature 
optimization that would have no effect on the actual use of the library. It 
likely was effective in the paper that the former code referenced as the 
allocation/deallocation of a chain of lazy list elements would take significant 
time, but my implementation uses either a fixed buffer or allocates a series of 
significantly sized buffers (for the "infinite" buffer), which takes 
insignificant time as compared to normal operation time. More on this at the 
end.
  6. This **project is complete and I will post the PR** once I complete the 
documentation.



> > However, the need for allocation from a sized memory pool is frequent 
> > enough that I think that there should exist a `sys/mempool` library
> 
> I think it's too early for that. We can always add it later and the initial 
> channel implementation can start without such an optimization too.

 **Your intuition on this was absolutely correct: The only thing that the 
memory pool optimization did was increase the speed of "new ups" of channels 
and channel buffers, and that is already not the bottleneck** with a speed in 
the order of ten million a second for buffered channels; for the use of these 
channels for inter-thread communication, the speed of sending and receiving a 
single element (non-buffered channel), which is the same thing as a new 
`FlowVar[T]`, is in the order of 150,000 per second, which is the bottleneck on 
channel's use and which a memory pool would not help as it is just caused by 
the overhead of more frequent use of condition variables.

**So moving on to considerations about the design of the new `std/threadpools` 
library** , as follows:

  1. Most **other languages have thread pools that are built into the language 
runtime** , either enabled by a flag such as our `--threads:on` or Haskell's 
`-threaded` or always available, and then accessed as required by importing a 
library or two rather than having those libraries automatically imported as we 
do currently. Other languages then allow the thread pool to run in global space 
and just be accessed as required. **It seems to me that we might consider not 
automatically importing any thread or channel libraries for version 2.0** and 
just import them as required. **Having the thread pool implementation in the 
runtime when `--threads:on` then would make them available globally** and they 
could be resized as necessary (or an infinite channel could be used as the pool 
of awaiting threads).
  2. **An alternate is to have the thread pool library create a local thread 
pool** , which could even be passed around as necessary and as many thread 
pools as desired could be created for different purposes.
  3. Another consideration is that currently, **we do not make exceptions that 
occurred in a thread to be caught and made available to the `spawn` 'ing 
thread, nor do we build in a means of cancellation of threads** in the thread 
pool (of course, co-operatively, as the running thread needs to check for 
cancellation periodically).
  4. Some contributors seem to like **the idea of outputting `async` 
`Future[T]` from `spawn` 'ed threads including exception status**, but I don't 
really agree with this, as using `async` will mean they should only be used 
with `--gc:orc` as the `async` library has cyclic references (at least in order 
to avoid memory leaks); I propose that our new concept of `FlowVar[T]` can also 
pass back exception information, and then it's easy enough for anyone who wants 
a `Future[T]` to just wrap the `FlowVar[T]` (with exceptions) - we could even 
provide yet another library to make this easy for those who frequently require 
it.



One thing that should be made clear in the documentation is the proper use of 
`spawn` and/or `parallel`: The Internet (StackOverflow, etc.) is full of newbie 
questions wondering why, when they `spawn` a series of "two-plus-two" 
calculations across a million threads in the thread pool (also applies to 
Golang go routines, even though Golang is about twenty times faster at spawning 
than even my new implementation), the resulting performance is no faster or 
even slower than not multi-threading. We really should explain that one needs 
to assign enough work to be worth the overhead, such as ten thousand 
"two-plus-two" calculations per `spawn`, else they need to look into using work 
stealing/sharing schemes such as @mratsim's Weave/Picasso project.

**I have written a Proof of Concept that shows that the new thread pool 
implementation works with my new `std/channels` implementation at about the 
same speed as the old built-in libraries or maybe a little faster, but would 
appreciate your consideration** on the above items so my work fits your 
"vision"...

Reply via email to