Re: Preview of Weave/Picasso v0.1.0, a message-passing based multithreading runtime.

mratsim Wed, 01 Jan 2020 05:30:43 -0800

And for the new year 0.3.0: 
[https://github.com/mratsim/weave/releases/tag/v0.3.0](https://github.com/mratsim/weave/releases/tag/v0.3.0)

Next developments will probably take a while as the "low-hanging" fruits are
done (i.e. from my PoC in July/August). If someone wants to add something like
graphviz output to [Synthesis](https://github.com/mratsim/Synthesis) that would
be helpful to display Weave internal/control-flow visually.

Changelog

* Weave can now compile with Microsoft Visual Studio Compiler, in C++ mode
(there are unsupported atomics with VCC in C mode)
* for-loops are now awaitable, note that only the spawning threads will be
blocked other will continue on other tasks. Being blocked means that the thread
"stay still" in the code but it will still complete tasks pending in its queue
while waiting for the blocking one to be resolved (by itself or another worker).
* Research flags for stealing early or thief targeting have been added.
* Weave now uses Synthesis state machine in several places. I am still unsure
on the readability benefits (maybe I'm too familiar with the codebase now) but
if visual graphs/description were added to Synthesis that would definitely tip
the scale.
* The memory pool now has the exact same API has malloc/free (previously
freeing required specifying the caller threadID). The scheme to retrieve an
unique thread identifier without expensive syscalls is probably worthwhile in
Nim:
[https://github.com/mratsim/weave/blob/v0.3.0/weave/memory/thread_id.nim#L8-L66](https://github.com/mratsim/weave/blob/v0.3.0/weave/memory/thread_id.nim#L8-L66)
* The internal memory pool and lookaside list have been annotated for use
with LLVM AddressSanitizer, a memory accesses debugger tool. There are
warnings, I didn't check yet if they are spurious or not. Some are due to Nim
internals.
* Significant performance bugs and improvements were identified on data
parallelism (for example not splitting parallel loops in some cases). Weave is
now competitive with OpenMP for coarse-grain loops (with a large amount of
work) and hopefully doesn't suffer from OpenMP issues on loops that are too
small and shouldn't be parallelized as Weave does parallelization lazily on an
as-needed basis (to be benchmarked, see
[https://github.com/zy97140/omp-benchmark-for-pytorch](https://github.com/zy97140/omp-benchmark-for-pytorch))
* The highlight is that on my 18-cores machine a pure Nim matrix
multiplication without assembly without OpenMP is now much faster than OpenBLAS
and competitive with Intel MKL and Intel MKL-DNN which are 90% assembly,
processor-tuned and the result of decades of dedicated development. Actually if
we just look at parallelization efficiency (time 18 cores / time 1 core):
* Weave with Backoff achieves the same speedup as Intel MKL + Intel OpenMP
at 15.0~15.5x speedup
* Weave is much better than Intel MKL + GCC OpenMP at 14x speedup
* Weave without Backoff achieves a speedup of 16.9x

One thing of note: measuring performance on a busy system is approximative at
best, you need a lot of runs to get a ballpark figure. Furthermore for
multithreading runtime, workers often "yield" or "sleep" when they fail to
steal work. But in that case, the OS might give the timeslice to other
processes (and not to other thread in the runtime). If a process like
nimsuggest hogs a core at 100% it will get a lot of those yield and sleep
timeslices even though your other 17 threads would have made useful progress.
The result is that while nimsuggest is stuck at 100% (or any other
application), Weave gets worse than sequential performance and I don't think I
can do anything about it.

Re: Preview of Weave/Picasso v0.1.0, a message-passing based multithreading runtime.

Reply via email to