And for the new year 0.3.0: 
[https://github.com/mratsim/weave/releases/tag/v0.3.0](https://github.com/mratsim/weave/releases/tag/v0.3.0)

Next developments will probably take a while as the "low-hanging" fruits are 
done (i.e. from my PoC in July/August). If someone wants to add something like 
graphviz output to [Synthesis](https://github.com/mratsim/Synthesis) that would 
be helpful to display Weave internal/control-flow visually.

Changelog

  * Weave can now compile with Microsoft Visual Studio Compiler, in C++ mode 
(there are unsupported atomics with VCC in C mode)
  * for-loops are now awaitable, note that only the spawning threads will be 
blocked other will continue on other tasks. Being blocked means that the thread 
"stay still" in the code but it will still complete tasks pending in its queue 
while waiting for the blocking one to be resolved (by itself or another worker).
  * Research flags for stealing early or thief targeting have been added.
  * Weave now uses Synthesis state machine in several places. I am still unsure 
on the readability benefits (maybe I'm too familiar with the codebase now) but 
if visual graphs/description were added to Synthesis that would definitely tip 
the scale.
  * The memory pool now has the exact same API has malloc/free (previously 
freeing required specifying the caller threadID). The scheme to retrieve an 
unique thread identifier without expensive syscalls is probably worthwhile in 
Nim: 
[https://github.com/mratsim/weave/blob/v0.3.0/weave/memory/thread_id.nim#L8-L66](https://github.com/mratsim/weave/blob/v0.3.0/weave/memory/thread_id.nim#L8-L66)
  * The internal memory pool and lookaside list have been annotated for use 
with LLVM AddressSanitizer, a memory accesses debugger tool. There are 
warnings, I didn't check yet if they are spurious or not. Some are due to Nim 
internals.
  * Significant performance bugs and improvements were identified on data 
parallelism (for example not splitting parallel loops in some cases). Weave is 
now competitive with OpenMP for coarse-grain loops (with a large amount of 
work) and hopefully doesn't suffer from OpenMP issues on loops that are too 
small and shouldn't be parallelized as Weave does parallelization lazily on an 
as-needed basis (to be benchmarked, see 
[https://github.com/zy97140/omp-benchmark-for-pytorch](https://github.com/zy97140/omp-benchmark-for-pytorch))
  * The highlight is that on my 18-cores machine a pure Nim matrix 
multiplication without assembly without OpenMP is now much faster than OpenBLAS 
and competitive with Intel MKL and Intel MKL-DNN which are 90% assembly, 
processor-tuned and the result of decades of dedicated development. Actually if 
we just look at parallelization efficiency (time 18 cores / time 1 core):
    * Weave with Backoff achieves the same speedup as Intel MKL + Intel OpenMP 
at 15.0~15.5x speedup
    * Weave is much better than Intel MKL + GCC OpenMP at 14x speedup
    * Weave without Backoff achieves a speedup of 16.9x



One thing of note: measuring performance on a busy system is approximative at 
best, you need a lot of runs to get a ballpark figure. Furthermore for 
multithreading runtime, workers often "yield" or "sleep" when they fail to 
steal work. But in that case, the OS might give the timeslice to other 
processes (and not to other thread in the runtime). If a process like 
nimsuggest hogs a core at 100% it will get a lot of those yield and sleep 
timeslices even though your other 17 threads would have made useful progress. 
The result is that while nimsuggest is stuck at 100% (or any other 
application), Weave gets worse than sequential performance and I don't think I 
can do anything about it.

Reply via email to