And for the new year 0.3.0: [https://github.com/mratsim/weave/releases/tag/v0.3.0](https://github.com/mratsim/weave/releases/tag/v0.3.0)
Next developments will probably take a while as the "low-hanging" fruits are done (i.e. from my PoC in July/August). If someone wants to add something like graphviz output to [Synthesis](https://github.com/mratsim/Synthesis) that would be helpful to display Weave internal/control-flow visually. Changelog * Weave can now compile with Microsoft Visual Studio Compiler, in C++ mode (there are unsupported atomics with VCC in C mode) * for-loops are now awaitable, note that only the spawning threads will be blocked other will continue on other tasks. Being blocked means that the thread "stay still" in the code but it will still complete tasks pending in its queue while waiting for the blocking one to be resolved (by itself or another worker). * Research flags for stealing early or thief targeting have been added. * Weave now uses Synthesis state machine in several places. I am still unsure on the readability benefits (maybe I'm too familiar with the codebase now) but if visual graphs/description were added to Synthesis that would definitely tip the scale. * The memory pool now has the exact same API has malloc/free (previously freeing required specifying the caller threadID). The scheme to retrieve an unique thread identifier without expensive syscalls is probably worthwhile in Nim: [https://github.com/mratsim/weave/blob/v0.3.0/weave/memory/thread_id.nim#L8-L66](https://github.com/mratsim/weave/blob/v0.3.0/weave/memory/thread_id.nim#L8-L66) * The internal memory pool and lookaside list have been annotated for use with LLVM AddressSanitizer, a memory accesses debugger tool. There are warnings, I didn't check yet if they are spurious or not. Some are due to Nim internals. * Significant performance bugs and improvements were identified on data parallelism (for example not splitting parallel loops in some cases). Weave is now competitive with OpenMP for coarse-grain loops (with a large amount of work) and hopefully doesn't suffer from OpenMP issues on loops that are too small and shouldn't be parallelized as Weave does parallelization lazily on an as-needed basis (to be benchmarked, see [https://github.com/zy97140/omp-benchmark-for-pytorch](https://github.com/zy97140/omp-benchmark-for-pytorch)) * The highlight is that on my 18-cores machine a pure Nim matrix multiplication without assembly without OpenMP is now much faster than OpenBLAS and competitive with Intel MKL and Intel MKL-DNN which are 90% assembly, processor-tuned and the result of decades of dedicated development. Actually if we just look at parallelization efficiency (time 18 cores / time 1 core): * Weave with Backoff achieves the same speedup as Intel MKL + Intel OpenMP at 15.0~15.5x speedup * Weave is much better than Intel MKL + GCC OpenMP at 14x speedup * Weave without Backoff achieves a speedup of 16.9x One thing of note: measuring performance on a busy system is approximative at best, you need a lot of runs to get a ballpark figure. Furthermore for multithreading runtime, workers often "yield" or "sleep" when they fail to steal work. But in that case, the OS might give the timeslice to other processes (and not to other thread in the runtime). If a process like nimsuggest hogs a core at 100% it will get a lot of those yield and sleep timeslices even though your other 17 threads would have made useful progress. The result is that while nimsuggest is stuck at 100% (or any other application), Weave gets worse than sequential performance and I don't think I can do anything about it.