Usually I'm as fast as pure Assembly, C, C++ or Fortran code on code I optimized.
Example:
* benchmarks of matrix multiplication:
[https://github.com/numforge/laser/blob/2f619fdbb2496aa7a5e5538035a8d42d88db8c10/benchmarks/gemm/gemm_bench_float32.nim#L418-L465](https://github.com/numforge/laser/blob/2f619fdbb2496aa7a5e5538035a8d42d88db8c10/benchmarks/gemm/gemm_bench_float32.nim#L418-L465),
sometimes Laser is faster, sometimes slower.
* OpenBLAS is pure Assembly:
[https://github.com/xianyi/OpenBLAS](https://github.com/xianyi/OpenBLAS)
* Intel MKL-DNN, is C++ and Jit Assembly:
[https://github.com/intel/mkl-dnn](https://github.com/intel/mkl-dnn)
* Nim implementation:
[https://github.com/numforge/laser/blob/2f619fdbb2496aa7a5e5538035a8d42d88db8c10/laser/primitives/matrix_multiplication/gemm.nim](https://github.com/numforge/laser/blob/2f619fdbb2496aa7a5e5538035a8d42d88db8c10/laser/primitives/matrix_multiplication/gemm.nim)
Note that the temporary buffers are managed by GC (ref object):
[https://github.com/numforge/laser/blob/2f619fdbb2496aa7a5e5538035a8d42d88db8c10/laser/primitives/matrix_multiplication/gemm_tiling.nim#L251](https://github.com/numforge/laser/blob/2f619fdbb2496aa7a5e5538035a8d42d88db8c10/laser/primitives/matrix_multiplication/gemm_tiling.nim#L251)
In short if you don't use a ref/seq/string (GC-ed types) in a critical path,
the GC won't trouble you.
Other example a multithreading runtime:
* benchmarks:
[https://github.com/mratsim/weave/tree/2bb0284e5555ec8882412886fb15c948b2826b81/benchmarks/fibonacci](https://github.com/mratsim/weave/tree/2bb0284e5555ec8882412886fb15c948b2826b81/benchmarks/fibonacci)
* OpenMP is pure C from GCC or Clang
* Intel TBB is pure C++
* Weave
([https://github.com/mratsim/weave/blob/2bb0284e5555ec8882412886fb15c948b2826b81/e04_channel_based_work_stealing/tests/fib.nim](https://github.com/mratsim/weave/blob/2bb0284e5555ec8882412886fb15c948b2826b81/e04_channel_based_work_stealing/tests/fib.nim))
is a pure Nim reimplementation of C code from a paper
Weave is as fast as the original C code and faster than both GCC/Clang OpenMP
and Intel Threads Building Blocks.
Last example, raytracing: see
[https://forum.nim-lang.org/t/5124#32243](https://forum.nim-lang.org/t/5124#32243)
to get Nim as fast as the original C code.
I.e. everything you learned about C++ performance can apply to Nim (just
replace std::vector by seq). If you write Nim code like C code, you get C
speed.
