Usually I'm as fast as pure Assembly, C, C++ or Fortran code on code I 
optimized.

Example:
    

  * benchmarks of matrix multiplication: 
[https://github.com/numforge/laser/blob/2f619fdbb2496aa7a5e5538035a8d42d88db8c10/benchmarks/gemm/gemm_bench_float32.nim#L418-L465](https://github.com/numforge/laser/blob/2f619fdbb2496aa7a5e5538035a8d42d88db8c10/benchmarks/gemm/gemm_bench_float32.nim#L418-L465),
 sometimes Laser is faster, sometimes slower.
  * OpenBLAS is pure Assembly: 
[https://github.com/xianyi/OpenBLAS](https://github.com/xianyi/OpenBLAS)
  * Intel MKL-DNN, is C++ and Jit Assembly: 
[https://github.com/intel/mkl-dnn](https://github.com/intel/mkl-dnn)
  * Nim implementation: 
[https://github.com/numforge/laser/blob/2f619fdbb2496aa7a5e5538035a8d42d88db8c10/laser/primitives/matrix_multiplication/gemm.nim](https://github.com/numforge/laser/blob/2f619fdbb2496aa7a5e5538035a8d42d88db8c10/laser/primitives/matrix_multiplication/gemm.nim)

Note that the temporary buffers are managed by GC (ref object): 
[https://github.com/numforge/laser/blob/2f619fdbb2496aa7a5e5538035a8d42d88db8c10/laser/primitives/matrix_multiplication/gemm_tiling.nim#L251](https://github.com/numforge/laser/blob/2f619fdbb2496aa7a5e5538035a8d42d88db8c10/laser/primitives/matrix_multiplication/gemm_tiling.nim#L251)




In short if you don't use a ref/seq/string (GC-ed types) in a critical path, 
the GC won't trouble you.

Other example a multithreading runtime:
    

  * benchmarks: 
[https://github.com/mratsim/weave/tree/2bb0284e5555ec8882412886fb15c948b2826b81/benchmarks/fibonacci](https://github.com/mratsim/weave/tree/2bb0284e5555ec8882412886fb15c948b2826b81/benchmarks/fibonacci)
  * OpenMP is pure C from GCC or Clang
  * Intel TBB is pure C++
  * Weave 
([https://github.com/mratsim/weave/blob/2bb0284e5555ec8882412886fb15c948b2826b81/e04_channel_based_work_stealing/tests/fib.nim](https://github.com/mratsim/weave/blob/2bb0284e5555ec8882412886fb15c948b2826b81/e04_channel_based_work_stealing/tests/fib.nim))
 is a pure Nim reimplementation of C code from a paper



Weave is as fast as the original C code and faster than both GCC/Clang OpenMP 
and Intel Threads Building Blocks.

Last example, raytracing: see 
[https://forum.nim-lang.org/t/5124#32243](https://forum.nim-lang.org/t/5124#32243)
 to get Nim as fast as the original C code.

I.e. everything you learned about C++ performance can apply to Nim (just 
replace std::vector by seq). If you write Nim code like C code, you get C 
speed. 

Reply via email to