If you have instructions to reproduce the benchmark in Nim and C++ I can help.
I just need a repo to clone, the scripts to run and the dataset. I already have Vapoursynth working. Ideally you have a profiler like Intel Instruments or Apple VTune to dive into assembly. For example this is my approach in debugging performance issue: [https://github.com/nim-lang/Nim/issues/9514](https://github.com/nim-lang/Nim/issues/9514) For memory bottlenecks it's a bit different, I use the roofline model as mentioned in my convolution optimization resources: [https://github.com/numforge/laser/blob/master/research/convolution_optimisation_resources.md#computational-complexity](https://github.com/numforge/laser/blob/master/research/convolution_optimisation_resources.md#computational-complexity) For example I know that matrix multiplication and convolution can reach 90% of the peak CPU GFlop because their arithmetic intensity is high (i.e. you do over 10 operations (add/mul) per byte) so if you don't reach that perf it's because you are moving bytes to much instead of computing with them. The theoretical peak of your CPU is easy to compute: * single threaded: > `CpuGhz * VectorWidth * InstrCycle * FlopInstr` > for a CPU that supports AVX2 on float32 (so packing 8 float32) that can issue > 2 Fused-Multiply-Add per cycle at 3GHz we have `3 (GHz) * 8 (packed float32 in AVX) * 2 (FMA per cycle) * 2 (FMA = 1 add + 1 mul)` `= 96 GFlops` * multithreaded: Just multiply the single result by the number of cores. For example 10 cores would be 960 GFlops or 0.9 TFlops And then the usual way to benchmark numerical algorithm is, you know the number of operations required by your algorithm, you divide that by the time spent to do them and you have your actual flops. And you compare your actual Flops with the theoretical peak. If you only reach 20% of the peak, you have a memory bottleneck and probably need to repack before processing to optimize cache usage, if not you need to look into SIMD vectorization, prefetching, ... All of that is quite complex so what I can do is reach the naive C++ implementation performance. Going beyond is something that I want to do but it's time-consuming and I feel that it would be better to spend my time on an image processing compiler similar to what's discussed here: [https://github.com/mratsim/Arraymancer/issues/347#issuecomment-459351890](https://github.com/mratsim/Arraymancer/issues/347#issuecomment-459351890) and with a proof of concept here: * [https://github.com/numforge/laser/tree/master/laser/lux_compiler](https://github.com/numforge/laser/tree/master/laser/lux_compiler) * [https://github.com/numforge/laser/tree/master/laser/lux_compiler/core](https://github.com/numforge/laser/tree/master/laser/lux_compiler/core)