Those are all fine points. Asm can sometimes make a bigger difference than people conditioned to not question compiler output expect (and there are many such people). This is especially with vector units. A few years back, I wrote an AVX2 vectorized "minimum" function that ran 24x (twenty four times) faster than a C one. That's a much bigger time ratio than most "typical use" comparisons of _many_ programming language pairs (though obviously most programs do more things than compute minima). Auto-vectorization in gcc (at least) has gotten better since for that precise problem, though it can still miss many opportunities.
If you ever do need to write asm, those "intrinsics functions" like `_mm256_add_ps` (there are a hundred others) are often an easier entry point/way to integrate with your program than raw asm/`.s` files. You can use them from Nim, too. :-) See, e.g., [https://github.com/numforge/laser](https://github.com/numforge/laser) README/code/etc. for how to use SIMD intrinsics.