Those are all fine points. Asm can sometimes make a bigger difference than 
people conditioned to not question compiler output expect (and there are many 
such people). This is especially with vector units. A few years back, I wrote 
an AVX2 vectorized "minimum" function that ran 24x (twenty four times) faster 
than a C one. That's a much bigger time ratio than most "typical use" 
comparisons of _many_ programming language pairs (though obviously most 
programs do more things than compute minima). Auto-vectorization in gcc (at 
least) has gotten better since for that precise problem, though it can still 
miss many opportunities.

If you ever do need to write asm, those "intrinsics functions" like 
`_mm256_add_ps` (there are a hundred others) are often an easier entry 
point/way to integrate with your program than raw asm/`.s` files. You can use 
them from Nim, too. :-) See, e.g., 
[https://github.com/numforge/laser](https://github.com/numforge/laser) 
README/code/etc. for how to use SIMD intrinsics.

Reply via email to