Update: The results I talked about yesterday were obtained with Nim simply -d:release compiling but with quite some optimization for the C reference code.
Today I cleaned up some minor lose ends and did some polishing (for both, C and Nim) and set Nim to compile with --opt:speed plus some checks disabled (which is a) unnecessary in this case, and b) fair because C has none of those at all). And - I hope you are seated properly - Bang, the algorithm implemented in Nim is on average 2% to 3% **faster than the C version!** And no that's not due to an error. I cross checked over 100K test vectors. The Nim implementation is correct. Kudos to @Araq and the Nim team!