On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
I'm not experienced with this kind of programming, so I'm doubting these results. Have I done something wrong? Am I overlooking something?

You've just discovered the fact that one can rarely be careful enough with what is benchmarked, and having enough statistics.

For example, check out the following output from running your program on macOS 10.12, compiled with LDC 1.8.0:

---
$ ./test
memcpyD: 2 ms, 570 μs, and 9 hnsecs
memcpyDstdAlg: 77 μs and 2 hnsecs
memcpyC: 74 μs and 1 hnsec
memcpyNaive: 76 μs and 4 hnsecs
memcpyASM: 145 μs and 5 hnsecs
$ ./test
memcpyD: 3 ms and 376 μs
memcpyDstdAlg: 76 μs and 9 hnsecs
memcpyC: 104 μs and 4 hnsecs
memcpyNaive: 72 μs and 2 hnsecs
memcpyASM: 181 μs and 8 hnsecs
$ ./test
memcpyD: 2 ms and 565 μs
memcpyDstdAlg: 76 μs and 9 hnsecs
memcpyC: 73 μs and 2 hnsecs
memcpyNaive: 71 μs and 9 hnsecs
memcpyASM: 145 μs and 3 hnsecs
$ ./test
memcpyD: 2 ms, 813 μs, and 8 hnsecs
memcpyDstdAlg: 81 μs and 2 hnsecs
memcpyC: 99 μs and 2 hnsecs
memcpyNaive: 74 μs and 2 hnsecs
memcpyASM: 149 μs and 1 hnsec
$ ./test
memcpyD: 2 ms, 593 μs, and 7 hnsecs
memcpyDstdAlg: 77 μs and 3 hnsecs
memcpyC: 75 μs
memcpyNaive: 77 μs and 2 hnsecs
memcpyASM: 145 μs and 5 hnsecs
---

Because of the large amounts of noise, the only conclusion one can draw from this is that memcpyD is the slowest, followed by the ASM implementation.

In fact, memcpyC and memcpyNaive produce exactly the same machine code (without bounds checking), as LLVM recognizes the loop and lowers it into a memcpy. memcpyDstdAlg instead gets turned into a vectorized loop, for reasons I didn't investigate any further.

 — David


Reply via email to