BTW the way memcpy is(was?) implemented in the C runtime coming from the Inter C++ compiler was really enlightening on the sheer difficulty of such a task.

First of all there isn't one loop but many depending on the source and destination alignment.

- If both are aligned on 16-byte boundaries, source and destination operand would be with MOVAPS/MOVDQA, nothing special - If only the source or destination was misaligned, the function would dispatch to a variant with the core loop loading 16-byte aligned and writing 16-byte unaligned, with the PALIGNR instruction. However, since PALIGNR can't take a runtime value, this variant was _replicated 16 times_. - I don't remember for both source and destination misaligned but you can degenerate this case to the above one.

Each of this loop had complicated loop preludes that do the first iteration, and they are so hard to do by hand.

It was also the only piece of assembly I've seen that (apparently) successfully used the "prefetch" instructions.

This was just the SSE version, AVX was different.

I don't know if someone really wrote this code, or if it was all from intrinsics.

Reply via email to