Denys Vlasenko wrote:
I tend to doubt that odd-byte aligned large memcpys are anywhere
near typical. malloc and mmap both return well-aligned buffers
(say, 8 byte aligned). Static and on-stack objects are also
at least word-aligned 99% of the time.

memcpy can just use "relatively simple" code for copies in which
either src or dst is not word aligned. This cuts possibilities down
from 16 to 4 (or even 2?).
The XMM code is still more than 3 times faster than rep movsl when data are aligned by 4 or 8, but not by 16. Even if odd addresses are rare, they must be supported, but we can put the most common cases first. strcpy and strcat can be implemented efficiently simply by calling strlen and memcpy, since both strlen and memcpy can be optimized very well. This can give unaligned addresses.

Dennis Clarke wrote:
You forgot to look at PowerPC :

http://cvs.opensolaris.org/source/xref/ppc-dev/ppc-dev/usr/src/lib/libc/ppc/gen/memcpy.s

is that nice and small ?
.. and slow. Why doesn't it use Altivec?

Reply via email to