On Sunday, 17 June 2018 at 17:00:00 UTC, David Nadlinger wrote:
On Wednesday, 13 June 2018 at 06:46:43 UTC, Mike Franklin wrote:
https://github.com/JinShil/memcpyD
[…]
Feedback, advise, and pull requests to improve the
implementation are most welcome.
The memcpyD implementation is buggy; it assumes that all
arguments are aligned to their size. This isn't necessarily
true. For example, `ubyte[1024].alignof == 1`, and struct
alignment can also be set explicitly using align(N).
Yes, I'm already aware of that. My plan is to create optimized
implementations for aligned data, and then handled unaligned data
as compositions of the various aligned implementations. For
example a 3 byte copy would be a short copy plus a byte copy.
That may not be appropriate for all cases. I'll have to measure,
and adapt.
On x86, you can get away with this in a lot of cases even
though it's undefined behaviour [1], but this is not
necessarily the case for SSE/AVX instructions. In fact, that's
probably a pretty good guess as to where those weird crashes
you mentioned come from.
Thanks! I think you're right.
For loading into vector registers, you can use
core.simd.loadUnaligned instead (ldc.simd.loadUnaligned for LDC
– LDC's druntime has not been updated yet after {load,
store}Unaligned were added upstream as well).
Unfortunately the code gen is quite a bit worse:
Exibit A:
https://run.dlang.io/is/jIuHRG
*(cast(void16*)(&s2)) = *(cast(const void16*)(&s1));
_Dmain:
push RBP
mov RBP,RSP
sub RSP,020h
lea RAX,-020h[RBP]
xor ECX,ECX
mov [RAX],RCX
mov 8[RAX],RCX
lea RDX,-010h[RBP]
mov [RDX],RCX
mov 8[RDX],RCX
movdqa XMM0,-020h[RBP]
movdqa -010h[RBP],XMM0
xor EAX,EAX
leave
ret
add [RAX],AL
.text._Dmain ends
Exhibit B:
https://run.dlang.io/is/PLRfhW
storeUnaligned(cast(void16*)(&s2), loadUnaligned(cast(const
void16*)(&s1)));
_Dmain:
push RBP
mov RBP,RSP
sub RSP,050h
lea RAX,-050h[RBP]
xor ECX,ECX
mov [RAX],RCX
mov 8[RAX],RCX
lea RDX,-040h[RBP]
mov [RDX],RCX
mov 8[RDX],RCX
mov -030h[RBP],RDX
mov -010h[RBP],RAX
movdqu XMM0,[RAX]
movdqa -020h[RBP],XMM0
movdqa XMM1,-020h[RBP]
movdqu [RDX],XMM1
xor EAX,EAX
leave
ret
add [RAX],AL
.text._Dmain ends
If the code gen was better, that would definitely be the way to
go; to have unaligned and aligned share the same implementation.
Maybe I can fix the DMD code gen, or implement a `copyUnaligned`
intrinsic.
Also, there doesn't seem to be any equivalent 32-byte
implementations in `core.simd`. Is that just because noone's
bother to implement them yet? And with AVX512, we should
probably have 64-byte implementations as well.
Mike