Re: Encouraging preliminary results implementing memcpy in D

Mike Franklin via Digitalmars-d-announce Sun, 17 Jun 2018 19:36:56 -0700

On Sunday, 17 June 2018 at 17:00:00 UTC, David Nadlinger wrote:

On Wednesday, 13 June 2018 at 06:46:43 UTC, Mike Franklin wrote:
https://github.com/JinShil/memcpyD
[…]
Feedback, advise, and pull requests to improve theimplementation are most welcome.
The memcpyD implementation is buggy; it assumes that allarguments are aligned to their size. This isn't necessarilytrue. For example, `ubyte[1024].alignof == 1`, and structalignment can also be set explicitly using align(N).

Yes, I'm already aware of that. My plan is to create optimizedimplementations for aligned data, and then handled unaligned dataas compositions of the various aligned implementations. Forexample a 3 byte copy would be a short copy plus a byte copy.That may not be appropriate for all cases. I'll have to measure,and adapt.

On x86, you can get away with this in a lot of cases eventhough it's undefined behaviour [1], but this is notnecessarily the case for SSE/AVX instructions. In fact, that'sprobably a pretty good guess as to where those weird crashesyou mentioned come from.


Thanks! I think you're right.

For loading into vector registers, you can usecore.simd.loadUnaligned instead (ldc.simd.loadUnaligned for LDC– LDC's druntime has not been updated yet after {load,store}Unaligned were added upstream as well).


Unfortunately the code gen is quite a bit worse:

Exibit A:
https://run.dlang.io/is/jIuHRG
*(cast(void16*)(&s2)) = *(cast(const void16*)(&s1));

_Dmain:
                push    RBP
                mov     RBP,RSP
                sub     RSP,020h
                lea     RAX,-020h[RBP]
                xor     ECX,ECX
                mov     [RAX],RCX
                mov     8[RAX],RCX
                lea     RDX,-010h[RBP]
                mov     [RDX],RCX
                mov     8[RDX],RCX
                movdqa  XMM0,-020h[RBP]
                movdqa  -010h[RBP],XMM0
                xor     EAX,EAX
                leave
                ret
                add     [RAX],AL
.text._Dmain    ends


Exhibit B:
https://run.dlang.io/is/PLRfhW

storeUnaligned(cast(void16*)(&s2), loadUnaligned(cast(constvoid16*)(&s1)));


_Dmain:
                push    RBP
                mov     RBP,RSP
                sub     RSP,050h
                lea     RAX,-050h[RBP]
                xor     ECX,ECX
                mov     [RAX],RCX
                mov     8[RAX],RCX
                lea     RDX,-040h[RBP]
                mov     [RDX],RCX
                mov     8[RDX],RCX
                mov     -030h[RBP],RDX
                mov     -010h[RBP],RAX
                movdqu  XMM0,[RAX]
                movdqa  -020h[RBP],XMM0
                movdqa  XMM1,-020h[RBP]
                movdqu  [RDX],XMM1
                xor     EAX,EAX
                leave
                ret
                add     [RAX],AL
.text._Dmain    ends

If the code gen was better, that would definitely be the way togo; to have unaligned and aligned share the same implementation.Maybe I can fix the DMD code gen, or implement a `copyUnaligned`intrinsic.

Also, there doesn't seem to be any equivalent 32-byteimplementations in `core.simd`. Is that just because noone'sbother to implement them yet? And with AVX512, we shouldprobably have 64-byte implementations as well.


Mike

Re: Encouraging preliminary results implementing memcpy in D

Reply via email to