On Sunday, 12 August 2018 at 07:00:30 UTC, Eugene Wissner wrote:

Also what about other implementations like memset or memcmp. Have they been implemented or require work from scratch?

These functions are still mostly implemented in asm, so I'm not sure there is an "idiomatic D" way. I would only wrap them into a function working with slices and checking the length. Mike?

Inline ASM is a feature of D, so "idiomatic D" includes assembly implementations. Where D shines here is with it's metaprogramming capabilities and the ability to select an implementation at compile-time or compose implementations. See https://github.com/JinShil/memcpyD/blob/master/memcpyd.d There you will that the compiler selects an implementation based on size in powers of 2. Sizes that aren't powers of 2 can be composed of the powers of 2 implementations. For example a 6 byte implementation would be comprised of a 4-byte implementation plus a the 2 byte implementation.

I have more code than what's currently check into that repository, but I stalled because I need AVX512 to do an optimized implementation, and that doesn't seem to be a feature of DMD at the moment. That was one reason I posted https://wiki.dlang.org/SAOC_2018_ideas#Implement_AVX2_and_AVX-512_in_DMD.2FDRuntime on the wiki.

Agner Fog had this to say in https://agner.org/optimize/optimizing_assembly.pdf

---
There are several ways of moving large blocks of data. The most common methods are:
1. REP MOVS instruction.
2. If data are aligned: Read and write in a loop with the largest available register size.
3. If size is constant: inline move instructions.
165
4. If data are misaligned: First move as many bytes as required to make the destination aligned. Then read unaligned and write aligned in a loop with the largest available register size. 5. If data are misaligned: Read aligned, shift to compensate for misalignment and write aligned. 6. If the data size is too big for caching, use non-temporal writes to bypass the cache. Shift to compensate for misalignment, if necessary.

As you can see, it can be very difficult to choose the optimal method in a given situation. The best advice I can give for a universal memcpy function, based on my testing, is as follows: * On Intel Wolfdale and earlier, Intel Atom, AMD K8 and earlier, and VIA Nano, use the aligned read - shift - aligned write method (5). * On Intel Nehalem and later, method (4) is up to 33% faster than method (5).
166
* On AMD K10 and later and Bobcat, use the unaligned read - aligned write method (4). * The non-temporal write method (6) can be used for data blocks bigger than half the size of the largest-level cache.
---

I think D is well suited for an implementation that takes all these things into consideration, selecting an appropriate implementation, and composing complex implementations of the simpler implementations.

Mike

Reply via email to