On Sunday, 12 August 2018 at 07:00:30 UTC, Eugene Wissner wrote:
Also what about other implementations like memset or memcmp.
Have they been implemented or require work from scratch?
These functions are still mostly implemented in asm, so I'm not
sure there is an "idiomatic D" way. I would only wrap them into
a function working with slices and checking the length. Mike?
Inline ASM is a feature of D, so "idiomatic D" includes assembly
implementations. Where D shines here is with it's
metaprogramming capabilities and the ability to select an
implementation at compile-time or compose implementations. See
https://github.com/JinShil/memcpyD/blob/master/memcpyd.d There
you will that the compiler selects an implementation based on
size in powers of 2. Sizes that aren't powers of 2 can be
composed of the powers of 2 implementations. For example a 6
byte implementation would be comprised of a 4-byte implementation
plus a the 2 byte implementation.
I have more code than what's currently check into that
repository, but I stalled because I need AVX512 to do an
optimized implementation, and that doesn't seem to be a feature
of DMD at the moment. That was one reason I posted
https://wiki.dlang.org/SAOC_2018_ideas#Implement_AVX2_and_AVX-512_in_DMD.2FDRuntime on the wiki.
Agner Fog had this to say in
https://agner.org/optimize/optimizing_assembly.pdf
---
There are several ways of moving large blocks of data. The most
common methods are:
1. REP MOVS instruction.
2. If data are aligned: Read and write in a loop with the largest
available register size.
3. If size is constant: inline move instructions.
165
4. If data are misaligned: First move as many bytes as required
to make the destination aligned. Then read unaligned and write
aligned in a loop with the largest available register size.
5. If data are misaligned: Read aligned, shift to compensate for
misalignment and write aligned.
6. If the data size is too big for caching, use non-temporal
writes to bypass the cache. Shift to compensate for misalignment,
if necessary.
As you can see, it can be very difficult to choose the optimal
method in a given situation. The best advice I can give for a
universal memcpy function, based on my testing, is as follows:
* On Intel Wolfdale and earlier, Intel Atom, AMD K8 and earlier,
and VIA Nano, use the aligned read - shift - aligned write method
(5).
* On Intel Nehalem and later, method (4) is up to 33% faster than
method (5).
166
* On AMD K10 and later and Bobcat, use the unaligned read -
aligned write method (4).
* The non-temporal write method (6) can be used for data blocks
bigger than half the size of the largest-level cache.
---
I think D is well suited for an implementation that takes all
these things into consideration, selecting an appropriate
implementation, and composing complex implementations of the
simpler implementations.
Mike