Re: Reimplementing software building blocks like malloc and free in D

Mike Franklin via Digitalmars-d Sun, 12 Aug 2018 01:35:34 -0700

On Sunday, 12 August 2018 at 07:00:30 UTC, Eugene Wissner wrote:

Also what about other implementations like memset or memcmp.Have they been implemented or require work from scratch?
These functions are still mostly implemented in asm, so I'm notsure there is an "idiomatic D" way. I would only wrap them intoa function working with slices and checking the length. Mike?

Inline ASM is a feature of D, so "idiomatic D" includes assemblyimplementations. Where D shines here is with it'smetaprogramming capabilities and the ability to select animplementation at compile-time or compose implementations. Seehttps://github.com/JinShil/memcpyD/blob/master/memcpyd.d Thereyou will that the compiler selects an implementation based onsize in powers of 2. Sizes that aren't powers of 2 can becomposed of the powers of 2 implementations. For example a 6byte implementation would be comprised of a 4-byte implementationplus a the 2 byte implementation.

I have more code than what's currently check into thatrepository, but I stalled because I need AVX512 to do anoptimized implementation, and that doesn't seem to be a featureof DMD at the moment. That was one reason I postedhttps://wiki.dlang.org/SAOC_2018_ideas#Implement_AVX2_and_AVX-512_in_DMD.2FDRuntime on the wiki.

Agner Fog had this to say inhttps://agner.org/optimize/optimizing_assembly.pdf

---

There are several ways of moving large blocks of data. The mostcommon methods are:

1. REP MOVS instruction.

2. If data are aligned: Read and write in a loop with the largestavailable register size.

3. If size is constant: inline move instructions.
165

4. If data are misaligned: First move as many bytes as requiredto make the destination aligned. Then read unaligned and writealigned in a loop with the largest available register size.5. If data are misaligned: Read aligned, shift to compensate formisalignment and write aligned.6. If the data size is too big for caching, use non-temporalwrites to bypass the cache. Shift to compensate for misalignment,if necessary.

As you can see, it can be very difficult to choose the optimalmethod in a given situation. The best advice I can give for auniversal memcpy function, based on my testing, is as follows:* On Intel Wolfdale and earlier, Intel Atom, AMD K8 and earlier,and VIA Nano, use the aligned read - shift - aligned write method(5).* On Intel Nehalem and later, method (4) is up to 33% faster thanmethod (5).

* On AMD K10 and later and Bobcat, use the unaligned read -aligned write method (4).* The non-temporal write method (6) can be used for data blocksbigger than half the size of the largest-level cache.

---

I think D is well suited for an implementation that takes allthese things into consideration, selecting an appropriateimplementation, and composing complex implementations of thesimpler implementations.


Mike

Re: Reimplementing software building blocks like malloc and free in D

Reply via email to