Florian Klämpfl via fpc-devel <fpc-devel@lists.freepascal.org> schrieb am Sa., 16. Apr. 2022, 21:00:
> > > > Am 16.04.2022 um 01:26 schrieb J. Gareth Moreton via fpc-devel < > fpc-devel@lists.freepascal.org>: > > > > Hi everyone, > > > > This is something that sprung to mind when thinking about code speed and > the like, and one thing that cropped up is the initialisation of large > variables such as arrays or records. A common means of doing this is, say: > > > > FillChar(MyVar, SizeOf(MyVar), 0); > > > > To keep things as general-purpose as possible, this usually results in a > function call that decides the best course of action, and for very large > blocks of data whose size may not be deterministic (e.g. a file buffer), > this is the best approach - the overhead is relatively small and it quickly > uses fast block-move instructions. > > > > However, for small-to-mid-sized variables of known size, this can lead > to some inefficiencies, first by not taking into account that the size of > the variable is known, but also because the initialisation value is zero, > more often that not, and the variable is probably aligned on the stack (so > the checks to make sure a pointer is aligned are unnecessary). > > > > I did a proof of concept on x86_64-win64 with the following record: > > > > type > > TTestRecord = record > > Field1: Byte; > > Field2, Field3, Field4: Integer; > > end; > > > > SizeOf(TTestRecord) is 16 and all the fields are on 4-byte boundaries. > Nothing particularly special. > > > > I then declared a variable of this time and filled the fields with > random values, and then ran two different methods to clear their memory. > To get a good speed average, I ran each method 1,000,000,000 times in a > for-loop. The first method was: > > > > FillChar(TestRecord, SizeOf(TestRecord), 0); > > > > The second method was inline assembly language (which I've called 'the > intrinsic'): > > > > asm > > PXOR XMM0, XMM0 > > MOVDQU [RIP+TestRecord], XMM0 > > end;2 > > > > It's not perfect because the presence of inline assembly prevents the > use of register variables (although TestRecord is always on the stack > regardless), but the performance hit is barely noticeable in this case, and > if the assembly language were inserted by the compiler, the register > variable problem won't arise. > > > > These are my results: > > > > FillChar time: 2.398 ns > > > > Field1 = 0 > > Field2 = 0 > > Field3 = 0 > > Field4 = 0 > > > > Intrinsic time: 1.336 ns > > > > Field1 = 0 > > Field2 = 0 > > Field3 = 0 > > Field4 = 0 > > > > Sure, it's on the order of nanoseconds, but the intrinsic is almost > twice as fast. > > > > In terms of size - FillChar call = 20 bytes: > > > > 488d0d22080200 lea 0x20822(%rip),%rcx # 0x100022010 > > 4531c0 xor %r8d,%r8d > > ba10000000 mov $0x10,%edx > > e8150a0000 callq 0x100002210 > <SYSTEM_$$_FILLCHAR$formal$INT64$BYTE> > > > > The intrinsic = 12 bytes: > > > > 660fefc0 pxor %xmm0,%xmm0 > > f30f7f05bd050200 movdqu %xmm0,0x205bd(%rip) # 0x100022010 > > > > For a 32-byte record instead, an extra 8-byte MOVDQU instruction would > be required, so the 2 would be equal size, but with the bonus that the > intrinsic doesn't have a function call and will probably help optimisation > in the rest of the procedure by freeing up the registers used to pass > parameters (%rcx, %rdx and %r8 in this case; although the intrinsic will > require an MM register in this x86_64 example, they tend to not be used as > often). Also, the peephole optimizer can remove redundant PXOR XMM0, XMM0 > calls, which will help as well if there are multiple FillChar calls. > > > > I'm not proposing a total rewrite, and I would say that in the default > case, it should just fall back to the in-built System functions, but the > relevant compiler nodes could be overridden on specific platforms to > generate smaller, more optimised code when the sizes and values are known > at compile time. > > > > Now, in this example, it is still faster to simply set the fields > manually one-by-one (clocks in at around 1.2 ns), possibly due to the > unaligned write (MOVDQU) and internal SSE state switching adding some > overhead, but there's nothing to stop the compiler from inserting code in > place of the FillChar call to do just that if it thinks it's the fastest > method. Then again, one has to be a little bit careful because FillChar > and the intrinsic will also set the filler bytes between Field1 and Field2 > to 0, whereas manually assigning 0 to the fields won't (so they aren't > strictly equivalent and might only be allowed if there are no filler bytes > or when compiling under -O4, but the latter may still be dangerous when > typecasting is concerned), and extra care would have to be taken when > unions are concerned (sorry, 'union' that's a C term - what's the official > Pascal term again?). > > > > Actual Pascal calls to FillChar would not change in any way and so > theoretically it won't break existing code. The only drawback is that the > intrinsic and the internal System functions would have to be named the same > so constructs such as "FuncPtr := @FillChar;" as well as calling FillChar > from assembler routines stilll work, and the compiler would have to know > how to differentiate between the two. > > > > Just on the surface, what are your thoughts? > > Inlining FillChar is for sure useful (same for move). The FillChar in the > system unit could stay, the compiler could just replace a call to > System.FillChar by some compiler generated assembler doing the FillChar. > But we should have a general mechanism for that, not something that just handles FillChar. Regards, Sven >
_______________________________________________ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel