So I've been doing some playing around recently, and noticed that while FillChar has some very fast internal code for initialising a block of memory, making use of non-temporal hints and memory fences, the versions for the larger types fall back to slow Pascal code. To showcase this, I ran a test on my 6-year-old laptop that compared a small and slightly basic assembler routine against the internal functions (times are averaged over 100 iterations):
FillWord - initialise 16,777,216 words to 0 - Internal: 8177.209 µs - Assembler: 4234.131 µs FillWord - initialise 1,048,576 words to $AAAA - Internal: 153.221 µs - Assembler: 86.496 µs FillWord - initialise 1,229 words to $5555 - Internal: 0.267 µs - Assembler: 0.135 µs FillDWord - initialise 16,777,216 DWords to 0 - Internal: 15552.032 µs - Assembler: 10945.809 µs FillDWord - initialise 1,048,576 DWords to $AAAAAAAA - Internal: 902.060 µs - Assembler: 470.788 µs FillDWord - initialise 1,229 DWords to $55555555 - Internal: 0.357 µs - Assembler: 0.275 µs FillQWord - initialise 16,777,216 QWords to 0 - Internal: 33397.248 µs - Assembler: 17488.901 µs FillQWord - initialise 1,048,576 QWords to $AAAAAAAAAAAAAAAA - Internal: 2130.116 µs - Assembler: 1258.130 µs FillQWord - initialise 1,229 QWords to $5555555555555555 - Internal: 0.739 µs - Assembler: 0.402 µs The assembler functions were as follows: {$ASMMODE INTEL} procedure SizeOptimisedFillWord(var x; count: SizeInt; Value: Word); assembler; nostackframe; asm { RCX = Pointer to x RDX = Count R8W = Value } PUSH RDI MOV AX, R8W MOV RDI, RCX MOV RCX, RDX REP STOSW POP RDI end; procedure SizeOptimisedFillDWord(var x; count: SizeInt; Value: DWord); assembler; nostackframe; asm { RCX = Pointer to x RDX = Count R8D = Value } PUSH RDI MOV EAX, R8D MOV RDI, RCX MOV RCX, RDX REP STOSD POP RDI end; procedure SizeOptimisedFillQWord(var x; count: SizeInt; Value: QWord); assembler; nostackframe; asm { RCX = Pointer to x RDX = Count R8 = Value } PUSH RDI MOV RAX, R8 MOV RDI, RCX MOV RCX, RDX REP STOSQ POP RDI end; I also made versions that use memory fences and other checks such as memory alignment in order to gain speed - I've converted them to use the System V ABI of Linux as well, but are currently completely untested as I don't have the facilities to yet compile on Linux (they are also even smaller in code size because you don't need to push and pop RDI, and the destination (var x) is already stored in RDI, thereby collapsing each routine to just 3 instructions (not including the REP prefix)). Would it be worth opening up a bug report for this, with the attached assembler routines as suggestions? I haven't worked out how to implement internal functions into the compiler yet, and I rather clear it with you guys first before I make such an addition. I had a thought that the simple routines above could be used for when compiling for small code size, while larger, more advanced ones are used for when compiling for speed. Yours faithfully, J. Gareth "Kit" Moreton _______________________________________________ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel