Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]
New unit test, with Martin's integrated. If I play with godbolt, Ryzen zen3 (ryzen 5x00X) is nearly twice as fast in cycles as my Ivy Bridge, so I would like to see some benchmarks from various processors. Also from very old ones (P4 and Clawhammers) to test instruction sets. Project utf8lentest raised exception class 'External: SIGSEGV'. In file 'utf8lentest.lpr' at line 89: movdqu xmm0, [rcx] OS: Linux x64. CPU: vendor_id = "GenuineIntel" (simple synth) = Intel Core (unknown type) (Sandy Bridge D2/J1/Q0) {Sandy Bridge}, 32nm -- ___ lazarus mailing list lazarus@lists.lazarus-ide.org https://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]
On 30-12-2021 01:29, Marco van de Voort via lazarus wrote: (P4 and Clawhammers) to test instruction sets. 64-bit supporting Pentium 4's of course, since this is a 64-bit only test. Claw/Sledgehammer is the first iteration of Athlon64 and of the x86_64 architecture as a whole and misses some instructions (like SSE3) that all other 64-bit capable CPUs have. -- ___ lazarus mailing list lazarus@lists.lazarus-ide.org https://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]
On 29-12-2021 00:00, Bart via lazarus wrote: On Tue, Dec 28, 2021 at 11:35 PM Martin Frb via lazarus wrote: I have a core I7-8600 The diff between the old code and popcnt is less significant. old: 715 pop: 695 But there is a 3rd way, that is faster. add: 610 Not surprising that you should come up with a faster solution. IIRC you won both speed contests I had on the forum ;-) Feel free to implement it in LazUtf8. New unit test, with Martin's integrated. If I play with godbolt, Ryzen zen3 (ryzen 5x00X) is nearly twice as fast in cycles as my Ivy Bridge, so I would like to see some benchmarks from various processors. Also from very old ones (P4 and Clawhammers) to test instruction sets. I use unaligned loads which afaik on older ( pre Core 1st or 2nd generation) CPUs are costly. (because it loads two caches lines per time) // // (C) 2021 Martin Friebe and Marco van de Voort. // attempt to accelerate utf8lengthfast which is a length(s) in utf8 codepoints without integrity checking // // 4 versions. // - Original, // - with popcount and // - the "add" variant that accumulates 127 iterations of ptrints and only adds // the intermeidates outside that loop // - a SSE2 version loosely inspired by the add variant combined with //the core of an existing (branchless) binarization routine for the main loop. {$mode objfpc}{$H+} {$asmmode intel} {define asmdebug} uses SysUtils,StrUtils; const mask3 : array[0..15] of byte = ( $C0,$C0,$C0,$C0, $C0,$C0,$C0,$C0, $C0,$C0,$C0,$C0, $C0,$C0,$C0,$C0); mask4 : array[0..15] of byte = ( $80,$80,$80,$80, $80,$80,$80,$80, $80,$80,$80,$80, $80,$80,$80,$80); mask2 : array[0..15] of byte = ( $1,$1,$1,$1, $1,$1,$1,$1, $1,$1,$1,$1, $1,$1,$1,$1); // Integer arguments are passed in registers RCX, RDX, R8, and R9. // Floating point arguments are passed in XMM0L, XMM1L, XMM2L, and XMM3L. // volatile: RAX, RCX, RDX, R8, R9, R10, R11 // nonvolatile RBX, RBP, RDI, RSI, RSP, R12, R13, R14, and R15 are considered nonvolatile // volatile xmm0-xmm3 (params) en xmm4,5 // https://msdn.microsoft.com/en-us/library/ms235286.aspx {$ifdef asmdebug} function asmutf8length(const s : pchar;res:pbyte;len:integer):int64; {$else} function asmutf8length(const s : pchar;len:integer):int64; {$endif} begin asm {$ifndef asmdebug} mov r8,rdx {$endif} // using broadcast etc raises requirements? mov rax,r8 mov r9,r8 // tuning for short strings: // -- // test rax,rax // je @theend // cmp r9,128 // difference between long and short // jl @restbytes and r9,15 shr r8,4 pxor xmm5,xmm5 // always zero pxor xmm6,xmm6 // dword counts movdqu xmm1,[rip+mask3] movdqu xmm2,[rip+mask4] movdqu xmm3,[rip+mask2] test r8,r8 je @restbytes @outer: mov r10,127// max iterations cmp r10,r8// more or less left? jl @last // more mov r10,r8 // less @last: sub r8,r10 // iterations left - iterations to do pxor xmm4,xmm4 @inner: movdqu xmm0, [rcx] pand xmm0,xmm1 // mask out top 2 bits pcmpeqb xmm0,xmm2// compare with $80. pand xmm0,xmm3 // change to $1 per byte. paddb xmm4,xmm0 // add to cumulative add rcx,16 dec r10 jne @inner // process 127 iterations movdqa xmm0,xmm4 PUNPCKLBW xmm0,xmm5 // zero extend to words PUNPCKHBW xmm4,xmm5 paddw xmm0,xmm4 // add, now 8 16-bit words. movdqa xmm4,xmm0 PUNPCKLWD xmm0,xmm5 // zero extend to dwords paddd xmm6,xmm0 PUNPCKHWD xmm4,xmm5 paddd xmm6,xmm4 // add both to cumulative test r8,r8 jne @outer MOVHLPS xmm4,xmm6// move high 8 bytes to low (float->int penalty?) paddd xmm6,xmm4 // add both 2*dwords (high doesn't matter) pshufd xmm4,xmm6,1 // mov 2nd dword in xmm6 to first in xmm4 paddd xmm6,xmm4 // add movd r8d,xmm6// to int alu reg sub rax,r8 // subtract from length in bytes. @restbytes: test r9,r9 je @theend // Done! @restloop: movzx r8d, byte [rcx]// unaligned bytes after sse loop mov r10,r8 shr r10,7 not r8 shr r8,6 and r10,r8 sub rax,r10 inc rcx dec r9 jne @restloop @theend: end['xmm5','xmm6']; // volatile registers used. end; function countmask(nx:int64):integer; begin nx := (nx and $00FF00FF00FF00FF) + ((nx >> 8) and $00FF00FF00FF00FF); nx := (nx and
Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]
On 29-12-2021 16:30, Martin Frb via lazarus wrote: Could you post full source if you haven't already? For a bit of benchmarking. I just wrote it from the top of my head, and I assumed 5 instructions for 16-byte would win any time, but haven't verified anything yet. I had it attached on my last mail. Attached it again here. (3rd procedure / "Utf8LengthAdd") It is only 64bit for now. (And not cleaned up in any way). Also changing "bc >> 7" and "bc and 127" to "moddiv(bc, 255, full, remain)" might save a few more ms. But probably needs larger data to benchmark. If you do work on this, feel free to integrate my code as the baseline for cpu without SSE. Otherwise, it might be a bit until I get to it. First results: (on an ageing i7-3770, trunk FPC -O4 -Cpcoreavx) fst 781 fst 781 fst 797 fst 766 pop 656 pop 641 pop 640 pop 641 add 562 add 578 add 563 add 594 asm 297 asm 296 asm 297 asm 297 Asm is nearly fully functional and working, more importantly the remaining issues are constant time and single instruction work, shouldn't influence benchmarking for anything than the shortest sequences. I'll finish up and post the whole shebang, since more eyes could help, I'm an asm amateur in some regards. -- ___ lazarus mailing list lazarus@lists.lazarus-ide.org https://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]
On 29/12/2021 13:42, Marco van de Voort via lazarus wrote: On 29-12-2021 10:16, Martin Frb via lazarus wrote: // Martin's routine that should be replaced by some punpkl magic, but it is too late now. Why too late? See datetime stamp. 02:10 AM. I don't know how it is with you Lazarus devels, but we FPC devels need our beauty sleep from time to time. Oh, too late for the day. I thought. Too late, that ship has sailed. Could you post full source if you haven't already? For a bit of benchmarking. I just wrote it from the top of my head, and I assumed 5 instructions for 16-byte would win any time, but haven't verified anything yet. I had it attached on my last mail. Attached it again here. (3rd procedure / "Utf8LengthAdd") It is only 64bit for now. (And not cleaned up in any way). Also changing "bc >> 7" and "bc and 127" to "moddiv(bc, 255, full, remain)" might save a few more ms. But probably needs larger data to benchmark. If you do work on this, feel free to integrate my code as the baseline for cpu without SSE. Otherwise, it might be a bit until I get to it. Project1.pas Description: application/unknown-content-type -- ___ lazarus mailing list lazarus@lists.lazarus-ide.org https://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]
Am 29.12.2021 um 13:42 schrieb Marco van de Voort via lazarus: p.s. is there a workaround for git worktree to work on the same branch? E.g. trunk for 32-bit and trunk for 64-bit ? :-) No. You cannot checkout the same branch in two worktrees. But you can do the following: create a new branch in both worktrees and check them out (main_32bit and main_64bit), do in both directories: work, commit, pull, checkout main, merge the branch, push: cd git/fpc/main1 git checkout -b main_32bit git worktree add ../main2 -b main_64bit cd main1 git checkout main git pull git merge main_32bit git push cd ../main2 git checkout main git pull git merge main_64bit git push -- ___ lazarus mailing list lazarus@lists.lazarus-ide.org https://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]
On 29-12-2021 10:16, Martin Frb via lazarus wrote: // Martin's routine that should be replaced by some punpkl magic, but it is too late now. Why too late? See datetime stamp. 02:10 AM. I don't know how it is with you Lazarus devels, but we FPC devels need our beauty sleep from time to time. So I'll be working on it some more today, possible things to do are: - fold countmask into the asm, and - maybe do another outer loop to process all bytes in one go. - special case for only a few bytes - maybe align to 16-byte (easier/cheaper for old machines than all the movdqu). - Maybe test the code on godbolt for performance At the very least I hope the punpkl code for countmask will be more compact since it doesn't require literals. There is a place for both. My routine works fine for cpu. (soon as a 32 bit (and maybe 16 bit) variant are added). Then for known cpu, special handling can be added. Could you post full source if you haven't already? For a bit of benchmarking. I just wrote it from the top of my head, and I assumed 5 instructions for 16-byte would win any time, but haven't verified anything yet. For 32-bit this can also be done, but you'd need a procedure variable to test for SSE support, and only run this for long strings (>200-500 or so) p.s. is there a workaround for git worktree to work on the same branch? E.g. trunk for 32-bit and trunk for 64-bit ? :-) -- ___ lazarus mailing list lazarus@lists.lazarus-ide.org https://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]
On 29/12/2021 02:10, Marco van de Voort via lazarus wrote: On 28-12-2021 23:35, Martin Frb via lazarus wrote: "nx" has a single "1" in each of the 8 bytes in a Qword (based on 64bit). If we regard each of this bytes as an entity of its own, then we can keep adding those "1". I also was thinking in that direction, but more about how to optimize that loop using SSE2 good idea... // Martin's routine that should be replaced by some punpkl magic, but it is too late now. Why too late? There is a place for both. My routine works fine for cpu. (soon as a 32 bit (and maybe 16 bit) variant are added). Then for known cpu, special handling can be added. -- ___ lazarus mailing list lazarus@lists.lazarus-ide.org https://lists.lazarus-ide.org/listinfo/lazarus