Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]

2021-12-30 Thread Martin Frb via lazarus

On 30/12/2021 14:43, Marco van de Voort via lazarus wrote:
Compile with -O4 -Cpcoreavx2 , the others (non asm) will become 
faster, my guess is  "add" will be about double of asm.


Core I7 8700K

3.3.1 from Dec 10th
3.2.3 from Dec 9th

With fpc 3.3.1:
- fst is worse?
- add gets better

-O4 -Cpcoreavx2

fpc 3.2.3 /   fpc 3.3.1

fst 594   fst 688
fst 578   fst 703
fst 578   fst 687
fst 562   fst 688

pop 485   pop 485
pop 500   pop 500
pop 500   pop 484
pop 484   pop 500

add 594   add 422
add 578   add 438
add 578   add 437
add 594   add 453

asm 250   asm 250
asm 250   asm 250
asm 250   asm 250
asm 250   asm 266



fpc 3.2.3
-O4 -Cpcoreavx   -O4 -CpCOREI

fst 594  fst 593
fst 578  fst 579
fst 578  fst 562
fst 594  fst 578

pop 500  pop 500
pop 515  pop 500
pop 500  pop 500
pop 485  pop 485

add 593  add 593
add 579  add 578
add 578  add 594
add 593  add 594

asm 250  asm 250
asm 250  asm 250
asm 235  asm 250
asm 250  asm 250


--
___
lazarus mailing list
lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]

2021-12-30 Thread Marco van de Voort via lazarus


On 30-12-2021 14:17, John Landmesser via lazarus wrote:

Perhaps usefui test information from my PC: 77

Compile with -O4 -Cpcoreavx2 , the others (non asm) will become faster, 
my guess is  "add" will be about double of asm.


Also, on windows "high performance" as power scheme.  On non windows try 
to disable power saving for a short while another way. If the first and 
second runs deviate it is probably the CPU throttling up.



--
___
lazarus mailing list
lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]

2021-12-30 Thread John Landmesser via lazarus

Perhaps usefui test information from my PC:

**
[john1@manjaro sdb2]$ ./utf8lentest
234526968
fst:128406168
pop:128406168
add:128406168
asm:128406168

29315871

fst 1365
fst 1367
fst 1366
fst 1366
pop 9990
pop 9990


pop 9997
pop 9981
add 1386
add 1382
add 1386
add 1390
asm 346
asm 346
asm 346
asm 349
fst 1357
fst 1368
fst 1372
fst 1371
pop 10681
pop 6886
pop 6895
pop 6916
add 1247
add 1248
add 1250
add 1248
asm 295
asm 291
asm 291
asm 293
[john1@manjaro sdb2]$
[john1@manjaro sdb2]$ inxi -F
System:
  Host: manjaro Kernel: 5.10.84-1-MANJARO x86_64 bits: 64
    Desktop: Xfce 4.16.0 Distro: Manjaro Linux
Machine:
  Type: Laptop System: LENOVO product: 81RS v: Lenovo Yoga S740-14IIL
    serial: 
  Mobo: LENOVO model: LNVNB161216 v: SDK0J40709 WIN
    serial:  UEFI: LENOVO v: BYCN39WW date: 05/28/2021
Battery:
  ID-1: BAT0 charge: 62.4 Wh (95.6%) condition: 65.3/62.0 Wh (105.3%)
CPU:
  Info: quad core model: Intel Core i7-1065G7 bits: 64 type: MT MCP cache:
    L2: 2 MiB
  Speed (MHz): avg: 3520 min/max: 400/3900 cores: 1: 3543 2: 3890 3: 2319
    4: 3513 5: 3709 6: 3650 7: 3792 8: 3749
Graphics:
  Device-1: Intel Iris Plus Graphics G7 driver: i915 v: kernel
  Device-2: NVIDIA GP108M [GeForce MX250] driver: nvidia v: 495.44
  Device-3: Chicony Integrated Camera type: USB driver: uvcvideo
  Display: x11 server: X.Org 1.21.1.2 driver: loaded: modesetting,nvidia
    unloaded: nouveau resolution: 1: 1920x1080~60Hz 2: 1920x1080~60Hz
  Message: Unable to show advanced data. Required tool glxinfo missing.
Audio:
  Device-1: Intel Ice Lake-LP Smart Sound Audio driver: sof-audio-pci
  Sound Server-1: ALSA v: k5.10.84-1-MANJARO running: yes
  Sound Server-2: PipeWire v: 0.3.40 running: yes
Network:
  Device-1: Intel Ice Lake-LP PCH CNVi WiFi driver: iwlwifi
  IF: wlp0s20f3 state: up mac: 04:33:c2:02:de:51
  Device-2: Realtek RTL8153 Gigabit Ethernet Adapter type: USB
    driver: r8152
  IF: enp0s13f0u1u4 state: up speed: 1000 Mbps duplex: full
    mac: 4c:e1:73:42:1f:6b
  IF-ID-1: pan1 state: down mac: 7a:5c:6a:f4:06:56
Bluetooth:
  Device-1: Intel AX201 Bluetooth type: USB driver: btusb
  Report: rfkill ID: hci0 state: up address: see --recommends
Drives:
  Local Storage: total: 1.86 TiB used: 317.16 GiB (16.7%)
  ID-1: /dev/nvme0n1 vendor: Micron model: MTFDHBA1T0TCK size: 953.87 GiB
  ID-2: /dev/sda type: USB vendor: Western Digital model: WD10EARX-00N0YB0
    size: 931.51 GiB
  ID-3: /dev/sdb type: USB vendor: Kingston model: DataTraveler 2.0
    size: 14.54 GiB
Partition:
  ID-1: / size: 57.9 GiB used: 35.88 GiB (62.0%) fs: ext4 dev:
/dev/nvme0n1p8
  ID-2: /boot/efi size: 259.5 MiB used: 114.1 MiB (44.0%) fs: vfat
    dev: /dev/nvme0n1p1
Swap:
  ID-1: swap-1 type: partition size: 16.67 GiB used: 0 KiB (0.0%)
    dev: /dev/nvme0n1p9
Sensors:
  System Temperatures: cpu: 58.0 C mobo: N/A
  Fan Speeds (RPM): N/A
Info:
  Processes: 289 Uptime: 9m Memory: 15.2 GiB used: 2.19 GiB (14.4%)
  Shell: Bash inxi: 3.3.11



*






Am 30.12.21 um 13:58 schrieb Marco van de Voort via lazarus:


On 30-12-2021 10:15, Florian Klämpfl via lazarus wrote:


Linux uses different calling conventions, please check with the patch
below.


Linux is quite generous with the volatile registers, so luckily it
matches quite closely.

I first tried the approach of your patch, but [s] has problems on
windows, so would require ifdef on every "s"use, so I simply move [s]
to rcx

  {$ifndef Windows}
  // we can't use [s] as an alias for the pointer parameter, because
the non assembler procedure on Windows
 // changes that into a stack reference. FPC doesn't support non
volatile frame management for assembler procs like Delphi does.
  mov rcx,s // rdi
  mov edx,len   // rsi
  {$endif}

and the ifdeffing of the assembler procedure on linux vs inline asm
block on Windows. Then it works on Linux x86_64.

Funnily, our server AMD Athlon 200GE (Zen1, 3.2GHz?) nearly the exact
same timings as my i7-3770 3.4GHz

I did some other minor work after last post, so here is now the entire
program:



--
___
lazarus mailing list
lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]

2021-12-30 Thread Marco van de Voort via lazarus


On 30-12-2021 10:15, Florian Klämpfl via lazarus wrote:


Linux uses different calling conventions, please check with the patch 
below.


Linux is quite generous with the volatile registers, so luckily it 
matches quite closely.


I first tried the approach of your patch, but [s] has problems on 
windows, so would require ifdef on every "s"use, so I simply move [s] to rcx


  {$ifndef Windows}
  // we can't use [s] as an alias for the pointer parameter, because 
the non assembler procedure on Windows
 // changes that into a stack reference. FPC doesn't support non 
volatile frame management for assembler procs like Delphi does.

  mov rcx,s // rdi
  mov edx,len   // rsi
  {$endif}

and the ifdeffing of the assembler procedure on linux vs inline asm 
block on Windows. Then it works on Linux x86_64.


Funnily, our server AMD Athlon 200GE (Zen1, 3.2GHz?) nearly the exact 
same timings as my i7-3770 3.4GHz


I did some other minor work after last post, so here is now the entire 
program://
// (C) 2021 Martin Friebe and Marco van de Voort.
// attempt to accelerate utf8lengthfast which is a length(s) in utf8 codepoints 
without integrity checking
//
// 4 versions.
// - Original,
// - with popcount and
// - the "add" variant that accumulates 127 iterations of ptrints and only adds
// the intermeidates outside that loop
// - a SSE2 version loosely inspired by the add variant combined with
//the core of an existing (branchless) binarization routine for the 
main loop.

{$mode objfpc}{$H+}
{$asmmode intel}
{$coperators on}
{define asmdebug}

uses SysUtils,StrUtils;

const   
  mask3   :  array[0..15] of byte  = (   $C0,$C0,$C0,$C0,
 $C0,$C0,$C0,$C0,
 $C0,$C0,$C0,$C0,
 $C0,$C0,$C0,$C0);

  mask4   :  array[0..15] of byte  = (   $80,$80,$80,$80,
 $80,$80,$80,$80,
 $80,$80,$80,$80,
 $80,$80,$80,$80);

  
  mask2   :  array[0..15] of byte  = (   $1,$1,$1,$1,
 $1,$1,$1,$1,
 $1,$1,$1,$1,
 $1,$1,$1,$1);

// Integer arguments are passed in registers RCX, RDX, R8, and R9.
// Floating point arguments are passed in XMM0L, XMM1L, XMM2L, and XMM3L.
// volatile: RAX, RCX, RDX, R8, R9, R10, R11
// nonvolatile RBX, RBP, RDI, RSI, RSP, R12, R13, R14, and R15 are considered 
nonvolatile
// volatile xmm0-xmm3 (params) en xmm4,5
// https://msdn.microsoft.com/en-us/library/ms235286.aspx

{$ifdef asmdebug}
function asmutf8length(const s : pchar;len:integer,res:pbyte):int64;
{$else}
function asmutf8length(const s : pchar;len:integer):int64;{$ifndef 
Windows}assembler; nostackframe;{$endif}
{$endif}

{$ifdef Windows}
begin
{$endif}
 asm
  // tuning for short strings:
  // --
  {$ifndef Windows}
  // we can't use [s] as an alias for the pointer parameter, because the non 
assembler procedure on Windows
  // changes that into a stack reference. FPC doesn't support non volatile 
frame management for assembler procs like Delphi does.
  mov rcx,s // rdi
  mov edx,len   // rsi
  {$endif}
  test rax,rax
  je @theend
  cmp rdx,128 // threshold between long and short.
  jl @restbytes

  mov rax,rdx
  mov r10,rcx
  and r10,15
  mov r9,16
  sub r9,r10
  and r9,15
  test r9,r9
  je @nopreloop
  sub rdx,r9
@preloop:  // roughly 2 cycles per iteration on ivy bridge
  movzx r11d, byte [rcx]// unaligned bytes after sse loop
  mov r10,r11
  shr r10,7
  not r11
  shr r11,6
  and r10,r11
  sub rax,r10
  inc rcx
  dec r9
  jne @preloop
@nopreloop:

  mov r9,rdx
  and r9,15
  shr rdx,4
  pxor xmm5,xmm5   // always zero
  pxor xmm6,xmm6   // dword counts

  // using broadcast etc raises requirements? -> use constant loads.

  movdqu xmm1,[rip+mask3]
  movdqu xmm2,[rip+mask4]
  movdqu xmm3,[rip+mask2]

  test rdx,rdx
  je @restbytes

@outer:
  mov r10,127   // max iterations per inner loop
  cmp r10,rdx   // more or less left?
  jl @last  // more
  mov r10,rdx   // less
  @last:
  sub rdx,r10// iterations left - iterations to do

  pxor xmm4,xmm4

// process 127 iterations (limit of signed int8)

@inner:// +/- 2.2 cycles per iteration for 16 bytes on ivy 
bridge
  movdqu xmm0, [rcx]
  pand  xmm0,xmm1  // mask out top 2 bits
  pcmpeqb xmm0,xmm2// compare with $80. 
  pand  xmm0,xmm3  // change to $1 per byte.
  paddb  xmm4,xmm0 // add to cumulative

  add rcx,16
  dec r10
  jne @inner


  // SSSE3 vertical adds might help this, but increase CPU reqs.

  movdqa xmm0,xmm4

  PUNPCKLBW xmm0,xmm5   // zero extend to words
  PUNPCKHBW xmm4,xmm5
  paddw xmm0,xmm4

Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]

2021-12-30 Thread Florian Klämpfl via lazarus

Am 30.12.21 um 08:23 schrieb Alexey Tor. via lazarus:


New unit test, with Martin's integrated. If I play with godbolt, Ryzen 
zen3 (ryzen 5x00X) is nearly twice as fast in cycles as my Ivy Bridge, 
so I would like to see some benchmarks from various processors. Also 
from very old ones (P4 and Clawhammers) to test instruction sets. 

Project utf8lentest raised exception class 'External: SIGSEGV'.

  In file 'utf8lentest.lpr' at line 89:

movdqu xmm0, [rcx]

OS: Linux x64. CPU:


Linux uses different calling conventions, please check with the patch below.



    vendor_id = "GenuineIntel"
   (simple synth)  = Intel Core (unknown type) (Sandy Bridge 
D2/J1/Q0) {Sandy Bridge}, 32nm





15c15
< {define asmdebug}
---
> { $define asmdebug}
46c46
< function asmutf8length(const s : pchar;len:integer):int64;
---
> function asmutf8length(const s : 
pchar;len:int64):int64;assembler;nostackframe;

49d48
< begin
52c51
< mov r8,rdx
---
> mov r8,len
89c88
<   movdqu xmm0, [rcx]
---
>   movdqu xmm0, oword ptr [s]
95c94
<   add rcx,16
---
>   add s,16
128c127
<   movzx r8d, byte [rcx]// unaligned bytes after sse loop
---
>   movzx r8d, byte [s]  // unaligned bytes after sse loop
135c134
<   inc rcx
---
>   inc s
140c139
< end['xmm5','xmm6']; // volatile registers used.
---
>   ret

--
___
lazarus mailing list
lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]

2021-12-29 Thread Alexey Tor. via lazarus


New unit test, with Martin's integrated. If I play with godbolt, Ryzen 
zen3 (ryzen 5x00X) is nearly twice as fast in cycles as my Ivy Bridge, 
so I would like to see some benchmarks from various processors. Also 
from very old ones (P4 and Clawhammers) to test instruction sets. 

Project utf8lentest raised exception class 'External: SIGSEGV'.

 In file 'utf8lentest.lpr' at line 89:

movdqu xmm0, [rcx]

OS: Linux x64. CPU:

   vendor_id = "GenuineIntel"
  (simple synth)  = Intel Core (unknown type) (Sandy Bridge 
D2/J1/Q0) {Sandy Bridge}, 32nm


--
___
lazarus mailing list
lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]

2021-12-29 Thread Marco van de Voort via lazarus


On 30-12-2021 01:29, Marco van de Voort via lazarus wrote:



 (P4 and Clawhammers) to test instruction sets.


64-bit supporting Pentium 4's of course, since this is a 64-bit only test.

Claw/Sledgehammer is the first iteration of Athlon64 and of the x86_64 
architecture as a whole and misses some instructions (like SSE3) that 
all other 64-bit capable CPUs have.



--
___
lazarus mailing list
lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]

2021-12-29 Thread Marco van de Voort via lazarus


On 29-12-2021 00:00, Bart via lazarus wrote:

On Tue, Dec 28, 2021 at 11:35 PM Martin Frb via lazarus
 wrote:


I have a core I7-8600
The diff between the old code and popcnt is less significant.

old: 715
pop: 695

But there is a 3rd way, that is faster.
add: 610

Not surprising that you should come up with a faster solution.
IIRC you won both speed contests I had on the forum ;-)

Feel free to implement it in LazUtf8.


New unit test, with Martin's integrated. If I play with godbolt, Ryzen 
zen3 (ryzen 5x00X) is nearly twice as fast in cycles as my Ivy Bridge, 
so I would like to see some benchmarks from various processors. Also 
from very old ones (P4 and Clawhammers) to test instruction sets.


I use unaligned loads which afaik on older ( pre Core 1st or 2nd 
generation) CPUs are costly. (because it loads two caches lines per time)



//
// (C) 2021 Martin Friebe and Marco van de Voort.
// attempt to accelerate utf8lengthfast which is a length(s) in utf8 codepoints 
without integrity checking
//
// 4 versions.
// - Original,
// - with popcount and
// - the "add" variant that accumulates 127 iterations of ptrints and only adds
// the intermeidates outside that loop
// - a SSE2 version loosely inspired by the add variant combined with
//the core of an existing (branchless) binarization routine for the 
main loop.

{$mode objfpc}{$H+}
{$asmmode intel}
{define asmdebug}

uses SysUtils,StrUtils;

const   
  mask3   :  array[0..15] of byte  = (   $C0,$C0,$C0,$C0,
 $C0,$C0,$C0,$C0,
 $C0,$C0,$C0,$C0,
 $C0,$C0,$C0,$C0);

  mask4   :  array[0..15] of byte  = (   $80,$80,$80,$80,
 $80,$80,$80,$80,
 $80,$80,$80,$80,
 $80,$80,$80,$80);

  
  mask2   :  array[0..15] of byte  = (   $1,$1,$1,$1,
 $1,$1,$1,$1,
 $1,$1,$1,$1,
 $1,$1,$1,$1);

// Integer arguments are passed in registers RCX, RDX, R8, and R9.
// Floating point arguments are passed in XMM0L, XMM1L, XMM2L, and XMM3L.
// volatile: RAX, RCX, RDX, R8, R9, R10, R11
// nonvolatile RBX, RBP, RDI, RSI, RSP, R12, R13, R14, and R15 are considered 
nonvolatile
// volatile xmm0-xmm3 (params) en xmm4,5
// https://msdn.microsoft.com/en-us/library/ms235286.aspx

{$ifdef asmdebug}
function asmutf8length(const s : pchar;res:pbyte;len:integer):int64;
{$else}
function asmutf8length(const s : pchar;len:integer):int64;
{$endif}

begin
 asm
  {$ifndef asmdebug}
mov r8,rdx
  {$endif}

  // using broadcast etc raises requirements?
  mov rax,r8
  mov r9,r8
  // tuning for short strings:
  // --
  // test rax,rax
  // je @theend
  // cmp r9,128 // difference between long and short
  // jl @restbytes

  and r9,15
  shr r8,4
  pxor xmm5,xmm5   // always zero
  pxor xmm6,xmm6   // dword counts

  movdqu xmm1,[rip+mask3]
  movdqu xmm2,[rip+mask4]
  movdqu xmm3,[rip+mask2]

  test r8,r8
  je @restbytes

@outer:
  mov r10,127// max iterations
  cmp r10,r8// more or less left?
  jl @last   // more
  mov r10,r8 // less
  @last:
  sub r8,r10 // iterations left - iterations to do

  pxor xmm4,xmm4


@inner:
  movdqu xmm0, [rcx]
  pand  xmm0,xmm1  // mask out top 2 bits   
  pcmpeqb xmm0,xmm2// compare with $80. 
  pand  xmm0,xmm3  // change to $1 per byte.
  paddb  xmm4,xmm0 // add to cumulative

  add rcx,16
  dec r10
  jne @inner

  // process 127 iterations

  movdqa xmm0,xmm4

  PUNPCKLBW xmm0,xmm5   // zero extend to words
  PUNPCKHBW xmm4,xmm5
  paddw xmm0,xmm4   // add, now 8 16-bit words.

  movdqa xmm4,xmm0
  PUNPCKLWD xmm0,xmm5   // zero extend to dwords
  paddd xmm6,xmm0
  PUNPCKHWD  xmm4,xmm5
  paddd xmm6,xmm4   // add both to cumulative
  test r8,r8
  jne @outer

  MOVHLPS  xmm4,xmm6// move high 8 bytes to low (float->int 
penalty?)

  paddd xmm6,xmm4   // add both 2*dwords (high doesn't matter)

  pshufd xmm4,xmm6,1   // mov 2nd dword in xmm6 to first in xmm4
  paddd xmm6,xmm4  // add

  movd r8d,xmm6// to int alu reg
  sub  rax,r8  // subtract from length in bytes.
@restbytes:
  test r9,r9
  je @theend   // Done!
@restloop:
  movzx r8d, byte [rcx]// unaligned bytes after sse loop
  mov r10,r8
  shr r10,7
  not r8
  shr r8,6
  and r10,r8
  sub rax,r10
  inc rcx
  dec r9
  jne @restloop

@theend:
end['xmm5','xmm6']; // volatile registers used.
end;

function countmask(nx:int64):integer;

begin
   nx := (nx and $00FF00FF00FF00FF) + ((nx >>  8) and $00FF00FF00FF00FF);
   nx := (nx and 

Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]

2021-12-29 Thread Marco van de Voort via lazarus



On 29-12-2021 16:30, Martin Frb via lazarus wrote:





Could you post full source if you haven't already? For a bit of 
benchmarking. I just wrote it from the top of my head, and I assumed 
5 instructions for 16-byte would win any time, but haven't verified 
anything yet.
I had it attached on my last mail. Attached it again here. (3rd 
procedure / "Utf8LengthAdd")


It is only 64bit for now. (And not cleaned up in any way).

Also changing "bc >> 7" and "bc and 127"
to "moddiv(bc, 255, full, remain)" might save a few more ms. But 
probably needs larger data to benchmark.


If you do work on this, feel free to integrate my code as the baseline 
for cpu without SSE.

Otherwise, it might be a bit until I get to it.


First results: (on an ageing i7-3770, trunk FPC -O4 -Cpcoreavx)

fst 781
fst 781
fst 797
fst 766
pop 656
pop 641
pop 640
pop 641
add 562
add 578
add 563
add 594
asm 297
asm 296
asm 297
asm 297

Asm is nearly fully functional and working, more importantly the 
remaining issues are constant time and single instruction work, 
shouldn't influence benchmarking for anything than the shortest sequences.


I'll finish up and post the whole shebang, since more eyes could help, 
I'm an asm amateur in some regards.

--
___
lazarus mailing list
lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]

2021-12-29 Thread Martin Frb via lazarus

On 29/12/2021 13:42, Marco van de Voort via lazarus wrote:


On 29-12-2021 10:16, Martin Frb via lazarus wrote:


// Martin's routine that should be replaced by some punpkl magic, 
but it is too late now.


Why too late?


See datetime stamp.  02:10 AM. I don't know how it is with you Lazarus 
devels, but we FPC devels need our beauty sleep from time to time.

Oh, too late for the day.
I thought. Too late, that ship has sailed.




Could you post full source if you haven't already? For a bit of 
benchmarking. I just wrote it from the top of my head, and I assumed 5 
instructions for 16-byte would win any time, but haven't verified 
anything yet.
I had it attached on my last mail. Attached it again here. (3rd 
procedure / "Utf8LengthAdd")


It is only 64bit for now. (And not cleaned up in any way).

Also changing "bc >> 7" and "bc and 127"
to "moddiv(bc, 255, full, remain)" might save a few more ms. But 
probably needs larger data to benchmark.



If you do work on this, feel free to integrate my code as the baseline 
for cpu without SSE.

Otherwise, it might be a bit until I get to it.


Project1.pas
Description: application/unknown-content-type
-- 
___
lazarus mailing list
lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]

2021-12-29 Thread Florian Klämpfl via lazarus

Am 29.12.2021 um 13:42 schrieb Marco van de Voort via lazarus:


p.s. is there a workaround for git worktree to work on the same branch? E.g. 
trunk for 32-bit and trunk for 64-bit ? :-)



No. You cannot checkout the same branch in two worktrees. But you can do the following: create a new branch in both 
worktrees and check them out (main_32bit and main_64bit), do in both directories: work, commit, pull, checkout main, 
merge the branch, push:


cd git/fpc/main1
git checkout -b main_32bit
git worktree add ../main2 -b main_64bit





cd main1
git checkout main
git pull
git merge main_32bit
git push

cd ../main2
git checkout main
git pull
git merge main_64bit
git push
--
___
lazarus mailing list
lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]

2021-12-29 Thread Marco van de Voort via lazarus


On 29-12-2021 10:16, Martin Frb via lazarus wrote:


// Martin's routine that should be replaced by some punpkl magic, but 
it is too late now.


Why too late?


See datetime stamp.  02:10 AM. I don't know how it is with you Lazarus 
devels, but we FPC devels need our beauty sleep from time to time.


So I'll be working on it some more today, possible things to do are:

-  fold countmask into the asm, and

-  maybe do another outer loop to process all bytes in one go.

-  special case for only a few bytes

- maybe align to 16-byte (easier/cheaper for old machines than all the 
movdqu).


- Maybe test the code on godbolt for performance

At the very least I hope the punpkl code for countmask will be more 
compact since it doesn't require literals.


There is a place for both. My routine works fine for cpu. (soon as a 
32 bit (and maybe 16 bit) variant are added).


Then for known cpu, special handling can be added.


Could you post full source if you haven't already? For a bit of 
benchmarking. I just wrote it from the top of my head, and I assumed 5 
instructions for 16-byte would win any time, but haven't verified 
anything yet.


For 32-bit this can also be done, but you'd need a procedure variable to 
test for SSE support, and only run this for long strings (>200-500 or so)


p.s. is there a workaround for git worktree to work on the same branch? 
E.g. trunk for 32-bit and trunk for 64-bit ? :-)


--
___
lazarus mailing list
lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]

2021-12-29 Thread Martin Frb via lazarus

On 29/12/2021 02:10, Marco van de Voort via lazarus wrote:


On 28-12-2021 23:35, Martin Frb via lazarus wrote:



"nx" has a single "1" in each of the 8 bytes in a Qword (based on 
64bit).
If we regard each of this bytes as an entity of its own, then we can 
keep adding those "1".


I also was thinking in that direction, but more about how to optimize 
that loop using SSE2

good idea...



// Martin's routine that should be replaced by some punpkl magic, but 
it is too late now.


Why too late?

There is a place for both. My routine works fine for cpu. (soon as a 32 
bit (and maybe 16 bit) variant are added).


Then for known cpu, special handling can be added.
--
___
lazarus mailing list
lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]

2021-12-28 Thread Marco van de Voort via lazarus


On 28-12-2021 23:35, Martin Frb via lazarus wrote:



"nx" has a single "1" in each of the 8 bytes in a Qword (based on 64bit).
If we regard each of this bytes as an entity of its own, then we can 
keep adding those "1".


I also was thinking in that direction, but more about how to optimize 
that loop using SSE2


Some simple masking achieves the same (an 1 for each byte that starts 
with %10 bits) in 5 instructions, the load inclusive.


Since 64-bit always supports SSE2, this could work:


{$mode objfpc}{$H+}
{$asmmode intel}

uses sysutils,strutils;

Type int128 = array[0..1] of int64;

const
 mask3   :  array[0..15] of byte  = ( $C0,$C0,$C0,$C0,
 $C0,$C0,$C0,$C0,
 $C0,$C0,$C0,$C0,
 $C0,$C0,$C0,$C0);

  mask4   :  array[0..15] of byte  = (   $80,$80,$80,$80,
 $80,$80,$80,$80,
 $80,$80,$80,$80,
 $80,$80,$80,$80);


  mask2   :  array[0..15] of byte  = ( $1,$1,$1,$1,
                         $1,$1,$1,$1,
 $1,$1,$1,$1,
 $1,$1,$1,$1);

function utf8length(const s : pchar;var res:int128;len:integer):integer;
// len is number of 16-byte counts to accumulate, max 255 I think
// stores 16 bytes worth of counts in "res"
begin
 asm
  movdqu xmm1,[rip+mask3] // unaligned is SSE3, doesn't work on 
original X86_64 clawhammer?

  movdqu xmm2,[rip+mask4]
  movdqu xmm3,[rip+mask2]
  pxor xmm4,xmm4

@lbl:
  movdqu xmm0, [rcx]
  pand  xmm0,xmm1  // mask out top 2 bits  ($C0)
  pcmpeqb xmm0,xmm2    // compare with $80. sets byte to  or 


  pand  xmm0,xmm3  // change to lsb (1/0) per byte only.
  paddb  xmm4,xmm0 // add to cumulative

  add rcx,16
  dec r8
  jne @lbl

  movdqu [rdx],xmm4

end; // no volatile registers used.
end;

function countmask(nx:int64):integer;
// Martin's routine that should be replaced by some punpkl magic, but it 
is too late now.

begin
   nx := (nx and $00FF00FF00FF00FF) + ((nx >>  8) and $00FF00FF00FF00FF);
   nx := (nx and $) + ((nx >> 16) and $);
   result := (nx and $) + ((nx >> 32) and 
$);

end;


// one of each pattern.
const pattern : array[0..3] of char = (chr(%11001001),chr(%10001001),
chr(%1001),chr(%01001001));

const testblocks = 5;

var s : string;
    i,j,cnt : integer;
    r : int128;

begin
  randomize;
  setlength(s,testblocks*16);
  // random string but keep a count of bytes with high value %10
  cnt:=0;
  for i:=0 to testblocks*16-1 do
    begin
  j:=random(4);
  if j=1 then inc(cnt);
  s[i+1]:=pattern[j];
    end;

  utf8length(pchar(s),r,testblocks+1);

  writeln(cnt,' = ',countmask(r[0])+countmask(r[1]));
//  writeln(inttohex(r[0],16));
//  writeln(inttohex(r[1],16));

end.



--
___
lazarus mailing list
lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]

2021-12-28 Thread Bart via lazarus
On Tue, Dec 28, 2021 at 11:35 PM Martin Frb via lazarus
 wrote:

> I have a core I7-8600
> The diff between the old code and popcnt is less significant.
>
> old: 715
> pop: 695
>
> But there is a 3rd way, that is faster.
> add: 610

Not surprising that you should come up with a faster solution.
IIRC you won both speed contests I had on the forum ;-)

Feel free to implement it in LazUtf8.
-- 
Bart
-- 
___
lazarus mailing list
lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


[Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]

2021-12-28 Thread Martin Frb via lazarus

On 28/12/2021 15:50, Bart via lazarus wrote:

On Tue, Dec 28, 2021 at 3:39 PM Marco van de Voort via lazarus
 wrote:


On what machine did you test? The settings if for the generated code,
but the actual processor determines the effective speed.

I have a Intel i5 7th generation on my Win10-64 laptop from approx.
2017 (so, it's really old for more modern folks than me).

Compiled for 32-bit:
With -CpCOREI
Unsigned version with multiplication: 1359
Unsigned version with PopCnt: 1282



I have a core I7-8600
The diff between the old code and popcnt is less significant.

old: 715
pop: 695

But there is a 3rd way, that is faster.
add: 610

"nx" has a single "1" in each of the 8 bytes in a Qword (based on 64bit).
If we regard each of this bytes as an entity of its own, then we can 
keep adding those "1".


We could add the 1 of up to 255 iteration, before an overflow can happen.
The example only does 128, as this avoids the "div" and "mod" operations.

The full routine / incl benchmark for all 3 versions is attached

For 64 bit:

  bc := (ByteCount-cnt) div sizeof(PtrInt);
  for j := 1 to bc >> 7 do begin
    nx := 0;
    for i := 0 to 127 do
    begin
  // Count bytes which are NOT the first byte of a character.
  nx += ((pnx^ and EIGHTYMASK) shr 7) and ((not pnx^) shr 6);
  inc(pnx);
    end;
    nx := (nx and $00FF00FF00FF00FF) + ((nx >>  8) and $00FF00FF00FF00FF);
    nx := (nx and $) + ((nx >> 16) and $);
    nx := (nx and $) + ((nx >> 32) and $);
    Result := Result + nx;
  end;


  if (bc and 127) > 0 then begin
  nx := 0;
  for i := 1 to bc and 127 do
  begin
    // Count bytes which are NOT the first byte of a character.
    nx += ((pnx^ and EIGHTYMASK) shr 7) and ((not pnx^) shr 6);
    inc(pnx);
  end;
    nx := (nx and $00FF00FF00FF00FF) + ((nx >>  8) and $00FF00FF00FF00FF);
    nx := (nx and $) + ((nx >> 16) and $);
    nx := (nx and $) + ((nx >> 32) and $);
    Result := Result + nx;
  end;

program Project1;
{$mode objfpc}{$H+}

uses SysUtils;

function UTF8LengthFast(p: PChar; ByteCount: PtrInt): PtrInt;
const
{$ifdef CPU32}
  ONEMASK   =$01010101;
  EIGHTYMASK=$80808080;
{$endif}
{$ifdef CPU64}
  ONEMASK   =$0101010101010101;
  EIGHTYMASK=$8080808080808080;
{$endif}
var
  pnx: PPtrInt absolute p; // To get contents of text in PtrInt blocks. x 
refers to 32 or 64 bits
  pn8: pint8 absolute pnx; // To read text as Int8 in the initial and final 
loops
  ix: PtrInt absolute pnx; // To read text as PtrInt in the block loop
  nx: PtrInt;  // values processed in block loop
  i,cnt,e: PtrInt;
begin
  Result := 0;
  e := ix+ByteCount; // End marker
  // Handle any initial misaligned bytes.
  cnt := (not (ix-1)) and (sizeof(PtrInt)-1);
  if cnt>ByteCount then
cnt := ByteCount;
  for i := 1 to cnt do
  begin
// Is this byte NOT the first byte of a character?
Result += (pn8^ shr 7) and ((not pn8^) shr 6);
inc(pn8);
  end;
  // Handle complete blocks
  for i := 1 to (ByteCount-cnt) div sizeof(PtrInt) do
  begin
// Count bytes which are NOT the first byte of a character.
nx := ((pnx^ and EIGHTYMASK) shr 7) and ((not pnx^) shr 6);
{$push}{$overflowchecks off} // "nx * ONEMASK" causes an arithmetic 
overflow.
Result += (nx * ONEMASK) >> ((sizeof(PtrInt) - 1) * 8);
{$pop}
inc(pnx);
  end;
  // Take care of any left-over bytes.
  while ixByteCount then
cnt := ByteCount;
  for i := 1 to cnt do
  begin
// Is this byte NOT the first byte of a character?
Result += (pn8^ shr 7) and ((not pn8^) shr 6);
inc(pn8);
  end;
  // Handle complete blocks
  for i := 1 to (ByteCount-cnt) div sizeof(PtrInt) do
  begin
// Count bytes which are NOT the first byte of a character.
nx := ((pnx^ and EIGHTYMASK) shr 7) and ((not pnx^) shr 6);
{$push}{$overflowchecks off} // "nx * ONEMASK" causes an arithmetic 
overflow.
//Result += (nx * ONEMASK) >> ((sizeof(PtrInt) - 1) * 8);
Result += PopCnt(qword(nx));
{$pop}
inc(pnx);
  end;
  // Take care of any left-over bytes.
  while ixByteCount then
cnt := ByteCount;
  for i := 1 to cnt do
  begin
// Is this byte NOT the first byte of a character?
Result += (pn8^ shr 7) and ((not pn8^) shr 6);
inc(pn8);
  end;
  // Handle complete blocks

  bc := (ByteCount-cnt) div sizeof(PtrInt);
  for j := 1 to bc >> 7 do begin
nx := 0;
for i := 0 to 127 do
begin
  // Count bytes which are NOT the first byte of a character.
  nx += ((pnx^ and EIGHTYMASK) shr 7) and ((not pnx^) shr 6);
  inc(pnx);
end;
nx := (nx and $00FF00FF00FF00FF) + ((nx >>  8) and $00FF00FF00FF00FF);
nx := (nx and $) + ((nx >> 16) and $);
nx := (nx and $) + ((nx >> 32) and $);
Result := Result + nx;
  end;


  if (bc and 127) > 0 then begin
  nx := 0;
  for i := 1 to bc