Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]
On 28-12-2021 23:35, Martin Frb via lazarus wrote: "nx" has a single "1" in each of the 8 bytes in a Qword (based on 64bit). If we regard each of this bytes as an entity of its own, then we can keep adding those "1". I also was thinking in that direction, but more about how to optimize that loop using SSE2 Some simple masking achieves the same (an 1 for each byte that starts with %10 bits) in 5 instructions, the load inclusive. Since 64-bit always supports SSE2, this could work: {$mode objfpc}{$H+} {$asmmode intel} uses sysutils,strutils; Type int128 = array[0..1] of int64; const mask3 : array[0..15] of byte = ( $C0,$C0,$C0,$C0, $C0,$C0,$C0,$C0, $C0,$C0,$C0,$C0, $C0,$C0,$C0,$C0); mask4 : array[0..15] of byte = ( $80,$80,$80,$80, $80,$80,$80,$80, $80,$80,$80,$80, $80,$80,$80,$80); mask2 : array[0..15] of byte = ( $1,$1,$1,$1, $1,$1,$1,$1, $1,$1,$1,$1, $1,$1,$1,$1); function utf8length(const s : pchar;var res:int128;len:integer):integer; // len is number of 16-byte counts to accumulate, max 255 I think // stores 16 bytes worth of counts in "res" begin asm movdqu xmm1,[rip+mask3] // unaligned is SSE3, doesn't work on original X86_64 clawhammer? movdqu xmm2,[rip+mask4] movdqu xmm3,[rip+mask2] pxor xmm4,xmm4 @lbl: movdqu xmm0, [rcx] pand xmm0,xmm1 // mask out top 2 bits ($C0) pcmpeqb xmm0,xmm2 // compare with $80. sets byte to or pand xmm0,xmm3 // change to lsb (1/0) per byte only. paddb xmm4,xmm0 // add to cumulative add rcx,16 dec r8 jne @lbl movdqu [rdx],xmm4 end; // no volatile registers used. end; function countmask(nx:int64):integer; // Martin's routine that should be replaced by some punpkl magic, but it is too late now. begin nx := (nx and $00FF00FF00FF00FF) + ((nx >> 8) and $00FF00FF00FF00FF); nx := (nx and $) + ((nx >> 16) and $); result := (nx and $) + ((nx >> 32) and $); end; // one of each pattern. const pattern : array[0..3] of char = (chr(%11001001),chr(%10001001), chr(%1001),chr(%01001001)); const testblocks = 5; var s : string; i,j,cnt : integer; r : int128; begin randomize; setlength(s,testblocks*16); // random string but keep a count of bytes with high value %10 cnt:=0; for i:=0 to testblocks*16-1 do begin j:=random(4); if j=1 then inc(cnt); s[i+1]:=pattern[j]; end; utf8length(pchar(s),r,testblocks+1); writeln(cnt,' = ',countmask(r[0])+countmask(r[1])); // writeln(inttohex(r[0],16)); // writeln(inttohex(r[1],16)); end. -- ___ lazarus mailing list lazarus@lists.lazarus-ide.org https://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]
On Tue, Dec 28, 2021 at 11:35 PM Martin Frb via lazarus wrote: > I have a core I7-8600 > The diff between the old code and popcnt is less significant. > > old: 715 > pop: 695 > > But there is a 3rd way, that is faster. > add: 610 Not surprising that you should come up with a faster solution. IIRC you won both speed contests I had on the forum ;-) Feel free to implement it in LazUtf8. -- Bart -- ___ lazarus mailing list lazarus@lists.lazarus-ide.org https://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] fpc bug with M1 [[was: Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]
On 28/12/2021 23:18, Noel Duffy via lazarus wrote: The assembler produced by 3.2.2 looks like this: # [43] Result += (pn8^ shr 7) and ((not pn8^) shr 6); ldr x0,[sp] ldrsb w0,[x0] mvn w0,w0 mvn => bitwise not. And that applies to the whole register. So I guess "eor w0,w0,#255 " is meant to be some optimization, but comes with a bug. -- ___ lazarus mailing list lazarus@lists.lazarus-ide.org https://lists.lazarus-ide.org/listinfo/lazarus
[Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]
On 28/12/2021 15:50, Bart via lazarus wrote: On Tue, Dec 28, 2021 at 3:39 PM Marco van de Voort via lazarus wrote: On what machine did you test? The settings if for the generated code, but the actual processor determines the effective speed. I have a Intel i5 7th generation on my Win10-64 laptop from approx. 2017 (so, it's really old for more modern folks than me). Compiled for 32-bit: With -CpCOREI Unsigned version with multiplication: 1359 Unsigned version with PopCnt: 1282 I have a core I7-8600 The diff between the old code and popcnt is less significant. old: 715 pop: 695 But there is a 3rd way, that is faster. add: 610 "nx" has a single "1" in each of the 8 bytes in a Qword (based on 64bit). If we regard each of this bytes as an entity of its own, then we can keep adding those "1". We could add the 1 of up to 255 iteration, before an overflow can happen. The example only does 128, as this avoids the "div" and "mod" operations. The full routine / incl benchmark for all 3 versions is attached For 64 bit: bc := (ByteCount-cnt) div sizeof(PtrInt); for j := 1 to bc >> 7 do begin nx := 0; for i := 0 to 127 do begin // Count bytes which are NOT the first byte of a character. nx += ((pnx^ and EIGHTYMASK) shr 7) and ((not pnx^) shr 6); inc(pnx); end; nx := (nx and $00FF00FF00FF00FF) + ((nx >> 8) and $00FF00FF00FF00FF); nx := (nx and $) + ((nx >> 16) and $); nx := (nx and $) + ((nx >> 32) and $); Result := Result + nx; end; if (bc and 127) > 0 then begin nx := 0; for i := 1 to bc and 127 do begin // Count bytes which are NOT the first byte of a character. nx += ((pnx^ and EIGHTYMASK) shr 7) and ((not pnx^) shr 6); inc(pnx); end; nx := (nx and $00FF00FF00FF00FF) + ((nx >> 8) and $00FF00FF00FF00FF); nx := (nx and $) + ((nx >> 16) and $); nx := (nx and $) + ((nx >> 32) and $); Result := Result + nx; end; program Project1; {$mode objfpc}{$H+} uses SysUtils; function UTF8LengthFast(p: PChar; ByteCount: PtrInt): PtrInt; const {$ifdef CPU32} ONEMASK =$01010101; EIGHTYMASK=$80808080; {$endif} {$ifdef CPU64} ONEMASK =$0101010101010101; EIGHTYMASK=$8080808080808080; {$endif} var pnx: PPtrInt absolute p; // To get contents of text in PtrInt blocks. x refers to 32 or 64 bits pn8: pint8 absolute pnx; // To read text as Int8 in the initial and final loops ix: PtrInt absolute pnx; // To read text as PtrInt in the block loop nx: PtrInt; // values processed in block loop i,cnt,e: PtrInt; begin Result := 0; e := ix+ByteCount; // End marker // Handle any initial misaligned bytes. cnt := (not (ix-1)) and (sizeof(PtrInt)-1); if cnt>ByteCount then cnt := ByteCount; for i := 1 to cnt do begin // Is this byte NOT the first byte of a character? Result += (pn8^ shr 7) and ((not pn8^) shr 6); inc(pn8); end; // Handle complete blocks for i := 1 to (ByteCount-cnt) div sizeof(PtrInt) do begin // Count bytes which are NOT the first byte of a character. nx := ((pnx^ and EIGHTYMASK) shr 7) and ((not pnx^) shr 6); {$push}{$overflowchecks off} // "nx * ONEMASK" causes an arithmetic overflow. Result += (nx * ONEMASK) >> ((sizeof(PtrInt) - 1) * 8); {$pop} inc(pnx); end; // Take care of any left-over bytes. while ixByteCount then cnt := ByteCount; for i := 1 to cnt do begin // Is this byte NOT the first byte of a character? Result += (pn8^ shr 7) and ((not pn8^) shr 6); inc(pn8); end; // Handle complete blocks for i := 1 to (ByteCount-cnt) div sizeof(PtrInt) do begin // Count bytes which are NOT the first byte of a character. nx := ((pnx^ and EIGHTYMASK) shr 7) and ((not pnx^) shr 6); {$push}{$overflowchecks off} // "nx * ONEMASK" causes an arithmetic overflow. //Result += (nx * ONEMASK) >> ((sizeof(PtrInt) - 1) * 8); Result += PopCnt(qword(nx)); {$pop} inc(pnx); end; // Take care of any left-over bytes. while ixByteCount then cnt := ByteCount; for i := 1 to cnt do begin // Is this byte NOT the first byte of a character? Result += (pn8^ shr 7) and ((not pn8^) shr 6); inc(pn8); end; // Handle complete blocks bc := (ByteCount-cnt) div sizeof(PtrInt); for j := 1 to bc >> 7 do begin nx := 0; for i := 0 to 127 do begin // Count bytes which are NOT the first byte of a character. nx += ((pnx^ and EIGHTYMASK) shr 7) and ((not pnx^) shr 6); inc(pnx); end; nx := (nx and $00FF00FF00FF00FF) + ((nx >> 8) and $00FF00FF00FF00FF); nx := (nx and $) + ((nx >> 16) and $); nx := (nx and $) + ((nx >> 32) and $); Result := Result + nx; end; if (bc and 127) > 0 then begin nx := 0; for i := 1 to bc and
Re: [Lazarus] fpc bug with M1 [[was: Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]
On 29/12/21 10:47, Martin Frb via lazarus wrote: On 28/12/2021 22:05, Noel Duffy via lazarus wrote: On 29/12/21 01:26, Bart via lazarus wrote: fpc -al ulen.pas > This will produce the file ulen.s > You can attach or copy that here. File is attached. Thanks. And I think there is a bug in FPC This is the signed version # [43] Result += (pn8^ shr 7) and ((not pn8^) shr 6); ldr x0,[sp] ldrsb w0,[x0] # < sign extend to a 32bit value (32bit register). eor w0,w0,#255 # < But only "not" the lowest 8 bit. That is wrong. The calculation uses 32 bit at this point lsr w0,w0,#6 ldr x2,[sp] ldrsb w2,[x2] lsr w2,w2,#7 and w0,w2,w0 # << here the full 32 bit are used. Had the full 32 bits been "not"ed, the the upper 24 bit where 0 (because they had been sign extended to 1). This would mask all the 1, that were sign extended in w2. Had the char been < 128 (high bit = 0), then w0 would have the upper 24 bit = 1 / but w2 would have them 0. And that is why the code worked, even with signed values. (signed values were still a bad idea) Interesting. As I noted in my message with the ulen.s attached, I tested and found the problem is not present in fpc 3.2.2. That's the version of fpc packaged by Lazarus, and I used that to compile fpc 3.3.1. The assembler produced by 3.2.2 looks like this: # [43] Result += (pn8^ shr 7) and ((not pn8^) shr 6); ldr x0,[sp] ldrsb w0,[x0] mvn w0,w0 sxtbw0,w0 lsr w1,w0,#6 ldr x0,[sp] ldrsb w0,[x0] lsr w0,w0,#7 and w0,w0,w1 sxtwx0,w0 ldr x1,[sp, #16] add x0,x1,x0 str x0,[sp, #16] -- ___ lazarus mailing list lazarus@lists.lazarus-ide.org https://lists.lazarus-ide.org/listinfo/lazarus
[Lazarus] fpc bug with M1 [[was: Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]
On 28/12/2021 22:05, Noel Duffy via lazarus wrote: On 29/12/21 01:26, Bart via lazarus wrote: fpc -al ulen.pas > This will produce the file ulen.s > You can attach or copy that here. File is attached. Thanks. And I think there is a bug in FPC This is the signed version # [43] Result += (pn8^ shr 7) and ((not pn8^) shr 6); ldr x0,[sp] ldrsb w0,[x0] # < sign extend to a 32bit value (32bit register). eor w0,w0,#255 # < But only "not" the lowest 8 bit. That is wrong. The calculation uses 32 bit at this point lsr w0,w0,#6 ldr x2,[sp] ldrsb w2,[x2] lsr w2,w2,#7 and w0,w2,w0 # << here the full 32 bit are used. Had the full 32 bits been "not"ed, the the upper 24 bit where 0 (because they had been sign extended to 1). This would mask all the 1, that were sign extended in w2. Had the char been < 128 (high bit = 0), then w0 would have the upper 24 bit = 1 / but w2 would have them 0. And that is why the code worked, even with signed values. (signed values were still a bad idea) -- ___ lazarus mailing list lazarus@lists.lazarus-ide.org https://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] UTF8LengthFast returning incorrect results on AARCH64 (MacOS)
On 29/12/21 01:26, Bart via lazarus wrote: fpc -al ulen.pas > This will produce the file ulen.s > You can attach or copy that here. File is attached. The output from running this program is: % ./ulen Signed version Len = -100663283 Unsigned version Len = 1 To add another wrinkle to this, the signed and unsigned versions of the code both work if compiled with fpc 3.2.2. It's only when compiled with 3.3.1 that I get an incorrect result.# Begin asmlist al_procedures .text .align 4 .globl _P$ULEN_$$_UTF8LENGTHFAST_SIGNED$PCHAR$INT64$$INT64 _P$ULEN_$$_UTF8LENGTHFAST_SIGNED$PCHAR$INT64$$INT64: # [ulen.pas] # [28] begin stp x29,x30,[sp, #-16]! mov x29,sp sub sp,sp,#64 # Var p located at sp+0, size=OS_64 # Var ByteCount located at sp+8, size=OS_S64 # Var $result located at sp+16, size=OS_S64 # Var nx located at sp+24, size=OS_S64 # Var i located at sp+32, size=OS_S64 # Var cnt located at sp+40, size=OS_S64 # Var e located at sp+48, size=OS_S64 str x0,[sp] str x1,[sp, #8] # [29] Result := 0; str xzr,[sp, #16] # [30] e := ix+ByteCount; // End marker ldr x0,[sp] ldr x1,[sp, #8] add x0,x1,x0 str x0,[sp, #48] # [32] cnt := (not (ix-1)) and (sizeof(PtrInt)-1); ldr x0,[sp] sub x0,x0,#1 mvn x0,x0 and x0,x0,#7 str x0,[sp, #40] # [33] if cnt>ByteCount then ldr x0,[sp, #40] ldr x1,[sp, #8] cmp x0,x1 b.gtLj7 b Lj8 Lj7: # [34] cnt := ByteCount; ldr x0,[sp, #8] str x0,[sp, #40] Lj8: # [35] for i := 1 to cnt do ldr x1,[sp, #40] cmp x1,#1 b.geLj9 b Lj10 Lj9: str xzr,[sp, #32] Lj11: ldr x0,[sp, #32] add x0,x0,#1 str x0,[sp, #32] # [43] Result += (pn8^ shr 7) and ((not pn8^) shr 6); ldr x0,[sp] ldrsb w0,[x0] eor w0,w0,#255 lsr w0,w0,#6 ldr x2,[sp] ldrsb w2,[x2] lsr w2,w2,#7 and w0,w2,w0 sxtwx0,w0 ldr x2,[sp, #16] add x0,x2,x0 str x0,[sp, #16] # [44] inc(pn8); ldr x0,[sp] add x0,x0,#1 str x0,[sp] ldr x0,[sp, #32] cmp x1,x0 b.leLj13 b Lj11 Lj13: Lj10: # [47] for i := 1 to (ByteCount-cnt) div sizeof(PtrUInt) do ldr x0,[sp, #8] ldr x1,[sp, #40] sub x0,x0,x1 asr x1,x0,#63 add x0,x0,x1,lsr #61 asr x0,x0,#3 cmp x0,#1 b.geLj14 b Lj15 Lj14: str xzr,[sp, #32] Lj16: ldr x1,[sp, #32] add x1,x1,#1 str x1,[sp, #32] # [50] nx := ((pnx^ and EIGHTYMASK) shr 7) and ((not pnx^) shr 6); ldr x1,[sp] ldr x1,[x1] and x1,x1,#0x8080808080808080 lsr x1,x1,#7 ldr x2,[sp] ldr x2,[x2] mvn x2,x2 lsr x2,x2,#6 and x1,x2,x1 str x1,[sp, #24] # [52] Result += (nx * ONEMASK) >> ((sizeof(PtrInt) - 1) * 8); ldr x1,[sp, #24] mov x2,#72340172838076673 mul x1,x1,x2 lsr x1,x1,#56 ldr x2,[sp, #16] add x1,x2,x1 str x1,[sp, #16] # [54] inc(pnx); ldr x1,[sp] add x1,x1,#8 str x1,[sp] ldr x1,[sp, #32] cmp x0,x1 b.leLj18 b Lj16 Lj18: Lj15: # [57] while ixByteCount then ldr x0,[sp, #40] ldr x1,[sp, #8] cmp x0,x1 b.gtLj24 b Lj25 Lj24: # [94] cnt := ByteCount; ldr x0,[sp, #8] str x0,[sp, #40] Lj25: # [95] for i := 1 to cnt do ldr x1,[sp, #40] cmp x1,#1 b.geLj26 b Lj27 Lj26: str xzr,[sp, #32] Lj28: ldr x0,[sp, #32] add x0,x0,#1 str x0,[sp, #32] # [103] Result += (pn8^ shr 7) and ((not pn8^) shr 6); ldr x0,[sp] ldrbw0,[x0] eor w0,w0,#255 lsr w0,w0,#6 ldr x2,[sp] ldrbw2,[x2] lsr w2,w2,#7 and w0,w2,w0 ldr x2,[sp, #16] add x0,x2,x0 str x0,[sp, #16] # [104] inc(pn8); ldr x0,[sp] add x0,x0,#1 str x0,[sp] ldr x0,[sp, #32] cmp x1,x0 b.leLj30 b Lj28 Lj30: Lj27: # [107] for i := 1 to (ByteCount-cnt) div sizeof(PtrUInt) do ldr x0,[sp, #8] ldr x1,[sp, #40] sub x0,x0,x1 asr x1,x0,#63 add x0,x0,x1,lsr #61 asr x0,x0,#3 cmp x0,#1 b.geLj31 b Lj32 Lj31:
Re: [Lazarus] UTF8LengthFast returning incorrect results on AARCH64 (MacOS)
On 27-12-2021 22:10, Noel Duffy via lazarus wrote: On 28/12/21 01:47, Juha Manninen via lazarus wrote: On Mon, Dec 27, 2021 at 1:44 AM Noel Duffy via lazarus < lazarus@lists.lazarus-ide.org> wrote: I need some help getting to the root of a problem with incorrect results on Apple hardware (M1, aarch64) for the function UTF8LengthFast in lazutf8. On MacOS, when given a string containing one or more UTF8 characters, UTF8LengthFast returns wildly incorrect results. On Fedora, the function returns the correct answer. You mean both MacOS and Fedora run on the same aarch64 CPU? Oh no, Fedora runs on an Intel x86_64. I don't think there's a Fedora that runs on aarch64. I included the results from Fedora just to show that the code does work in some places. I've tried on pine64 aarch64 linux little-endian fpc 3.2.0 and it showed the correct results using your example code in another mail Marc It must be a Big endian / Little endian issue. IIRC it can be adjusted in ARM CPUs. Why do MacOS and Linux use a different setting there? I have no idea. Well, as Florian said above, the M1 is little-endian. So it doesn't appear to be an endian issue. https://developer.apple.com/documentation/apple-silicon/porting-your-macos-apps-to-apple-silicon -- ___ lazarus mailing list lazarus@lists.lazarus-ide.org https://lists.lazarus-ide.org/listinfo/lazarus
[Lazarus] Improved FPC JSON-RPC support
Hello, Thanks to the magic of RTTI and Invoke(), creating a JSON-RPC server has just become significantly easier ! Given an interface definition: IMyOtherInterface = interface ['{4D52BEE3-F709-44AC-BD31-870CBFF44632}'] Function SayHello : string; function Echo(args : TStringArray) : String; end; Creating a JSON-RPC server for this is now as simple as: // Create a class that implements the interface Type TIntf2Impl = class(TInterfacedObject, IMyOtherInterface) public function Echo(args: TStringArray): String; function SayHello: string; end; function TIntf2Impl.Echo(args: TStringArray): String; var S : String; begin Result:=''; For S in Args do begin if Result<>'' then Result:=Result+' '; Result:=Result+S; end end; function TIntf2Impl.SayHello: string; begin Result:='Hello, World!'; end; // Register the class using an interface factory: Function GetMyOtherInterface(Const aName : string) : IInterface; begin Result:=TIntf2Impl.Create as IInterface; end; initialization RTTIJSONRPCRegistry.Add(TypeInfo(IMyOtherInterface),@GetMyOtherInterface,'Service2'); end. And calling it from a client program is even more simple: var client: IMyOtherInterface; aRPCClient: TFPRPCClient; begin // Register interface with name 'Service2' RPCServiceRegistry.Add(TypeInfo(IMyOtherInterface),'Service2'); // Create client aRPCClient:=TFPRPCClient.Create(Nil); aRPCClient.BaseURL:='http://localhost:8080/RPC'; // Just typecast the client to the desired interface Client:=aRPCClient as IMyotherInterface; // or explitly create using the registered service name // Client:=RPC.Specialize CreateService('Service2'); Writeln('Sayhello: ',client.SayHello); Writeln('Sayhello: ',client.Echo(['This','is','Sparta'])); end. The service can also be consumed by pas2js. The support for various argument types is still limited, but this will improve soon. Currently simple types and arrays of simple types are improved. Proper support for records and other structured types needs extended RTTI... An example server and client programs have been committed to the FPC repo. (packages/fcl-web/examples/jsonrpc/rtti) With many thanks to Sven Barth for putting me on the right track... Enjoy! Michael. -- ___ lazarus mailing list lazarus@lists.lazarus-ide.org https://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] UTF8LengthFast returning incorrect results on AARCH64 (MacOS)
On Tue, Dec 28, 2021 at 3:56 PM Florian Klämpfl via lazarus wrote: > > Crash at run time with sigill. Popcnt was introduced with Nehalem, so >10 > years ago. Thanks. Any other CPU's support something like this? -- Bart -- ___ lazarus mailing list lazarus@lists.lazarus-ide.org https://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] UTF8LengthFast returning incorrect results on AARCH64 (MacOS)
Am 28.12.2021 um 15:50 schrieb Bart via lazarus: On Tue, Dec 28, 2021 at 3:39 PM Marco van de Voort via lazarus wrote: On what machine did you test? The settings if for the generated code, but the actual processor determines the effective speed. I have a Intel i5 7th generation on my Win10-64 laptop from approx. 2017 (so, it's really old for more modern folks than me). Compiled for 32-bit: With -CpCOREI Unsigned version with multiplication: 1359 Unsigned version with PopCnt: 1282 Compiled for 32-bit: With -CpCOREAVX2 Unsigned version with multiplication: 1312 Unsigned version with PopCnt: 1297 Compiled for 32-bit No -Cp switch Unsigned version with multiplication: 1329 Unsigned version with PopCnt: 3546 B.t.w. what happens if I compile for e.g. CoreAVX2 but my processor does not support that instructionset. Will the compilation/build fail, or will the executable just error out? Crash at run time with sigill. Popcnt was introduced with Nehalem, so >10 years ago. -- ___ lazarus mailing list lazarus@lists.lazarus-ide.org https://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] UTF8LengthFast returning incorrect results on AARCH64 (MacOS)
On Tue, Dec 28, 2021 at 3:39 PM Marco van de Voort via lazarus wrote: > On what machine did you test? The settings if for the generated code, > but the actual processor determines the effective speed. I have a Intel i5 7th generation on my Win10-64 laptop from approx. 2017 (so, it's really old for more modern folks than me). Compiled for 32-bit: With -CpCOREI Unsigned version with multiplication: 1359 Unsigned version with PopCnt: 1282 Compiled for 32-bit: With -CpCOREAVX2 Unsigned version with multiplication: 1312 Unsigned version with PopCnt: 1297 Compiled for 32-bit No -Cp switch Unsigned version with multiplication: 1329 Unsigned version with PopCnt: 3546 B.t.w. what happens if I compile for e.g. CoreAVX2 but my processor does not support that instructionset. Will the compilation/build fail, or will the executable just error out? -- Bart -- ___ lazarus mailing list lazarus@lists.lazarus-ide.org https://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] UTF8LengthFast returning incorrect results on AARCH64 (MacOS)
On Tue, Dec 28, 2021 at 3:31 PM Florian Klämpfl via lazarus wrote: > For X86, check for the define CPUX86_HAS_POPCNT (compile time!). Thanks. -- Bart -- ___ lazarus mailing list lazarus@lists.lazarus-ide.org https://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] UTF8LengthFast returning incorrect results on AARCH64 (MacOS)
Op 12/28/2021 om 3:01 PM schreef Bart via lazarus: -Cpcoreavx for core 3000 series and higher Thanks for that. Up to PENTIUMM: PopCnt slower COREI : approximately equally fast COREAVX PopCnt slightly faster COREAVX2 PopCnt slightly faster On what machine did you test? The settings if for the generated code, but the actual processor determines the effective speed. Most likely not worth bothering. In code can we check (at compile time) for which instructionset the code was compiled? Not that I know. Moreover with AMD etc there would be no simple "x or greater" kind of criterium other than having popcnt -- ___ lazarus mailing list lazarus@lists.lazarus-ide.org https://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] UTF8LengthFast returning incorrect results on AARCH64 (MacOS)
Am 28.12.2021 um 15:01 schrieb Bart via lazarus: On Tue, Dec 28, 2021 at 2:46 PM Marco van de Voort via lazarus wrote: You need an appropriate minimal CPU with -Cp Try e.g. -Cpcoreavx for core 3000 series and higher Thanks for that. Up to PENTIUMM: PopCnt slower COREI : approximately equally fast COREAVX PopCnt slightly faster COREAVX2 PopCnt slightly faster Most likely not worth bothering. In code can we check (at compile time) for which instructionset the code was compiled? For X86, check for the define CPUX86_HAS_POPCNT (compile time!). -- ___ lazarus mailing list lazarus@lists.lazarus-ide.org https://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] UTF8LengthFast returning incorrect results on AARCH64 (MacOS)
On Tue, Dec 28, 2021 at 2:46 PM Marco van de Voort via lazarus wrote: > You need an appropriate minimal CPU with -Cp > > > Try e.g. -Cpcoreavx for core 3000 series and higher Thanks for that. Up to PENTIUMM: PopCnt slower COREI : approximately equally fast COREAVX PopCnt slightly faster COREAVX2 PopCnt slightly faster Most likely not worth bothering. In code can we check (at compile time) for which instructionset the code was compiled? -- Bart -- ___ lazarus mailing list lazarus@lists.lazarus-ide.org https://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] UTF8LengthFast returning incorrect results on AARCH64 (MacOS)
Op 12/28/2021 om 1:53 PM schreef Bart via lazarus: I just tested PopCnt vs Multiplication on win32 and win64. The version with PopCnt is appr. 3 times slower on both 32 and 64 bit! You need an appropriate minimal CPU with -Cp Try e.g. -Cpcoreavx for core 3000 series and higher -- ___ lazarus mailing list lazarus@lists.lazarus-ide.org https://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] UTF8LengthFast returning incorrect results on AARCH64 (MacOS)
On Tue, Dec 28, 2021 at 1:09 PM Juha Manninen via lazarus wrote: >> I will patch the function using unsigned types where applicable. >> I will keep the loop variables unsigned though. > > > Yes, thank you. Done. Should that be merged to fixes? -- Bart -- ___ lazarus mailing list lazarus@lists.lazarus-ide.org https://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] UTF8LengthFast returning incorrect results on AARCH64 (MacOS)
On Tue, Dec 28, 2021 at 1:09 PM Juha Manninen via lazarus wrote: > I confess I didn't remember what PopCnt does. I checked from the net. > FPC implements it as internproc. > function PopCnt(Const AValue : QWord): QWord;[internproc:fpc_in_popcnt_x]; > I guess it translates to one x86_64 instruction. > Is it implemented for all CPUs? I found this: > https://gitlab.com/freepascal.org/fpc/source/-/issues/38729 I just tested PopCnt vs Multiplication on win32 and win64. The version with PopCnt is appr. 3 times slower on both 32 and 64 bit! C:\Users\Bart\LazarusProjecten\bugs\Utf8\ulenfast>fpc ulen.lpr Free Pascal Compiler version 3.3.1 [2021/12/08] for i386 Copyright (c) 1993-2021 by Florian Klaempfl and others Target OS: Win32 for i386 C:\Users\Bart\LazarusProjecten\bugs\Utf8\ulenfast>ulen Unsigned version with multiplication: 1344 Unsigned version with PopCnt: 3563 C:\Users\Bart\LazarusProjecten\bugs\Utf8\ulenfast>fpc ulen.lpr -Px86_64 Free Pascal Compiler version 3.3.1 [2021/12/08] for x86_64 Copyright (c) 1993-2021 by Florian Klaempfl and others Target OS: Win64 for x64 C:\Users\Bart\LazarusProjecten\bugs\Utf8\ulenfast>ulen Unsigned version with multiplication: 656 Unsigned version with PopCnt: 3797 It looks like PopCnt on these platforms at least calls the generic version (/rtl/inc/generic.inc). -- Bart -- ___ lazarus mailing list lazarus@lists.lazarus-ide.org https://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] UTF8LengthFast returning incorrect results on AARCH64 (MacOS)
On Tue, Dec 28, 2021 at 12:08 PM Martin Frb via lazarus wrote: > I would like to see the generates assembler on M1, if that is possible? (for > code with optimization off, as well as code with whatever optimization was > used so far) @Noel: Here's example code (standalone) you can use to test both signed and unsigned versions. Save this code as ulen.pas In order to get assembler output compile with: fpc -al ulen.pas This will produce the file ulen.s You can attach or copy that here. = program ulen; {$mode objfpc}{$H+} {$optimization off} {$codepage utf8} uses SysUtils; function UTF8LengthFast_Signed(p: PChar; ByteCount: PtrInt): PtrInt; const {$ifdef CPU32} ONEMASK =$01010101; EIGHTYMASK=$80808080; {$endif} {$ifdef CPU64} ONEMASK =$0101010101010101; EIGHTYMASK=$8080808080808080; {$endif} var pnx: PPtrInt absolute p; // To get contents of text in PtrInt blocks. x refers to 32 or 64 bits pn8: pint8 absolute pnx; // To read text as Int8 in the initial and final loops ix: PtrInt absolute pnx; // To read text as PtrInt in the block loop nx: PtrInt; // values processed in block loop i,cnt,e: PtrInt; begin Result := 0; e := ix+ByteCount; // End marker // Handle any initial misaligned bytes. cnt := (not (ix-1)) and (sizeof(PtrInt)-1); if cnt>ByteCount then cnt := ByteCount; for i := 1 to cnt do begin // Is this byte NOT the first byte of a character? //writeln('pn8^ = ',byte(pn8^).ToBinString); //writeln('pn8^ shr 7 = ',Byte(Byte(pn8^) shr 7).ToBinString); //writeln('not pn8^ = ',Byte(not pn8^).ToBinString); //writeln('(not pn8^) shr 6 = ',Byte((not pn8^) shr 6).ToBinString); //writeln; Result += (pn8^ shr 7) and ((not pn8^) shr 6); inc(pn8); end; // Handle complete blocks for i := 1 to (ByteCount-cnt) div sizeof(PtrUInt) do begin // Count bytes which are NOT the first byte of a character. nx := ((pnx^ and EIGHTYMASK) shr 7) and ((not pnx^) shr 6); {$push}{$overflowchecks off} // "nx * ONEMASK" causes an arithmetic overflow. Result += (nx * ONEMASK) >> ((sizeof(PtrInt) - 1) * 8); {$pop} inc(pnx); end; // Take care of any left-over bytes. while ixByteCount then cnt := ByteCount; for i := 1 to cnt do begin // Is this byte NOT the first byte of a character? //writeln('pn8^ = ',byte(pn8^).ToBinString); //writeln('pn8^ shr 7 = ',Byte(Byte(pn8^) shr 7).ToBinString); //writeln('not pn8^ = ',Byte(not pn8^).ToBinString); //writeln('(not pn8^) shr 6 = ',Byte((not pn8^) shr 6).ToBinString); //writeln; Result += (pn8^ shr 7) and ((not pn8^) shr 6); inc(pn8); end; // Handle complete blocks for i := 1 to (ByteCount-cnt) div sizeof(PtrUInt) do begin // Count bytes which are NOT the first byte of a character. nx := ((pnx^ and EIGHTYMASK) shr 7) and ((not pnx^) shr 6); {$push}{$overflowchecks off} // "nx * ONEMASK" causes an arithmetic overflow. Result += (nx * ONEMASK) >> ((sizeof(PtrInt) - 1) * 8); {$pop} inc(pnx); end; // Take care of any left-over bytes. while ixhttps://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] UTF8LengthFast returning incorrect results on AARCH64 (MacOS)
On Tue, Dec 28, 2021 at 1:45 PM Bart via lazarus < lazarus@lists.lazarus-ide.org> wrote: > @Juha: can you please comment on my possible improvement using PopCnt > instead of a multiplication with ONEMASK. > I confess I didn't remember what PopCnt does. I checked from the net. FPC implements it as internproc. function PopCnt(Const AValue : QWord): QWord;[internproc:fpc_in_popcnt_x]; I guess it translates to one x86_64 instruction. Is it implemented for all CPUs? I found this: https://gitlab.com/freepascal.org/fpc/source/-/issues/38729 If it works everywhere, good. It looks like another good optimization for this highly optimized function. I will patch the function using unsigned types where applicable. > I will keep the loop variables unsigned though. > Yes, thank you. Regards, Juha -- ___ lazarus mailing list lazarus@lists.lazarus-ide.org https://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] UTF8LengthFast returning incorrect results on AARCH64 (MacOS)
On Tue, Dec 28, 2021 at 11:52 AM Juha Manninen via lazarus wrote: > Can you please create a patch for UTFLengthFast. You can upload it here or > create a merge request in GitLab or anything. @Juha: can you please comment on my possible improvement using PopCnt instead of a multiplication with ONEMASK. I will patch the function using unsigned types where applicable. I will keep the loop variables unsigned though. -- Bart -- ___ lazarus mailing list lazarus@lists.lazarus-ide.org https://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] UTF8LengthFast returning incorrect results on AARCH64 (MacOS)
On 28/12/2021 11:52, Juha Manninen via lazarus wrote: On Tue, Dec 28, 2021 at 3:29 AM Noel Duffy via lazarus wrote: So it appears to me that an unsigned pointer type is required in UTFLengthFast. Can you please create a patch for UTFLengthFast. You can upload it here or create a merge request in GitLab or anything. I would like to see the generates assembler on M1, if that is possible? (for code with optimization off, as well as code with whatever optimization was used so far) Thanks None the less, switching to "puint8" might be a good idea.-- ___ lazarus mailing list lazarus@lists.lazarus-ide.org https://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] UTF8LengthFast returning incorrect results on AARCH64 (MacOS)
On Tue, Dec 28, 2021 at 3:29 AM Noel Duffy via lazarus < lazarus@lists.lazarus-ide.org> wrote: > So it appears to me that an unsigned pointer type is required in > UTFLengthFast. > Can you please create a patch for UTFLengthFast. You can upload it here or create a merge request in GitLab or anything. Juha -- ___ lazarus mailing list lazarus@lists.lazarus-ide.org https://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] Lazarus server back online
On Tue, 28 Dec 2021, Marc Weustink via lazarus wrote: Hi all, It took a bit longer than expected, but I'm happy to inform you that the Lazarus services are back online. For those interested in why it took longer, I'll explain at the end of the message. [snip] Meanwhile it was 24:00 and I decided to continue using MySQL and called it a day. This morning I reverted my TP changes and put the MySQL database back online. And that is why I no longer wish to maintain something like Mantis. Similar problems, every time you upgrade. But upgrade you must. Michael. -- ___ lazarus mailing list lazarus@lists.lazarus-ide.org https://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] Lazarus server back online
On Tue, 28 Dec 2021 09:41:03 +0100 Marc Weustink via lazarus wrote: >[...] > To be continued... Oh dear. Thanks for the all the work! Mattias -- ___ lazarus mailing list lazarus@lists.lazarus-ide.org https://lists.lazarus-ide.org/listinfo/lazarus
[Lazarus] Lazarus server back online
Hi all, It took a bit longer than expected, but I'm happy to inform you that the Lazarus services are back online. For those interested in why it took longer, I'll explain at the end of the message. Marc On 24-12-2021 08:30, Marc Weustink wrote: Hi, On Monday 27 December 9.00 CET (8.00 GMT) the Lazarus server will be down for maintenance. This affects the following services: * Lazarus website * Lazarus mailinglists * Lazarus online package manager * Lazarus and FreePascal forum Thanks, Marc The story The server was running Ubuntu 16 LTR, so it's support ended on April this year. An attempt at that time failed since Ubuntu decided to switch to systemd-resolve which resulted in a server not even able to resolve its own name, let alone other hosts. When Googling about it, you learn that it doesn't work, it's a half backed solution, not a full DNS etc. At that time it became clear that I wouldn't be able to solve that in an evening. Luckily I could try this on a cloned server (we had to rent for another issue) so I parked the upgrade till I could spend a full day. Yesterday I found the "correct" solution to this which appeared to work. So I continued to upgrade to Ubuntu 18 and finally to Ubuntu 20 LTR. Everything seemed to work until I enabled the mailserver. It couldn't resolve any mailserver for a given domain. What the f.. 'host -t mx freepascal.org' resolves, why can't postfix resolve it. Again after some Googling, postfix needs a /etc/resolf.conf. However one step of the DNS solution was to remove the /etc/resolf.conf symlink, so I tried to restore the original link to some systemd-resolve generated one. This one pointed to their internal resolver. Still no luck since I hadn't configured systemd-resolve which DNS itself should use. After doing so, the generated resolf.conf became empty ? After more Googling I found that systemd-resolve generates another conf where you also can link to. No clue why there have to exist another version, but this one works. So this part of the server upgrade got finished around 14:00. The Lazarus mailing list and main website were back online. Another wish we had was to change the database backend of the forum. It appeared over time that when doing a search on the database, mysql blocks updates, so browsing the forum becomes unresponsive. The current version of SMF supports different databases so we decided to go for PostgreSQL (I'm using them at work for years now). Migrating MySql data to PostgreSQL seemed easy with pgLoader. The documentation about is was initially a bit sparse, but I could start a conversion with some commandline options. Unfortunately it got killed after 15 mins of import. After two more attempts it became clear, out of memory :( Reducing the memory requirements was a build time option so I didn't want to go that way. Another solution was to convert only a few tables at a time. That required however a configuration file which has more options than the command line. Most of the examples I found on the web failed, since they lack the semi colon at the end if the configuration. So the parser barfs with some abracadabra, initially not giving a clue. Fast forward, on 18:00 all but the messages table were converted. On 22:00 the messages table was converted using 9 parts. Then I realized that I didn't have a php-pgsql driver installed. After doing so, I discovered that the Lazarus main site also showed the forum maintenance message ??? Those sites are running on two different virtual servers. How can the contents of an index.php of one site have influence on the index.php of another site. What did go wrong when I installed the driver ??? After an hour investigating I decided to enable the forum first and investigate the issue later. Poof the forum results in a bunch of errors. What we didn't think of when switching backend, is that we use SMF (which is PostgreSQL capable) and TinyPortal (TP) to have the menus at the sides. And TP is full of MySQL only statements. Luckily there is someone who created all those missing functions for SMF and I created them on the database. After adjusting some TP php files (PostgreSQL requires a true boolean and not some integer <> 0), the forum started without errors. But it didn't show any boards. So there is still something wrong under the hood. Meanwhile it was 24:00 and I decided to continue using MySQL and called it a day. This morning I reverted my TP changes and put the MySQL database back online. To be continued... Marc -- ___ lazarus mailing list lazarus@lists.lazarus-ide.org https://lists.lazarus-ide.org/listinfo/lazarus