Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]

2021-12-28 Thread Marco van de Voort via lazarus


On 28-12-2021 23:35, Martin Frb via lazarus wrote:



"nx" has a single "1" in each of the 8 bytes in a Qword (based on 64bit).
If we regard each of this bytes as an entity of its own, then we can 
keep adding those "1".


I also was thinking in that direction, but more about how to optimize 
that loop using SSE2


Some simple masking achieves the same (an 1 for each byte that starts 
with %10 bits) in 5 instructions, the load inclusive.


Since 64-bit always supports SSE2, this could work:


{$mode objfpc}{$H+}
{$asmmode intel}

uses sysutils,strutils;

Type int128 = array[0..1] of int64;

const
 mask3   :  array[0..15] of byte  = ( $C0,$C0,$C0,$C0,
 $C0,$C0,$C0,$C0,
 $C0,$C0,$C0,$C0,
 $C0,$C0,$C0,$C0);

  mask4   :  array[0..15] of byte  = (   $80,$80,$80,$80,
 $80,$80,$80,$80,
 $80,$80,$80,$80,
 $80,$80,$80,$80);


  mask2   :  array[0..15] of byte  = ( $1,$1,$1,$1,
                         $1,$1,$1,$1,
 $1,$1,$1,$1,
 $1,$1,$1,$1);

function utf8length(const s : pchar;var res:int128;len:integer):integer;
// len is number of 16-byte counts to accumulate, max 255 I think
// stores 16 bytes worth of counts in "res"
begin
 asm
  movdqu xmm1,[rip+mask3] // unaligned is SSE3, doesn't work on 
original X86_64 clawhammer?

  movdqu xmm2,[rip+mask4]
  movdqu xmm3,[rip+mask2]
  pxor xmm4,xmm4

@lbl:
  movdqu xmm0, [rcx]
  pand  xmm0,xmm1  // mask out top 2 bits  ($C0)
  pcmpeqb xmm0,xmm2    // compare with $80. sets byte to  or 


  pand  xmm0,xmm3  // change to lsb (1/0) per byte only.
  paddb  xmm4,xmm0 // add to cumulative

  add rcx,16
  dec r8
  jne @lbl

  movdqu [rdx],xmm4

end; // no volatile registers used.
end;

function countmask(nx:int64):integer;
// Martin's routine that should be replaced by some punpkl magic, but it 
is too late now.

begin
   nx := (nx and $00FF00FF00FF00FF) + ((nx >>  8) and $00FF00FF00FF00FF);
   nx := (nx and $) + ((nx >> 16) and $);
   result := (nx and $) + ((nx >> 32) and 
$);

end;


// one of each pattern.
const pattern : array[0..3] of char = (chr(%11001001),chr(%10001001),
chr(%1001),chr(%01001001));

const testblocks = 5;

var s : string;
    i,j,cnt : integer;
    r : int128;

begin
  randomize;
  setlength(s,testblocks*16);
  // random string but keep a count of bytes with high value %10
  cnt:=0;
  for i:=0 to testblocks*16-1 do
    begin
  j:=random(4);
  if j=1 then inc(cnt);
  s[i+1]:=pattern[j];
    end;

  utf8length(pchar(s),r,testblocks+1);

  writeln(cnt,' = ',countmask(r[0])+countmask(r[1]));
//  writeln(inttohex(r[0],16));
//  writeln(inttohex(r[1],16));

end.



--
___
lazarus mailing list
lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]

2021-12-28 Thread Bart via lazarus
On Tue, Dec 28, 2021 at 11:35 PM Martin Frb via lazarus
 wrote:

> I have a core I7-8600
> The diff between the old code and popcnt is less significant.
>
> old: 715
> pop: 695
>
> But there is a 3rd way, that is faster.
> add: 610

Not surprising that you should come up with a faster solution.
IIRC you won both speed contests I had on the forum ;-)

Feel free to implement it in LazUtf8.
-- 
Bart
-- 
___
lazarus mailing list
lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] fpc bug with M1 [[was: Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]

2021-12-28 Thread Martin Frb via lazarus

On 28/12/2021 23:18, Noel Duffy via lazarus wrote:


The assembler produced by 3.2.2 looks like this:

# [43] Result += (pn8^ shr 7) and ((not pn8^) shr 6);
ldr    x0,[sp]
ldrsb    w0,[x0]
mvn    w0,w0


mvn => bitwise not. And that applies to the whole register.

So I guess "eor    w0,w0,#255 " is meant to be some optimization, but 
comes with a bug.

--
___
lazarus mailing list
lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


[Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]

2021-12-28 Thread Martin Frb via lazarus

On 28/12/2021 15:50, Bart via lazarus wrote:

On Tue, Dec 28, 2021 at 3:39 PM Marco van de Voort via lazarus
 wrote:


On what machine did you test? The settings if for the generated code,
but the actual processor determines the effective speed.

I have a Intel i5 7th generation on my Win10-64 laptop from approx.
2017 (so, it's really old for more modern folks than me).

Compiled for 32-bit:
With -CpCOREI
Unsigned version with multiplication: 1359
Unsigned version with PopCnt: 1282



I have a core I7-8600
The diff between the old code and popcnt is less significant.

old: 715
pop: 695

But there is a 3rd way, that is faster.
add: 610

"nx" has a single "1" in each of the 8 bytes in a Qword (based on 64bit).
If we regard each of this bytes as an entity of its own, then we can 
keep adding those "1".


We could add the 1 of up to 255 iteration, before an overflow can happen.
The example only does 128, as this avoids the "div" and "mod" operations.

The full routine / incl benchmark for all 3 versions is attached

For 64 bit:

  bc := (ByteCount-cnt) div sizeof(PtrInt);
  for j := 1 to bc >> 7 do begin
    nx := 0;
    for i := 0 to 127 do
    begin
  // Count bytes which are NOT the first byte of a character.
  nx += ((pnx^ and EIGHTYMASK) shr 7) and ((not pnx^) shr 6);
  inc(pnx);
    end;
    nx := (nx and $00FF00FF00FF00FF) + ((nx >>  8) and $00FF00FF00FF00FF);
    nx := (nx and $) + ((nx >> 16) and $);
    nx := (nx and $) + ((nx >> 32) and $);
    Result := Result + nx;
  end;


  if (bc and 127) > 0 then begin
  nx := 0;
  for i := 1 to bc and 127 do
  begin
    // Count bytes which are NOT the first byte of a character.
    nx += ((pnx^ and EIGHTYMASK) shr 7) and ((not pnx^) shr 6);
    inc(pnx);
  end;
    nx := (nx and $00FF00FF00FF00FF) + ((nx >>  8) and $00FF00FF00FF00FF);
    nx := (nx and $) + ((nx >> 16) and $);
    nx := (nx and $) + ((nx >> 32) and $);
    Result := Result + nx;
  end;

program Project1;
{$mode objfpc}{$H+}

uses SysUtils;

function UTF8LengthFast(p: PChar; ByteCount: PtrInt): PtrInt;
const
{$ifdef CPU32}
  ONEMASK   =$01010101;
  EIGHTYMASK=$80808080;
{$endif}
{$ifdef CPU64}
  ONEMASK   =$0101010101010101;
  EIGHTYMASK=$8080808080808080;
{$endif}
var
  pnx: PPtrInt absolute p; // To get contents of text in PtrInt blocks. x 
refers to 32 or 64 bits
  pn8: pint8 absolute pnx; // To read text as Int8 in the initial and final 
loops
  ix: PtrInt absolute pnx; // To read text as PtrInt in the block loop
  nx: PtrInt;  // values processed in block loop
  i,cnt,e: PtrInt;
begin
  Result := 0;
  e := ix+ByteCount; // End marker
  // Handle any initial misaligned bytes.
  cnt := (not (ix-1)) and (sizeof(PtrInt)-1);
  if cnt>ByteCount then
cnt := ByteCount;
  for i := 1 to cnt do
  begin
// Is this byte NOT the first byte of a character?
Result += (pn8^ shr 7) and ((not pn8^) shr 6);
inc(pn8);
  end;
  // Handle complete blocks
  for i := 1 to (ByteCount-cnt) div sizeof(PtrInt) do
  begin
// Count bytes which are NOT the first byte of a character.
nx := ((pnx^ and EIGHTYMASK) shr 7) and ((not pnx^) shr 6);
{$push}{$overflowchecks off} // "nx * ONEMASK" causes an arithmetic 
overflow.
Result += (nx * ONEMASK) >> ((sizeof(PtrInt) - 1) * 8);
{$pop}
inc(pnx);
  end;
  // Take care of any left-over bytes.
  while ixByteCount then
cnt := ByteCount;
  for i := 1 to cnt do
  begin
// Is this byte NOT the first byte of a character?
Result += (pn8^ shr 7) and ((not pn8^) shr 6);
inc(pn8);
  end;
  // Handle complete blocks
  for i := 1 to (ByteCount-cnt) div sizeof(PtrInt) do
  begin
// Count bytes which are NOT the first byte of a character.
nx := ((pnx^ and EIGHTYMASK) shr 7) and ((not pnx^) shr 6);
{$push}{$overflowchecks off} // "nx * ONEMASK" causes an arithmetic 
overflow.
//Result += (nx * ONEMASK) >> ((sizeof(PtrInt) - 1) * 8);
Result += PopCnt(qword(nx));
{$pop}
inc(pnx);
  end;
  // Take care of any left-over bytes.
  while ixByteCount then
cnt := ByteCount;
  for i := 1 to cnt do
  begin
// Is this byte NOT the first byte of a character?
Result += (pn8^ shr 7) and ((not pn8^) shr 6);
inc(pn8);
  end;
  // Handle complete blocks

  bc := (ByteCount-cnt) div sizeof(PtrInt);
  for j := 1 to bc >> 7 do begin
nx := 0;
for i := 0 to 127 do
begin
  // Count bytes which are NOT the first byte of a character.
  nx += ((pnx^ and EIGHTYMASK) shr 7) and ((not pnx^) shr 6);
  inc(pnx);
end;
nx := (nx and $00FF00FF00FF00FF) + ((nx >>  8) and $00FF00FF00FF00FF);
nx := (nx and $) + ((nx >> 16) and $);
nx := (nx and $) + ((nx >> 32) and $);
Result := Result + nx;
  end;


  if (bc and 127) > 0 then begin
  nx := 0;
  for i := 1 to bc and

Re: [Lazarus] fpc bug with M1 [[was: Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]

2021-12-28 Thread Noel Duffy via lazarus

On 29/12/21 10:47, Martin Frb via lazarus wrote:

On 28/12/2021 22:05, Noel Duffy via lazarus wrote:

On 29/12/21 01:26, Bart via lazarus wrote:

fpc -al ulen.pas


> This will produce the file ulen.s
> You can attach or copy that here.

File is attached.


Thanks.
And I think there is a bug in FPC

This is the signed version

# [43] Result += (pn8^ shr 7) and ((not pn8^) shr 6);
     ldr    x0,[sp]
     ldrsb    w0,[x0] # < sign extend to a 32bit value 
(32bit register).
     eor    w0,w0,#255   # < But only "not" the lowest 8 bit. That 
is wrong. The calculation uses 32 bit at this point

     lsr    w0,w0,#6
     ldr    x2,[sp]
     ldrsb    w2,[x2]
     lsr    w2,w2,#7
     and    w0,w2,w0    # << here the full 32 bit are used.


Had the full 32 bits been "not"ed, the the upper 24 bit where 0 (because 
they had been sign extended to 1).

This would mask all the 1, that were sign extended in w2.

Had the char been < 128 (high bit = 0), then w0 would have the upper 24 
bit = 1 / but w2 would have them 0.


And that is why the code worked, even with signed values. (signed values 
were still a bad idea)




Interesting. As I noted in my message with the ulen.s attached, I tested 
and found the problem is not present in fpc 3.2.2. That's the version of 
fpc packaged by Lazarus, and I used that to compile fpc 3.3.1.


The assembler produced by 3.2.2 looks like this:

# [43] Result += (pn8^ shr 7) and ((not pn8^) shr 6);
ldr x0,[sp]
ldrsb   w0,[x0]
mvn w0,w0
sxtbw0,w0
lsr w1,w0,#6
ldr x0,[sp]
ldrsb   w0,[x0]
lsr w0,w0,#7
and w0,w0,w1
sxtwx0,w0
ldr x1,[sp, #16]
add x0,x1,x0
str x0,[sp, #16]


--
___
lazarus mailing list
lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


[Lazarus] fpc bug with M1 [[was: Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]

2021-12-28 Thread Martin Frb via lazarus

On 28/12/2021 22:05, Noel Duffy via lazarus wrote:

On 29/12/21 01:26, Bart via lazarus wrote:

fpc -al ulen.pas


> This will produce the file ulen.s
> You can attach or copy that here.

File is attached.


Thanks.
And I think there is a bug in FPC

This is the signed version

# [43] Result += (pn8^ shr 7) and ((not pn8^) shr 6);
    ldr    x0,[sp]
    ldrsb    w0,[x0] # < sign extend to a 32bit value 
(32bit register).
    eor    w0,w0,#255   # < But only "not" the lowest 8 bit. That 
is wrong. The calculation uses 32 bit at this point

    lsr    w0,w0,#6
    ldr    x2,[sp]
    ldrsb    w2,[x2]
    lsr    w2,w2,#7
    and    w0,w2,w0    # << here the full 32 bit are used.


Had the full 32 bits been "not"ed, the the upper 24 bit where 0 (because 
they had been sign extended to 1).

This would mask all the 1, that were sign extended in w2.

Had the char been < 128 (high bit = 0), then w0 would have the upper 24 
bit = 1 / but w2 would have them 0.


And that is why the code worked, even with signed values. (signed values 
were still a bad idea)




--
___
lazarus mailing list
lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] UTF8LengthFast returning incorrect results on AARCH64 (MacOS)

2021-12-28 Thread Noel Duffy via lazarus

On 29/12/21 01:26, Bart via lazarus wrote:

fpc -al ulen.pas


> This will produce the file ulen.s
> You can attach or copy that here.

File is attached.

The output from running this program is:

% ./ulen
Signed version
Len = -100663283
Unsigned version
Len = 1

To add another wrinkle to this, the signed and unsigned versions of the 
code both work if compiled with fpc 3.2.2. It's only when compiled with 
3.3.1 that I get an incorrect result.# Begin asmlist al_procedures

.text
.align 4
.globl  _P$ULEN_$$_UTF8LENGTHFAST_SIGNED$PCHAR$INT64$$INT64
_P$ULEN_$$_UTF8LENGTHFAST_SIGNED$PCHAR$INT64$$INT64:
# [ulen.pas]
# [28] begin
stp x29,x30,[sp, #-16]!
mov x29,sp
sub sp,sp,#64
# Var p located at sp+0, size=OS_64
# Var ByteCount located at sp+8, size=OS_S64
# Var $result located at sp+16, size=OS_S64
# Var nx located at sp+24, size=OS_S64
# Var i located at sp+32, size=OS_S64
# Var cnt located at sp+40, size=OS_S64
# Var e located at sp+48, size=OS_S64
str x0,[sp]
str x1,[sp, #8]
# [29] Result := 0;
str xzr,[sp, #16]
# [30] e := ix+ByteCount; // End marker
ldr x0,[sp]
ldr x1,[sp, #8]
add x0,x1,x0
str x0,[sp, #48]
# [32] cnt := (not (ix-1)) and (sizeof(PtrInt)-1);
ldr x0,[sp]
sub x0,x0,#1
mvn x0,x0
and x0,x0,#7
str x0,[sp, #40]
# [33] if cnt>ByteCount then
ldr x0,[sp, #40]
ldr x1,[sp, #8]
cmp x0,x1
b.gtLj7
b   Lj8
Lj7:
# [34] cnt := ByteCount;
ldr x0,[sp, #8]
str x0,[sp, #40]
Lj8:
# [35] for i := 1 to cnt do
ldr x1,[sp, #40]
cmp x1,#1
b.geLj9
b   Lj10
Lj9:
str xzr,[sp, #32]
Lj11:
ldr x0,[sp, #32]
add x0,x0,#1
str x0,[sp, #32]
# [43] Result += (pn8^ shr 7) and ((not pn8^) shr 6);
ldr x0,[sp]
ldrsb   w0,[x0]
eor w0,w0,#255
lsr w0,w0,#6
ldr x2,[sp]
ldrsb   w2,[x2]
lsr w2,w2,#7
and w0,w2,w0
sxtwx0,w0
ldr x2,[sp, #16]
add x0,x2,x0
str x0,[sp, #16]
# [44] inc(pn8);
ldr x0,[sp]
add x0,x0,#1
str x0,[sp]
ldr x0,[sp, #32]
cmp x1,x0
b.leLj13
b   Lj11
Lj13:
Lj10:
# [47] for i := 1 to (ByteCount-cnt) div sizeof(PtrUInt) do
ldr x0,[sp, #8]
ldr x1,[sp, #40]
sub x0,x0,x1
asr x1,x0,#63
add x0,x0,x1,lsr #61
asr x0,x0,#3
cmp x0,#1
b.geLj14
b   Lj15
Lj14:
str xzr,[sp, #32]
Lj16:
ldr x1,[sp, #32]
add x1,x1,#1
str x1,[sp, #32]
# [50] nx := ((pnx^ and EIGHTYMASK) shr 7) and ((not pnx^) shr 6);
ldr x1,[sp]
ldr x1,[x1]
and x1,x1,#0x8080808080808080
lsr x1,x1,#7
ldr x2,[sp]
ldr x2,[x2]
mvn x2,x2
lsr x2,x2,#6
and x1,x2,x1
str x1,[sp, #24]
# [52] Result += (nx * ONEMASK) >> ((sizeof(PtrInt) - 1) * 8);
ldr x1,[sp, #24]
mov x2,#72340172838076673
mul x1,x1,x2
lsr x1,x1,#56
ldr x2,[sp, #16]
add x1,x2,x1
str x1,[sp, #16]
# [54] inc(pnx);
ldr x1,[sp]
add x1,x1,#8
str x1,[sp]
ldr x1,[sp, #32]
cmp x0,x1
b.leLj18
b   Lj16
Lj18:
Lj15:
# [57] while ixByteCount then
ldr x0,[sp, #40]
ldr x1,[sp, #8]
cmp x0,x1
b.gtLj24
b   Lj25
Lj24:
# [94] cnt := ByteCount;
ldr x0,[sp, #8]
str x0,[sp, #40]
Lj25:
# [95] for i := 1 to cnt do
ldr x1,[sp, #40]
cmp x1,#1
b.geLj26
b   Lj27
Lj26:
str xzr,[sp, #32]
Lj28:
ldr x0,[sp, #32]
add x0,x0,#1
str x0,[sp, #32]
# [103] Result += (pn8^ shr 7) and ((not pn8^) shr 6);
ldr x0,[sp]
ldrbw0,[x0]
eor w0,w0,#255
lsr w0,w0,#6
ldr x2,[sp]
ldrbw2,[x2]
lsr w2,w2,#7
and w0,w2,w0
ldr x2,[sp, #16]
add x0,x2,x0
str x0,[sp, #16]
# [104] inc(pn8);
ldr x0,[sp]
add x0,x0,#1
str x0,[sp]
ldr x0,[sp, #32]
cmp x1,x0
b.leLj30
b   Lj28
Lj30:
Lj27:
# [107] for i := 1 to (ByteCount-cnt) div sizeof(PtrUInt) do
ldr x0,[sp, #8]
ldr x1,[sp, #40]
sub x0,x0,x1
asr x1,x0,#63
add x0,x0,x1,lsr #61
asr x0,x0,#3
cmp x0,#1
b.geLj31
b   Lj32
Lj31:

Re: [Lazarus] UTF8LengthFast returning incorrect results on AARCH64 (MacOS)

2021-12-28 Thread Marc Weustink via lazarus

On 27-12-2021 22:10, Noel Duffy via lazarus wrote:

On 28/12/21 01:47, Juha Manninen via lazarus wrote:

On Mon, Dec 27, 2021 at 1:44 AM Noel Duffy via lazarus <
lazarus@lists.lazarus-ide.org> wrote:


I need some help getting to the root of a problem with incorrect results
on Apple hardware (M1, aarch64) for the function UTF8LengthFast in 
lazutf8.


On MacOS, when given a string containing one or more UTF8 characters,
UTF8LengthFast returns wildly incorrect results. On Fedora, the function
returns the correct answer.



You mean both MacOS and Fedora run on the same aarch64 CPU?


Oh no, Fedora runs on an Intel x86_64. I don't think there's a Fedora 
that runs on aarch64. I included the results from Fedora just to show 
that the code does work in some places.


I've tried on pine64 aarch64 linux little-endian fpc 3.2.0 and it showed 
the correct results using your example code in another mail


Marc



It must be a Big endian / Little endian issue. IIRC it can be adjusted in
ARM CPUs.
Why do MacOS and Linux use a different setting there? I have no idea.


Well, as Florian said above, the M1 is little-endian. So it doesn't 
appear to be an endian issue.


https://developer.apple.com/documentation/apple-silicon/porting-your-macos-apps-to-apple-silicon 






--
___
lazarus mailing list
lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


[Lazarus] Improved FPC JSON-RPC support

2021-12-28 Thread Michael Van Canneyt via lazarus



Hello,

Thanks to the magic of RTTI and Invoke(), creating a JSON-RPC server has
just become significantly easier !

Given an interface definition:

  IMyOtherInterface = interface ['{4D52BEE3-F709-44AC-BD31-870CBFF44632}']
Function SayHello : string;
function Echo(args : TStringArray) : String;
  end;

Creating a JSON-RPC server for this is now as simple as:

// Create a class that implements the interface
Type
  TIntf2Impl = class(TInterfacedObject, IMyOtherInterface)
  public
function Echo(args: TStringArray): String;
function SayHello: string;
  end;

function TIntf2Impl.Echo(args: TStringArray): String;

var
  S : String;

begin
  Result:='';
  For S in Args do
begin
if Result<>'' then
  Result:=Result+' ';
Result:=Result+S;
end
end;

function TIntf2Impl.SayHello: string;
begin
  Result:='Hello, World!';
end;

// Register the class using an interface factory:

Function GetMyOtherInterface(Const aName : string) : IInterface;

begin
  Result:=TIntf2Impl.Create as IInterface;
end;

initialization
  
RTTIJSONRPCRegistry.Add(TypeInfo(IMyOtherInterface),@GetMyOtherInterface,'Service2');
end.


And calling it from a client program is even more simple:

var
  client: IMyOtherInterface;
  aRPCClient: TFPRPCClient;

begin
  // Register interface with name 'Service2'
  RPCServiceRegistry.Add(TypeInfo(IMyOtherInterface),'Service2');
  // Create client
  aRPCClient:=TFPRPCClient.Create(Nil);
  aRPCClient.BaseURL:='http://localhost:8080/RPC';

  // Just typecast the client to the desired interface
  Client:=aRPCClient as IMyotherInterface;

  // or explitly create using the registered service name
  // Client:=RPC.Specialize CreateService('Service2');

  Writeln('Sayhello: ',client.SayHello);
  Writeln('Sayhello: ',client.Echo(['This','is','Sparta']));
end.

The service can also be consumed by pas2js.

The support for various argument types is still limited, but this will improve 
soon.
Currently simple types and arrays of simple types are improved. Proper
support for records and other structured types needs extended RTTI...

An example server and client programs have been committed to the FPC repo.
(packages/fcl-web/examples/jsonrpc/rtti)

With many thanks to Sven Barth for putting me on the right track...

Enjoy!

Michael.
--
___
lazarus mailing list
lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] UTF8LengthFast returning incorrect results on AARCH64 (MacOS)

2021-12-28 Thread Bart via lazarus
On Tue, Dec 28, 2021 at 3:56 PM Florian Klämpfl via lazarus
 wrote:

>
> Crash at run time with sigill. Popcnt was introduced with Nehalem, so >10 
> years ago.

Thanks.
Any other CPU's support something like this?


-- 
Bart
-- 
___
lazarus mailing list
lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] UTF8LengthFast returning incorrect results on AARCH64 (MacOS)

2021-12-28 Thread Florian Klämpfl via lazarus

Am 28.12.2021 um 15:50 schrieb Bart via lazarus:

On Tue, Dec 28, 2021 at 3:39 PM Marco van de Voort via lazarus
 wrote:


On what machine did you test? The settings if for the generated code,
but the actual processor determines the effective speed.


I have a Intel i5 7th generation on my Win10-64 laptop from approx.
2017 (so, it's really old for more modern folks than me).

Compiled for 32-bit:
With -CpCOREI
Unsigned version with multiplication: 1359
Unsigned version with PopCnt: 1282

Compiled for 32-bit:
With -CpCOREAVX2
Unsigned version with multiplication: 1312
Unsigned version with PopCnt: 1297

Compiled for 32-bit
No -Cp switch
Unsigned version with multiplication: 1329
Unsigned version with PopCnt: 3546

B.t.w. what happens if I compile for e.g. CoreAVX2 but my processor
does not support that instructionset.
Will the compilation/build fail, or will the executable just error out?



Crash at run time with sigill. Popcnt was introduced with Nehalem, so >10 years 
ago.
--
___
lazarus mailing list
lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] UTF8LengthFast returning incorrect results on AARCH64 (MacOS)

2021-12-28 Thread Bart via lazarus
On Tue, Dec 28, 2021 at 3:39 PM Marco van de Voort via lazarus
 wrote:

> On what machine did you test? The settings if for the generated code,
> but the actual processor determines the effective speed.

I have a Intel i5 7th generation on my Win10-64 laptop from approx.
2017 (so, it's really old for more modern folks than me).

Compiled for 32-bit:
With -CpCOREI
Unsigned version with multiplication: 1359
Unsigned version with PopCnt: 1282

Compiled for 32-bit:
With -CpCOREAVX2
Unsigned version with multiplication: 1312
Unsigned version with PopCnt: 1297

Compiled for 32-bit
No -Cp switch
Unsigned version with multiplication: 1329
Unsigned version with PopCnt: 3546

B.t.w. what happens if I compile for e.g. CoreAVX2 but my processor
does not support that instructionset.
Will the compilation/build fail, or will the executable just error out?

-- 
Bart
-- 
___
lazarus mailing list
lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] UTF8LengthFast returning incorrect results on AARCH64 (MacOS)

2021-12-28 Thread Bart via lazarus
On Tue, Dec 28, 2021 at 3:31 PM Florian Klämpfl via lazarus
 wrote:


> For X86, check for the define CPUX86_HAS_POPCNT (compile time!).

Thanks.


-- 
Bart
-- 
___
lazarus mailing list
lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] UTF8LengthFast returning incorrect results on AARCH64 (MacOS)

2021-12-28 Thread Marco van de Voort via lazarus


Op 12/28/2021 om 3:01 PM schreef Bart via lazarus:

  -Cpcoreavx  for core 3000 series and higher
Thanks for that.

Up to PENTIUMM: PopCnt slower
COREI : approximately equally fast
COREAVX PopCnt slightly faster
COREAVX2 PopCnt slightly faster


On what machine did you test? The settings if for the generated code, 
but the actual processor determines the effective speed.



Most likely not worth bothering.
In code can we check (at compile time) for which instructionset the
code was compiled?
Not that I know. Moreover with AMD etc there would be no simple "x or 
greater"  kind of criterium other than having popcnt

--
___
lazarus mailing list
lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] UTF8LengthFast returning incorrect results on AARCH64 (MacOS)

2021-12-28 Thread Florian Klämpfl via lazarus

Am 28.12.2021 um 15:01 schrieb Bart via lazarus:

On Tue, Dec 28, 2021 at 2:46 PM Marco van de Voort via lazarus
 wrote:



You need an appropriate minimal CPU with -Cp


Try e.g. -Cpcoreavx  for core 3000 series and higher


Thanks for that.

Up to PENTIUMM: PopCnt slower
COREI : approximately equally fast
COREAVX PopCnt slightly faster
COREAVX2 PopCnt slightly faster

Most likely not worth bothering.
In code can we check (at compile time) for which instructionset the
code was compiled?



For X86, check for the define CPUX86_HAS_POPCNT (compile time!).
--
___
lazarus mailing list
lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] UTF8LengthFast returning incorrect results on AARCH64 (MacOS)

2021-12-28 Thread Bart via lazarus
On Tue, Dec 28, 2021 at 2:46 PM Marco van de Voort via lazarus
 wrote:


> You need an appropriate minimal CPU with -Cp
>
>
> Try e.g. -Cpcoreavx  for core 3000 series and higher

Thanks for that.

Up to PENTIUMM: PopCnt slower
COREI : approximately equally fast
COREAVX PopCnt slightly faster
COREAVX2 PopCnt slightly faster

Most likely not worth bothering.
In code can we check (at compile time) for which instructionset the
code was compiled?

-- 
Bart
-- 
___
lazarus mailing list
lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] UTF8LengthFast returning incorrect results on AARCH64 (MacOS)

2021-12-28 Thread Marco van de Voort via lazarus


Op 12/28/2021 om 1:53 PM schreef Bart via lazarus:

I just tested PopCnt vs Multiplication on win32 and win64.
The version with PopCnt is appr. 3 times slower on both 32 and 64 bit!


You need an appropriate minimal CPU with -Cp


Try e.g. -Cpcoreavx  for core 3000 series and higher

--
___
lazarus mailing list
lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] UTF8LengthFast returning incorrect results on AARCH64 (MacOS)

2021-12-28 Thread Bart via lazarus
On Tue, Dec 28, 2021 at 1:09 PM Juha Manninen via lazarus
 wrote:

>> I will patch the function using unsigned types where applicable.
>> I will keep the loop variables unsigned though.
>
>
> Yes, thank you.

Done.
Should that be merged to fixes?


-- 
Bart
-- 
___
lazarus mailing list
lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] UTF8LengthFast returning incorrect results on AARCH64 (MacOS)

2021-12-28 Thread Bart via lazarus
On Tue, Dec 28, 2021 at 1:09 PM Juha Manninen via lazarus
 wrote:

> I confess I didn't remember what PopCnt does. I checked from the net.
> FPC implements it as internproc.
>   function PopCnt(Const AValue : QWord): QWord;[internproc:fpc_in_popcnt_x];
> I guess it translates to one x86_64 instruction.
> Is it implemented for all CPUs? I found this:
>   https://gitlab.com/freepascal.org/fpc/source/-/issues/38729

I just tested PopCnt vs Multiplication on win32 and win64.
The version with PopCnt is appr. 3 times slower on both 32 and 64 bit!

C:\Users\Bart\LazarusProjecten\bugs\Utf8\ulenfast>fpc ulen.lpr
Free Pascal Compiler version 3.3.1 [2021/12/08] for i386
Copyright (c) 1993-2021 by Florian Klaempfl and others
Target OS: Win32 for i386

C:\Users\Bart\LazarusProjecten\bugs\Utf8\ulenfast>ulen
Unsigned version with multiplication: 1344
Unsigned version with PopCnt: 3563

C:\Users\Bart\LazarusProjecten\bugs\Utf8\ulenfast>fpc ulen.lpr -Px86_64
Free Pascal Compiler version 3.3.1 [2021/12/08] for x86_64
Copyright (c) 1993-2021 by Florian Klaempfl and others
Target OS: Win64 for x64

C:\Users\Bart\LazarusProjecten\bugs\Utf8\ulenfast>ulen
Unsigned version with multiplication:  656
Unsigned version with PopCnt: 3797

It looks like PopCnt on these platforms at least calls the generic
version (/rtl/inc/generic.inc).

-- 
Bart
-- 
___
lazarus mailing list
lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] UTF8LengthFast returning incorrect results on AARCH64 (MacOS)

2021-12-28 Thread Bart via lazarus
On Tue, Dec 28, 2021 at 12:08 PM Martin Frb via lazarus
 wrote:

> I would like to see the generates assembler on M1, if that is possible?  (for 
> code with optimization off, as well as code with whatever optimization was 
> used so far)

@Noel:

Here's example code (standalone) you can use to test both signed and
unsigned versions.
Save this code as ulen.pas
In order to get assembler output compile with:

fpc -al ulen.pas

This will produce the file ulen.s
You can attach or copy that here.

=

program ulen;

{$mode objfpc}{$H+}
{$optimization off}
{$codepage utf8}

uses
  SysUtils;



function UTF8LengthFast_Signed(p: PChar; ByteCount: PtrInt): PtrInt;
const
{$ifdef CPU32}
  ONEMASK   =$01010101;
  EIGHTYMASK=$80808080;
{$endif}
{$ifdef CPU64}
  ONEMASK   =$0101010101010101;
  EIGHTYMASK=$8080808080808080;
{$endif}
var
  pnx: PPtrInt absolute p; // To get contents of text in PtrInt
blocks. x refers to 32 or 64 bits
  pn8: pint8 absolute pnx; // To read text as Int8 in the initial and
final loops
  ix: PtrInt absolute pnx; // To read text as PtrInt in the block loop
  nx: PtrInt;  // values processed in block loop
  i,cnt,e: PtrInt;
begin
  Result := 0;
  e := ix+ByteCount; // End marker
  // Handle any initial misaligned bytes.
  cnt := (not (ix-1)) and (sizeof(PtrInt)-1);
  if cnt>ByteCount then
cnt := ByteCount;
  for i := 1 to cnt do
  begin
// Is this byte NOT the first byte of a character?
//writeln('pn8^ = ',byte(pn8^).ToBinString);
//writeln('pn8^ shr 7   = ',Byte(Byte(pn8^) shr 7).ToBinString);
//writeln('not pn8^ = ',Byte(not pn8^).ToBinString);
//writeln('(not pn8^) shr 6 = ',Byte((not pn8^) shr 6).ToBinString);
//writeln;
Result += (pn8^ shr 7) and ((not pn8^) shr 6);
inc(pn8);
  end;
  // Handle complete blocks
  for i := 1 to (ByteCount-cnt) div sizeof(PtrUInt) do
  begin
// Count bytes which are NOT the first byte of a character.
nx := ((pnx^ and EIGHTYMASK) shr 7) and ((not pnx^) shr 6);
{$push}{$overflowchecks off} // "nx * ONEMASK" causes an
arithmetic overflow.
Result += (nx * ONEMASK) >> ((sizeof(PtrInt) - 1) * 8);
{$pop}
inc(pnx);
  end;
  // Take care of any left-over bytes.
  while ixByteCount then
cnt := ByteCount;
  for i := 1 to cnt do
  begin
// Is this byte NOT the first byte of a character?
//writeln('pn8^ = ',byte(pn8^).ToBinString);
//writeln('pn8^ shr 7   = ',Byte(Byte(pn8^) shr 7).ToBinString);
//writeln('not pn8^ = ',Byte(not pn8^).ToBinString);
//writeln('(not pn8^) shr 6 = ',Byte((not pn8^) shr 6).ToBinString);
//writeln;
Result += (pn8^ shr 7) and ((not pn8^) shr 6);
inc(pn8);
  end;
  // Handle complete blocks
  for i := 1 to (ByteCount-cnt) div sizeof(PtrUInt) do
  begin
// Count bytes which are NOT the first byte of a character.
nx := ((pnx^ and EIGHTYMASK) shr 7) and ((not pnx^) shr 6);
{$push}{$overflowchecks off} // "nx * ONEMASK" causes an
arithmetic overflow.
Result += (nx * ONEMASK) >> ((sizeof(PtrInt) - 1) * 8);
{$pop}
inc(pnx);
  end;
  // Take care of any left-over bytes.
  while ixhttps://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] UTF8LengthFast returning incorrect results on AARCH64 (MacOS)

2021-12-28 Thread Juha Manninen via lazarus
On Tue, Dec 28, 2021 at 1:45 PM Bart via lazarus <
lazarus@lists.lazarus-ide.org> wrote:

> @Juha: can you please comment on my possible improvement using PopCnt
> instead of a multiplication with ONEMASK.
>

I confess I didn't remember what PopCnt does. I checked from the net.
FPC implements it as internproc.
  function PopCnt(Const AValue : QWord): QWord;[internproc:fpc_in_popcnt_x];
I guess it translates to one x86_64 instruction.
Is it implemented for all CPUs? I found this:
  https://gitlab.com/freepascal.org/fpc/source/-/issues/38729
If it works everywhere, good. It looks like another good optimization for
this highly optimized function.


I will patch the function using unsigned types where applicable.
> I will keep the loop variables unsigned though.
>

Yes, thank you.

Regards,
Juha
-- 
___
lazarus mailing list
lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] UTF8LengthFast returning incorrect results on AARCH64 (MacOS)

2021-12-28 Thread Bart via lazarus
On Tue, Dec 28, 2021 at 11:52 AM Juha Manninen via lazarus
 wrote:

> Can you please create a patch for UTFLengthFast. You can upload it here or 
> create a merge request in GitLab or anything.

@Juha: can you please comment on my possible improvement using PopCnt
instead of a multiplication with ONEMASK.

I will patch the function using unsigned types where applicable.
I will keep the loop variables unsigned though.



-- 
Bart
-- 
___
lazarus mailing list
lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] UTF8LengthFast returning incorrect results on AARCH64 (MacOS)

2021-12-28 Thread Martin Frb via lazarus

On 28/12/2021 11:52, Juha Manninen via lazarus wrote:
On Tue, Dec 28, 2021 at 3:29 AM Noel Duffy via lazarus 
 wrote:


So it appears to me that an unsigned pointer type is required in
UTFLengthFast.


Can you please create a patch for UTFLengthFast. You can upload it 
here or create a merge request in GitLab or anything.




I would like to see the generates assembler on M1, if that is possible?  
(for code with optimization off, as well as code with whatever 
optimization was used so far)

Thanks

None the less, switching to "puint8" might be a good idea.-- 
___
lazarus mailing list
lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] UTF8LengthFast returning incorrect results on AARCH64 (MacOS)

2021-12-28 Thread Juha Manninen via lazarus
On Tue, Dec 28, 2021 at 3:29 AM Noel Duffy via lazarus <
lazarus@lists.lazarus-ide.org> wrote:

> So it appears to me that an unsigned pointer type is required in
> UTFLengthFast.
>

Can you please create a patch for UTFLengthFast. You can upload it here or
create a merge request in GitLab or anything.

Juha
-- 
___
lazarus mailing list
lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] Lazarus server back online

2021-12-28 Thread Michael Van Canneyt via lazarus




On Tue, 28 Dec 2021, Marc Weustink via lazarus wrote:


Hi all,

It took a bit longer than expected, but I'm happy to inform you that the 
Lazarus services are back online.
For those interested in why it took longer, I'll explain at the end of the 
message.




[snip]

Meanwhile it was 24:00 and I decided to continue using MySQL and called it a 
day.

This morning I reverted my TP changes and put the MySQL database back online.


And that is why I no longer wish to maintain something like Mantis.
Similar problems, every time you upgrade. But upgrade you must.

Michael.
--
___
lazarus mailing list
lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


Re: [Lazarus] Lazarus server back online

2021-12-28 Thread Mattias Gaertner via lazarus
On Tue, 28 Dec 2021 09:41:03 +0100
Marc Weustink via lazarus  wrote:

>[...]
> To be continued...

Oh dear. Thanks for the all the work!

Mattias
-- 
___
lazarus mailing list
lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus


[Lazarus] Lazarus server back online

2021-12-28 Thread Marc Weustink via lazarus

Hi all,

It took a bit longer than expected, but I'm happy to inform you that the 
Lazarus services are back online.
For those interested in why it took longer, I'll explain at the end of 
the message.


Marc

On 24-12-2021 08:30, Marc Weustink wrote:

Hi,

On Monday 27 December 9.00 CET (8.00 GMT) the Lazarus server will be 
down for maintenance. This affects the following services:


* Lazarus website
* Lazarus mailinglists
* Lazarus online package manager
* Lazarus and FreePascal forum

Thanks,
Marc


The story

The server was running Ubuntu 16 LTR, so it's support ended on April 
this year. An attempt at that time failed since Ubuntu decided to switch 
to systemd-resolve which resulted in a server not even able to resolve 
its own name, let alone other hosts. When Googling about it, you learn 
that it doesn't work, it's a half backed solution, not a full DNS etc. 
At that time it became clear that I wouldn't be able to solve that in an 
evening. Luckily I could try this on a cloned server (we had to rent for 
another issue) so I parked the upgrade till I could spend a full day.


Yesterday I found the "correct" solution to this which appeared to work. 
So I continued to upgrade to Ubuntu 18 and finally to Ubuntu 20 LTR. 
Everything seemed to work until I enabled the mailserver. It couldn't 
resolve any mailserver for a given domain.
What the f.. 'host -t mx freepascal.org' resolves, why can't postfix 
resolve it. Again after some Googling, postfix needs a /etc/resolf.conf. 
However one step of the DNS solution was to remove the /etc/resolf.conf 
symlink, so I tried to restore the original link to some systemd-resolve 
generated one. This one pointed to their internal resolver. Still no 
luck since I hadn't configured systemd-resolve which DNS itself should 
use. After doing so, the generated resolf.conf became empty ?
After more Googling I found that systemd-resolve generates another conf 
where you also can link to. No clue why there have to exist another 
version, but this one works.
So this part of the server upgrade got finished around 14:00. The 
Lazarus mailing list and main website were back online.


Another wish we had was to change the database backend of the forum. It 
appeared over time that when doing a search on the database, mysql 
blocks updates, so browsing the forum becomes unresponsive. The current 
version of SMF supports different databases so we decided to go for 
PostgreSQL (I'm using them at work for years now).
Migrating MySql data to PostgreSQL seemed easy with pgLoader. The 
documentation about is was initially a bit sparse, but I could start a 
conversion with some commandline options. Unfortunately it got killed 
after 15 mins of import. After two more attempts it became clear, out of 
memory :(
Reducing the memory requirements was a build time option so I didn't 
want to go that way. Another solution was to convert only a few tables 
at a time. That required however a configuration file which has more 
options than the command line. Most of the examples I found on the web 
failed, since they lack the semi colon at the end if the configuration. 
So the parser barfs with some abracadabra, initially not giving a clue.
Fast forward, on 18:00 all but the messages table were converted. On 
22:00 the messages table was converted using 9 parts. Then I realized 
that I didn't have a php-pgsql driver installed. After doing so, I 
discovered that the Lazarus main site also showed the forum maintenance 
message ???
Those sites are running on two different virtual servers. How can the 
contents of an index.php of one site have influence on the index.php of 
another site. What did go wrong when I installed the driver ???
After an hour investigating I decided to enable the forum first and 
investigate the issue later.
Poof the forum results in a bunch of errors. What we didn't think of 
when switching backend, is that we use SMF (which is PostgreSQL capable) 
and TinyPortal (TP) to have the menus at the sides. And TP is full of 
MySQL only statements. Luckily there is someone who created all those 
missing functions for SMF and I created them on the database. After 
adjusting some TP php files (PostgreSQL requires a true boolean and not 
some integer <> 0), the forum started without errors. But it didn't show 
any boards. So there is still something wrong under the hood.
Meanwhile it was 24:00 and I decided to continue using MySQL and called 
it a day.
This morning I reverted my TP changes and put the MySQL database back 
online.


To be continued...


Marc

--
___
lazarus mailing list
lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus