Re: [Lazarus] UTF8LengthFast returning incorrect results on AARCH64 (MacOS)

Noel Duffy via lazarus Mon, 27 Dec 2021 17:29:34 -0800

On 28/12/21 07:21, Bart via lazarus wrote:

On Mon, Dec 27, 2021 at 6:35 PM Marco van de Voort via lazarus
<lazarus@lists.lazarus-ide.org> wrote:

The expression seems to be 1 when the top bits are 10  iow when it is a
follow bytes of utf8, that is what the comment says, and I as far as I
can see the signedness doesn't matter.

Basically to me that seems to be a branchless version of

if (p[i] and %11000000)=%10000000 then

     inc(result);


This is how I understood that paert of the code as well.

Just a side node: all this assumes that the UTF8 is correct (in the
strict sense).

Now for the part that tries to do calculations on blocks of 32 or 64 bits.
It uses a multiplication in that part.
That seems a bit odd as this code tries to do evrything with ands,
nots, and shifts.

Would this approach (in that part of the code) also work?
X = 32 or 64 bit block

1: AND X with EIGHTYMASK
2: SHR 7: gives (A)

3: NOT X
4: SHR 6: gives (B)

5: (A) AND (B): gives (C)

Basically we do the same as for a 1-byte block;
Now we have a pattern where any 1 tells us this byte was a following byte

A 32-bit example:
(Invalid sequence but that does not matter here)

X = %11xxxxxx %01xxxxxx %00xxxxxx %10xxxxxx
(Leading-ASCII-ASCII-Following, so count of following bytes must be 1)

AND with EIGTHYMASK gives %10000000 %00000000 %00000000 %10000000
SHR 7 gives: %00000001 %00000000 %00000000 %00000001 (A)

NOT %11xxxxxx %01xxxxxx %00xxxxxx %10xxxxxx = %00yyyyyy %10yyyyyy
%11yyyyyy %01yyyyyy
SHR 6 gives: %00000000 %yyyyyy10 %yyyyyy11 %yyyyyy01 (B)

(A) and (B) gives: %00000000 %00000000 %00000000 %00000001 (C)
All non-following bytes turn into %00000000

The count of following bytes is PopCnt(C)

As long as PopCnt is available as a machine instruction, this will be
faster than the (nx * ONEMASK) calculation I think.
(I would hope this is the case for all platforms Lazarus supports, if
not the call to PopCnt could be ifdef-ed.)

So basically change this:

     nx := ((pnx^ and EIGHTYMASK) shr 7) and ((not pnx^) shr 6);
     {$push}{$overflowchecks off} // "nx * ONEMASK" causes an
arithmetic overflow.
     Result += (nx * ONEMASK) >> ((sizeof(PtrInt) - 1) * 8);
     {$pop}

into

     nx := ((pnx^ and EIGHTYMASK) shr 7) and ((not pnx^) shr 6);
     Result := Result + PopCnt(PtrUInt(nx));  /PopCnt only accepts
unsigend parameter


Since bit manipulating is not my strongpoint, please comment.

I ran my code that calls UTF8LengthFast with a euro through the debuggerto see what path through UTF8LengthFast was followed. The debuggerskipped over all the loops until the very last one:


  // Take care of any left-over bytes.
  while ix<e do
  begin
    // Is this byte NOT the first byte of a character?
    Result += (pn8^ shr 7) and ((not pn8^) shr 6);
    inc(pn8);
  end;
  Result := ByteCount - Result;

The euro symbol encoded in UTF8 has three bytes, 226, 130, 172.

The loop has 3 iterations, one for each byte:

0: Result=33554428
1: Result=67108857
2: Result=100663286

The last line subtracts Result from ByteCount, and gets us -100663283.

OK, time to simplify. I tried writing a short pascal program that didthe bit shifting with constants:


   c := (226 shr 7) and ((not 226) shr 6);
   writeln('c='+inttostr(c)+' 226.');
   c := (130 shr 7) and ((not 130) shr 6);
   writeln('c='+inttostr(c)+' 130.');
   c := (172 shr 7) and ((not 172) shr 6);
   writeln('c='+inttostr(c)+' 172.');

This produces the correct result:

c=0 226.
c=1 130.
c=1 172.

Next I tried using a pint8:

const
   s = '€';

var
  p: PChar;
  c: Byte;
  pn8: pint8 absolute p;
begin
   c := (pn8^ shr 7) and ((not pn8^) shr 6);
   writeln('c='+inttostr(c)+' '+IntToStr(Byte(pn8^)));
   Inc(pn8);

   c := (pn8^ shr 7) and ((not pn8^) shr 6);
   writeln('c='+inttostr(c)+' '+IntToStr(Byte(pn8^)));
   Inc(pn8);

   c := (pn8^ shr 7) and ((not pn8^) shr 6);
   writeln('c='+inttostr(c)+' '+IntToStr(Byte(pn8^)));
end;

This produces:

c=252 226
c=253 130
c=253 172

Clearly wrong. Someone mentioned signed vs unsigned, so that seemed alogical next step. I tried using a pbyte.


const
   s = '€';

var
  pb: PByte absolute p;
  c: Byte;
  p: PChar;
begin
   p := PChar(s);
   c := (pb^ shr 7) and ((not pb^) shr 6);
   writeln('c='+inttostr(c)+' '+IntToStr(pb^));
   Inc(pb);
   c := (pb^ shr 7) and ((not pb^) shr 6);
   writeln('c='+inttostr(c)+' '+IntToStr(pb^));
   Inc(pb);
   c := (pb^ shr 7) and ((not pb^) shr 6);
   writeln('c='+inttostr(c)+' '+IntToStr(pb^));
end;

This produces:

c=0 226
c=1 130
c=1 172

Correct result. So it appears that using a pint8 produces the wrongresult on aarch64, but it doesn't on x86_64. It's not clear why, though.

So it appears to me that an unsigned pointer type is required inUTFLengthFast.


--
_______________________________________________
lazarus mailing list
lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus

Re: [Lazarus] UTF8LengthFast returning incorrect results on AARCH64 (MacOS)

Reply via email to