On Mon, Dec 27, 2021 at 6:35 PM Marco van de Voort via lazarus <lazarus@lists.lazarus-ide.org> wrote:
> The expression seems to be 1 when the top bits are 10 iow when it is a > follow bytes of utf8, that is what the comment says, and I as far as I > can see the signedness doesn't matter. > > Basically to me that seems to be a branchless version of > > if (p[i] and %11000000)=%10000000 then > > inc(result); This is how I understood that paert of the code as well. Just a side node: all this assumes that the UTF8 is correct (in the strict sense). Now for the part that tries to do calculations on blocks of 32 or 64 bits. It uses a multiplication in that part. That seems a bit odd as this code tries to do evrything with ands, nots, and shifts. Would this approach (in that part of the code) also work? X = 32 or 64 bit block 1: AND X with EIGHTYMASK 2: SHR 7: gives (A) 3: NOT X 4: SHR 6: gives (B) 5: (A) AND (B): gives (C) Basically we do the same as for a 1-byte block; Now we have a pattern where any 1 tells us this byte was a following byte A 32-bit example: (Invalid sequence but that does not matter here) X = %11xxxxxx %01xxxxxx %00xxxxxx %10xxxxxx (Leading-ASCII-ASCII-Following, so count of following bytes must be 1) AND with EIGTHYMASK gives %10000000 %00000000 %00000000 %10000000 SHR 7 gives: %00000001 %00000000 %00000000 %00000001 (A) NOT %11xxxxxx %01xxxxxx %00xxxxxx %10xxxxxx = %00yyyyyy %10yyyyyy %11yyyyyy %01yyyyyy SHR 6 gives: %00000000 %yyyyyy10 %yyyyyy11 %yyyyyy01 (B) (A) and (B) gives: %00000000 %00000000 %00000000 %00000001 (C) All non-following bytes turn into %00000000 The count of following bytes is PopCnt(C) As long as PopCnt is available as a machine instruction, this will be faster than the (nx * ONEMASK) calculation I think. (I would hope this is the case for all platforms Lazarus supports, if not the call to PopCnt could be ifdef-ed.) So basically change this: nx := ((pnx^ and EIGHTYMASK) shr 7) and ((not pnx^) shr 6); {$push}{$overflowchecks off} // "nx * ONEMASK" causes an arithmetic overflow. Result += (nx * ONEMASK) >> ((sizeof(PtrInt) - 1) * 8); {$pop} into nx := ((pnx^ and EIGHTYMASK) shr 7) and ((not pnx^) shr 6); Result := Result + PopCnt(PtrUInt(nx)); /PopCnt only accepts unsigend parameter Since bit manipulating is not my strongpoint, please comment. -- Bart -- _______________________________________________ lazarus mailing list lazarus@lists.lazarus-ide.org https://lists.lazarus-ide.org/listinfo/lazarus