I just had a closer look at the function UTF8CharacterLength in unit LazUTF8. To me it looks as if it can be improved (made faster) because it checks too many things.
According to https://de.wikipedia.org/wiki/UTF-8 the number of bytes of an UTF-8-character should be computable by the first byte only. So it seems not to be neccessary to check for any following bytes (which also bears the danger of accessing bytes out of the range of the string). Isn't it enough to just do it like this: ------------------------------ if p=nil then exit(0); if (ord(p^) and %10000000)=%00000000 then // First bit is not set ==> 1 byte exit(1); if (ord(p^) and %11100000)=%11000000 then // First 2 (of first 3) bits are set ==> 2 byte exit(2); if (ord(p^) and %11110000)=%11100000 then // First 3 (of first 4) bits are set ==> 3 byte exit(3); if (ord(p^) and %11111000)=%11110000 then // First 4 (of first 5) bits are set ==> 4 byte exit(4); exit(0); // invalid UTF-8 character ------------------------------- Currently, further bytes are checked even when the first byte already determines the number of bytes. But if the following bytes would not be as expected it would not be a valid UTF-8-character. But should this be checked by the UTF8CharacterLength function? There is no error condition in the result of the function anyway. I think errors should be checked when accessing the character as a whole. Or is there any reason for handling invalid UTF-8-bytes more fault-tolerant? -- _______________________________________________ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus