On Sun, 9 Aug 2015 14:31:44 +0200 Jürgen Hestermann <juergen.hesterm...@gmx.de> wrote:
> I just had a closer look at the function UTF8CharacterLength in unit LazUTF8. > To me it looks as if it can be improved (made faster) because it checks too > many things. > > According to https://de.wikipedia.org/wiki/UTF-8 the number of bytes of an > UTF-8-character should be computable by the first byte only. True. > So it seems not to be neccessary to check for any following bytes (which also > bears > the danger of accessing bytes out of the range of the string). A string always ends with a #0, so checking byte by byte makes sure you stay within range. If you only read the first byte of a codepoint to determine its length, you must check the length of the string. The UTF8CharacterLength function handles invalid UTF-8 gracefully. If you know that you have a valid UTF-8 string you can simply use the first byte of each codepoint (as you pointed out). So, for that case a faster function can be added. Maybe UTF8QuickCharLen or something like that. > Isn't it enough to just do it like this: > > ------------------------------ > if p=nil then > exit(0); > if (ord(p^) and %10000000)=%00000000 then // First bit is not set ==> 1 byte > exit(1); > if (ord(p^) and %11100000)=%11000000 then // First 2 (of first 3) bits are > set ==> 2 byte > exit(2); > if (ord(p^) and %11110000)=%11100000 then // First 3 (of first 4) bits are > set ==> 3 byte > exit(3); > if (ord(p^) and %11111000)=%11110000 then // First 4 (of first 5) bits are > set ==> 4 byte > exit(4); > exit(0); // invalid UTF-8 character > ------------------------------- Yes, although afaik the compiler can optimize a CASE better than a series of IFs. case p^ of #0..#127: exit(1); #192..#223: exit(2); #224..#239: exit(3); #240..#247: exit(4); else exit(0); // invalid UTF-8 character, should never happen end; Note: because it is an optimized version the check for p=nil can be omitted. > Currently, further bytes are checked even when > the first byte already determines the number of bytes. > But if the following bytes would not be as expected > it would not be a valid UTF-8-character. > But should this be checked by the UTF8CharacterLength function? > There is no error condition in the result of the function anyway. > I think errors should be checked when accessing the character as a whole. > Or is there any reason for handling invalid UTF-8-bytes more fault-tolerant? The last, more fault-tolerant. This allows to use the function like this: while p^<>#0 do begin CharLen:=UTF8CharacterLength(p); // ... inc(p,CharLen); end; This works even with invalid UTF8. Mattias -- _______________________________________________ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus