[Lazarus] Improving UTF8CharacterLength?

Jürgen Hestermann Sun, 09 Aug 2015 05:32:59 -0700

I just had a closer look at the function UTF8CharacterLength in unit LazUTF8.
To me it looks as if it can be improved (made faster) because it checks too 
many things.


According to https://de.wikipedia.org/wiki/UTF-8 the number of bytes of an
UTF-8-character should be computable by the first byte only.
So it seems not to be neccessary to check for any following bytes (which also 
bears
the danger of accessing bytes out of the range of the string).
Isn't it enough to just do it like this:

------------------------------
if p=nil then
   exit(0);
if (ord(p^) and %10000000)=%00000000 then // First bit is not set ==> 1 byte
   exit(1);
if (ord(p^) and %11100000)=%11000000 then // First 2 (of first 3) bits are set 
==> 2 byte
   exit(2);
if (ord(p^) and %11110000)=%11100000 then // First 3 (of first 4) bits are set 
==> 3 byte
   exit(3);
if (ord(p^) and %11111000)=%11110000 then // First 4 (of first 5) bits are set 
==> 4 byte
   exit(4);
exit(0); // invalid UTF-8 character
-------------------------------

Currently, further bytes are checked even when
the first byte already determines the number of bytes.
But if the following bytes would not be as expected
it would not be a valid UTF-8-character.
But should this be checked by the UTF8CharacterLength function?
There is no error condition in the result of the function anyway.
I think errors should be checked when accessing the character as a whole.
Or is there any reason for handling invalid UTF-8-bytes more fault-tolerant?


--
_______________________________________________
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

[Lazarus] Improving UTF8CharacterLength?

Reply via email to