Aleksa Todorovic schrieb:
On Tue, Aug 21, 2012 at 10:16 AM, Ivanko B <ivankob4m...@gmail.com> wrote:
Handling 1..4(6) bytes is less efficient than handling surrogate
*pairs*.
===============
But surrogate pairs break array-like fast char access anyway, isn't it ?
It's also "broken" in UTF8 in the same way - so none of them gets +1
on this. UCS4 is the only real winner here (one dword for each
character).
Depending on the language, ligatures etc. still can span multiple
codepoints. IMO everybody should decide whether he wants to do text
processing for full Unicode, or whether simple stringhandling (as used
till now) is sufficient.
I never heard that non-canoncial text has caused problems in character
sets with accents or umlauts - except in (MacOS, Linux) filenames. Since
file searches have to use the platform API, all required special
handling can be encapsulated in the RTL.
Breaking strings into substrings can be done on specific delimiters
(spaces...), which are all ASCII, again no complication with UTF. A
comparison or search for given patterns also is insensitive to the
encoding. Where would one really need indexed access to single characters?
DoDi
_______________________________________________
fpc-devel maillist - fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel