Aleksa Todorovic schrieb:
On Tue, Aug 21, 2012 at 10:16 AM, Ivanko B <ivankob4m...@gmail.com> wrote:
Handling 1..4(6) bytes is less efficient than handling surrogate
 *pairs*.
===============
But surrogate pairs break array-like fast char access anyway,  isn't it ?

It's also "broken" in UTF8 in the same way - so none of them gets +1
on this. UCS4 is the only real winner here (one dword for each
character).

Depending on the language, ligatures etc. still can span multiple codepoints. IMO everybody should decide whether he wants to do text processing for full Unicode, or whether simple stringhandling (as used till now) is sufficient.

I never heard that non-canoncial text has caused problems in character sets with accents or umlauts - except in (MacOS, Linux) filenames. Since file searches have to use the platform API, all required special handling can be encapsulated in the RTL.

Breaking strings into substrings can be done on specific delimiters (spaces...), which are all ASCII, again no complication with UTF. A comparison or search for given patterns also is insensitive to the encoding. Where would one really need indexed access to single characters?

DoDi

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Reply via email to