On 11/18/2010 12:33 AM, Hans-Peter Diettrich wrote:
Separator characters can be assumed as ASCII, so that they can be found by a dumb byte/char scan; only few encodings have to be recognized and handled, based on the char size: MBCS (UTF-8...), WideChars (UTF-16/UCS2) and UTF-32.

In fact I suppose that for UTF-8 ("pure UTF-8" without surrogates) pos() works for all strings and an UTF-8 "character" is a string. It's just not allowed to use the result of pos() other than in the position argument of copy() or delete() and to calculate the length argument for copy() or delete() as a difference between pos() results or Length(String)-values. this makes it hard to extract a single Unicode character from an UTF-8 string, but of course it's easy to create a library function that gets a pos() result and - decoding the UTF-8 code - creates an UTF-8 string containing the next Unicode character. (UTF-8 coded surrogate pairs may need additional attention)

-Michael
_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Reply via email to