On 11/18/2010 12:33 AM, Hans-Peter Diettrich wrote:
Separator characters can be assumed as ASCII, so that they can be
found by a dumb byte/char scan; only few encodings have to be
recognized and handled, based on the char size: MBCS (UTF-8...),
WideChars (UTF-16/UCS2) and UTF-32.
In fact I suppose that for UTF-8 ("pure UTF-8" without surrogates) pos()
works for all strings and an UTF-8 "character" is a string. It's just
not allowed to use the result of pos() other than in the position
argument of copy() or delete() and to calculate the length argument for
copy() or delete() as a difference between pos() results or
Length(String)-values. this makes it hard to extract a single Unicode
character from an UTF-8 string, but of course it's easy to create a
library function that gets a pos() result and - decoding the UTF-8 code
- creates an UTF-8 string containing the next Unicode character. (UTF-8
coded surrogate pairs may need additional attention)
-Michael
_______________________________________________
fpc-devel maillist - fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel