Michael Schnell schrieb:
On 02/10/2011 03:20 PM, Hans-Peter Diettrich wrote:
length('à') return 2
utf8length('à') return 1
I thinks according to the definition of UTF8String it's correct that
Length(s) provides the byte count. I do hope that with "NewStrings" this
some day might change, as it's quite confusing for anybody who does not
want to be bothered with the Uniocde internals.
Length() is bound to the physical (array) size, a redefinition would
break this established rule.
MBCS users had to live with this problem since ever, and UTF-8 is a
MBCS. I'm not sure whether the difference between number of characters
(glyphs) and number of codepoints can be eliminated by any approved
convention.
IMO it's a good idea to forget about "char" in dealing with Unicode/UTF
strings, and only use (sub)strings. This is not a major problem, since
Pascal does not distinguish between char and string literals.
Obviously this code will fail with UTF-8 encoding:
var a: char = 'à'; //or '`a'?
and even UTF-32 may fail with ligatures or other character combinations.
Some "NewStrings" model IMO should at least distinguish between ASCII,
ANSI and UTF strings:
ASCII: never convert, codes above #$7F are undefined (maybe raw data).
ANSI: SBCS according to a specific codepage.
UCS2: a possible Unicode subset (BMP) of 2-byte (WideChar) characters.
UTF: anything else, with unrelated character and byte counts.
This would make at least those coders happy, that are used to deal with
SBCS, and writing applications for local/national use. All coders, in
detail the English (ASCII) speakers, have to learn about UTF and MBCS
when dealing with UTF strings (apart from assignment and display).
DoDi
--
_______________________________________________
Lazarus mailing list
[email protected]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus