Michael Schnell schrieb:
On 02/10/2011 03:20 PM, Hans-Peter Diettrich wrote:

length('à') return 2
utf8length('à') return 1

I thinks according to the definition of UTF8String it's correct that Length(s) provides the byte count. I do hope that with "NewStrings" this some day might change, as it's quite confusing for anybody who does not want to be bothered with the Uniocde internals.

Length() is bound to the physical (array) size, a redefinition would break this established rule.

MBCS users had to live with this problem since ever, and UTF-8 is a MBCS. I'm not sure whether the difference between number of characters (glyphs) and number of codepoints can be eliminated by any approved convention.

IMO it's a good idea to forget about "char" in dealing with Unicode/UTF strings, and only use (sub)strings. This is not a major problem, since Pascal does not distinguish between char and string literals.

Obviously this code will fail with UTF-8 encoding:
  var a: char = 'à'; //or '`a'?
and even UTF-32 may fail with ligatures or other character combinations.

Some "NewStrings" model IMO should at least distinguish between ASCII, ANSI and UTF strings:

ASCII: never convert, codes above #$7F are undefined (maybe raw data).
ANSI: SBCS according to a specific codepage.
UCS2: a possible Unicode subset (BMP) of 2-byte (WideChar) characters.
UTF: anything else, with unrelated character and byte counts.

This would make at least those coders happy, that are used to deal with SBCS, and writing applications for local/national use. All coders, in detail the English (ASCII) speakers, have to learn about UTF and MBCS when dealing with UTF strings (apart from assignment and display).

DoDi


--
_______________________________________________
Lazarus mailing list
[email protected]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Reply via email to