Re: [Lazarus] substr return wrong string with some utf8 char

Hans-Peter Diettrich Fri, 11 Feb 2011 04:08:29 -0800

Michael Schnell schrieb:

On 02/10/2011 03:20 PM, Hans-Peter Diettrich wrote:
length('à') return 2
utf8length('à') return 1
I thinks according to the definition of UTF8String it's correct thatLength(s) provides the byte count. I do hope that with "NewStrings" thissome day might change, as it's quite confusing for anybody who does notwant to be bothered with the Uniocde internals.

Length() is bound to the physical (array) size, a redefinition wouldbreak this established rule.

MBCS users had to live with this problem since ever, and UTF-8 is aMBCS. I'm not sure whether the difference between number of characters(glyphs) and number of codepoints can be eliminated by any approvedconvention.

IMO it's a good idea to forget about "char" in dealing with Unicode/UTFstrings, and only use (sub)strings. This is not a major problem, sincePascal does not distinguish between char and string literals.


Obviously this code will fail with UTF-8 encoding:
  var a: char = 'à'; //or '`a'?
and even UTF-32 may fail with ligatures or other character combinations.

Some "NewStrings" model IMO should at least distinguish between ASCII,ANSI and UTF strings:


ASCII: never convert, codes above #$7F are undefined (maybe raw data).
ANSI: SBCS according to a specific codepage.
UCS2: a possible Unicode subset (BMP) of 2-byte (WideChar) characters.
UTF: anything else, with unrelated character and byte counts.

This would make at least those coders happy, that are used to deal withSBCS, and writing applications for local/national use. All coders, indetail the English (ASCII) speakers, have to learn about UTF and MBCSwhen dealing with UTF strings (apart from assignment and display).


DoDi


--
_______________________________________________
Lazarus mailing list
[email protected]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] substr return wrong string with some utf8 char

Reply via email to