Martin Schreiber schrieb:

but I fear we can not use that information for development with Free Pascal because:
"
The string is represented internally as a Unicode string encoded as UTF-16. Characters in the Basic Multilingual Plane (BMP) take 2 bytes, and characters not in the BMP require 4 bytes.
"
and
"
A control string is a sequence of one or more control characters, each of which consists of the # symbol followed by an unsigned integer constant from 0 to 65,535 (decimal) or from $0 to $FFFF (hexadecimal) in UTF-16 encoding, and denotes the character corresponding to a specified code value. Each integer is represented internally by 2 bytes in the string. This is useful for representing control characters and multibyte characters.
"
which seems to be different from Free Pascal.

Where do you see a difference? The strings are stored in UTF-16, which is the same in every implementation, regardless of (possibly) different more verbose descriptions.

The new AnsiStrings are safe against misinterpretation, because they contain their encoding (codepage). Every char in an AnsiString now can be converted to one and only one Unicode char, when needed. This is not true for single AnsiChars, which still have no codepage information stored with them (in both Delphi and FPC). I strongly discourage the use of Char variables in all flavours (Char, AnsiChar, WideChar), because these are incapable of holding all possible Unicode characters. Only UnicodeChar or UCS4Char (if these exist) can hold all possible character codes, without possible codepage misinterpreation.

The discussion mostly covers the compilation of string *literals*, like 'äöü' or #123, for which every compiler tries to find the best interpretation and internal representation. FPC has a $codepage directive, which tells the compiler that *all* literals in this unit shall be treated as strings of that codepage. This is essential for files stored as Ansi, which have no information about the codepage of the contained single-byte characters. Files stored with UTF-8 encoding, and an UTF-8 BOM at their begin, are safe against misinterpretation.

When the compiler translates the source code string literals, it can store them either as Unicode (UTF-16) or as AnsiString of the given $codepage, depending on the *use* of the literal (type of the string variable in an assignment). This will reduce the number of implicit string conversions at runtime.

[Please correct me if I'm wrong]
DoDi

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Reply via email to