Re: [fpc-devel] utf8 in 2.6.0

Hans-Peter Diettrich Sat, 05 Jan 2013 05:37:45 -0800

Martin Schreiber schrieb:

but I fear we can not use that information for development with Free Pascalbecause:
"
The string is represented internally as a Unicode string encoded as UTF-16.Characters in the Basic Multilingual Plane (BMP) take 2 bytes, and charactersnot in the BMP require 4 bytes.
"
and
"
A control string is a sequence of one or more control characters, each ofwhich consists of the # symbol followed by an unsigned integer constant from0 to 65,535 (decimal) or from $0 to $FFFF (hexadecimal) in UTF-16 encoding,and denotes the character corresponding to a specified code value. Eachinteger is represented internally by 2 bytes in the string. This is usefulfor representing control characters and multibyte characters.
"
which seems to be different from Free Pascal.

Where do you see a difference? The strings are stored in UTF-16, whichis the same in every implementation, regardless of (possibly) differentmore verbose descriptions.

The new AnsiStrings are safe against misinterpretation, because theycontain their encoding (codepage). Every char in an AnsiString now canbe converted to one and only one Unicode char, when needed. This is nottrue for single AnsiChars, which still have no codepage informationstored with them (in both Delphi and FPC). I strongly discourage the useof Char variables in all flavours (Char, AnsiChar, WideChar), becausethese are incapable of holding all possible Unicode characters. OnlyUnicodeChar or UCS4Char (if these exist) can hold all possible charactercodes, without possible codepage misinterpreation.

The discussion mostly covers the compilation of string *literals*, like'äöü' or #123, for which every compiler tries to find the bestinterpretation and internal representation. FPC has a $codepagedirective, which tells the compiler that *all* literals in this unitshall be treated as strings of that codepage. This is essential forfiles stored as Ansi, which have no information about the codepage ofthe contained single-byte characters. Files stored with UTF-8 encoding,and an UTF-8 BOM at their begin, are safe against misinterpretation.

When the compiler translates the source code string literals, it canstore them either as Unicode (UTF-16) or as AnsiString of the given$codepage, depending on the *use* of the literal (type of the stringvariable in an assignment). This will reduce the number of implicitstring conversions at runtime.


[Please correct me if I'm wrong]
DoDi

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] utf8 in 2.6.0

Reply via email to