[fpc-pascal] Proposed tidy-up of the FPC Manual section on Character Types and the FPC Wiki

Tony Whyman Fri, 18 Aug 2017 02:56:08 -0700

There has been some heated discussion on the Lazarus lists on thesubject to character encodings etc. This has exposed several issues withthe FPC Manual that I wanted to record.


1. The char type

The manual says: "A Char is exactly 1 byte in size, and contains oneASCII character. "

This was probably true when Pascal was first defined, but char is oftennow used for any on-byte character set e.g. ISO 8859-1. Replace ASCIIwith ANSI.


2. WideChar

The Manual says: "A WideChar is exactly 2 bytes in size, and containsone UNICODE character in UTF-16 encoding. "

This seems to be wrong as UTF-16 is not limited to code points definedusing a single 16-bit code unit, but also permits code points comprisingtwo 16-bit code units. The definition should be updated to indicate thata WideChar was really created for the obsolescent UCS-2 and is limitedto a UTF-16 subset (Unicode characters that can be expressed as a single16-bit code unit).

Proposed replacement text: "A WideChar is exactly 2 bytes in size, andcontains one UNICODE character in UCS-2 encoding or UTF-16 encodinglimited to the Basic Multilingual Plane. Note that Unicode Charactersrepresented by a UTF-16 code points that require two 16-bit code unitscannot be contained in a single WideChar variable."


3. UnicodeStrings

The Manual says: "For multi-byte string types, the basic character has asize of at least 2."


Proposed improvement:

"Multi-byte string types are used to represent Unicode charactersencoded as code points requiring two or four bytes".


As with UTF8String, the following caveat should also be added:

"Since a unicode character may require two or four bytes to berepresented in the UTF-16 encoding, there are 2 points to take care ofwhen using UnicodeString/WideString:

1. The character index – which retrieves a WideChar at a certainposition – must be used with care: the expression S[i] will notnecessarily be a valid character for a string S of typeUnicodeString/WideString.

2. The length of the string is not necessarily equal to the number ofelements in the array. The standard function length cannot be used toget the character length of the string, it will always return the arraylength.


------------------------------------------------------

Wiki Page on "Character and string Type"

1. This needs to start with a Health Warning on the use of the wordUnicode. Proposed Text (borrowing from Wikipedia):

"Free Pascal supports several character and string types. They rangefrom single ANSI characters to unicode strings and also include pointertypes. Differences also apply to encodings and reference counting. ANSIis typically used to refer to single byte character encodings - althoughFPC also uses AnsiStrings to hold Unicode UTF-8 encoded strings.

Unicode is a computing industry standard for the consistent encoding,representation, and handling of text expressed in most of the world'swriting systems. Developed in conjunction with the Universal CodedCharacter Set (UCS) standard and published as The Unicode Standard, thelatest version of Unicode contains a repertoire of 136,755 characterscovering 139 modern and historic scripts, as well as multiple symbol sets.

Unicode can be implemented by different character encodings. The Unicodestandard defines UTF-8, UTF-16, and UTF-32, and several other encodingsare in use. The most commonly used encodings are UTF-8, UTF-16 andUCS-2, a precursor of UTF-16.

The original idea behind Unicode was to replace the typical256-character encodings requiring 1 byte per character with an encodingusing 2^16 = 65,536 values requiring 2 bytes per character.The early2-byte encoding was usually called "Unicode", but is now called "UCS-2".UCS-2 differs from UTF-16 by being a constant length encoding and onlycapable of encoding characters of Basic Multilingual Plane (BMP), it issupported by many programs. However, "UCS-2 should now be consideredobsolete. It no longer refers to an encoding form in either 10646 or theUnicode Standard.

Unfortunately, the term Unicode, in common usage, is still often used torefer to the UCS-2 two byte encoding and this can give rise to muchconfusion e.g. when Unicode is used when referring to the UTF-8 encoding."

2. The text on WideChar is too terse and needs to be expanded. Proposedtext:

"A variable of type WideChar, also referred to as UnicodeChar (whichderives from the archaic use of Unicode to mean UCS-2), is exactly 2bytes in size, and usually contains either:


(a) a single UCS-2 code point, or

(b) a single UTF-16 code unit.

In case (b), this is sufficient for Unicode Characters that have aUTF-16 code point that comprises a single 16-bit code unit i.e.characters in the Basic Multilingual Plane. However, all other UTF-16characters have a UTF-16 code point that comprises a two 16-bit codeunits. FPC provides no specific support for such characters whichrequire, e.g. a WideChar pair to encoded them."

Note: that the byte order used to store a WideChar can vary betweenplatforms.


2. PChar

This should be identified as a synonym for PAnsiChar in FPC, It can alsobe as a C style pointer to any AnsiString including UTF-8.

It may also be useful to add a note that in later versions of Delphi,PChar is a synonym for PWideChar.



_______________________________________________
fpc-pascal maillist  -  [email protected]
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

[fpc-pascal] Proposed tidy-up of the FPC Manual section on Character Types and the FPC Wiki

Reply via email to