Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"

Hans-Peter Diettrich Thu, 27 Nov 2014 20:03:59 -0800

Michael Schnell schrieb:

On 11/26/2014 06:37 PM, Hans-Peter Diettrich wrote:
An AnsiString consists of AnsiChar's. The *meaning* of these char's(bytes) depends on their encoding, regardless of whether the usedencoding is or is not stored with the string.
I understand that the implementation (in Delphi) seems to be driven moreby the Wording ("ANSI") than by the logical paradigm the language syntaxsuggests. The language syntax and the string header fields suggest thatboth the element-size as the code-ID-number need to be adhered to (be itstatically or dynamically - depending on the usage instance). E.g. thereare (are least two "Code pages" for UTF-16 ("LE", and "BE"), that wouldbe worth supporting.


You are confusing codepages and encodings :-(

UTF-7, UTF-8, UTF-16 and UTF-16BE describe different representations ofthe same values (Unicode codepoints). And I agree, all commonly usedencodings should be implemented, at least for data import/export.

It's essential to distinguish between low-level (physical) AnsiCharvalues, and *logical* characters possibly consisting of multipleAnsiChars.
I now do see that the implementation is done following this concept. Butthe language syntax and the string header field suggest a more versatileparadigm, providing a universal reference counting "element string" type.

See it as a multi-level protocol for text processing. The bottom(physical) level deals with physical storage items (AnsiChar,WideChar...), and how they are stored in memory or files. Like itdoesn't make sense to deal with individual bytes of real numbers incomputations, it doesn't make sense to deal with individual bytes(AnsiChars) of logical characters - except in type/encoding conversions.Higher levels deal with logical values, which can consist of multiplephysical items, and may need different interpretatons (in case of Ansicodepages). This level is partially coverd now by AnsiString encodingsand UTF-16 surrogate pairs, which allow to map the values into fullUnicode (UCS-4) codepoints. But these codepoints still are notsufficient for a correct interpretation and manipulation of logicalcharacters, which again can consist of multiple codepoints (decomposedumlauts, ligatures...). In a next level another (mostly languagespecific) interpretation may be required, like which logical charactershave to be treated together (ligatures, non-breaking characters...).Some natural languages (Hebrew, Arabic...) require another specialhandling of (mixed) LTR/RTL reading, and of "paths", influencing thegraphical representation of character sequences; but that's nothing anapplication or library writer should have to deal with, suchfunctionality should be provided by the target platform.

There must be a boundary between the standard (RTL) handling of thephysical items and encodings, and higher text processing levels, up tolanguage specific processing (how to break words, when to applycapitalization, syntax checks...), so that such special handling can beimplemented in dedicated extensions (libraries, classes), by developersfamiliar with the rules and conventions of the natural languages.

For now we are talking only about the handling up to individual Unicodecodepoints, and related string manipulation. Herefore at least onestring representation must exist, that covers the full Unicode range ofcodepoints (UTF-8 or UTF-16 for now). When such an implementation claimsfor "undefined" behaviour, then this can only mean implementation flaws,resulting in something different from what can be expected from properUnicode handling. This includes invalid parameter values in subroutinecalls, which should result in proper (defined) runtime error reporting(AV, error result...).

WRT to AnsiString encodings, the only acceptable (expected) differencescan result from lossy conversions, when converting proper Unicode into anon-UTF encoding. Even then the results should be consistent, even ifthe concrete results depend on some external (platform...) convention orsettings.


IMO.

That's why I wonder *when* exactly the result of such an expression*is* converted (implicitly) into the static encoding of the targetvariable, and when *not*.
I understand that the idea is, to use the static encoding informationprovided by the type definition whenever possible.

Right, but here "whenever possible" depends on the correspondence ofstatic and dynamic encoding. When the dynamic encoding can *ever* bedifferent from the static encoding, except for RawByteString, I considerit NOT possible to derive the need for a conversion from the staticencoding. In the handling of floatingpoint values we may have to expectinvalid operations (division by zero, overflow...) or values (NaN...),but NOT that a Double variable ever contains two Integer values - unlessforced by dirty hacks out of compiler control. Why should this bedifferent and acceptable with string types?

In Delphi the use of the dynamic encoding information seems to be veryrare (and the implementation does not make much sense to me).

It's known that the Delphi AnsiString implementation is flawed, withpossibly different results when the same expression is based onAnsiString or UnicodeString operands. But the same IMO is unacceptablein FPC, *unless* the user has the same choice, between a proper and safe(maybe slower), and another error prone and dangerous (maybe faster),string expression evaluation.

My hope was, that fpc might be able to correct this error of the Delphicompiler coders. But of course for Delphi compatibility the type nameRawByteString and the code-ID-number $FFFF can't be used any more, buta new naming and ID number would need to be invented. IMHO this in factis possible and viable (see wiki page for details).

I see no problem in using the same names and values. Delphi documentsclearly state:

>>

RawByteString should only be used as a parameter type, and only inroutines which otherwise would need multiple overloads for AnsiStringswith different codepages. Such routines need to be written with care forthe actual codepage of the string at run time.

In general, it is recommended that string processing routines shouldsimply use "string" as the string type. Declaring variables or fields oftype RawByteString should rarely, if ever, be done, because thispractice can lead to undefined behavior and potential data loss.

<<

Where is specified that no conversion occurs, when a RawByteString isassigned *to* a variable of a different encoding?


DoDi

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Trying to understand the wiki-Page "FPC Unicode support"

Reply via email to