Michael Schnell schrieb:
On 11/26/2014 06:37 PM, Hans-Peter Diettrich wrote:

An AnsiString consists of AnsiChar's. The *meaning* of these char's (bytes) depends on their encoding, regardless of whether the used encoding is or is not stored with the string.
I understand that the implementation (in Delphi) seems to be driven more by the Wording ("ANSI") than by the logical paradigm the language syntax suggests. The language syntax and the string header fields suggest that both the element-size as the code-ID-number need to be adhered to (be it statically or dynamically - depending on the usage instance). E.g. there are (are least two "Code pages" for UTF-16 ("LE", and "BE"), that would be worth supporting.

You are confusing codepages and encodings :-(

UTF-7, UTF-8, UTF-16 and UTF-16BE describe different representations of the same values (Unicode codepoints). And I agree, all commonly used encodings should be implemented, at least for data import/export.


It's essential to distinguish between low-level (physical) AnsiChar values, and *logical* characters possibly consisting of multiple AnsiChars.
I now do see that the implementation is done following this concept. But the language syntax and the string header field suggest a more versatile paradigm, providing a universal reference counting "element string" type.

See it as a multi-level protocol for text processing. The bottom (physical) level deals with physical storage items (AnsiChar, WideChar...), and how they are stored in memory or files. Like it doesn't make sense to deal with individual bytes of real numbers in computations, it doesn't make sense to deal with individual bytes (AnsiChars) of logical characters - except in type/encoding conversions. Higher levels deal with logical values, which can consist of multiple physical items, and may need different interpretatons (in case of Ansi codepages). This level is partially coverd now by AnsiString encodings and UTF-16 surrogate pairs, which allow to map the values into full Unicode (UCS-4) codepoints. But these codepoints still are not sufficient for a correct interpretation and manipulation of logical characters, which again can consist of multiple codepoints (decomposed umlauts, ligatures...). In a next level another (mostly language specific) interpretation may be required, like which logical characters have to be treated together (ligatures, non-breaking characters...). Some natural languages (Hebrew, Arabic...) require another special handling of (mixed) LTR/RTL reading, and of "paths", influencing the graphical representation of character sequences; but that's nothing an application or library writer should have to deal with, such functionality should be provided by the target platform.

There must be a boundary between the standard (RTL) handling of the physical items and encodings, and higher text processing levels, up to language specific processing (how to break words, when to apply capitalization, syntax checks...), so that such special handling can be implemented in dedicated extensions (libraries, classes), by developers familiar with the rules and conventions of the natural languages.

For now we are talking only about the handling up to individual Unicode codepoints, and related string manipulation. Herefore at least one string representation must exist, that covers the full Unicode range of codepoints (UTF-8 or UTF-16 for now). When such an implementation claims for "undefined" behaviour, then this can only mean implementation flaws, resulting in something different from what can be expected from proper Unicode handling. This includes invalid parameter values in subroutine calls, which should result in proper (defined) runtime error reporting (AV, error result...).

WRT to AnsiString encodings, the only acceptable (expected) differences can result from lossy conversions, when converting proper Unicode into a non-UTF encoding. Even then the results should be consistent, even if the concrete results depend on some external (platform...) convention or settings.

IMO.


That's why I wonder *when* exactly the result of such an expression *is* converted (implicitly) into the static encoding of the target variable, and when *not*.
I understand that the idea is, to use the static encoding information provided by the type definition whenever possible.

Right, but here "whenever possible" depends on the correspondence of static and dynamic encoding. When the dynamic encoding can *ever* be different from the static encoding, except for RawByteString, I consider it NOT possible to derive the need for a conversion from the static encoding. In the handling of floatingpoint values we may have to expect invalid operations (division by zero, overflow...) or values (NaN...), but NOT that a Double variable ever contains two Integer values - unless forced by dirty hacks out of compiler control. Why should this be different and acceptable with string types?


In Delphi the use of the dynamic encoding information seems to be very rare (and the implementation does not make much sense to me).

It's known that the Delphi AnsiString implementation is flawed, with possibly different results when the same expression is based on AnsiString or UnicodeString operands. But the same IMO is unacceptable in FPC, *unless* the user has the same choice, between a proper and safe (maybe slower), and another error prone and dangerous (maybe faster), string expression evaluation.



My hope was, that fpc might be able to correct this error of the Delphi compiler coders. But of course for Delphi compatibility the type name RawByteString and the code-ID-number $FFFF can't be used any more, but a new naming and ID number would need to be invented. IMHO this in fact is possible and viable (see wiki page for details).

I see no problem in using the same names and values. Delphi documents clearly state:

>>
RawByteString should only be used as a parameter type, and only in routines which otherwise would need multiple overloads for AnsiStrings with different codepages. Such routines need to be written with care for the actual codepage of the string at run time.

In general, it is recommended that string processing routines should simply use "string" as the string type. Declaring variables or fields of type RawByteString should rarely, if ever, be done, because this practice can lead to undefined behavior and potential data loss.
<<

Where is specified that no conversion occurs, when a RawByteString is assigned *to* a variable of a different encoding?

DoDi

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Reply via email to