Michael Schnell schrieb:
I fail to understand some of the text.

It seems to be unavoidable to use the name "ANSIString" even though I always though up when seeing a thing called "ANSI" containing Unicode (e. g. "UTF8String = type AnsiString(CP_UTF8)" ).


Seemingly here the "bytes per character" setting implicitly is thought of as a port of the "code-page" definition. correct ?

An AnsiString consists of AnsiChar's. The *meaning* of these char's (bytes) depends on their encoding, regardless of whether the used encoding is or is not stored with the string.

It's essential to distinguish between low-level (physical) AnsiChar values, and *logical* characters possibly consisting of multiple AnsiChars.


In section "Dynamic code page":

"When assigning a string to a plain AnsiString (= AnsiString(CP_ACP)) or ShortString, the string data will however be converted to DefaultSystemCodePage. The dynamic code page of that AnsiString(CP_ACP) will then be the current value of DefaultSystemCodePage (e.g. 1250 for the Windows-1250 code page), even though its static code page is CP_ACP (which is a constant <> 1250). This is one example of how the static code page can differ from the dynamic code page. Subsequent sections will describe more such scenarios."

1) A short String does not have a Code page notification so for this "static code page can differ from the dynamic code page" does not seem to make much sense.

The text correctly states "dynamic code page of that AnsiString". ShortString (and AnsiChar) has no encoding indicator, they are assumed to be encoded in CP_ACP.


2) I fail to understand how with this explanation that seems to force auto conversion for assignments between types with different "code page" settings (also for CP_ACP) the "static code page can differ from the dynamic code page" can happen.

Continue reading until you understood the special handling of string literals and RawByteString.

In fact this disaster seems to be able to happen (see section "RawByteString") if assigning a string with a static code page X1 to a RawByteString (hence no conversion) and then assigning that RawByteString to a string with a static code page X2 (no conversion again). In fact I assume that without abusing RawByteString such "intersexual" strings can't be produced, otherwise this would be rather disastrous for normal users.

*All* intermediate strings, generated during the evaluation of string expressions, only have a dynamic encoding, thus can be considered as being RawByteStrings.

That's why I wonder *when* exactly the result of such an expression *is* converted (implicitly) into the static encoding of the target variable, and when *not*.

Obviously the compiler inserts an conversion request for the *direct* assignment of one string variable to another one, of an different *static* encoding. But what happens when a string expression doesn't have such a known static encoding???


In section "RawByteString":

"the results of conversions from/to the CP_NONE code page are undefined."

In effect the behavior is exactly defined in this section "As a first approximation".

Right, the result *is* well defined, but has no *predetermined* dynamic encoding.

The entire mess results from the bad interpretation of RawByteString assignments, which IMO was well thought by the Delphi language architects, but not understood by the Delphi compiler coders. This interpretation also found its way into FPC:

"Less intuitive is probably that when a RawByteString is assigned to an AnsiString(X), the same happens: no code page conversion[...]"

It's clear that a conversion *can* be omitted for every assignment *to* an RawByteString. That's one of the purposes of that type - to avoid excess conversions into CP_ACP or UnicodeString.

But it's unclear why the heck the assignment to any *other* AnsiString type should be omitted, as soon as the source string is a RawByteString???

Therefore I'd suggest an compiler switch, implementing the lame Delphi compatible behaviour only on *demand*, while the FPC default would force eventual conversions with *every* assignment to any other (non-CP_NONE) AnsiString type. This simple change will safely prevent strings of different static and dynamic encoding, so that according tests can be removed safely from library *and* user code.


The proper use of RawByteStrings deserves further documentation, for users who want/need their own (generic) stringhandling routines. Topics should be:
- how to determine the dynamic encoding of strings (StringCodePage)
- how to force required conversions (SetCodePage)
- how to deal with strings of different encodings
- how to minimize the number of string conversions

DoDi

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Reply via email to