Jonas Maebe schrieb:

The code page of ansistrings concatenations is the code page of the
result to which this concatenation is assigned/converted. For
rawbytestring, this code page is CP_ACP per Delphi compatibility.

This does not match my experience with Delphi XE :-(

Can you give an Delphi example, so that I can verify this behaviour?


I'm inclined to add a global boolean variable to the system unit that
allows changing this behaviour so that it uses CP_UTF8 instead in
such cases (defaulting to false, for Delphi compatibility). In
practice, setting it to true shouldn't cause problems even with
virtually all Delphi, as routines that work with rawbytestring should
be able to handle any code page anyway.

The Result of an f(...):RawByteString should return an string of that encoding, that results from its construction.


My view on RawByteString:

1) This type serves as a collector for AnsiStrings of any encoding, where otherwise a conversion into UTF-16 (string) or CP_ACP (AnsiString) were required.

2) Variables of type RawByteString are intended only as *local* variables, inside subroutines dealing with RawByteStrings.

3) Functions accepting RawByteStrings can provide fast results, when the encoding of the string arguments is the same, otherwise they have to use Unicode (UTF-8/16) for intermediate results.


Rationale/observations:

[1] Delphi: Only UTF-16 and CP_ACP are explicitly supported in overloaded stringhandling functions. This would require to convert all string arguments other than AnsiString(0) into UTF-16. A RawByteString overload (instead of AnsiString(0)) allows to process an AnsiString(x) without UTF-16 conversion, when the function code and argument encodings do not require such a conversion. Otherwise the RawByteString overloads convert all strings into UTF-16 internally, and back again into a RawByteString Result. Since UTF-8 is not a specifically supported encoding, UTF-16 must be converted back to CP_ACP instead, with possible losses.

In fact the AnsiString(0) overloads in AnsiStrings.pas are another optimization, that does not check the encoding of the string arguments, eventual conversions are assumed to be performed before. This leads to errors when the declared (static) string type of an parameter does not match its actual (dynamic) encoding. Such irregular strings can be constructed by wrong/unexpected use of RawByteString. Example (XE):

var a: AnsiString; u: UTF8String;
function cpy(s: RawByteString):RawByteString;
begin Result := s; end;
a := cpy(u); //now a has encoding UTF-8!

Here the XE compiler omits the conversion of the RawByteString result to the declared encoding of the target. Dunno about newer versions.


[3] Delphi: since the only explicitly supported lossless encoding is UTF-16, RawByteString stringhandling functions with arguments of mixed encodings must be converted to UTF-16, finally back to AnsiString. Here a conversion to CP_ACP may occur, when/because the further use of a RawByteString result is unknown. Delphi does not provide UTF-8 overloads, so that this encoding cannot be used when an UnicodeString has to be converted into an RawByteString.

FPC: when UTF-8 is used inside RawByteString routines, instead of UTF-16, the RawByteString result can have exactly this encoding, for lossless handling in further calls, until the result finally is assigned to a variable/parameter of a fixed encoding. In detail no conversion to CP_ACP is required when UTF-8 is a supported by overloads, or as a special case of RawByteString arguments.


So IMO there exists no *requirement*, that intermediate Unicode strings have to be converted to CP_ACP as RawByteString Results. This is only a fatal consequence of the crippled Delphi handling of encodings (disregarding UTF-8), with possible conversion losses. When UTF-8 is used for intermediate Unicode strings, the RawByteString results can preserve lossless UTF-8 encoding.

DoDi

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Reply via email to