The following console program demonstrates various problems with the new (encoded) AnsiStrings (FPC trunk):

program litTest2;
{.$codepage UTF8} //off for now
uses Classes,SysUtils;
var A: AnsiString;
begin
  a := 'äöü';
  //a := a+' '; //uncomment later
  WriteLn(a,'äöü');
  WriteLn(AnsiUpperCase(a),AnsiUpperCase('äöü'));
end.

The output varies depending on (at least) the file encoding and target platform (tested only on Windows, using Lazarus).

With an Ansi source file the last line shows as 'ÄÖÜÄÖÜ', as expected. The variable also shows as 'äöü', but not the literal (3 graphical characters). In all other (tested) cases something different is shown, no uppercase letters at all.

With an UTF-8 source file (with BOM) both the variable and literal show as 'äöü', but unfortunately never in upper case.

Adding {$codepage UTF8} requires an UTF-8 source file. That's compatible with Lazarus defaults, so that further tests (here) will use this combination. Please note that (currently) Lazarus sets or leaves DefaultSystemCodePage as according to the actual OS, i.e. 1252 for my installation, regardless of $codepage.

Now all items are shown as 'äöü', but again never in uppercase - how that?


AnsiUpperCase finally calls Win32AnsiUpperCase (on Windows), declared as
  function Win32AnsiUpperCase(const s: string): string;
which in turn calls CharUpperBuffA.
This explains why no uppercase conversion is performed, when S has a dynamic encoding different from (WinAPI) CP_ACP, which is expected by CharUpperBuffA. Actually I found the *dynamic* encoding of A and S as CP_UTF8, even if its static encoding is CP_ACP (or 1252).

Consequently AnsiUpperCase should convert S to the WinAPI CP_ACP (GetACP), before passing it to CharUpperBuffA. The same for all other functions with AnsiString arguments, calling external (OS API...) routines expecting a specific encoding, on all platforms. And for user code, which relies on the encoding of all strings being the declared one, like in:
  str1[1]:=str2[1]; //both strings of same type

IMO such additional checks and conversions should be avoided, they bloat the library code and consume runtime. Note that SetCodePage requires an RawByteString (var parameter), and thus cannot be used immediately to adjust the dynamic codepage of an AnsiString.


Now let's add (uncomment) the line
  a := a+' ';
and voila, AnsiUpperCase works, because now the string has the expected CP_ACP instead of UTF-8. The same effect occurs when A is assigned from an UnicodeString variable.

Is it really intended, that AnsiString behaviour depends on such details?


The most simple solution would disallow a different static and dynamic encoding of AnsiStrings, except for RawByteString. Then no additional checks and conversions are required, except the one in the assignment of an RawByteString to an AnsiString of different type, and everything else can be determined by the compiler from the known static=dynamic encoding of strings.

More checks and conversions can be avoided, when the dynamic encoding of string literals is the actual encoding, as used by the compiler for the stored literal, not Delphi incompatible placeholders like CP_ACP. Then TranslatePlaceholderCP is required only for explicitly given encoding values, but no more for the dynamic encoding of strings.

DoDi

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Reply via email to