[fpc-devel] AnsiUpperCase problems

Hans-Peter Diettrich Thu, 04 Dec 2014 06:30:21 -0800

The following console program demonstrates various problems with the new(encoded) AnsiStrings (FPC trunk):


program litTest2;
{.$codepage UTF8} //off for now
uses Classes,SysUtils;
var A: AnsiString;
begin
  a := 'äöü';
  //a := a+' '; //uncomment later
  WriteLn(a,'äöü');
  WriteLn(AnsiUpperCase(a),AnsiUpperCase('äöü'));
end.

The output varies depending on (at least) the file encoding and targetplatform (tested only on Windows, using Lazarus).

With an Ansi source file the last line shows as 'ÄÖÜÄÖÜ', as expected.The variable also shows as 'äöü', but not the literal (3 graphicalcharacters). In all other (tested) cases something different is shown,no uppercase letters at all.

With an UTF-8 source file (with BOM) both the variable and literal showas 'äöü', but unfortunately never in upper case.

Adding {$codepage UTF8} requires an UTF-8 source file. That's compatiblewith Lazarus defaults, so that further tests (here) will use thiscombination. Please note that (currently) Lazarus sets or leavesDefaultSystemCodePage as according to the actual OS, i.e. 1252 for myinstallation, regardless of $codepage.


Now all items are shown as 'äöü', but again never in uppercase - how that?


AnsiUpperCase finally calls Win32AnsiUpperCase (on Windows), declared as
  function Win32AnsiUpperCase(const s: string): string;
which in turn calls CharUpperBuffA.

This explains why no uppercase conversion is performed, when S has adynamic encoding different from (WinAPI) CP_ACP, which is expected byCharUpperBuffA. Actually I found the *dynamic* encoding of A and S asCP_UTF8, even if its static encoding is CP_ACP (or 1252).

Consequently AnsiUpperCase should convert S to the WinAPI CP_ACP(GetACP), before passing it to CharUpperBuffA. The same for all otherfunctions with AnsiString arguments, calling external (OS API...)routines expecting a specific encoding, on all platforms. And for usercode, which relies on the encoding of all strings being the declaredone, like in:

  str1[1]:=str2[1]; //both strings of same type

IMO such additional checks and conversions should be avoided, they bloatthe library code and consume runtime. Note that SetCodePage requires anRawByteString (var parameter), and thus cannot be used immediately toadjust the dynamic codepage of an AnsiString.



Now let's add (uncomment) the line
  a := a+' ';

and voila, AnsiUpperCase works, because now the string has the expectedCP_ACP instead of UTF-8. The same effect occurs when A is assigned froman UnicodeString variable.


Is it really intended, that AnsiString behaviour depends on such details?

The most simple solution would disallow a different static and dynamicencoding of AnsiStrings, except for RawByteString. Then no additionalchecks and conversions are required, except the one in the assignment ofan RawByteString to an AnsiString of different type, and everything elsecan be determined by the compiler from the known static=dynamic encodingof strings.

More checks and conversions can be avoided, when the dynamic encoding ofstring literals is the actual encoding, as used by the compiler for thestored literal, not Delphi incompatible placeholders like CP_ACP. ThenTranslatePlaceholderCP is required only for explicitly given encodingvalues, but no more for the dynamic encoding of strings.


DoDi

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

[fpc-devel] AnsiUpperCase problems

Reply via email to