Re: [fpc-devel] Encoded AnsiString

2013-12-29 Thread Paul Ishenin

30.12.2013 9:07, Hans-Peter Diettrich пишет:
Do you think that FPC should really reproduce all this inconsistent 
behaviour? Who would test or even specify the compatible behaviour, 
when every new variation will result in more unexpected results? IMO 
it's much easier to do it right, and fix the Delphi flaws in FPC.


The work is already done by FPC team. AnsiString(codepage) works and 
works compatible with Delphi (whether someone like this or not) and the 
behavior is covered by tests. Trunk version is very close to 2.8 
release. The only related thing which we thought to touch before the 
release was resourcestring handling. If I have some free time during the 
new year holidays I will look at it.


So how one can help at this stage:

1. Check related FPC tests and write new for the missing cases.
2. Compare FPC and Delphi RTL classes which had beed adjusted in Delphi 
during the unicodestring move and check whether something minor can be 
added to FPC.


All major changes like the new TStringList class based on UnicodeString 
should wait for 2.8 release.


Best regards,
Paul Ishenin
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Encoded AnsiString

2013-12-29 Thread Hans-Peter Diettrich

Jonas Maebe schrieb:


I'm inclined to add a global boolean variable to the system unit that
allows changing this behaviour so that it uses CP_UTF8 instead in
such cases (defaulting to false, for Delphi compatibility). In
practice, setting it to true shouldn't cause problems even with
virtually all Delphi, as routines that work with rawbytestring should
be able to handle any code page anyway.


Sounds good, but I fear complications because such a global variable 
will affect also library behaviour. When UTF-8 is used in Lazarus or for 
filenames, and this encoding doesn't work in combination with string 
literals (CP_ACP?), then the Delphi default is not acceptable. When 
string literals are assumed as UTF-8, they won't work with strings of 
CP_ACP or other encodings, for the same reason.


I'd restrict strict Delphi compatibility to string=UnicodeString, if 
ever, and leave the UTF-8 RTL and LCL unaffected by the Delphi flaw. But 
what's Delphi compatibility worth without a UTF-16 LCL?


DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Encoded AnsiString

2013-12-29 Thread Hans-Peter Diettrich

Sven Barth schrieb:

[...]
This was tested using Delphi XE (it might not compile though as I've 
just rewritten the code from memory as the original is on a different 
computer)


Thanks, the code is okay, and it produces the expected results. The 
compiler also warns about a downcast from 'string' to 'RawByteString'.


Then I tested some adds:

When s1+s2 is assigned to an UTF-8 variable, the compiler also warns 
about an downcast from 'string' to 'UTF8String', and that's ridiculous. 
Either the conversion from UTF-16 to UTF-8 is flawed, or the warning is 
wrong.


When I write a
  function conc(s1,s2: RawByteString): RawByteString;
  begin Result := s1+s2; end;
then
  test(conc(s1,s2));
detects CP 65001 (UTF-8)!

The same when conc is redeclared as
  function conca(s1,s2: RawByteString): AnsiString;
  begin Result := s1+s2; end;
also detects CP 65001 (UTF-8), where the function should return a CP_ACP 
(1251) result.


BTW both conc and conca do not produce compiler warnings.


Do you think that FPC should really reproduce all this inconsistent 
behaviour? Who would test or even specify the compatible behaviour, when 
every new variation will result in more unexpected results? IMO it's 
much easier to do it right, and fix the Delphi flaws in FPC.


DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Improving i8086 performance..

2013-12-29 Thread Kostas Michalopoulos
> That emulator is not cycle-exact, so it doesn't have the same
> characteristics as the real hardware.  PCem comes closer, but is also not
> exact.

Well, it is better than nothing when you have no access to a real 808x machine.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Encoded AnsiString

2013-12-29 Thread Sven Barth

On 29.12.2013 19:26, Hans-Peter Diettrich wrote:

Jonas Maebe schrieb:


The code page of ansistrings concatenations is the code page of the
result to which this concatenation is assigned/converted. For
rawbytestring, this code page is CP_ACP per Delphi compatibility.


This does not match my experience with Delphi XE :-(

Can you give an Delphi example, so that I can verify this behaviour?


=== code begin ===

program tstrtest;

{$apptype console}

procedure Test(aArg: RawByteString);
begin
  Writeln(StringCodePage(aArg));
end;

type
  CP1252String = type AnsiString(1252);

var
  s1: UTF8String;
  s2: CP1252String;
begin
  s1 := 'Test';
  s2 := 'Test';
  DefaultSystemCodePage := 1251;
  Test(s1);
  Test(s2);
  Test(s1 + s2);
end.

=== code end ===

This will print

=== output begin ===

65001
1252
1251

=== output end ===

This was tested using Delphi XE (it might not compile though as I've 
just rewritten the code from memory as the original is on a different 
computer)


Regards,
Sven
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Encoded AnsiString

2013-12-29 Thread Sven Barth

On 29.12.2013 17:53, Hans-Peter Diettrich wrote:

Michael Van Canneyt schrieb:



On Sun, 29 Dec 2013, Hans-Peter Diettrich wrote:



This will be combined with the dotted unit filenames, to be Delphi
2010+ compatible.



How do I create source files for use with both versions?


What do you mean by this statement ?


I'm not familiar with dotted unit names, they seem not to be used in XE.
So I only can imagine something like conditionals around the different
items in un/dotted environment, to keep "Classes" separate from
"System.Classes"?



Dotted unit names are supported at least since Delphi 2007 (maybe also 
from 2005 on), but they weren't extensively used back then (look at 
Collections.Generics for an example). Only XE2 did use it extensively 
which was also when support for default namespaces were added. With XE2 
the RTL units were put into a System namespace, the VCL units were put 
into a VCL namespace and the newly added FireMonkey units were added to 
a FMX namespace. The usage of default namespaces (System and VCL in case 
of a VCL application) allows the usage of backwards compatible single 
identifier unit names.



Are directories involved? If so, does the Delphi structure match the FPC
tree structure?


No directories are involved, it's just the naming.

Regards,
Sven
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Encoded AnsiString

2013-12-29 Thread Hans-Peter Diettrich

Jonas Maebe schrieb:


The code page of ansistrings concatenations is the code page of the
result to which this concatenation is assigned/converted. For
rawbytestring, this code page is CP_ACP per Delphi compatibility.


This does not match my experience with Delphi XE :-(

Can you give an Delphi example, so that I can verify this behaviour?



I'm inclined to add a global boolean variable to the system unit that
allows changing this behaviour so that it uses CP_UTF8 instead in
such cases (defaulting to false, for Delphi compatibility). In
practice, setting it to true shouldn't cause problems even with
virtually all Delphi, as routines that work with rawbytestring should
be able to handle any code page anyway.


The Result of an f(...):RawByteString should return an string of that 
encoding, that results from its construction.



My view on RawByteString:

1) This type serves as a collector for AnsiStrings of any encoding, 
where otherwise a conversion into UTF-16 (string) or CP_ACP (AnsiString) 
were required.


2) Variables of type RawByteString are intended only as *local* 
variables, inside subroutines dealing with RawByteStrings.


3) Functions accepting RawByteStrings can provide fast results, when the 
encoding of the string arguments is the same, otherwise they have to use 
Unicode (UTF-8/16) for intermediate results.



Rationale/observations:

[1] Delphi: Only UTF-16 and CP_ACP are explicitly supported in 
overloaded stringhandling functions. This would require to convert all 
string arguments other than AnsiString(0) into UTF-16. A RawByteString 
overload (instead of AnsiString(0)) allows to process an AnsiString(x) 
without UTF-16 conversion, when the function code and argument encodings 
do not require such a conversion. Otherwise the RawByteString overloads 
convert all strings into UTF-16 internally, and back again into a 
RawByteString Result. Since UTF-8 is not a specifically supported 
encoding, UTF-16 must be converted back to CP_ACP instead, with possible 
losses.


In fact the AnsiString(0) overloads in AnsiStrings.pas are another 
optimization, that does not check the encoding of the string arguments, 
eventual conversions are assumed to be performed before. This leads to 
errors when the declared (static) string type of an parameter does not 
match its actual (dynamic) encoding. Such irregular strings can be 
constructed by wrong/unexpected use of RawByteString. Example (XE):


var a: AnsiString; u: UTF8String;
function cpy(s: RawByteString):RawByteString;
begin Result := s; end;
a := cpy(u); //now a has encoding UTF-8!

Here the XE compiler omits the conversion of the RawByteString result to 
the declared encoding of the target. Dunno about newer versions.



[3] Delphi: since the only explicitly supported lossless encoding is 
UTF-16, RawByteString stringhandling functions with arguments of mixed 
encodings must be converted to UTF-16, finally back to AnsiString. Here 
a conversion to CP_ACP may occur, when/because the further use of a 
RawByteString result is unknown. Delphi does not provide UTF-8 
overloads, so that this encoding cannot be used when an UnicodeString 
has to be converted into an RawByteString.


FPC: when UTF-8 is used inside RawByteString routines, instead of 
UTF-16, the RawByteString result can have exactly this encoding, for 
lossless handling in further calls, until the result finally is assigned 
to a variable/parameter of a fixed encoding. In detail no conversion to 
CP_ACP is required when UTF-8 is a supported by overloads, or as a 
special case of RawByteString arguments.



So IMO there exists no *requirement*, that intermediate Unicode strings 
have to be converted to CP_ACP as RawByteString Results. This is only a 
fatal consequence of the crippled Delphi handling of encodings 
(disregarding UTF-8), with possible conversion losses. When UTF-8 is 
used for intermediate Unicode strings, the RawByteString results can 
preserve lossless UTF-8 encoding.


DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Encoded AnsiString

2013-12-29 Thread Hans-Peter Diettrich

Michael Van Canneyt schrieb:



On Sun, 29 Dec 2013, Hans-Peter Diettrich wrote:


This will be combined with the dotted unit filenames, to be Delphi 
2010+ compatible.



How do I create source files for use with both versions?


What do you mean by this statement ?


I'm not familiar with dotted unit names, they seem not to be used in XE.
So I only can imagine something like conditionals around the different 
items in un/dotted environment, to keep "Classes" separate from 
"System.Classes"?


Are directories involved? If so, does the Delphi structure match the FPC 
tree structure?




Where can I jump in?


When I'm done I will release a version for testing to the public.


Fine :-)



How can a user request an string of a specific allocation size?


You should not.


Okay.



Another one:

I've heard that a mix of encodings converts the (concatenated) output 
(RawByteString?) to CP_ACP, with possible losses. Is this correct?


Define "output" ?


s := SomeACPstr+SomeUTF8str+"äöü";

In XE I can concatenate ACP and UTF-8 strings and assign it to an OEM 
string without losses. Somebody said this will fail in FPC, on e.g.

  FindFirst(myPath+allfiles,faAnyFile,sr);
due to an (intermediate?) conversion of myPath+allfiles to CP_ACP.

Of course the string must be converted to CP_ACP if FindFirst expects 
exactly an AnsiString(0) argument, otherwise something is broken.


DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Encoded AnsiString

2013-12-29 Thread Jonas Maebe

On 29 Dec 2013, at 16:25, Hans-Peter Diettrich wrote:

> I've heard that a mix of encodings converts the (concatenated) output 
> (RawByteString?) to CP_ACP, with possible losses. Is this correct?

The code page of ansistrings concatenations is the code page of the result to 
which this concatenation is assigned/converted. For rawbytestring, this code 
page is CP_ACP per Delphi compatibility.

I'm inclined to add a global boolean variable to the system unit that allows 
changing this behaviour so that it uses CP_UTF8 instead in such cases 
(defaulting to false, for Delphi compatibility). In practice, setting it to 
true shouldn't cause problems even with virtually all Delphi, as routines that 
work with rawbytestring should be able to handle any code page anyway.


Jonas___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Encoded AnsiString

2013-12-29 Thread Michael Van Canneyt



On Sun, 29 Dec 2013, Hans-Peter Diettrich wrote:


Michael Van Canneyt schrieb:



On Sun, 29 Dec 2013, Hans-Peter Diettrich wrote:

Inspired by the current Lazarus discussion I'd like to learn more about 
the current state of the implementation of the new AnsiStrings.


In case nothing has be done yet, I'd suggest to extend TAnsiRec by the new 
codePage and elemSize fields (words). These can be zero for now, so that 
the remaining codebase is not affected. Then it will be possible to play 
around with encoded strings, using the codePage field.




All this is done already a long time ago in trunk.
We're way past that stage.


I'm very confused, didn't use FPC for a long time. Have to refresh memory of 
all related procedures...


How do I instruct fpcup to checkout the trunk version? (Windows)
I tried to add an parameter fpcURL=trunk to the shortcut, is this correct?

How do I proceed (build, use in Lazarus...)?
Any links appreciated :-)


No idea.



Current stage is the creation of a unicode RTL, where all base file/string 
operations accept unicode strings. This is done too.


Next step is creation of the unicode RTL, where "string" = "widestring".
This will be combined with the dotted unit filenames, to be Delphi 2010+ 
compatible.



How do I create source files for use with both versions?


What do you mean by this statement ?

To allow people to choose, 2 RTLs will be created: one unicode 
(string=ansistring), one non-unicode (string=widestring).


This will result (probably) in 2 paths:
units/os-cpu
units/os-cpu-unicode
This is not decided yet.

I planned the work in februari/march.


Thanks :-)

Where can I jump in?


When I'm done I will release a version for testing to the public.


A related question:
Why is the string length set to zero in NewAnsiString, when the allocated 
Length is already known?


Because the allocated memory length is not necessarily equal to the string 
length.
If you have a string of length 50, setting the length to 25 will not 
discard and reallocate the memory block, but merely set the character 
length to 25.


This means that the allocated length is stored somewhere else, in the memory 
block descriptor?


Yes.



How can a user request an string of a specific allocation size?


You should not. But if you absolutely want that: Look up TAnsiRec and do

SetLength(S,AllocationLength-SizeOf(TAnsiRec));

Don't rely on this. Messing with internals is always a bad idea.
That is why TAnsiRec is an internal type, not exposed. 
To prevent people (like you, seemingly) from messing with internals.





Another one:

I've heard that a mix of encodings converts the (concatenated) output 
(RawByteString?) to CP_ACP, with possible losses. Is this correct?


Define "output" ?

Michael.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Encoded AnsiString

2013-12-29 Thread Hans-Peter Diettrich

Michael Van Canneyt schrieb:



On Sun, 29 Dec 2013, Hans-Peter Diettrich wrote:

Inspired by the current Lazarus discussion I'd like to learn more 
about the current state of the implementation of the new AnsiStrings.


In case nothing has be done yet, I'd suggest to extend TAnsiRec by the 
new codePage and elemSize fields (words). These can be zero for now, 
so that the remaining codebase is not affected. Then it will be 
possible to play around with encoded strings, using the codePage field.




All this is done already a long time ago in trunk.
We're way past that stage.


I'm very confused, didn't use FPC for a long time. Have to refresh 
memory of all related procedures...


How do I instruct fpcup to checkout the trunk version? (Windows)
I tried to add an parameter fpcURL=trunk to the shortcut, is this correct?

How do I proceed (build, use in Lazarus...)?
Any links appreciated :-)

Current stage is the creation of a unicode RTL, where all base 
file/string operations accept unicode strings. This is done too.


Next step is creation of the unicode RTL, where "string" = "widestring".
This will be combined with the dotted unit filenames, to be Delphi 2010+ 
compatible.



How do I create source files for use with both versions?

To allow people to choose, 2 RTLs will be created: one unicode 
(string=ansistring), one non-unicode (string=widestring).


This will result (probably) in 2 paths:
units/os-cpu
units/os-cpu-unicode
This is not decided yet.

I planned the work in februari/march.


Thanks :-)

Where can I jump in?



A related question:
Why is the string length set to zero in NewAnsiString, when the 
allocated Length is already known?


Because the allocated memory length is not necessarily equal to the 
string length.
If you have a string of length 50, setting the length to 25 will not 
discard and reallocate the memory block, but merely set the character 
length to 25.


This means that the allocated length is stored somewhere else, in the 
memory block descriptor?


How can a user request an string of a specific allocation size?


Another one:

I've heard that a mix of encodings converts the (concatenated) output 
(RawByteString?) to CP_ACP, with possible losses. Is this correct?


DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Encoded AnsiString

2013-12-29 Thread Marco van de Voort
In our previous episode, Michael Van Canneyt said:
> 
> Current stage is the creation of a unicode RTL, where all base file/string 
> operations accept unicode strings. This is done too.
> 
> Next step is creation of the unicode RTL, where "string" = "widestring".
> This will be combined with the dotted unit filenames, to be Delphi 2010+ 
> compatible.

If the dotted stuff is ready, as an excercise, the windows unit could be
switched to the scheme.

Btw, and it is unicodestring, not widestring.
 
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Encoded AnsiString

2013-12-29 Thread Michael Van Canneyt



On Sun, 29 Dec 2013, Hans-Peter Diettrich wrote:

Inspired by the current Lazarus discussion I'd like to learn more about the 
current state of the implementation of the new AnsiStrings.


In case nothing has be done yet, I'd suggest to extend TAnsiRec by the new 
codePage and elemSize fields (words). These can be zero for now, so that the 
remaining codebase is not affected. Then it will be possible to play around 
with encoded strings, using the codePage field.




All this is done already a long time ago in trunk.
We're way past that stage.

Current stage is the creation of a unicode RTL, where all base file/string 
operations accept unicode strings. This is done too.


Next step is creation of the unicode RTL, where "string" = "widestring".
This will be combined with the dotted unit filenames, to be Delphi 2010+ 
compatible.

To allow people to choose, 2 RTLs will be created: one unicode (string=ansistring), 
one non-unicode (string=widestring).


This will result (probably) in 2 paths:
units/os-cpu
units/os-cpu-unicode
This is not decided yet.

I planned the work in februari/march.


A related question:
Why is the string length set to zero in NewAnsiString, when the allocated 
Length is already known?


Because the allocated memory length is not necessarily equal to the string 
length.
If you have a string of length 50, setting the length to 25 will not discard 
and reallocate the memory block, but merely set the character length to 25.


Michael.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel