Re: [fpc-devel] String and UnicodeString and UTF8String

Jonas Maebe Mon, 10 Jan 2011 09:27:12 -0800

On 10 Jan 2011, at 16:27, Marco van de Voort wrote:

> In our previous episode, Jonas Maebe said:
> 
>> Why should a tstringlist force ansistring(0)?
> 
> I mean that if you locally (for your units) set string=utf8string,
> TStringList still would be ansistring(0) or whatever the default becomes.


I meant: why not use ansistring($ffff) instead? You could even add a property 
to tstringlist that causes it to force the encoding of added strings to a 
particular code page whenever a string is added.

>> Or does Delphi force it  to be that way?
> 
> In D2009+ it is unicodestring, period. Everything is unicodestring (UTF16),
> ansistring (+ variants) are for legacy only, and people try to forget
> shortstring as quickly as possible.

Then a unicodestring version is certainly required, and an ansistring($ffff) 
version would have to be called differently.

> I think in the planned Embarcadero cross-compile products, string will also
> be utf-16 on OS X and Linux.  If only because it is (1) easier, and windows
> remains dominant by far (including UTF16 assuming codebases) (2) they plan
> to target QT. 

I think it's a good decision to keep it the same everywhere, since 
string=unicodestring is not an opaque type in any way. As a result, choosing a 
different string type on other platforms would probably break lots of code 
again. And regardless of which toolkit you target on Mac OS X, conversions will 
probably happen anyway. The encoding used by Carbon and Cocoa is not specified 
anywhere afaik, and the CFString/NSString they are based on can use any 
encoding internally (I guess that's probably also UTF-16 for ease of 
processing).

> For me, having a mandatory UTF16 Unix is not an option, and a mandatory UTF8
> Windows neither.  (D2009+ incompatible)

I don't think UTF-16 everywhere would be a big problem.

>> Conversion may indeed be required for output (input would only pass on  
>> the encoding of the input if based on ansistring($ffff))
> 
> ansistring(0), system encoding would be more logical than $ffff. $ffff is
> used more internally in string conversion routines and for strings that are
> not strings.

The fact that the formal return type is $ffff does not mean, afaik, that you 
also have to return something whose internal encoding is set to "$ffff". It can 
still be an ansistring(0), ansistring(OEMSTRING) or whatever. It simply means 
that the encoding won't be forced to anything in particular when you assign a 
value to the function result. If you then assign this function result to 
another variable (which may have a forced encoding), then a conversion will 
happen if the forced encoding is different from the actual one. If you assign 
it to another ansistring($ffff), no encoding change will happen in any case, 
and the destination string will "inherit" the source's encoding.

> But what does that mean on Windows, where the console encoding is OEMSTRING
> and not ansistring(0) ?  

As I said: ansistring($ffff). 

>> but I think doing that only when necessary at the lowest level should be
>> no problem.  Many existing frameworks work that way.
> 
> It touches all places where you touch the OS. But indeed one could try to
> split this by doing the classes utf8 or tunicodestring depending on OS.

I'm not sure why you say "indeed", because I did not propose to do that. I only 
proposed keeping as many RTL interfaces as possible in ansistring($ffff) to 
have something that's
a) generic, and
b) with the least chance of resulting in encoding conversion

However...

> And we have to deal with Windows, where the default is UTF16.

... since Delphi 2009 uses (unicode)string everywhere, we need at least also 
unicode versions.

>> Why ansistring(0) for base classes? OS-level interfaces: yes, but why  
>> base classes?
> 
> This is the core problem. What solution will do for everybody
> (legacy,Lazarus,Delphi/unicode?) or (ansistring(0), ansistring(cp_utf8) or
> TUnicodestring) ?
> 
> And what do we do if e.g. Lazarus changes opinion and goes from utf8 to
> utf16 on Windows? (e.g. the Delphi/unicode becomes the dominant influx).
> 
> And do we really want Lazarus' direction to fixate this for everybody?
> 
> Or what if they bring in a new Kylix principle with utf16 base type?

A unicodestring version for Delphi-compatibility, and if required an 
ansistring($ffff) version for all other purposes (afaik that would also work 
with legacy ansistring=ansistring(0), although it's not yet clear to me what 
happens if you pass an empty ansistring(0) to a rawbytestring var-parameter -- 
is it still nil like with current ansitrings, or can you somehow extract its 
declared encoding?)

>> I agree that the RTL should work regardless of the used string  
>> encoding, but I don't see why a particular encoding should be enforced  
>> throughout the entire RTL rather than just using ansistring($ffff)  
>> almost everywhere.
> 
> That only solves the 1-byte case.

It's true that you probably need a separate overloaded version for 
unicodestring (just like we currently also have separate overloads for 
ansistring and unicodestring).

> And while that solves some of the
> overloading problems deep in RTL and frameworks, it might not be applicable
> on largescale, since afaik you need to test in the routine for codepages
> manually ? IOW this can't be done in every routine with a string parameter
> in the entire classtree

Most routines don't process strings themselves: they store them, pass them on 
(to routines that may process them) or return them. In those cases, you don't 
have to look at the encoding.

> Btw, while looking up rawbytestring I saw this in the Delphi help:
> 
> "Declaring variables or fields of type RawByteString should rarely, if ever,
> be done, because this practice can lead to undefined behavior and potential
> data loss."

They are right if you mainly care about code maintainability. If you however 
insist on supporting multiple encodings efficiently and transparently, there is 
no other option. The danger they are talking about mainly occurs when mixing 
rawbytestring and string literals. And even that could actually be solved by 
the compiler (the compiler could insert a conversion of the string literal to 
the actual encoding of the rawbytestring at run time, just like we currently do 
for mixing widestring constants and ansistring), but CodeGear chose not to do 
that, presumably for efficiency reasons.

> How will you deal with e.g. Windows? Legacy string=ansistring(0), D2009 is
> string=utf16 TUnicodestring?
> 
> These are not the same types, and inheritance and the other problems will
> kill you if you attempt to combine it. We need two separate targets for 
> Windows
> anyway. Maybe three (if Lazarus persists in UTF8 in windows)

I think at most two are required for any target: unicodestring (D2009 
compatibility), and if really necessary because somehow the unicodestring 
version causes too much overhead, an ansistring($ffff) version as well. That's 
only for the classes though, I think most of the base RTL can be simply 
ansistring($ffff).

>> Outside the RTL, the encoding mainly matters if you perform manual low-
>> level processing of a string (for i:=1 to length(s) do
>> something_with(s[i])). 
> 
> The RTL is not the only interface with the OS. Like e.g. a widget set that
> may be ansi,UTF8 or UTF16.

Changing the string type in your entire application and RTL only because a 
widgetset uses it does not make sense to me. Generally, you want to process 
everything in whatever format is most convenient, and only convert it to the 
required type once you are actually communicate with the component.

>> It's not really clear to me which problem this would solve, but I may  
>> be missing something.
> 
> Mainly the question what the classtree will be. The main operating type used
> in applications.  You always need two RTLs for that, since it can be 1 or 2
> byte, and even if you fixated it on one byte encodings, rawbytestring would
> force you to write case statements in each and every procedure.

That last part is not true. It's only required in those cases where strings are 
directly manipulated and where the overhead is very important. And again: if 
you want to support compiling the RTL for different one-byte code pages that 
operate directly on those strings without any conversions, you have to write 
all that different code anyway. It's mainly a matter of replacing ifdef's with 
case statements.


Jonas_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] String and UnicodeString and UTF8String

Reply via email to