Re: [fpc-devel] Unicode proceedings

Michael Schnell Thu, 17 Nov 2011 03:54:41 -0800

On 11/16/2011 05:24 PM, Marco van de Voort wrote:


The original proposal was like (A) but only for base unicode encodings
(utf8/16 and maybe 32), but went down due to either excess conversions and
need for overloading.  The amount of overloading for the current 3-4
stringtypes is already a bit much.  (short/ansi/wide/unicodestring)

...

This is exactly what I meant to say. (It's a viable definition, but...)

(B) was a counter proposal floated by Florian. The cons were pretty much
that you had to guard every encoding sensitive routine (e.g.  every API/OS
call) to enforce the string contained the encoding you expected.  Combining
one and two byte types also cast doubt on the [] operator's performance.

...

This is exactly what I meant to say. (It's a viable definition, but...)

Then Yury proposed to combine A and B, in retrospect a bit like the current
Delphi implementation but with one and two byte encodings in one type.

...

Yep. But IMHO the wording I proposed by (C): such as "object-alike" IMHOleads to a more "understandable" definition, in effect providingidentical (or at least very similar) results and i.e. to at most anidentical implementation, as most of the differences might be considered"implementation depending" (not defined by the pure, documenteddefinition of the behavior (such as what happens with "intersexual"variables).

Note that the Delphi2009 definition is theoretically capable of combining one 
and
two bytes in one type (like Yury's).

As I don't have such a Delphi please help me to understand:

Is there a general type dedicated for being able to hold any encoding ?(be it ANSIxyz, UTF-8 or UTF-16) ?

Of course, when assigning something to a "strictly encoded" String (thetype denotes the encoding) the definition of what is supposed to happenis clear and obvious. If the Type name or the dynamic encoding of thetarget (even if Length=0) is used for deciding about a conversion is an"implementation detail".

Is there a clear definition about what happens if the "general" stringtype is the target ? Here, IMHO, it would be very hard to understand, ifthe history of the target variable (i.e. has a string of some encodingbeen assigned to it before) would decide about a conversion. IMHO aGeneral string type needs to be handled as fully dynamically encoded andthus as a target always needs to get the source's encoding.

Such "assignment" can happen with ":=", and with function calls. Withfunction calls there is "value" and "var" parameters. All this shouldbehave identical, any other behavior would be very hard to understand.

And on top of this: what is the type "String" ? Of course the generalString type would be an obvious choice, but perhaps (depending on theimplementation) this might result in worse performance in certain casesof usage and thus some strict (specifically encoded) Type could bechosen. (In fact I will never again use "String" in any project, but usea propriety type defined in some central unit so that I at any time cando a central change to some specific string type.)

Embarcadero kept the two types separate,

Making a decently clear definition of the behavior (from a user's view)rather complicated.

- backwards compatibility (and thus the hurdle to upgrade)

This did not seem to have worked. Everybody, I asked, who migrated alarge project to the new strings, was very unhappy.


Explain parent-child for explicitely this context. This kind of stuff is
what I meant with self contained. Don't use terms that you don't fully
describe elsewhere.

Sorry that I seemingly failed with my intention to help understandingwhat I meant by stating the similarity to the objects' parent-childrelationship.

I just meant a "General" (or "Raw") string type needs to exist that canhold any encoding and needs no conversion when a strictly encodedvariable is assigned to it (via ":=", value parameter or var parameter).Similar as with a parent object it "is" any strictly encoded stringtype, so that when using it as a nominal parameter of a function, it can- without conversion - take any strict string type (and of course thegeneral type, too).Similar as with an object's runtime type (such as via "is" and "as"),the encoding of a General string can be detected and handled whenappropriate (e.g. when combining with a another (strict or general)string or assigning to a strict string variable might request forconversion).

the RAW string type and the types supposed to hold a specific encoding.

Explain RAW.

See above. "General" or "not Strict" would be more appropriate (I tookthe term "Raw" from other recent discussions on the issue.)

Yes, I never really considered (B) a workable solution. It would break
existing code, and the ways to deal with the other problems was hackish at
best.

Yep. But I was told by unhappy coders that the new Delphi way breaks alot of existing code, as well. So a new FPC way has a chance of beingbetter. :) This might (or might not) be a way to do this.

I think the A-B hybrid is better than either A or B. And that is what is
being implemented.

Yep. Only the definition of it's behavior of course is a lot morecomplex. In fact with "C" I tried (and failed) to find a proper basicdefinition of exactly this.

Then describe how that should work. What should happen if I pass such marked
raw string to a function that wants encoding<y>?

I hope I did this some lines up. But better see below

So IMHO the Parent-Child (alike) relationship between RAW and any other
new string type is quite obvious.

No it is not. And you don't make the situation any clearer by writing yet
another message without a concrete description (either using specs or with
examples), and not defining RAW and exactly how the parent-child relation
works.

I hope I did this some lines up. But better see below


I've been doing OOP for 15 years now, but I've no idea whatsoever.

Obviously it was not as a good idea as I thought to state the similaritybetween the relation between a single "General" and multiple "strictlyencoded" string types regarding a Parent object and multiple Childobjects. But I am not at all against dropping this analogy and justusing a "self-contained" definition. Moreover I think we agree upondropping the term "Raw" for the general string type.


So the wording could be similar to:

- There is a General String type that can hold any encoding (and anywidth of the code elements)- There are lots of Strict String types that are supposed to holdstrings in a predefined encoding (somebody else might describe in detailhow these are defined)- There are the appropriate single-character types corresponding toall of the above string types- A variable of the General string type or the General character typeonly has a defined encoding if it before has been the target of anassignment of a not empty string or a character with defined encoding.

 - If just using strict types the conversion rules are obvious

- If assigning a value to a General string or character, (via ":=",value or var parameter) no conversion is done.- There are means to detect the actual encoding of a general String orCharacter variable.- If combining any Strings/Chars with General String or Chars, thecoding of the General ones is fetched from their embedded dynamicencoding definition to decide upon conversion.


I hope this is more like what you'd like to see.

Note that the recent discussion about how variable passing with RAW /not RAW strings is implemented might be decided by such a definition.

Note that Delphi seemingly introduced the encoding types $0000 for"None"/"to be assigned" and $FFFF" for "Raw". This allows for a stringvariable to be "General" or "Raw" with different meaning. How use thisto implement a proper handling of General variables that hold a certainencoding but still are strictly General so that they get a differentencoding with the next assignment ? I don't see of / if / if not thisis helps implementing the above definition of if this or if it is acontradiction to same and/or provides nasty ambiguity.

Of course you can try to create some object based stringtype like C++, but
then you will have to deal with all its problems, and the fact that Pascal's
object model is not the same as C++'s. Also stuff that we take for granted
(like copy-on-write) would be hard.

Of course I agree.

Thanks,
-Michael

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode proceedings

Reply via email to