On 11/17/2011 02:55 PM, Sven Barth wrote:
Am 17.11.2011 12:59, schrieb Michael Schnell:
Note that the Delphi2009 definition is theoretically capable of
combining one and
two bytes in one type (like Yury's).
As I don't have such a Delphi please help me to understand:

Is there a general type dedicated for being able to hold any encoding ?
(be it ANSIxyz, UTF-8 or UTF-16) ?

In theory the AnsiString type (which is now the code page aware string type) should be capable of holding UTF-8 and UTF-16 data,
Why should a type that is capable of holding multiple different UTF encodings be called "ANSIString". IMHO this is very contra-intuitive. I think FPC should establish a better name (such as "GeneralString" or similar). This would not harm Delphi compatibility as there could be a type alias for this.

but either the direct unconverted storage of 2 byte data (UTF-16) is forbidden or undefined (don't remember which one it is in Delphi).
What do you mean by unconverted ? What I mean is a type that just is able to "be" any of the "Strict" Types and thus provides for fully dynamic encoding for applications (function) that want to handle any encoding by the same code sequence (being aware that they take the appropriate conversion performance hit when combining differently encoded strings).
Such "assignment" can happen with ":=", and with function calls. With
function calls there is "value" and "var" parameters. All this should
behave identical, any other behavior would be very hard to understand.

Don't forget about "out". As it sets the string to empty I don't know by myself what Delphi does here (e.g. what codepage the string will contain).
Of course we need a decent definition for this case. As I never intentionally used "out" parameters yet, I am not aware of the exact implications, but I am sure that there is a way to do a decently compatible definition.

In Delphi the type "String" is an alias to "UnicodeString", thus a 2-byte string.
IMHO, predefining a type named UnicodeString to be encoded as UTF-16 is contra-intuitive. I think FPC should establish a better naming (such as UTF16String for something that is predefined to be coded that way, if it in fact makes sense to define such a type in the language itself). For Delphi compatibility type alializing could be used.
In FPC there is no final decision yet and thus currently "String" is an "AnsiString" set to a specific codepage (though I honestly don't know which one it is...).
So I hope this discussion might help to promote a string Type functionality and naming system that is better than that Delphi currently provides.


I feel that - regarding the current state of the discussion - such types should be defined (I don't intend to define the exact names by this, nor to do any assumption on how to implement this):

- GeneralString (fully dynamic encoding can hold any encoding, 1, 2 and 4 byte code words, no conversion when used as a target of an assignment, automatic conversion whenever necessary)

- RawByteString (on byte code words, never doing a conversion, supposedly triggering an exception when combined with a variable that requires a dedicated encoding)

 - Raw Word String (two bytes code words, working like RawByteString)

 - RawDWordString (four bytes code words, working like RawByteString)

 - UTF8String (one byte code words, behavior is obvious)

 - UTF16String (two byte code words, behavior is obvious)

- UTF32String (and/or UCS4String) (four byte code words, behavior is obvious)

- ANSIString(n) (strictly encoded according to an ANSI code page, one byte code words, behavior is obvious)

- ANSIStinrg(and/or LocaleString) = ANSIString(n) n defined by current locale) This thingy should work very much alike the plain old "String", even though the implementation is different.

and as a goody this could be implemented later:

- RawByteFIFOString (behaving exactly as RawByteString, but implemented in a way that deleting from position 1 is much faster, while any other operation might be slower)
 - RawWordFIFOString (obvious)
 - RawDWordFIFOString (obvious)

Moreover I feel that for some or all of these string types corresponding character types should be provided. Otherwise I don't see how consistent programming could be enabled. This obviously includes a dynamically typed character type.

Moreover, IMHO, the meaning of the position aware functions (MyString[i], pos(), copy(), delete(), ... ) should be reconsidered, to allow the user to somehow declare his will to either work on code-positions (fast) or on visual-character-positions (meaningful, user-friendly).

(I don't think this is a great contradiction to what already is implemented in the svn.)

-Michael
_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Reply via email to