Felipe Monteiro de Carvalho schrieb:
On Tue, Sep 13, 2011 at 9:23 PM, Michael Van Canneyt
<mich...@freepascal.org> wrote:
One with unicode string, one with ansistring. They will have the same code,
but will be compiled twice, each time with a different compiler define to
decide which version it must be.

Is this possible in UNIX? I can see that in Windows you can use the
trick to use W versions which are identical except for the string type
and drop Windows 9x support, but is this really possible for the UNIX
syscalls? They expect UTF-8 not UTF-16 which is what UnicodeString
uses.

A few topics:

The NT WinAPI (not 9x) *implements* everything in the Wide (UTF-16) routines, the Ansi versions do the string *conversion* before calling the Wide version. Unix API (most probably - dunno) has no such dual interface with internal conversion.

The NT filesystems store names in UTF-16, while Unix filesystems store UTF-8. This means that access to an NTFS or FAT32 drive under Unix will require a string conversion, in the filesystem handler.

On Windows, Ansi means any (byte-char) encoding, with different (national) codepages on every machine. This can cause trouble to Ansi applications (using Ansi strings), when filenames do not convert losslessly into that codepage. Unix IMO uses UTF-8 as the Ansi encoding, eliminating possible losses, and that's why FPC also prefers UTF-8 encoding.


But let's not forget the user!

Many users still want simple string handling, with direct mapping between logical and physical chars (SBCS). This is not possible at all with UTF-8, while UTF-16 works fine with the BMP, at least. This "want of simple string handling" suggests the use of UTF-16 for Unicode strings in *user* code.

WRT the latter argument, FPC IMO should follow the Delphi implementation of Unicode strings as UTF-16. This choice is independent from the (platform dependent) RTL conventions, but it affects the standard components (string lists...) in the FCL, and the other components in the LCL. Here again the average user will prefer UTF-16 component libraries, compatible with his own code, while more experienced users may be happier with the current UTF-8 libraries.

English (ASCII) users also may prefer UTF-8, as long as they do not have to (or want to) deal with strings in foreign languages. Once they have to face the existence of non-ASCII strings in their applications, they will most probably prefer switching to UTF-16, with few changes to their existing codebase and coding habits(!). Really *processing* Unicode text, with all its bells and whistles, is so complicated that it should be left to dedicated software and libraries, while typical application code will ignore everything beyond char level.


IMO the number of required conversions is of little importance to the runtime behaviour of an application. File access is always expensive, so that a single conversion into the platform specific filename representation is not perceptible at all. The same for GUI components, which typically store all strings twice: once for their own (and application) use, and another copy in the widgets. Here again transfers of strings between widgets and components are rare, with neglectable slowdown by eventual conversions during message handling.

More important IMO is the external storage of Unicode, where I see no reasonable way around UTF-8, considering codepage dependencies and UTF-16 byte-order problems.

Another note: a "set of char" is quite incompatible with Unicode/UTF-16. This should be taken into account with *every* introduction of an Unicode string type.

DoDi

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Reply via email to