Re: [fpc-devel] Unicode support (yet again)

Hans-Peter Diettrich Wed, 14 Sep 2011 11:21:18 -0700

Felipe Monteiro de Carvalho schrieb:

On Tue, Sep 13, 2011 at 9:23 PM, Michael Van Canneyt
<mich...@freepascal.org> wrote:

One with unicode string, one with ansistring. They will have the same code,
but will be compiled twice, each time with a different compiler define to
decide which version it must be.


Is this possible in UNIX? I can see that in Windows you can use the
trick to use W versions which are identical except for the string type
and drop Windows 9x support, but is this really possible for the UNIX
syscalls? They expect UTF-8 not UTF-16 which is what UnicodeString
uses.


A few topics:

The NT WinAPI (not 9x) *implements* everything in the Wide (UTF-16)routines, the Ansi versions do the string *conversion* before callingthe Wide version. Unix API (most probably - dunno) has no such dualinterface with internal conversion.

The NT filesystems store names in UTF-16, while Unix filesystems storeUTF-8. This means that access to an NTFS or FAT32 drive under Unix willrequire a string conversion, in the filesystem handler.

On Windows, Ansi means any (byte-char) encoding, with different(national) codepages on every machine. This can cause trouble to Ansiapplications (using Ansi strings), when filenames do not convertlosslessly into that codepage. Unix IMO uses UTF-8 as the Ansi encoding,eliminating possible losses, and that's why FPC also prefers UTF-8 encoding.



But let's not forget the user!

Many users still want simple string handling, with direct mappingbetween logical and physical chars (SBCS). This is not possible at allwith UTF-8, while UTF-16 works fine with the BMP, at least. This "wantof simple string handling" suggests the use of UTF-16 for Unicodestrings in *user* code.

WRT the latter argument, FPC IMO should follow the Delphi implementationof Unicode strings as UTF-16. This choice is independent from the(platform dependent) RTL conventions, but it affects the standardcomponents (string lists...) in the FCL, and the other components in theLCL. Here again the average user will prefer UTF-16 component libraries,compatible with his own code, while more experienced users may behappier with the current UTF-8 libraries.

English (ASCII) users also may prefer UTF-8, as long as they do not haveto (or want to) deal with strings in foreign languages. Once they haveto face the existence of non-ASCII strings in their applications, theywill most probably prefer switching to UTF-16, with few changes to theirexisting codebase and coding habits(!). Really *processing* Unicodetext, with all its bells and whistles, is so complicated that it shouldbe left to dedicated software and libraries, while typical applicationcode will ignore everything beyond char level.

IMO the number of required conversions is of little importance to theruntime behaviour of an application. File access is always expensive, sothat a single conversion into the platform specific filenamerepresentation is not perceptible at all. The same for GUI components,which typically store all strings twice: once for their own (andapplication) use, and another copy in the widgets. Here again transfersof strings between widgets and components are rare, with neglectableslowdown by eventual conversions during message handling.

More important IMO is the external storage of Unicode, where I see noreasonable way around UTF-8, considering codepage dependencies andUTF-16 byte-order problems.

Another note: a "set of char" is quite incompatible with Unicode/UTF-16.This should be taken into account with *every* introduction of anUnicode string type.


DoDi

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode support (yet again)

Reply via email to