Re: Wide characters support in D

Yao G. Tue, 08 Jun 2010 09:25:20 -0700

Every time you reply to somebody, a new message is created. Is kindadifficult to follow this discussion when you need to look more than 15separated messages about the same issue. Please check your news client orsomething.


Yao G.

On Tue, 08 Jun 2010 10:11:34 -0500, Ruslan Nikolaev<nruslan_de...@yahoo.com> wrote:

Generally Linux systems use UTF-8 so I guess the "system
encoding" there will be UTF-8. But then if you start to use
QT you have to use UTF-16, but you might have to intermix
UTF-8 to work with other libraries in the backend (libraries
which are not necessarily D libraries, nor system
libraries). So you may have a UTF-8 backend (such as the
MySQL library), UTF-8 "system encoding" glue code, and
UTF-16 GUI code (QT). That might be a good or a bad choice,
depending on various factors, such as whether the glue code
send more strings to the backend or the GUI.

Now try to port the thing to Windows where you define the
"system encoding" as UTF-16. Now you still have the same
UTF-8 backend, and the same UTF-16 GUI code, but for some
reason you're changing the glue code in the middle to
UTF-16? Sure, it can be made to work, but all the string
conversions will start to happen elsewhere, which may change
the performance characteristics and add some potential for
bugs, and all this for no real reason.

The problem is that what you call "system encoding" is only
the encoding used by the system frameworks. It is relevant
when working with the system frameworks, but when you're
working with any other API, you'll probably want to use the
same character type as that API does, not necessarily the
"system encoding". Not all programs are based on extensive
use of the system frameworks. In some situations you'll want
to use UTF-16 on Linux, or UTF-8 on Windows, because you're
dealing with libraries that expect that (QT, MySQL).
Agreed. True, system encoding is not always that clear. Yet, usuallyUTF-8 is common for Linux (consider also Gtk, wxWidgets, system calls,etc.) At the same time, UTF-16 is more common for Windows (considerwin32api, DFL, system calls, etc.). Some programs written in C even tendto have their own 'tchar' so that they can be compiled differentlydepending on platform.
A compiler switch is a poor choice there, because you can't
mix libraries compiled with a different compiler switches
when that switch changes the default character type.
Compiler switch is only necessary for system programmer. For instance,gcc also has 'fshort-wchar' that changes width of wchar_t to 16 bit. Italso DOES break the code casue libraries normally compiled for wchar_tto 32 bit. Again, it's generally not for application programmer.
In most cases, it's much better in my opinion if the
programmer just uses the same character type as one of the
libraries it uses, stick to that, and is aware of what he's
doing. If someone really want to deal with the complexity of
Programmer should not know generally what encoding he works with. Forboth UTF-8 and UTF-16, it's easy to determine number of bytes (words) inmultibyte (word) sequence by just looking at first code point. This canalso be builtin function (e.g. numberOfChars(tchar firstChar)). Size ofeach element can easily be determined by sizeof. Conversion to UTF-32and back can be done very transparently.
The only problem it might cause - bindings with other libraries (but inthis case you can just use fromUTFxx and toUTFxx; you do this conversionanyway). Also, transferring data over the network - again you can juststick to a particular encoding (for network and files, UTF-8 is bettersince it's byte order free).
supporting both character types depending on the environment
it runs on, it's easy to create a "tchar" and "tstring"
alias that depends on whether it's Windows or Linux, or on a
custom version flag from a compiler switch, but that'll be
his choice and his responsibility to make everything work.
If it's a choice of programmer, then almost all advantages of tchar arelost. It's like garbage collector - if used by everybody, you can expectadvantages of using it. However, if it's optional - everybody will writelibraries assuming no GC is available, thus - almost all performanceadvantages are lost.
And after all, one of the goals of D (if I am not wrong) to be flexible,so that performance gains will be available for particularconfigurations if they can be achieved (it's fully compiled language).It does not stick to something particular and say 'you must use UTF-8'or 'you must use UTF-16'.
michel.for...@michelf.com
http://michelf.com/



--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

Re: Wide characters support in D

Reply via email to