Bruno,
> wchar_t is a very wrong thing to normalize to, because it is OS and
> locale dependent. UTF-8 is a much better normalization for strings,
> both in-memory and on disk. UCS-4 is an alternative, good
> normalization for strings in memory.
>
I agree. Do not assume that wchar_t is Unicode. For example do not use
wchar_t on a Solaris system for Unicode. OTOH if your are using Windows
Unicode facilities, the UTF-8 support is minimal.
I recommend UTF-8 between systems for another big reason. With UTF-16 &
UTF-32 you have big endian and little endian problems. You can always add a
BOM but how many systems will actually check for it and will the BOM create
any problems?
But UTF-8 is not without its own problems. Take Oracle for example. They
designed UTF-8 to encode UCS-2 not UTF-16. Thus if I have character that is
not plane 0 (> U+FFFF) it will encode the two surrogates as two 3 byte UTF-8
characters instead of a single 4 byte UTF-8 character. (Invalid UTF-8
encoding) When they went to i9 they kept this behavior for their utf8
encoding and added al32utf8 for the proper UTF-8 encoding. Why? So that
UTF-8 would sort in UTF-16 binary sort order. My xiua_strcmp on the other
hand uses Unicode code point compares on UTF-16 data so that it compares the
same as UTF-32 and UTF-8. This way if you use al32utf8 database encoding
all Unicode will compare the same but you are not inventing a non-standard
UTF-8.
Technically you could also use GB18030 because it also encodes all of
Unicode 3.1. But if you have ever worked with it you will soon se the
advantages of UTF-8. UTF-8 in an MBCS but it is far easier to handle. To
check the length of a GB18030 character you have to check the first
character. Depending on that character you may also have the check the
second character. Unlike UTF-8 if you want to back up a character you have
to start at the beginning of the string to find the start of the previous
character.
For internal processing there are often reasons to use UTF-32 or UTF-16. If
you do, you should support the current UTF-32/UTF-16 standards and not
UCS-4/UCS-2. UTF-16 is a bit of a pain because you are back to MBCS issues.
However there is a lot of UTF-16 Unicode out there.
I am sure that you know all this but I thought that it was important to this
discussion.
As you might have gathered even though I think that ICU is probably the best
C/C++ cross platform support for Unicode, I do not feel that UTF-16 support
is enough.
I am not familiar with libiconv. I would like to know more because often
people do not want a large package like ICU. I look forward to looking it
over when you get your link back up.
ICU has an invalid character callback handler. I use it for example to
convert characters that are not in the code page to HTML/XML escape
sequences. I am concerned about things like Euro support. It looks like
most of the world is ignoring the 1/1/2002 date. Many browsers and systems
still do not support iso-5589-15. This routine lets you convert Unicode
data to iso-5589-1 and escape any Euro symbols to € automatically. I
submitted the changes to ICU so it should also work natively in ICU 2.0.
Looking at the iconv() I did not see any provisions for special invalid
character handling. Do you have this kind of support in libiconv? If not,
I was wondering since iconv support algorithmic converters that you might
want to consider a Unicode to iso_5589-1 converter with XML escapes. Maybe
call it iso-5589-1-xml. I can produce either decimal or hex escapes but
decimal escapes are supported by more browsers. (I like hex escapes for
readability)
If you look at xIUA my XIUA_FROM_U_CALLBACK_ESCAPE routine it is probably
not very portable. When converting to code page it first produces a UTF-16
escape sequence that then ICU uses a second converter to convert it to the
proper character set. This is a more flexible approach but over kill for
this type of routine. It also uses a length of 2 to indicate a surrogate
pair.
As an example of when UTF-32 is useful, I also have a converter that
converts digits from any script, all forms of *,-./ plus all roman double
wide characters to ASCII characters. This can be used for doing numeric
conversions on numeric strings. This was much easier to write using UTF-32.
Carl
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/