RE: Encoding conversions

Carl W. Brown Sun, 09 Sep 2001 11:28:47 -0700
Bruno,

> wchar_t is a very wrong thing to normalize to, because it is OS and
> locale dependent. UTF-8 is a much better normalization for strings,
> both in-memory and on disk. UCS-4 is an alternative, good
> normalization for strings in memory.
>

I agree.  Do not assume that wchar_t is Unicode.  For example do not use
wchar_t on a Solaris system for Unicode.  OTOH if your are using Windows
Unicode facilities, the UTF-8 support is minimal.

I recommend UTF-8 between systems for another big reason.  With UTF-16 &
UTF-32 you have big endian and little endian problems.  You can always add a
BOM but how many systems will actually check for it and will the BOM create
any problems?

But UTF-8 is not without its own problems.  Take Oracle for example.  They
designed UTF-8 to encode UCS-2 not UTF-16.  Thus if I have character that is
not plane 0 (> U+FFFF) it will encode the two surrogates as two 3 byte UTF-8
characters instead of a single 4 byte UTF-8 character.  (Invalid UTF-8
encoding) When they went to i9 they kept this behavior for their utf8
encoding and added al32utf8 for the proper UTF-8 encoding.  Why?  So that
UTF-8 would sort in UTF-16 binary sort order.  My xiua_strcmp on the other
hand uses Unicode code point compares on UTF-16 data so that it compares the
same as UTF-32 and UTF-8.  This way if you use al32utf8 database encoding
all Unicode will compare the same but you are not inventing a non-standard
UTF-8.

Technically you could also use GB18030 because it also encodes all of
Unicode 3.1.  But if you have ever worked with it you will soon se the
advantages of UTF-8.  UTF-8 in an MBCS but it is far easier to handle.  To
check the length of a GB18030 character you have to check the first
character.  Depending on that character you may also have the check the
second character.  Unlike UTF-8 if you want to back up a character you have
to start at the beginning of the string to find the start of the previous
character.

For internal processing there are often reasons to use UTF-32 or UTF-16.  If
you do, you should support the current UTF-32/UTF-16 standards and not
UCS-4/UCS-2.  UTF-16 is a bit of a pain because you are back to MBCS issues.
However there is a lot of UTF-16 Unicode out there.

I am sure that you know all this but I thought that it was important to this
discussion.

As you might have gathered even though I think that ICU is probably the best
C/C++ cross platform support for Unicode, I do not feel that UTF-16 support
is enough.

I am not familiar with libiconv.  I would like to know more because often
people do not want a large package like ICU.  I look forward to looking it
over when you get your link back up.

ICU has an invalid character callback handler.  I use it for example to
convert characters that are not in the code page to HTML/XML escape
sequences.  I am concerned about things like Euro support.  It looks like
most of the world is ignoring the 1/1/2002 date.  Many browsers and systems
still do not support iso-5589-15.  This routine lets you convert Unicode
data to iso-5589-1 and escape any Euro symbols to &#8364; automatically.  I
submitted the changes to ICU so it should also work natively in ICU 2.0.

Looking at the iconv() I did not see any provisions for special invalid
character handling.  Do you have this kind of support in libiconv?  If not,
I was wondering since iconv support algorithmic converters that you might
want to consider a Unicode to iso_5589-1 converter with XML escapes.  Maybe
call it iso-5589-1-xml.  I can produce either decimal or hex escapes but
decimal escapes are supported by more browsers.  (I like hex escapes for
readability)

If you look at xIUA my XIUA_FROM_U_CALLBACK_ESCAPE routine it is probably
not very portable.  When converting to code page it first produces a UTF-16
escape sequence that then ICU uses a second converter to convert it to the
proper character set.  This is a more flexible approach but over kill for
this type of routine.  It also uses a length of 2 to indicate a surrogate
pair.

As an example of when UTF-32 is useful, I also have a converter that
converts digits from any script, all forms of *,-./ plus all roman double
wide characters to ASCII characters.  This can be used for doing numeric
conversions on numeric strings.  This was much easier to write using UTF-32.

Carl



-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/
RE: Encoding conversions

Reply via email to