wchar_t, mbrtowc and iconv (was: Re: Updated UTF-8 decoder stress test file)

Bruno Haible Tue, 05 Sep 2000 10:36:53 -0700
Marcin 'Qrczak' Kowalczyk writes:

> Which endianness variants of UTF-16 make sense to provide? Only BE?
> BE, LE and native?

Those which are specified in RFC 2781: UTF-16BE and UTF-16LE (both
without byte order mark) and UTF-16 (with interpretation of the first
word as a byte order mark if it happens to be = 0xFFFE or = 0xFEFF).

> What about raw UCS-4?

UCS-4 is not used as an external format in practice.

> I am aware that wchar_t needs not to be Unicode but don't know what
> else can I

If your internal representation of text is Unicode, then why do you
bother with wchar_t[] at all? Just provide a converter between "char*"
(locale dependent encoding) and Unicode.

For the conversions you can use iconv() and a normalizing wrapper
around nl_langinfo(CODESET).

If you then still need wchar_t[] to Unicode conversion (which I think
should be rare - most data is passed around as char*) you can
distinguish two cases:
  - if __STDC_ISO_10646__ is defined, then you just copy the words from
    one array to the other (or apply UTF-8 conversion if that's your
    internal representation).
  - if __STDC_ISO_10646__ is not defined, then you convert wchar_t[]
    to char* and use the char* to Unicode conversion.

> What is frustrating about glibc's implementation of iconv is that
> it uses the same internal format as one of my internal formats for
> interfacing to C implementations of conversions (an array of words in
> native endianness) but does not provide that format externally.

In glibc 2.1.93 it does: use iconv with "wchar_t" argument. It also
knows about "UCS-4" and "UCS-4LE" encodings.

> For the default encoding of files, it's not much better. Using iconv
> would have too much limitations :-(

The default encoding of files is "char*", i.e. nl_langinfo(CODESET).
Which limitations does the portable iconv substitute (libiconv) have?

> I wonder what Java implementations do.

Java's FileReader class (which implicitly converts char* to Unicode)
takes an encoding argument. The list of permitted encodings is again
platform and version dependent. Best is not to use this explicit
encoding argument and rely on the locale dependent default value.

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
wchar_t, mbrtowc and iconv (was: Re: Updated UTF-8 decoder stress test file)

Reply via email to