Marcin 'Qrczak' Kowalczyk writes:
> > For the conversions you can use iconv() and a normalizing wrapper
> > around nl_langinfo(CODESET).
>
> I don't like the idea of finding and interpreting locale.aliases by
> applications themselves...
It's not the locale aliases, it's the charset aliases. That makes a
difference, because there are much fewer of them, and because many
OSes actually do support the standardized MIME names with few
exceptions.
> > In glibc 2.1.93 it does: use iconv with "wchar_t" argument. It also
> > knows about "UCS-4" and "UCS-4LE" encodings.
>
> Good, so there is a chance that future iconv will be more usable?
It will. glibc has now a testsuite for iconv.
> (it's a development version of glibc, isn't it? so it's still future
> for me).
You can install it anyway, either yourself (instructions at
http://clisp.cons.org/~haible/glibc22-HOWTO.html) or get the beta of
the next RedHat distribution.
> > Which limitations does the portable iconv substitute (libiconv) have?
>
> That it must be carried along with a package - it's not a tiny wrapper
> around what the OS+std.libc provide but the whole implementation
> from scratch.
You can distribute it as a separate file, and let people on systems
with insufficient iconv() install it before your package.
> And that there is no nice way to determine either the name of the
> default local encoding or a known encoding of Unicode (for iconv in
> general). It all looks like kludges and guessing...
Nice or not - nl_langinfo(CODESET) plus a bit of postprocessing works
on most modern systems. And the encoding name "UTF-8" is known
everywhere.
> How to determine the quality of a locally installed iconv? For example
> I don't consider that in glibc-2.1.3 usable - recently I've seen
> "../iconv/skeleton.c:324: __gconv_transform_utf8_internal: Assertion
> `nstatus == GCONV_FULL_OUTPUT' failed.", there are several other
> errors, checking for illegal UTF-8 is poor.
These bugs are fixed in current glibc.
Solaris iconv is also quite usable, but its checking for invalid input
is poor.
> How can an iconv implementation be portable if it has to know all
> charsets that are used on all OSes? What worries me is that it must
> do everything itself. If an OS provides an unusual charset, libiconv
> will not see it.
Then someone will hopefully report it to me, and I will add that
unusual charset. Btw, can someone provide conversion tables for
HP-UX's "ccdc" or Solaris' "sun_eu_greek" encodings?
> How do Java implementations find this locale dependent default value?
> Do they use e.g. iconv for the actual conversion? Or determine only the
> name of the encoding somehow and implement the conversion themselves?
The Sun JDK has a documented set of encoding names
(http://www.javasoft.com:80/products/jdk/1.1/docs/guide/intl/encoding.doc.html)
and implements the conversion in Java.
> What about Perl and Python?
Python implements the conversions in Python. Don't know about Perl.
Bruno
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/