Re: Utf-8 support in C functions on Linux

Glenn Maynard Thu, 13 Dec 2001 15:57:51 -0800

On Thu, Dec 13, 2001 at 09:48:29AM -0500, Richard, Francois M wrote:
> We were doing some testing with a piece of C code on Linux (Locale sensitive
> thanks to SetLocale() and with a system Locale set first to en_US.utf8 and


You mean setlocale().  C is case-sensitive, and it's good practice to
maintain that in writing.

> then to sv_SV.utf8)  and it looks like magically strcoll() was sorting the
> utf-8 file read in input(two characters only: ä and z). So in en_US.utf8, ä
> came first, then z. And in sv_SV.uft8, z came first, then ä.

Not magic; strcoll honors the locale explicitely:

"STRCOLL(3)          Linux Programmer's Manual          STRCOLL(3)
...

The comparison is based on strings interpreted as appropriate for the
program's current locale for category  LC_COLLATE.  (See setlocale(3))."

> Does it mean strcoll() properly handle utf-8 data??? I would be very
> surprised. But how to explain the proper sorting results we got?

Don't be--there's been a lot of work done to make glibc honor locales.

> Is there somewhere an extensive list indicating which C char functions do
> handle utf-8 properly and which ones do not (and as a result need to be
> replaced with wide C functions to correctly manipulate utf-8 data)? That
> would save us a lot of time since interpreting our test results is not in
> fact that obvious.

There's rarely any need to replace them with wchar versions--if you want
to deal with UTF-8, you're usually better off dealing with the multibyte
encoding directly.  (In other words, rethink your "as a result". :) The
only case where it's a clear win is where you need fast random access,
and that's very rarely needed.  (Okay, one other: when you want to
handle arbitrary encodings, though I think you'd still be better off
decoding iteratively, rather than converting the whole thing to wchar.)

Another note: if you need the width of a string frequently, a string
object that caches this can help.  Too bad C++ strings have no
provisions whatsoever for MBCS.  (They drop a whole ton of arbitrary
"character types" and "character traits" junk into that class, trying to
make it "reusable", and it can't even do UTF-8.  Ugh.)

Are there any glibc functions which should honor the locale but as of yet
do not?  I'm not aware of any.  You probably already know of the
functions that don't handle the locale and aren't supposed to
(particularly strlen().)  http://www.cl.cam.ac.uk/~mgk25/unicode.html, of
course.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Utf-8 support in C functions on Linux

Reply via email to