On Thu, Dec 13, 2001 at 09:48:29AM -0500, Richard, Francois M wrote: > We were doing some testing with a piece of C code on Linux (Locale sensitive > thanks to SetLocale() and with a system Locale set first to en_US.utf8 and
You mean setlocale(). C is case-sensitive, and it's good practice to maintain that in writing. > then to sv_SV.utf8) and it looks like magically strcoll() was sorting the > utf-8 file read in input(two characters only: ä and z). So in en_US.utf8, ä > came first, then z. And in sv_SV.uft8, z came first, then ä. Not magic; strcoll honors the locale explicitely: "STRCOLL(3) Linux Programmer's Manual STRCOLL(3) ... The comparison is based on strings interpreted as appropriate for the program's current locale for category LC_COLLATE. (See setlocale(3))." > Does it mean strcoll() properly handle utf-8 data??? I would be very > surprised. But how to explain the proper sorting results we got? Don't be--there's been a lot of work done to make glibc honor locales. > Is there somewhere an extensive list indicating which C char functions do > handle utf-8 properly and which ones do not (and as a result need to be > replaced with wide C functions to correctly manipulate utf-8 data)? That > would save us a lot of time since interpreting our test results is not in > fact that obvious. There's rarely any need to replace them with wchar versions--if you want to deal with UTF-8, you're usually better off dealing with the multibyte encoding directly. (In other words, rethink your "as a result". :) The only case where it's a clear win is where you need fast random access, and that's very rarely needed. (Okay, one other: when you want to handle arbitrary encodings, though I think you'd still be better off decoding iteratively, rather than converting the whole thing to wchar.) Another note: if you need the width of a string frequently, a string object that caches this can help. Too bad C++ strings have no provisions whatsoever for MBCS. (They drop a whole ton of arbitrary "character types" and "character traits" junk into that class, trying to make it "reusable", and it can't even do UTF-8. Ugh.) Are there any glibc functions which should honor the locale but as of yet do not? I'm not aware of any. You probably already know of the functions that don't handle the locale and aren't supposed to (particularly strlen().) http://www.cl.cam.ac.uk/~mgk25/unicode.html, of course. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/