Re: Unicode support under Linux

Markus Kuhn Thu, 04 Oct 2001 13:48:25 -0700

On Wed, 3 Oct 2001, Richard, Francois M wrote:
> In the GNU library glibc 2.2, is it true to say that all wide character C
> functions are based on UTF-32 (since their character arguments are wchar_t)
> and properly handle such character encoding according to the Locale
> (assuming setlocale(LC_ALL,"") has been called).


Yes. There is an optional macro symbol in the ISO C 99 standard that - if
defined by the compiler - signals that wchar_t is always encoded according
to ISO 10646, independent of the locale:

  __STDC_ISO_10646__

Glibc 2.2 defines that symbol. Other operating systems such as Solaris are
not likely to follow, because they need to keep historic locales with
wchar_t != UCS around for backwards compatibility reasons (though they now
do provide for each historic locale an alternative with wchar_t = UCS).
GNU didn't have any real wchar_t support before glibc 2.2, so there was no
backwards compatibility problem.

> If true, it means that my application can use these wide character C
> functions to process Unicode character data to correctly process Unicode
> character data according to the Locale.

Yes.

> But, is it also true to say that under Linux utf-8 Locales, all C functions
> handle properly char data representing utf-8 character encoded data?

Of course not. How could they? The char functions operate on bytes only.
UTF-8 was specifically designed such that these functions remain as useful
as possible on UTF-8, but they will not work 100% on UTF-8 without
modifications of your application code. For example, most of the ctype.h
functions are completely useless on UTF-8. Most of the string.h function
are still useful, if you understand what they do (e.g., if you understand
that strlen counts bytes, but not characters or terminal columns).

> Do
> strlen, strchr, strcmp, strcpy, toupper process char data correctly when the
> Locale character encoding is utf-8?

strlen and strchr work on bytes, which can be useful for UTF-8 strings as
well. strcmp and strcpy work fine on UTF-8. toupper is hopeless on UTF-8,
for character class testing and case conversion, you will have to convert
to wchar_t first and use the wide equivalents.

A problem with these functions are mostly the man pages. They say things
like

  strchr - locate character in string

whereas they should actually say

  strchr - locate byte in string

or perhaps better

  strchr - locate char in string

Before multibyte character sets came up, there wasn't a big difference
between characters, chars and bytes, so the man pages are still a bit
naive here. Even ISO C 99 still uses that language at some places,
though they did make a bit of an effort to say char instead of
character.

> OR do I need to use the wide character
> functions after specific conversion from char to wchar_t of my charatcer
> data?

Sometimes, e.g. for converting to uppercase or in regular expression
processing. Othertimes no, for example for substring searching.

> Which functions (wide and non wide character) correctly handle utf-8 data
> and which ones do not?

That should be obvious if you just think about what they actually do. I
don't think it makes sense to give you a list if you are not able to
produce this list yourself. But as a hint: In particular, any function
that receives a single char value as input will not work for UTF-8,
because a single char value can't hold most UTF-8 characters.

> And what is the recommended option if I want to
> developped a fully i18ned application (use the utf-8 Locale support; or
> convert internally to utf-32 by using wchar_t and wide character functions;
> or use ICU).

It depends on the algorithm. Most of the time, treating UTF-8 strings just
like ASCII strings works fine. As soon as you start to operate on
individual characters (chars, bytes), not on strings, it usually becomes
necessary to decode the UTF-8 into a wide representation. You can do that
either for the entire text you process, or just locally for the single
character in question. The speed tradeoffs should be obvious in the
respective particular situation.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Unicode support under Linux

Reply via email to