Re: Roundtripping in Unicode

Marcin 'Qrczak' Kowalczyk Wed, 15 Dec 2004 08:57:38 -0800

Lars Kristan <[EMAIL PROTECTED]> writes:

> OK, strcpy does not need to interpret UTF-8. But strchr probably should.


No. Its argument is a byte, even though it's passed as type int.
By "byte" here I mean "C char value, which is an octet in virtually
all modern C implementations; the C standard doesn't guarantee this
but POSIX does".

Many C functions are not suitable for processing UTF-8, or are
suitable only as long as we consider all non-ASCII characters opaque
bags of bytes. For example isalpha takes a byte, toupper transforms
a byte to a byte, and strncpy copies up to n bytes even if it's
in the middle of a UTF-8 character.

There are wide character versions like iswalpha and towupper. But then
data must be converted from a sequence of char to a sequence of wchar_t.
Standard and semi-standard function which do this conversion for UTF-8
reject invalid UTF-8 (they all have a mean for reporting errors).

The assumption that wchar_t has something do to with Unicode is not as
common as about char and bytes. I don't know whether FreeBSD finally
changed their wchar_t to Unicode. And it can be UTF-32 (Unix) or
UTF-16 (Windows).

> But then all languages are supposed to provide functions for
> processing opaque strings in addition to their Unicode functions.

Yes, IMHO all general-purpose languages should support processing
arrays of bytes, in addition to Unicode strings.

It's not clear however how the API of filenames should look like,
especially if they wish to be portable to Windows.

> But sooner or later you need to incorporate the filename in some
> UTF-8 text. An error report, for example.

While it's not clear what a well-behaved application should do by
default, in order to be 100% robust and preserve all information
you must change the usual conventions anyway. Remember that any byte
except "\0" and "/" is valid in a filename, so you must either escape
some characters, or delimit the filename with "\0", or prefix it with
the length, or something like this. A backup software should do this
and not pay attention to the locale. But for end-user software like
an image viewer, processing arbitrary filenames is less important.

> What are stdin, stdout and argv (command line parameters) when a
> process is running in a UTF-8 locale?

Technically they are binary (command line arguments must not contain
zero bytes). Users are expecting stdin and stdout to be treated as
text or binary depending on the program, while command like arguments
are generally interpreted as text or filenames.

-- 
   __("<         Marcin Kowalczyk
   \__/       [EMAIL PROTECTED]
    ^^     http://qrnik.knm.org.pl/~qrczak/

Re: Roundtripping in Unicode

Reply via email to