RE: Roundtripping in Unicode

Lars Kristan Wed, 15 Dec 2004 07:47:04 -0800

Title: RE: Roundtripping in Unicode

Kenneth Whistler wrote:
> Lars said:
>
> > According to UTC, you need to keep processing
> > the UNIX filenames as BINARY data. And, also according to
> UTC, any UTF-8
> > function is allowed to reject invalid sequences. Basically,
> you are not
> > supposed to use strcpy to process filenames.
>
> This is a very misleading set of statements.
Perhaps deliberately so.

>
> First of all, the UTC has not taken *any* position on the
> processing of UNIX filenames.
At this point, I won't make any statement about whether UTC should or need not do that.
Let me just ask if it is appropriate to discuss such issues on this list?

>
> It is erroneous to imply that the UTC has indicated that "you
> are not supposed to use strcpy to process filenames."
As long as explanatins about validation aren't misinterpreted by some people. Is there a thorough explanation of where and how to apply validation anywhere in the standard?

>
> Any process *interpreting* a UTF-8 code unit sequences as
> characters can and should recognize invalid sequences, but
> that is a different matter.
OK, strcpy does not need to interpret UTF-8. But strchr probably should. Or, is it that strchr is for opaque strings and mbschr is for UTF-8 strings? Then strchr should remain as is and be used for processing filenames. Hopefully, you do not need to search for Unicode characters in it and strchr-ing for '/' is all you need. But then all languages are supposed to provide functions for processing opaque strings in addition to their Unicode functions. Or, alternatively, they need to carefully define how string functions should process invalid sequences. If that can be done at all.

But sooner or later you need to incorporate the filename in some UTF-8 text. An error report, for example. You then need to program the boundaries quite carefully.

Not to mention the cost to maintain existing programs. I think it makes sense to keep looking for other solutions.

>
> If I pass the byte stream <0x80 0xFF 0x80 0xFF 0x80 0xFF> to
> a process claiming conformance to UTF-8 and ask it to intepret
> that as Unicode characters, it should tell me that it is
> garbage. *How* it tells me that it is garbage is a matter of
> API design, code design, and application design.

What are stdin, stdout and argv (command line parameters) when a process is running in a UTF-8 locale? Binary? Opaque strings? UTF-8?

> Unicode did not invent the notion of conformance to character
> encoding standards. What is new about Unicode is that it has
> *3* interoperable character encoding forms, not just one, and
> all of them are unusual in some way, because they are designed
> for a very, very large encoded character repertoire, and
> involve multibyte and/or non-byte code unit representations.

The difference is that far more people will be faced with such problems.

Lars

RE: Roundtripping in Unicode

Reply via email to