RE: Roundtripping in Unicode

Kenneth Whistler Tue, 14 Dec 2004 12:48:17 -0800

Lars said:

> According to UTC, you need to keep processing
> the UNIX filenames as BINARY data. And, also according to UTC, any UTF-8
> function is allowed to reject invalid sequences. Basically, you are not
> supposed to use strcpy to process filenames.


This is a very misleading set of statements.

First of all, the UTC has not taken *any* position on the
processing of UNIX filenames. That is an implementation issue
outside the scope of what the UTC normally deals with, and I
doubt that it will take a position on the issue.

It is erroneous to imply that the UTC has indicated that "you
are not supposed to use strcpy to process filenames." It has
done nothing of the kind, and I don't know of any reason why
anyone should think otherwise. I certainly use strcpy to process
filenames, UTF-8 or not, and expect that nearly every implementer
on the list has done so, too.

Any process *interpreting* a UTF-8 code unit sequences as
characters can and should recognize invalid sequences, but
that is a different matter.

If I pass the byte stream <0x80 0xFF 0x80 0xFF 0x80 0xFF> to
a process claiming conformance to UTF-8 and ask it to intepret
that as Unicode characters, it should tell me that it is
garbage. *How* it tells me that it is garbage is a matter of
API design, code design, and application design.

But there is *nothing* new here.

If I pass the byte stream <0x80 0xFF 0x80 0xFF 0x80 0xFF> to
a process claiming conformance to Shift-JIS and ask it to intepret
that as JIS characters, it should tell me that it is
garbage. *How* it tells me that it is garbage is a matter of
API design, code design, and application design.

Unicode did not invent the notion of conformance to character
encoding standards. What is new about Unicode is that it has
*3* interoperable character encoding forms, not just one, and
all of them are unusual in some way, because they are designed
for a very, very large encoded character repertoire, and
involve multibyte and/or non-byte code unit representations.

> Well, I just hope noone will listen to them and modify strcpy and strchr to
> validate the data when running in UTF-8 locale and start signalling
> something (really, where and how?!). The two statements from UTC don't make
> sense when put together. Unless we are really expected to start building
> everything from scratch.

This is bogus. The UTC has never asked anyone to modify strcpy
and strchr. What anyone implementing UTF-8 using a C runtime
library (or similar set of functions) has to do is completely
comparable to what they have to do for supporting any other
multibyte character encoding on such systems. If your system
handles euc-kr, euc-tw, and/or euc-jp correctly, then adding
UTF-8 support is comparable, in principle and in practice.

--Ken

RE: Roundtripping in Unicode

Reply via email to