Lars said: > According to UTC, you need to keep processing > the UNIX filenames as BINARY data. And, also according to UTC, any UTF-8 > function is allowed to reject invalid sequences. Basically, you are not > supposed to use strcpy to process filenames.
This is a very misleading set of statements. First of all, the UTC has not taken *any* position on the processing of UNIX filenames. That is an implementation issue outside the scope of what the UTC normally deals with, and I doubt that it will take a position on the issue. It is erroneous to imply that the UTC has indicated that "you are not supposed to use strcpy to process filenames." It has done nothing of the kind, and I don't know of any reason why anyone should think otherwise. I certainly use strcpy to process filenames, UTF-8 or not, and expect that nearly every implementer on the list has done so, too. Any process *interpreting* a UTF-8 code unit sequences as characters can and should recognize invalid sequences, but that is a different matter. If I pass the byte stream <0x80 0xFF 0x80 0xFF 0x80 0xFF> to a process claiming conformance to UTF-8 and ask it to intepret that as Unicode characters, it should tell me that it is garbage. *How* it tells me that it is garbage is a matter of API design, code design, and application design. But there is *nothing* new here. If I pass the byte stream <0x80 0xFF 0x80 0xFF 0x80 0xFF> to a process claiming conformance to Shift-JIS and ask it to intepret that as JIS characters, it should tell me that it is garbage. *How* it tells me that it is garbage is a matter of API design, code design, and application design. Unicode did not invent the notion of conformance to character encoding standards. What is new about Unicode is that it has *3* interoperable character encoding forms, not just one, and all of them are unusual in some way, because they are designed for a very, very large encoded character repertoire, and involve multibyte and/or non-byte code unit representations. > Well, I just hope noone will listen to them and modify strcpy and strchr to > validate the data when running in UTF-8 locale and start signalling > something (really, where and how?!). The two statements from UTC don't make > sense when put together. Unless we are really expected to start building > everything from scratch. This is bogus. The UTC has never asked anyone to modify strcpy and strchr. What anyone implementing UTF-8 using a C runtime library (or similar set of functions) has to do is completely comparable to what they have to do for supporting any other multibyte character encoding on such systems. If your system handles euc-kr, euc-tw, and/or euc-jp correctly, then adding UTF-8 support is comparable, in principle and in practice. --Ken