Re: Unicode, ISO/IEC 10646 Synchronization Issues for UTF-8

Christopher Fynn Fri, 27 Apr 2007 11:16:06 -0700

Rich Felker wrote:

On Fri, Apr 27, 2007 at 05:15:16PM +0600, Christopher Fynn wrote:

N3266 was discussed and rejected by WG2 yesterday. As you pointed out
there are all sorts of problems with this proposal, and accepting it
would break many existing implementations.

That's good to hear. In followup, I think the whole idea of trying to
standardize error handling is flawed. What you should do when
encountering invalid data varies a lot depending on the application.
For filenames or text file contents you probably want to avoid
corrupting them at all costs, even if they contain illegal sequences,
to avoid catastrophic data loss or vulnerabilities. On the other hand,
when presenting or converting data, there are many approaches that are
all acceptable. These include dropping the corrupt data, replacing it
with U+FFFD, or even interpreting the individual bytes according to a
likely legacy codepage. This last option is popular for example in IRC
clients and works well to deal with the stragglers who refuse to
upgrade their clients to use UTF-8. Also, some applications may wish
to give fatal errors and refuse to process data at all unless it's
valid to begin with.

Rich

Yes. Someone who was there tells me the main reason it was rejected wasthat it was considered out of scope for ISO 10646 or even Unicode todictate what a process should do in an error condition. Should it throwan exception, etc. etc. The UTF-8 validity specification is expressed interms of what constitutes a valid string or substring rather than what aprocess needs to do in a given condition. Neither standard wants to getinto the game of standardizing API type things like what processesshould do.


- Chris

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/



--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Unicode, ISO/IEC 10646 Synchronization Issues for UTF-8

Reply via email to