Re: Unicode 4.0 BETA available for review

Yung-Fong Tang Thu, 27 Feb 2003 12:58:13 -0800

Kent Karlsson wrote:

The Unicode 4.0 text further strengthens Conformance Clause
C12, to make this crystal clear:
"C12 When a process generates a code unit sequence which purports to be in a Unicode character encoding form, it shall not emit ill-formed code unit sequences. "C12a When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition, and shall not interpret such sequences as characters." And just in case anyone still has any trouble reading the painfully detailed specification of the UTF-8 encoding form, an explicit note is included there:

"* Because surrogate code points are not Unicode scalar values, any UTF-8 byte sequence that would otherwise map to code points D800..DFFF is ill-formed." So I don't think there is any hole here. If anyone still thinks that they can use these 3-octet/3-octet encodings of supplementary characters and call it UTF-8, then they are either engaging in wishful thinking or are not reading the standard carefully enough.

The problem I need to deal with is not GENERATE those UTF-8, but how to handle these DATA when my code receive it. For example, when I receive a 10K UTF-8 file which have 1000 lines of text, if there are one UTF-8 sequence in the line 990 are ill-formed, should I fire the "error" for 1. the whole file (10K, 1000 lines), 2. all the line after line 899, 3. the line 990 itslef, 4. the text between the leading byte of that ill-formed UTF-8 till the end of the file, 5. the text between the leading byte of that ill-formed UTF-8 sequenec till the end of the line 990, 6. the text between the leading byte of that ill-formed UTF-8 till the next leading byte in line 990

I there are others way you can scope the ERROR, I probably can continue it on and on and tell you 10-20 other way to scope it if I spend 20 more minutes.

I do believe the error handling should be application specific.

Re: Unicode 4.0 BETA available for review

Reply via email to