Kent Karlsson wrote:

The Unicode 4.0 text further strengthens Conformance Clause
C12, to make this crystal clear:

"C12 When a process generates a code unit sequence which
purports to be in a Unicode character encoding form, it shall
not emit ill-formed code unit sequences.
"C12a When a process interprets a code unit sequence which
purports to be in a Unicode character encoding form, it
shall treat ill-formed code unit sequences as an error
condition, and shall not interpret such sequences as
characters."
And just in case anyone still has any trouble reading the
painfully detailed specification of the UTF-8
encoding form, an explicit note is included there:


"* Because surrogate code points are not Unicode scalar
values, any UTF-8 byte sequence that would otherwise
map to code points D800..DFFF is ill-formed."
So I don't think there is any hole here. If anyone still
thinks that they can use these 3-octet/3-octet encodings
of supplementary characters and call it UTF-8, then they
are either engaging in wishful thinking or are not reading
the standard carefully enough.


The problem I need to deal with is not GENERATE those UTF-8, but how to handle these DATA when my code receive it. For example, when I receive a 10K UTF-8 file which have 1000 lines of text, if there are one UTF-8 sequence in the line 990 are ill-formed, should I fire the "error" for
1. the whole file (10K, 1000 lines),
2. all the line after line 899,
3. the line 990 itslef,
4. the text between the leading byte of that ill-formed UTF-8 till the end of the file,
5. the text between the leading byte of that ill-formed UTF-8 sequenec till the end of the line 990,
6. the text between the leading byte of that ill-formed UTF-8 till the next leading byte in line 990


I there are others way you can scope the ERROR, I probably can continue it on and on and tell you 10-20 other way to scope it if I spend 20 more minutes.

I do believe the error handling should be application specific.






Reply via email to