RE: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

Kent Karlsson Sun, 02 Mar 2003 02:53:12 -0800


Michael (michka) Kaplan:
...
> then the conversion will simply strip the errant characters. Note that
> either solution meets the needs of refusal to interpret the errant
> sequences.


Simply stripping the errant byte sequences means that they are
each interpreted as the empty string of characters.  To me, that
contradicts:

   "C12a When a process interprets a code unit sequence which
    purports to be in a Unicode character encoding form, it
    shall treat ill-formed code unit sequences as an error
    condition, and shall not interpret such sequences as
    characters."

On the other hand I think C12a is too harsh.  It essentially
requires either an error stop, or at least division of the
input into a sequence of runs of text with possible error
byte (for UTF-8) sequences at the borders between the runs.
I think it would be ok to replace errant byte sequence with
characters that indicate that there may have been an error
(which excludes the empty string).  SUBSTITUTE ("SUB is used
in the place of a character [sic] that has been found to be
invalid or in error, SUB is intended to be introduced by
automatic means") seem to fit that.

(Ken's "Titan" discussion earlier is at a much lower "protocol
level"; byte string, or even bit string level).

                /kent k

RE: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

Reply via email to