Michael (michka) Kaplan: ... > then the conversion will simply strip the errant characters. Note that > either solution meets the needs of refusal to interpret the errant > sequences.
Simply stripping the errant byte sequences means that they are each interpreted as the empty string of characters. To me, that contradicts: "C12a When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition, and shall not interpret such sequences as characters." On the other hand I think C12a is too harsh. It essentially requires either an error stop, or at least division of the input into a sequence of runs of text with possible error byte (for UTF-8) sequences at the borders between the runs. I think it would be ok to replace errant byte sequence with characters that indicate that there may have been an error (which excludes the empty string). SUBSTITUTE ("SUB is used in the place of a character [sic] that has been found to be invalid or in error, SUB is intended to be introduced by automatic means") seem to fit that. (Ken's "Titan" discussion earlier is at a much lower "protocol level"; byte string, or even bit string level). /kent k