At 12:37 PM 2/16/02 -0800, Doug Ewell wrote:
>Why would anyone, faced with a UTF-8 file that contains invalid
>sequences, want to retain the invalid sequences, much less convert the
>file to another encoding form that either (a) preserves the invalid
>sequences or (b) leaves a marker showing where they were?  Invalid
>sequences are garbage.  They don't represent anything, and you can't
>always even tell what they were supposed to represent.

Marking up a UTF-8 file with errors encountered allows two things

a) reconstructing the original file (necessary if the file was mistakenly
assumed to be UTF-8, but was Latin-1 etc. instead.)

b) processing the data in the other encoding form with *precisely* the
same results as if the data had been processed in UTF-8.

Marking up the data has many advantages over duplicating the error in
the other encoding form:

c) markup can be removed and clean data can be generated

d) markup can be ignored, allowing the data to processed as if it was
clean data, (without committing to an irreversable cleanup step.)

I find these at least potentially compelling. I have not personally
run into a situation where they would be required, but note the existence
of DUTR#26 on CESU-8 (http://www.unicode.org/unicode/reports/tr26/
soon to be updated based on the decisions from the recent UTC meeting)
to see evidence that this is enough of an issue that people are casting
about for solutions...

A./

Reply via email to