Re: UTF-8 stress test file?

Philippe Verdy Tue, 12 Oct 2004 18:50:13 -0700

From: "Philipp Reichmuth" <[EMAIL PROTECTED]>

Don't you think you are stretching things a bit? This is an UTF-8 parser stress test file. If an application opens it in a different encoding, well, of course the results will be different, and things will not look UTF-8-ish. Again, this is a non-issue. It's like distributing a Linux binary for testing something and then getting complaints that it doesn't work under DOS and that it shouldn't make assumptions on operating systems.

That's not the good point I wanted to focus. Things CANNOT look "UTF-8-ish" in a UTF-8 conforming editor or browser that will correctly detect all encoding errors in that file, and thus will never properly present the text properly aligned. What a conforming editor or browser *may* eventually do is to recover and mandatorily signal to the user the positions of errors (possibly by using a replacement glyph as if each error was coding a U+FFFD substitute), but how many errors will you signal given that the error recovery level is not defined in the Unicode/ISO/IEC UTF-8 standard? Even in the old ISO/IEC10646 standard, recovery is only possible after errors only if uninterpretable byte sequences were still properly parsed into sub-sequences (of unspecified length) where a substiture could be used.

The problem is in the length of each invalid byte sequence; for example, if there's a 4-bytes old UTF-8 encoding sequence (or longer) the error will be detected at the first byte, recovery will take place at the second byte after the first byte as been interpreted as a invalid sequence represented by a substitute glyph, but then each of the immediately following trailing byte will signal an error.

Suppose that the parser recovers until it can find a new starter byte, it will still need to parse this byte to see if its a leading byte for a longer sequence, so the recovery is not necessarily immediately possible after the first invalid byte, or after the supposed end of the byte sequence. Now if the parser will reover by skipping all bytes until a valid sequence is found, there will be only 1 encoding error thrown on the leading byte, and only 1 substitution glyph.

We are navigating within unspecified areas where error recovery after decoding errors is not defined in the current UTF-8 standard itself (not even in the old RFC version with ISO/IEC 10646-1:2000)

And as I said, the document itself is not complete enough, because it forgets other invalid sequences for non-characters.

Re: UTF-8 stress test file?

Reply via email to