Re: UTF-8 stress test file?

Philipp Reichmuth Tue, 12 Oct 2004 16:25:03 -0700

Philippe Verdy schrieb:

Examples of bad assumptions that a reader could make:
- [quote](...) Experience so far suggests
that most first-time authors of UTF-8 decoders find at least one
serious problem in their decoder by using this file.[/quote]
This suggests to the reader that if its browser or editor does not display the contained test text as indicated, there's a problem in that application.

Well, to me it didn't. After all, the purpose of this file is to be a stress test for UTF-8 decoders, as indicated in line 1. By testing their decoders on this file, UTF-8 decoder authors tend to find problems of some kind in their programs. So where is the problem again?

But given that the file is not conforming to UTF-8 because of the "errors" it contains *on purpose*, No assumption should be made about how the browser or text editor will behave with the content of that file.

Where is any such assumption being made? Actually, most of your statements on what is "wrong" with this file are based on the idea that it makes some expectations on parser behaviour. However, in paragraph 1, this is explicitly excluded. So what is the point?

A conforming browser or editor should load that document without encoding violation problems, assuming it is encoded instead with ISO-8859-1 [...]

While possibly being technically correct behaviour, that would sort of defeat the purpose of testing an UTF-8 decoder, wouldn't it?

Nothing is wrong if lines are displayed with more or less characters, or if "|" characters are not vertically aligned when using fixed fonts.

Assuming, however, that the file is used for its purpose of testing an UTF-8 decoder, all lines should indeed align.

You should see the Greek word 'kosme':       "ÎáÏÎÎ"
(...) [/quote]
You can see the Greek word here in this message (because this message is properly UTF-8 encoded), but nothing is wrong in your editor or browser if the word is not readable as indicated, and you see for example the string "ÃÂÃÂÂÃÆÃÂÃÂ" when your editor or browser loads the file as an ISO-8859-1 text.

Don't you think you are stretching things a bit? This is an UTF-8 parser stress test file. If an application opens it in a different encoding, well, of course the results will be different, and things will not look UTF-8-ish. Again, this is a non-issue. It's like distributing a Linux binary for testing something and then getting complaints that it doesn't work under DOS and that it shouldn't make assumptions on operating systems.

And so on.

Philipp

Re: UTF-8 stress test file?

Reply via email to