From: "Terje Bless" <[EMAIL PROTECTED]>
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Theodore H. Smith <[EMAIL PROTECTED]> wrote:

I'd like to see a UTF-8 stress test file.

The top result on Google for the query âUTF-8 Stress Testâ is <http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt>.

This test file is out of date and incorrect: it uses "Unicode" incorrectly, where it should relate to the old RFC definition of UTF-8 referenced by previous versions of ISO/IEC 10646: in that file, all UTF-8 sequences with 5 bytes or more are invalid (they are not "boundary cases").
So the list of "impossible bytes" is longer than documented there.
The more exact definition of UTF-8, shared now by Unicode and by the current version of ISO/IEC 10646 is documented in the conformance section of the Unicode standard.
Still, this file will be useful to determine if your browser or editor effectively shows substitutes (like "?") where it should for all invalid sequences. But if your browser just says that this is not a UTF-8 encoded file, it will be right, if it does not display it at all:
- the file mixes UTF-8 and UTF-16
- invalid sequences may raise an exception that informs the user that the file can't be decoded.
- a browser or text editor may as well attempt to trigger its charset-autodetection mechanism to try finding another charset. If the file is then displayed assuming ISO-8859-1 and showing each byte of UTF-8 or UTF-16 sequences as if they were ISO-8859-1 characters, it will not be a conformance problem for the browser or text editor.





Reply via email to