I'd like to see a UTF-8 stress test file.
It should consist of lines of UTF-8, separated each by a newline. Each line should be malformed. Also, some idea of how to deal with the malformed UTF-8 should be noted in a separate file.
Really, I just want some way to verify that I can detect every kind of UTF-8 wrongness. I have some code I adapted from Unicode.org, but I want to make sure my adaptions haven't broken the code.

http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

"This file is not meant to be a conformance test. It does not prescribes any particular outcome and therefore there is no way to "pass" or "fail" this test file, even though the texts suggests a preferable decoder behaviour at some places."

I'm wondering if Unicode.org has a proper conformance test? If not, I suggest they make one. One where we had each test separated by a single newline, and no non-ttest lines existing... less they wanted to make some kind of "comment line" which is easy to parse (lets say starting the line with "#").

For me to use that test programmatically, I'll need to break out my non-UTF-8 aware text editor, delete all the non test lines, and then separate out the good and the bad UTF8 into different files! That way I can use readline type code to do my UTF-8 verification.

It would be nice if someone had a "automated test ready" UTF-8 file.

If not, I'll modify this one and then put the results up on my website, someday. (week or so).

--
    Theodore H. Smith - Software Developer.
    http://www.elfdata.com




Reply via email to