I'd like to see a UTF-8 stress test file.
It should consist of lines of UTF-8, separated each by a newline. Each line should be malformed. Also, some idea of how to deal with the malformed UTF-8 should be noted in a separate file.
Really, I just want some way to verify that I can detect every kind of UTF-8 wrongness. I have some code I adapted from Unicode.org, but I want to make sure my adaptions haven't broken the code.
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
"This file is not meant to be a conformance test. It does not prescribes any particular outcome and therefore there is no way to "pass" or "fail" this test file, even though the texts suggests a preferable decoder behaviour at some places."
I'm wondering if Unicode.org has a proper conformance test? If not, I suggest they make one. One where we had each test separated by a single newline, and no non-ttest lines existing... less they wanted to make some kind of "comment line" which is easy to parse (lets say starting the line with "#").
For me to use that test programmatically, I'll need to break out my non-UTF-8 aware text editor, delete all the non test lines, and then separate out the good and the bad UTF8 into different files! That way I can use readline type code to do my UTF-8 verification.
It would be nice if someone had a "automated test ready" UTF-8 file.
If not, I'll modify this one and then put the results up on my website, someday. (week or so).
--
Theodore H. Smith - Software Developer.
http://www.elfdata.com
