It gets worse with the file at: http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

'  According to ISO 10646-1:2000, sections D.7 and 2.3c, a device
receiving UTF-8 shall interpret a "malformed sequence in the same way
that it interprets a character that is outside the adopted subset"  '

That behaviour is clearly out of date. Unicode added some new standard for security reasons. The text should be rejected instead, OR the malformed UTF-8 should be modified upon loading to make it conforming UTF-8, basically stripping out the bad bytes or replacing the bad bytes.

As long as we don't pass any invalid UTF-8 to client apps/code, and we don't process any invalid UTF-8, we are fine, so modifying the bytes of the UTF8 text before doing anything with it, can in some circumstances work.

--
    Theodore H. Smith - Software Developer.
    http://www.elfdata.com




Reply via email to