Thanks Phillippe,
in that file, all UTF-8 sequences with 5 bytes or more are invalid (they are not "boundary cases").
Thanks.
So the list of "impossible bytes" is longer than documented there.
Is it just a case of moving the boundary cases into the impossible bytes? Or are there impossible bytes that simply aren't in the file?
- the file mixes UTF-8 and UTF-16
Does this file mix UTF-8 and UTF-16? I thought it just had surrogates encoded into UTF-8? Of course a surrogate should never exist in UTF-8.
--
Theodore H. Smith - Software Developer.
http://www.elfdata.com
