Stefan Persson wrote:

> This links to a different page on the same server:
> 
> http://www.cl.cam.ac.uk/~mgk25/unicode.html
> 
> That page contains a strange UTF-8 table:
> ...
> The last two byte sequences are invalid.


Markus Kuhn's page shows the original ISO 10646 definition.
This necessarily includes all codes up to 7FFFFFFF.
It also includes D800..DFFF, which is not allowed in Unicode 3.2 and the RFC on UTF-8, 
and I think implicitly not allowed in ISO 10646.

In my personal opinion, a decoder should at least recognize all those illegal 
sequences and generate an error of some kind for the whole sequence, so that it 
resynchronizes well.

markus


Reply via email to