On 27 Oct 2001, H. Peter Anvin wrote:

> I don't think it's that hard, actually.  The best way to think of a
> malformed UTF-8 sequence is as a unique character with no semantic
> meaning.  Of course, what constitutes the "sequence" is somewhat
> arbitrary, but ideally such an encoding should have the following
> properties:
>
> a) It cannot be used to encode valid characters.
> b) It's unambiguous (only one possible encoding for any one possible
>    sequence.)

Ermm, Peter, I have a problem with those rules.

Without (a) it's just a 'rawbytes' encoding for all the bytes in the
input stream that are part of an invalid sequence. The only, minor,
issue being that you could edit a sequence of rawbytes to make them
into a valid sequence ... is that really an 'issue'!

If you enforce (a) then you need a 'rawbyte that must be kept with
the previous byte' in 28 bits (assuming utf-8) you can do a lead byte
plus upto 3 continuations.

Rawbyte:    8 bits
cont count: 2 bits
cont1:      6 bits
cont2:      6 bits
cont3:      6 bits

In addition you must forbid the entry of rawbytes that are continuation
bytes _and_ the removal of characters that may bring two rawbyte sequences
together. eg: a leadbyte and it's continuation that have been seperated
by a non-utf8 folding program.

Personally I think (a) is impossible (unreasonable) for an editor,
even emacs.

-- 
Rob.                          (Robert de Bath <robert$ @ debath.co.uk>)
                                       <http://www.cix.co.uk/~mayday>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to