On 27 Oct 2001, H. Peter Anvin wrote: > I don't think it's that hard, actually. The best way to think of a > malformed UTF-8 sequence is as a unique character with no semantic > meaning. Of course, what constitutes the "sequence" is somewhat > arbitrary, but ideally such an encoding should have the following > properties: > > a) It cannot be used to encode valid characters. > b) It's unambiguous (only one possible encoding for any one possible > sequence.)
Ermm, Peter, I have a problem with those rules. Without (a) it's just a 'rawbytes' encoding for all the bytes in the input stream that are part of an invalid sequence. The only, minor, issue being that you could edit a sequence of rawbytes to make them into a valid sequence ... is that really an 'issue'! If you enforce (a) then you need a 'rawbyte that must be kept with the previous byte' in 28 bits (assuming utf-8) you can do a lead byte plus upto 3 continuations. Rawbyte: 8 bits cont count: 2 bits cont1: 6 bits cont2: 6 bits cont3: 6 bits In addition you must forbid the entry of rawbytes that are continuation bytes _and_ the removal of characters that may bring two rawbyte sequences together. eg: a leadbyte and it's continuation that have been seperated by a non-utf8 folding program. Personally I think (a) is impossible (unreasonable) for an editor, even emacs. -- Rob. (Robert de Bath <robert$ @ debath.co.uk>) <http://www.cix.co.uk/~mayday> - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/