Followup to:  <[EMAIL PROTECTED]>
By author:    Markus Kuhn <[EMAIL PROTECTED]>
In newsgroup: linux.utf8
> 
> Not entirely.
> 
> Internal representation does matter somewhat when it comes to the handling
> of malformed UTF-8 sequences. I think it is highly desireable that the
> UTF-8 -> emacs internal -> UTF-8 conversion roundtrip is made 100% binary
> transparent. Loading and saving a file that contains malformed UTF-8
> sequences should not change them, but character encoding conversions are
> prone to throw away information in the case of invalid source byte
> streams.
> 
> Using UTF-8 as the internal Emacs encoding is one way of achieving
> continued guaranteed binary transparency, coming up with a tricky encoding
> for malformed UTF-8 sequences is another one. I favour the former
> approach, which is also what other UTF-8 capable modern editors do today.
> 

I don't think it's that hard, actually.  The best way to think of a
malformed UTF-8 sequence is as a unique character with no semantic
meaning.  Of course, what constitutes the "sequence" is somewhat
arbitrary, but ideally such an encoding should have the following
properties:

a) It cannot be used to encode valid characters.
b) It's unambiguous (only one possible encoding for any one possible
   sequence.)

If that can be obtained, it also solves all the security issues
involved with malformed sequences, since the fundamental cause of the
security hazards is aliasing.

For example, in a 32-bit word, one could use negative numbers for
these sequences.  Valid sequences up to 31 bits are of course
represented by their respective valid, positive numbers.

One definition of "sequence" that is reasonably easy to implement is
"lead byte followed by the number of continuation bytes it is supposed
to have, or until terminated by another lead byte; an isolated
continuation byte; or an isolated FE or FF."  Under that definition
the malformed byte stream "E0 80 80 80 80 80 80 80 80 80" consists of
an overlong 3-byte sequence followed 7 isolated continuation bytes. 

a) Isolated continuation bytes;
b) FE or FF bytes;
c) Overlong seqeuences (e.g. E0 80 80);
d) Truncated sequences (e.g. the first two bytes in E3 8C 61).
e) Encoded surrogate characters (e.g. ED A0 80).

In the case of (d) it may be possible to represent the surrogates by
their respective encoding as if UTF-16 had never existed, however, if
so, one needs to take care that nothing else will try to interpret
them.

        -hpa
-- 
<[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt    <[EMAIL PROTECTED]>
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to