----- Original Message -----
From: "Darren Reed" <[EMAIL PROTECTED]>
To: "Tom Petch" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Monday, January 16, 2006 10:51 PM
Subject: Re: [Syslog] Sec 6.1: Truncation


> [ Charset ISO-8859-1 unsupported, converting... ]
> > Truncation of UTF-8 is actually slightly worse than has been described.
> >
> > It is possible to determine from the UTF-8 octets where one coded
> > character ends and another begins.  But because Unicode contains
> > combining characters, with no limit on how many of these there can
> > be, and these modify the meaning of previous or later coded characters,
> > it is not possible to determine where one 'symbol' ends.  So truncation
> > at a UTF-8 boundary could subtlety change the meaning of a message,
> > even breach security.  Not something we can guard against
> > but should mention.
>
> The above seems a little confused to me.  How can there be a problem
> if a message is truncated on the boundary of complex character ?
>
> Darren

I lack the precise terminology.  Unicode includes base characters and modifying
characters, such as diacritic marks, as well as characters that combine the two.
Where the combination exists as a single code point, no problem.  Where it does
not, then what the user would see as a single character is actually sent as
several code points, each separately encoded in UTF-8.  It is fairly easy for a
truncating relay to work out the boundary of the UTF-8 and so ensure that a
complete UTF-8 encoding is truncated (or not).  It is much harder, probably
impossible, to work out where any modifying characters belong, whether they
should be removed or left in.  And the character 'o' with a diacritic mark is
not the same as that character without that diacritic mark, so removing trailing
modifying characters changes the meaning, which could be a security exposure.
.
Tom Petch


_______________________________________________
Syslog mailing list
Syslog@lists.ietf.org
https://www1.ietf.org/mailman/listinfo/syslog

Reply via email to