RE: [Syslog-sec] Syslog protocol - UTF-8 encoding

Robert Horn Fri, 03 Jun 2005 09:04:25 -0700

A MUST in the header with SHOULD elsewhere would be sufficient, but I think
that there is little risk making it a MUST everywhere.  ISO made it into a
MUST with the extensions to 10646-2.   The problem is an oversight in the
UTF-8 specification.  It specifies how to take an m-bit character and break
it down into 8-bit chunks.  It was assumed that people would always
minimize the number of 8-bit chunks used, and this is the general practice.
So if I have a character with a 10-bit code point, it will get encoded as a
6-bit and a 4-bit chunk.  Then malicious programmers discovered that they
could get programs to malfunction by using more chunks, e.g. encoding a
10-bit code point as two 4-bit chunks and a 2-bit chunk.  Sometimes this
caused buffer overflows and sometimes it lets them evade 8-bit oriented
regular expression parsers.  These are legitimate UTF-8 encodings because
the UTF-8 specification failed to require minimal size encodings be used.
I am not aware of any reasonable UTF-8 encoder that does not generate
minimal size encodings.


I didn't have the text with me while traveling, hence the uncertainty over
the "space".  Specifying the ASCII code value is sufficient.  We probably
should note to readers that the code value used for backslash in ASCII is
used for the yen symbol in Japan, and that they should be prepared for user
interface confusion.  It is inevitable that there will be people who use
the Japanese backslash character (a valid UTF-8 character) instead of the
correct ASCII code value because they are matching what they see on the
screen with what they see on the keyboard.  We should alert them to the
problem.  (Or we could pick another character, but most of the good
characters have already been used for other purposes.)

R Horn


                                                                                
                                                     
                      "Anton Okmianski                                          
                                                     
                      \(aokmians\)"            To:       Robert 
Horn/WIL/AGFA/US/[EMAIL PROTECTED], <[EMAIL PROTECTED]>              
                      <[EMAIL PROTECTED]        cc:       "Alexander Clemm 
\(alex\)" <[EMAIL PROTECTED]>, <[EMAIL PROTECTED]>, "Steve  
                      om>                       Chang \(schang99\)" <[EMAIL 
PROTECTED]>, <syslog-sec@employees.org>                 
                                               Subject:  RE: [Syslog-sec] 
Syslog protocol - UTF-8 encoding                           
                      06/02/2005 03:53                                          
                                                     
                      PM                                                        
                                                     
                                                                                
                                                     
                                                                                
                                                     




Robert:

> Potential confusions:
>
>   1) Saying UTF-8 is insufficient.  To really cover all the
> bases (especially from a security  and string parsing
> perspective) you need to
> say:
>
> "Unicode characters encoded in UTF-8 using the minimal
> encoding."  UTF-8 permits a variety of encodings for the same
> character, but only one is the minimal encoding.

Are you suggesting we make minimum encoding a MUST or a SHOULD? Everywhere?

I am fine with a SHOULD everywhere and maybe making it a MUST for certain
parts of the HEADER, like space separator.  However, I think before we
require minimal encoding in PARAM-VALUE and MSG, we should explore the
reasons why UTF-8 allows for different encodings.  There may be good reason
for it. We need to have a good reason to re-define the use of the standard
for parts of the message which may be received by library from third-party
applications.  My concern is that some perfectly legitimate UTF-8 code in
the field may not do minimum encoding.  Then, we are making syslog protocol
adoption more difficult by requiring it.

> For more
> info you can also reference the most recent ISO
> 10646-1 and 10646-2 (with extensions).  With minimal
> encodings you eliminate some potential buffer overflows and
> you simplify the use of regular expression matching.  It is
> easy enough for an incoming message filter to detect and
> recode UTF-8 into minimal encoding, but you need to say this
> in the specification to inform people that they need the
> filter on the incoming side and that the emitters of messages
> should use the minimal form.
>
>   2) There are multiple blank space characters defined in
> Unicode.  These are typographically different.  There is only
> one that corresponds to the ASCII blank character and  its
> minimal encoding using UTF-8 is intentionally identical to
> the encoding of the ASCII blank character.  The confusion may
> be resolved by identifying this Unicode code point by number
> rather than just saying "blank".

I could not find the word "blank" anywhere in the latest draft. The
encoding defines the space explicitly as:

SP = %d32

Do you think we need to specify more?

Does UTF-8 allow more than one encoding for basic ASCII character subset or
only for characters with larger Unicode code points?

>   3) Not mentioned originally, but also a potential problem,
> are the other homotype and semi-homotype characters.  For
> example, there are multiple backslash characters.  In fact
> there are three of them in common use, one the ASCII
> character (whose minimal UTF-8 encoding matches the ASCII
> character) and two that are used in Japanese.  These are
> pseudo-homotype characters in that a close examination will
> reveal that in a high precision font they are all different
> in size and slope.  But in many situations they look the same.
>
> More importantly from the perspective of regular use, the
> ASCII backslash character was replaced in the Japanese 7-bit
> Latin characterset by the Yen symbol.  So the Japanese will
> have significant problems regarding use of backslash.  Even
> if you specify the use of the proper Unicode character set,
> encoded using minimal size UTF-8,  all the backslashes will
> be presented to Japanese users as Yen symbols on most
> systems.  These systems make the assumption that what they
> are seeing is the older modified 7-bit
> ASCII that is standard in Japan.   This is almost always the correct
> assumption.
>
> There is no simple solution to the backslash problem.  The
> backslash should not be given any special meaning in any
> protocol.  The various default workarounds for conflicts
> between the older and newer systems introduce a lot of
> confusion around this character.  If it has special meaning
> to computers there will always be confusion and problems.  If
> you leave it an ordinary non-special character the humans who
> read the message usually have enough context to decide
> whether the character is intended to mean yen or backslash
> and will know from their application context how to interpret
> the text.
>
> If you have messages that must be composed by people and must
> contain backslashes you have an even worse problem.  They
> have a backslash character on the keyboard, but it will
> generate the Japanese backslashes, not the ASCII backslash.
> This effectively guarantees problems with entering backslash
> in Japan because people will forget that they need to do
> something special and will just use the keyboard.

Will this issue be addressed if instead of referring to "\" when we talk
about escaping it in PARAM-VALUE and using it as escape sequence, we were
to specifically refer to ASCII character %d92 instead?

Thanks,
Anton.



_______________________________________________
Syslog-sec mailing list
Syslog-sec@www.employees.org
http://www.employees.org/mailman/listinfo/syslog-sec

RE: [Syslog-sec] Syslog protocol - UTF-8 encoding

Reply via email to