RE: [Syslog-sec] Syslog protocol - UTF-8 encoding

Anton Okmianski \(aokmians\) Thu, 02 Jun 2005 12:53:42 -0700

Robert:
 
> Potential confusions:
> 
>   1) Saying UTF-8 is insufficient.  To really cover all the 
> bases (especially from a security  and string parsing 
> perspective) you need to
> say:
> 
> "Unicode characters encoded in UTF-8 using the minimal 
> encoding."  UTF-8 permits a variety of encodings for the same 
> character, but only one is the minimal encoding.


Are you suggesting we make minimum encoding a MUST or a SHOULD? Everywhere?

I am fine with a SHOULD everywhere and maybe making it a MUST for certain parts 
of the HEADER, like space separator.  However, I think before we require 
minimal encoding in PARAM-VALUE and MSG, we should explore the reasons why 
UTF-8 allows for different encodings.  There may be good reason for it. We need 
to have a good reason to re-define the use of the standard for parts of the 
message which may be received by library from third-party applications.  My 
concern is that some perfectly legitimate UTF-8 code in the field may not do 
minimum encoding.  Then, we are making syslog protocol adoption more difficult 
by requiring it. 

> For more 
> info you can also reference the most recent ISO
> 10646-1 and 10646-2 (with extensions).  With minimal 
> encodings you eliminate some potential buffer overflows and 
> you simplify the use of regular expression matching.  It is 
> easy enough for an incoming message filter to detect and 
> recode UTF-8 into minimal encoding, but you need to say this 
> in the specification to inform people that they need the 
> filter on the incoming side and that the emitters of messages 
> should use the minimal form.
> 
>   2) There are multiple blank space characters defined in 
> Unicode.  These are typographically different.  There is only 
> one that corresponds to the ASCII blank character and  its 
> minimal encoding using UTF-8 is intentionally identical to 
> the encoding of the ASCII blank character.  The confusion may 
> be resolved by identifying this Unicode code point by number 
> rather than just saying "blank".

I could not find the word "blank" anywhere in the latest draft. The encoding 
defines the space explicitly as:

SP = %d32

Do you think we need to specify more?

Does UTF-8 allow more than one encoding for basic ASCII character subset or 
only for characters with larger Unicode code points?

>   3) Not mentioned originally, but also a potential problem, 
> are the other homotype and semi-homotype characters.  For 
> example, there are multiple backslash characters.  In fact 
> there are three of them in common use, one the ASCII 
> character (whose minimal UTF-8 encoding matches the ASCII
> character) and two that are used in Japanese.  These are 
> pseudo-homotype characters in that a close examination will 
> reveal that in a high precision font they are all different 
> in size and slope.  But in many situations they look the same.
> 
> More importantly from the perspective of regular use, the 
> ASCII backslash character was replaced in the Japanese 7-bit 
> Latin characterset by the Yen symbol.  So the Japanese will 
> have significant problems regarding use of backslash.  Even 
> if you specify the use of the proper Unicode character set, 
> encoded using minimal size UTF-8,  all the backslashes will 
> be presented to Japanese users as Yen symbols on most 
> systems.  These systems make the assumption that what they 
> are seeing is the older modified 7-bit
> ASCII that is standard in Japan.   This is almost always the correct
> assumption.
> 
> There is no simple solution to the backslash problem.  The 
> backslash should not be given any special meaning in any 
> protocol.  The various default workarounds for conflicts 
> between the older and newer systems introduce a lot of 
> confusion around this character.  If it has special meaning 
> to computers there will always be confusion and problems.  If 
> you leave it an ordinary non-special character the humans who 
> read the message usually have enough context to decide 
> whether the character is intended to mean yen or backslash 
> and will know from their application context how to interpret 
> the text.
> 
> If you have messages that must be composed by people and must 
> contain backslashes you have an even worse problem.  They 
> have a backslash character on the keyboard, but it will 
> generate the Japanese backslashes, not the ASCII backslash.  
> This effectively guarantees problems with entering backslash 
> in Japan because people will forget that they need to do 
> something special and will just use the keyboard.

Will this issue be addressed if instead of referring to "\" when we talk about 
escaping it in PARAM-VALUE and using it as escape sequence, we were to 
specifically refer to ASCII character %d92 instead?  

Thanks,
Anton.
_______________________________________________
Syslog-sec mailing list
Syslog-sec@www.employees.org
http://www.employees.org/mailman/listinfo/syslog-sec

RE: [Syslog-sec] Syslog protocol - UTF-8 encoding

Reply via email to