A MUST in the header with SHOULD elsewhere would be sufficient, but I think that there is little risk making it a MUST everywhere. ISO made it into a MUST with the extensions to 10646-2. The problem is an oversight in the UTF-8 specification. It specifies how to take an m-bit character and break it down into 8-bit chunks. It was assumed that people would always minimize the number of 8-bit chunks used, and this is the general practice. So if I have a character with a 10-bit code point, it will get encoded as a 6-bit and a 4-bit chunk. Then malicious programmers discovered that they could get programs to malfunction by using more chunks, e.g. encoding a 10-bit code point as two 4-bit chunks and a 2-bit chunk. Sometimes this caused buffer overflows and sometimes it lets them evade 8-bit oriented regular expression parsers. These are legitimate UTF-8 encodings because the UTF-8 specification failed to require minimal size encodings be used. I am not aware of any reasonable UTF-8 encoder that does not generate minimal size encodings.
I didn't have the text with me while traveling, hence the uncertainty over the "space". Specifying the ASCII code value is sufficient. We probably should note to readers that the code value used for backslash in ASCII is used for the yen symbol in Japan, and that they should be prepared for user interface confusion. It is inevitable that there will be people who use the Japanese backslash character (a valid UTF-8 character) instead of the correct ASCII code value because they are matching what they see on the screen with what they see on the keyboard. We should alert them to the problem. (Or we could pick another character, but most of the good characters have already been used for other purposes.) R Horn "Anton Okmianski \(aokmians\)" To: Robert Horn/WIL/AGFA/US/[EMAIL PROTECTED], <[EMAIL PROTECTED]> <[EMAIL PROTECTED] cc: "Alexander Clemm \(alex\)" <[EMAIL PROTECTED]>, <[EMAIL PROTECTED]>, "Steve om> Chang \(schang99\)" <[EMAIL PROTECTED]>, <syslog-sec@employees.org> Subject: RE: [Syslog-sec] Syslog protocol - UTF-8 encoding 06/02/2005 03:53 PM Robert: > Potential confusions: > > 1) Saying UTF-8 is insufficient. To really cover all the > bases (especially from a security and string parsing > perspective) you need to > say: > > "Unicode characters encoded in UTF-8 using the minimal > encoding." UTF-8 permits a variety of encodings for the same > character, but only one is the minimal encoding. Are you suggesting we make minimum encoding a MUST or a SHOULD? Everywhere? I am fine with a SHOULD everywhere and maybe making it a MUST for certain parts of the HEADER, like space separator. However, I think before we require minimal encoding in PARAM-VALUE and MSG, we should explore the reasons why UTF-8 allows for different encodings. There may be good reason for it. We need to have a good reason to re-define the use of the standard for parts of the message which may be received by library from third-party applications. My concern is that some perfectly legitimate UTF-8 code in the field may not do minimum encoding. Then, we are making syslog protocol adoption more difficult by requiring it. > For more > info you can also reference the most recent ISO > 10646-1 and 10646-2 (with extensions). With minimal > encodings you eliminate some potential buffer overflows and > you simplify the use of regular expression matching. It is > easy enough for an incoming message filter to detect and > recode UTF-8 into minimal encoding, but you need to say this > in the specification to inform people that they need the > filter on the incoming side and that the emitters of messages > should use the minimal form. > > 2) There are multiple blank space characters defined in > Unicode. These are typographically different. There is only > one that corresponds to the ASCII blank character and its > minimal encoding using UTF-8 is intentionally identical to > the encoding of the ASCII blank character. The confusion may > be resolved by identifying this Unicode code point by number > rather than just saying "blank". I could not find the word "blank" anywhere in the latest draft. The encoding defines the space explicitly as: SP = %d32 Do you think we need to specify more? Does UTF-8 allow more than one encoding for basic ASCII character subset or only for characters with larger Unicode code points? > 3) Not mentioned originally, but also a potential problem, > are the other homotype and semi-homotype characters. For > example, there are multiple backslash characters. In fact > there are three of them in common use, one the ASCII > character (whose minimal UTF-8 encoding matches the ASCII > character) and two that are used in Japanese. These are > pseudo-homotype characters in that a close examination will > reveal that in a high precision font they are all different > in size and slope. But in many situations they look the same. > > More importantly from the perspective of regular use, the > ASCII backslash character was replaced in the Japanese 7-bit > Latin characterset by the Yen symbol. So the Japanese will > have significant problems regarding use of backslash. Even > if you specify the use of the proper Unicode character set, > encoded using minimal size UTF-8, all the backslashes will > be presented to Japanese users as Yen symbols on most > systems. These systems make the assumption that what they > are seeing is the older modified 7-bit > ASCII that is standard in Japan. This is almost always the correct > assumption. > > There is no simple solution to the backslash problem. The > backslash should not be given any special meaning in any > protocol. The various default workarounds for conflicts > between the older and newer systems introduce a lot of > confusion around this character. If it has special meaning > to computers there will always be confusion and problems. If > you leave it an ordinary non-special character the humans who > read the message usually have enough context to decide > whether the character is intended to mean yen or backslash > and will know from their application context how to interpret > the text. > > If you have messages that must be composed by people and must > contain backslashes you have an even worse problem. They > have a backslash character on the keyboard, but it will > generate the Japanese backslashes, not the ASCII backslash. > This effectively guarantees problems with entering backslash > in Japan because people will forget that they need to do > something special and will just use the keyboard. Will this issue be addressed if instead of referring to "\" when we talk about escaping it in PARAM-VALUE and using it as escape sequence, we were to specifically refer to ASCII character %d92 instead? Thanks, Anton. _______________________________________________ Syslog-sec mailing list Syslog-sec@www.employees.org http://www.employees.org/mailman/listinfo/syslog-sec