Robert: > Potential confusions: > > 1) Saying UTF-8 is insufficient. To really cover all the > bases (especially from a security and string parsing > perspective) you need to > say: > > "Unicode characters encoded in UTF-8 using the minimal > encoding." UTF-8 permits a variety of encodings for the same > character, but only one is the minimal encoding.
Are you suggesting we make minimum encoding a MUST or a SHOULD? Everywhere? I am fine with a SHOULD everywhere and maybe making it a MUST for certain parts of the HEADER, like space separator. However, I think before we require minimal encoding in PARAM-VALUE and MSG, we should explore the reasons why UTF-8 allows for different encodings. There may be good reason for it. We need to have a good reason to re-define the use of the standard for parts of the message which may be received by library from third-party applications. My concern is that some perfectly legitimate UTF-8 code in the field may not do minimum encoding. Then, we are making syslog protocol adoption more difficult by requiring it. > For more > info you can also reference the most recent ISO > 10646-1 and 10646-2 (with extensions). With minimal > encodings you eliminate some potential buffer overflows and > you simplify the use of regular expression matching. It is > easy enough for an incoming message filter to detect and > recode UTF-8 into minimal encoding, but you need to say this > in the specification to inform people that they need the > filter on the incoming side and that the emitters of messages > should use the minimal form. > > 2) There are multiple blank space characters defined in > Unicode. These are typographically different. There is only > one that corresponds to the ASCII blank character and its > minimal encoding using UTF-8 is intentionally identical to > the encoding of the ASCII blank character. The confusion may > be resolved by identifying this Unicode code point by number > rather than just saying "blank". I could not find the word "blank" anywhere in the latest draft. The encoding defines the space explicitly as: SP = %d32 Do you think we need to specify more? Does UTF-8 allow more than one encoding for basic ASCII character subset or only for characters with larger Unicode code points? > 3) Not mentioned originally, but also a potential problem, > are the other homotype and semi-homotype characters. For > example, there are multiple backslash characters. In fact > there are three of them in common use, one the ASCII > character (whose minimal UTF-8 encoding matches the ASCII > character) and two that are used in Japanese. These are > pseudo-homotype characters in that a close examination will > reveal that in a high precision font they are all different > in size and slope. But in many situations they look the same. > > More importantly from the perspective of regular use, the > ASCII backslash character was replaced in the Japanese 7-bit > Latin characterset by the Yen symbol. So the Japanese will > have significant problems regarding use of backslash. Even > if you specify the use of the proper Unicode character set, > encoded using minimal size UTF-8, all the backslashes will > be presented to Japanese users as Yen symbols on most > systems. These systems make the assumption that what they > are seeing is the older modified 7-bit > ASCII that is standard in Japan. This is almost always the correct > assumption. > > There is no simple solution to the backslash problem. The > backslash should not be given any special meaning in any > protocol. The various default workarounds for conflicts > between the older and newer systems introduce a lot of > confusion around this character. If it has special meaning > to computers there will always be confusion and problems. If > you leave it an ordinary non-special character the humans who > read the message usually have enough context to decide > whether the character is intended to mean yen or backslash > and will know from their application context how to interpret > the text. > > If you have messages that must be composed by people and must > contain backslashes you have an even worse problem. They > have a backslash character on the keyboard, but it will > generate the Japanese backslashes, not the ASCII backslash. > This effectively guarantees problems with entering backslash > in Japan because people will forget that they need to do > something special and will just use the keyboard. Will this issue be addressed if instead of referring to "\" when we talk about escaping it in PARAM-VALUE and using it as escape sequence, we were to specifically refer to ASCII character %d92 instead? Thanks, Anton. _______________________________________________ Syslog-sec mailing list Syslog-sec@www.employees.org http://www.employees.org/mailman/listinfo/syslog-sec