David,

thanks for your wake-up call...

> I believe we should move to UTF-8 to allow operators who

UTF-8 is actually a MUST in syslog-protocol.

I have to admit that I did not fully understand UNICODE until now... I
always read RFC 2279 (UTF-8 encoding). It specifies (page 2):

- Character values from 0000 0000 to 0000 007F (US-ASCII repertoire)
  correspond to octets 00 to 7F (7 bit US-ASCII values). A direct
  consequence is that a plain ASCII string is also a valid UTF-8
  string.

- US-ASCII values do not appear otherwise in a UTF-8 encoded
  character stream.  This provides compatibility with file systems
  or other software (e.g. the printf() function in C libraries) that
  parse based on US-ASCII values but are transparent to other
  values.

So I thought that control characters (US-ASCII below 0x20 & 0x7f) are
only present in this range. I assumed, however, that UNICODE as such
does not provide control characters  but only printable characters.

Obviously, I am wrong. I did some more research this morning and found
that at least pane 20xx does contain control characters:

http://www.unicode.org/charts/PDF/U2000.pdf

For a sample, see 0x200C and the characters following it. So my basic
assumption "just exclude US-ASCII control chars and you are done" is
wrong.

Having said this, I think we now have a bigger issue than I initially
thought. In the light of Unicode control characters, we are more or less
forced to allow any control characters inside the message part. If we
don't we can't comply with the (well thought-out) IETF Unicode
requirement (RFC 2277/BCP 18) for new RFCs.

As I wrote in my initial message, that not only affects -protocol, but
also -sign (though not to bad when it refers to -protocol for the format
description).

I think allowance for all character values affects also most of the
existing syslog software, as many work with C strings, where 0x00 is a
terminating character. I don't say it can't be dealt with in new
implementations. I just would like to mention that this will probably
get us a slow start, because the initial effort will be much higher for
an implementor - lot's of existing code could not be re-used.

Anyhow, I don't see an alternative to allowing all control characters.

I have included some Unicode links in my summary on this issue at
http://www.syslog.cc/ietf/protocol/issue9.html - this may be helpful for
others who need to dig a little into the Unicode requirement.

What does the rest of the WG think?

Rainer




Reply via email to