RE: [Syslog] MSG encoding and content (#3, #4, #5) (fwd)
Hi Rainer, I don't believe that we need to follow up with Patrik immediately. It looks like we have some general consensus on the charset issue. Please update the ID with the consensus points that we have reached at this time. Thanks, Chris On Thu, 8 Dec 2005, Rainer Gerhards wrote: Chris, I can agree to what you propose. So it's fine with me. Question: does it make any sense to answer some of Patrik's questions (in order to obtain some more advise). I guess he is pretty busy, so we might save this for later. I'd appreciate your advise. Rainer -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Chris Lonvick Sent: Wednesday, December 07, 2005 8:11 PM To: [EMAIL PROTECTED] Subject: Re: [Syslog] MSG encoding and content (#3, #4, #5) (fwd) Hi Folks, I asked Patrik Faltstrom to review this proposal. He has some comments below. Let's don't get hung up in his details - he has looked this over without any knowledge of our prior discussions. He does have some good pointers. We may want to consider a "belt and suspenders" approach. - senders MAY indicate their charset in the SD-ID. If the SD-ID does not contain any indication of a charset, then the receiver will just have to guess (it may be US-ASCII or it may be something entirely different). Having the UTF-8 BOM there would be a good indication that it is UTF-8. - senders are RECOMMENDED to include a charset indicator in the SD-ID. The ONLY one defined in the syslog-protocol will be [charset="UTF-8"]. When that is specified, then the BOM MUST be present. To address Bazsi's concerns of too many charset definitions, Rainer could indicated that additional charset values can only be accepted by the IANA through Standards Action (RFC 2434). As Patrik indicates, it would be good to see this separated into - what can the sender send - what will the receiver expect to receive. I would like to see other comments on this proposal. I need to review the threads but I believe that we have rough consensus on all of the other issues so that Rainer can re-work syslog-protocol. Thanks, Chris PAF's comments below >>> -- Forwarded message -- Date: Wed, 7 Dec 2005 17:23:24 +0100 From: "[ISO-8859-1] Patrik Fältström" To: Chris Lonvick <[EMAIL PROTECTED]> Subject: Re: [Syslog] MSG encoding and content (#3, #4, #5) (fwd) Let's first quickly review what has been discussed on list: - current implementations sometimes use LF as a record delimiter Ok - some implementations use LF inside the MSG part Ok - some implementations include binary data in syslog messages and would like to continue to do so (but these seem to be few) Ok - there are at least some use cases where a syslogd can not definitely detect the character encoding of a message (some of that might be related to the POSIX API, but there may be a work-around [I had no time yet to evaluate this in-depth]). It gets problematic if a message from a legacy sender is received (no encoding information) and transformed into a syslog-protocol message [I assume this is a valid use-case]) Ok - previous discussion showed the need for Unicode. With Unicode, the term "printable character" basically becomes useless, because there are so many non-printable characters in Unicode and new ones are potentially added constantly. Well...define "printable"... I don't really know what that means. - previous consensus thus was that any valid UTF-8 string MUST be supported inside MSG (including NUL and LF) NULL and LF are part of Unicode, and because of that UTF-8. The encoding UTF-8 encode NUL and LF as one byte only, with the same value as LF and NUL as we are used to. - current discussion has shown that backwards-compatibility is not absolutely vital (but still desirable) Ok. Solves some of the binary problems. - it was suggested that an "encoding SD-ID" be defined which carries the character set definition Hmmmwhy is the charset definition needed? That is then to be able to say UTF-8 or BIG5 or...? It seems to be better and more important to say whether it is UTF-8 or for example binhex encoded binary data. Remember that the main difference between text and binary is that text is to be converted regarding linebreak algorithms, while binary data is not. - as a side-note, Tom Petch has provided a very good digression on "character encoding" terminology which I have reproduced after my signature. I guess most people on this list already know the exact differences, but I still find it useful... Ok. Can not remember I have seen it, but anyway... It is somewhat hard to find a good compromise. A compromise, in my point of view, must allow the following: When looking at a protocol like this, you have to first of all define whether the charset translation/
Terminator: was Re: [Syslog] MSG encoding and content (#3, #4, #5) (fwd)
As RFC like 2130 state, protocol designers should differentiate between protocol - which does not have language, charset etc - and text, which has. I see the text elements of syslog as MSG and PARAM-VALUE; these are the ones defined as UTF-8-STRING and so the ones to which these issues apply. The problem for me is termination of the string so that for the latter, syslog-protocol says characters '"', '\' and; ']' MUST be escaped so these can then be used to tell us where the PARAM-VALUE ends. In essence, we are defining a transfer syntax for this these fields (not IMHO a very elegant one but I don't have a better idea - I note that syslog-sign uses base64 to transfer encode its binary). So how do we terminate MSG? Using a count has been suggested, ASCII NUL is obviously used in some implementations; elsewhere I assume that it is determined implicitly by the length of the UDP packet. I believe this is not enough, since TCP is around as a transport and should be on our radar. Allowing UTF-8 as the charset (IETF's preferred term for CCS+CES) allows all octet values from +0 to +127 and most of +128 to +255 so we lose the obvious terminating characters. MIME either uses a transfer syntax such as base64 or quoted printable - which brings many values back into play - or allows the generating software to create its own terminating string which can then be chosen not to appear in the free text. NETCONF uses a string that can never be valid XML in a similar manner. My instinct is we should be doing more in this area, in particular having greater consistency between MSG and PARAM-VALUE, in their transfer syntax and termination.. Anyone else agree or disagree? Tom Petch - Original Message - From: "Chris Lonvick" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Wednesday, December 07, 2005 8:10 PM Subject: Re: [Syslog] MSG encoding and content (#3, #4, #5) (fwd) Hi Folks, I asked Patrik Faltstrom to review this proposal. He has some comments below. Let's don't get hung up in his details - he has looked this over without any knowledge of our prior discussions. He does have some good pointers. We may want to consider a "belt and suspenders" approach. - senders MAY indicate their charset in the SD-ID. If the SD-ID does not contain any indication of a charset, then the receiver will just have to guess (it may be US-ASCII or it may be something entirely different). Having the UTF-8 BOM there would be a good indication that it is UTF-8. - senders are RECOMMENDED to include a charset indicator in the SD-ID. The ONLY one defined in the syslog-protocol will be [charset="UTF-8"]. When that is specified, then the BOM MUST be present. To address Bazsi's concerns of too many charset definitions, Rainer could indicated that additional charset values can only be accepted by the IANA through Standards Action (RFC 2434). As Patrik indicates, it would be good to see this separated into - what can the sender send - what will the receiver expect to receive. I would like to see other comments on this proposal. I need to review the threads but I believe that we have rough consensus on all of the other issues so that Rainer can re-work syslog-protocol. Thanks, Chris PAF's comments below >>> -- Forwarded message -- Date: Wed, 7 Dec 2005 17:23:24 +0100 From: "[ISO-8859-1] Patrik Fältström" To: Chris Lonvick <[EMAIL PROTECTED]> Subject: Re: [Syslog] MSG encoding and content (#3, #4, #5) (fwd) > Let's first quickly review what has been discussed on list: > > - current implementations sometimes use LF as a record delimiter Ok > - some implementations use LF inside the MSG part Ok > - some implementations include binary data in syslog messages > and would like to continue to do so (but these seem to be few) Ok > - there are at least some use cases where a syslogd can not > definitely detect the character encoding of a message > (some of that might be related to the POSIX API, but there > may be a work-around [I had no time yet to evaluate this > in-depth]). It gets problematic if a message from a legacy > sender is received (no encoding information) and transformed > into a syslog-protocol message [I assume this is a valid use-case]) Ok > - previous discussion showed the need for Unicode. With Unicode, the > term "printable character" basically becomes useless, because there > are so many non-printable characters in Unicode and new ones are > potentially added constantly. Well...define "printable"... I don't really know what that means. > - previous consensus thus was that any valid UTF-8 string MUST > be supported inside MSG (including NUL and LF) NULL and LF are part of Unicode, and because of that UTF-8. The encoding UTF-8 en
RE: [Syslog] MSG encoding and content (#3, #4, #5) (fwd)
Chris, I can agree to what you propose. So it's fine with me. Question: does it make any sense to answer some of Patrik's questions (in order to obtain some more advise). I guess he is pretty busy, so we might save this for later. I'd appreciate your advise. Rainer > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Chris Lonvick > Sent: Wednesday, December 07, 2005 8:11 PM > To: [EMAIL PROTECTED] > Subject: Re: [Syslog] MSG encoding and content (#3, #4, #5) (fwd) > > Hi Folks, > > I asked Patrik Faltstrom to review this proposal. He has > some comments > below. Let's don't get hung up in his details - he has > looked this over > without any knowledge of our prior discussions. He does have > some good > pointers. > > We may want to consider a "belt and suspenders" approach. > > - senders MAY indicate their charset in the SD-ID. If the > SD-ID does not > contain any indication of a charset, then the receiver will > just have to > guess (it may be US-ASCII or it may be something entirely different). > Having the UTF-8 BOM there would be a good indication that it > is UTF-8. > > - senders are RECOMMENDED to include a charset indicator in > the SD-ID. > The ONLY one defined in the syslog-protocol will be > [charset="UTF-8"]. > When that is specified, then the BOM MUST be present. > > To address Bazsi's concerns of too many charset definitions, > Rainer could > indicated that additional charset values can only be accepted > by the IANA > through Standards Action (RFC 2434). > > As Patrik indicates, it would be good to see this separated into > - what can the sender send > - what will the receiver expect to receive. > > > I would like to see other comments on this proposal. I need > to review the > threads but I believe that we have rough consensus on all of > the other > issues so that Rainer can re-work syslog-protocol. > > Thanks, > Chris > > PAF's comments below >>> > > > -- Forwarded message -- > Date: Wed, 7 Dec 2005 17:23:24 +0100 > From: "[ISO-8859-1] Patrik Fältström" > To: Chris Lonvick <[EMAIL PROTECTED]> > Subject: Re: [Syslog] MSG encoding and content (#3, #4, #5) (fwd) > > > Let's first quickly review what has been discussed on list: > > > > - current implementations sometimes use LF as a record delimiter > > Ok > > > - some implementations use LF inside the MSG part > > Ok > > > - some implementations include binary data in syslog messages > > and would like to continue to do so (but these seem to be few) > > Ok > > > - there are at least some use cases where a syslogd can not > > definitely detect the character encoding of a message > > (some of that might be related to the POSIX API, but there > > may be a work-around [I had no time yet to evaluate this > > in-depth]). It gets problematic if a message from a legacy > > sender is received (no encoding information) and transformed > > into a syslog-protocol message [I assume this is a valid > use-case]) > > Ok > > > - previous discussion showed the need for Unicode. With Unicode, the > > term "printable character" basically becomes useless, > because there > > are so many non-printable characters in Unicode and new ones are > > potentially added constantly. > > Well...define "printable"... I don't really know what that means. > > > - previous consensus thus was that any valid UTF-8 string MUST > > be supported inside MSG (including NUL and LF) > > NULL and LF are part of Unicode, and because of that UTF-8. > The encoding UTF-8 > encode NUL and LF as one byte only, with the same value as LF > and NUL as we are > used to. > > > - current discussion has shown that backwards-compatibility > > is not absolutely vital (but still desirable) > > Ok. Solves some of the binary problems. > > > - it was suggested that an "encoding SD-ID" be defined which > > carries the character set definition > > Hmmmwhy is the charset definition needed? That is then to > be able to say > UTF-8 or BIG5 or...? It seems to be better and more important > to say whether it > is UTF-8 or for example binhex encoded binary data. > > Remember that the main difference between text and binary is > that text is to be > converted regarding linebreak algorithms, while binary data is not. > > > - as a side-note, Tom Petch has provided a very good digression &
Re: [Syslog] MSG encoding and content (#3, #4, #5) (fwd)
Hi Folks, I asked Patrik Faltstrom to review this proposal. He has some comments below. Let's don't get hung up in his details - he has looked this over without any knowledge of our prior discussions. He does have some good pointers. We may want to consider a "belt and suspenders" approach. - senders MAY indicate their charset in the SD-ID. If the SD-ID does not contain any indication of a charset, then the receiver will just have to guess (it may be US-ASCII or it may be something entirely different). Having the UTF-8 BOM there would be a good indication that it is UTF-8. - senders are RECOMMENDED to include a charset indicator in the SD-ID. The ONLY one defined in the syslog-protocol will be [charset="UTF-8"]. When that is specified, then the BOM MUST be present. To address Bazsi's concerns of too many charset definitions, Rainer could indicated that additional charset values can only be accepted by the IANA through Standards Action (RFC 2434). As Patrik indicates, it would be good to see this separated into - what can the sender send - what will the receiver expect to receive. I would like to see other comments on this proposal. I need to review the threads but I believe that we have rough consensus on all of the other issues so that Rainer can re-work syslog-protocol. Thanks, Chris PAF's comments below >>> -- Forwarded message -- Date: Wed, 7 Dec 2005 17:23:24 +0100 From: "[ISO-8859-1] Patrik F?ltstr?m" To: Chris Lonvick <[EMAIL PROTECTED]> Subject: Re: [Syslog] MSG encoding and content (#3, #4, #5) (fwd) Let's first quickly review what has been discussed on list: - current implementations sometimes use LF as a record delimiter Ok - some implementations use LF inside the MSG part Ok - some implementations include binary data in syslog messages and would like to continue to do so (but these seem to be few) Ok - there are at least some use cases where a syslogd can not definitely detect the character encoding of a message (some of that might be related to the POSIX API, but there may be a work-around [I had no time yet to evaluate this in-depth]). It gets problematic if a message from a legacy sender is received (no encoding information) and transformed into a syslog-protocol message [I assume this is a valid use-case]) Ok - previous discussion showed the need for Unicode. With Unicode, the term "printable character" basically becomes useless, because there are so many non-printable characters in Unicode and new ones are potentially added constantly. Well...define "printable"... I don't really know what that means. - previous consensus thus was that any valid UTF-8 string MUST be supported inside MSG (including NUL and LF) NULL and LF are part of Unicode, and because of that UTF-8. The encoding UTF-8 encode NUL and LF as one byte only, with the same value as LF and NUL as we are used to. - current discussion has shown that backwards-compatibility is not absolutely vital (but still desirable) Ok. Solves some of the binary problems. - it was suggested that an "encoding SD-ID" be defined which carries the character set definition Hmmmwhy is the charset definition needed? That is then to be able to say UTF-8 or BIG5 or...? It seems to be better and more important to say whether it is UTF-8 or for example binhex encoded binary data. Remember that the main difference between text and binary is that text is to be converted regarding linebreak algorithms, while binary data is not. - as a side-note, Tom Petch has provided a very good digression on "character encoding" terminology which I have reproduced after my signature. I guess most people on this list already know the exact differences, but I still find it useful... Ok. Can not remember I have seen it, but anyway... It is somewhat hard to find a good compromise. A compromise, in my point of view, must allow the following: When looking at a protocol like this, you have to first of all define whether the charset translation/transformation is happening in the client or in the server. This is not really clear to me. If the transformation is in the client, the client translate to for example UTF-8. It can also be the server doing it. (Or of course a client that read from wherever the syslog daemon store the data, so that the storage can handle multiple charsets...but I think this is out of the question?) - transforming existing messages into -protocol format should not intentionally be forbidden - transformation is a very important "feature" when it comes to deploying new technology Yup. - new receivers should be able to precisely "enough" understand the message content Ok. Message content from old senders? - I also find it advisable that newer receivers are capable t
Re: [Syslog] MSG encoding and content (#3, #4, #5)
On Wed, 2005-12-07 at 15:30 +0100, Rainer Gerhards wrote: > Hi WG, > > the topic of MSG encoding as well as its content (e.g. NUL and LF > characters) has not yet been solved. The past days, I've talked to a lot > of my friends not on this list and I have also looked at various ways to > solve the issue. Be prepared, this is another long mail, but I think it > is appropriate as this is our top issue left open. It is complex and it > requires a good amount of thinking, theory and arguments. I am trying to > convey a proposal and the facts it builds on in this mail. I am convinced. Although I first preferred coding the character set in SD-ID variant over the Unicode BOM (simply because of its similarities to MIME), but I think not allowing a myriad of encodings (with their associated security risks) is a very nice thing to have. As I see we need a single utf8/undefined bit in the message, using BOM for this purpose is perfectly fine by me. -- Bazsi ___ Syslog mailing list Syslog@lists.ietf.org https://www1.ietf.org/mailman/listinfo/syslog
[Syslog] MSG encoding and content (#3, #4, #5)
Hi WG, the topic of MSG encoding as well as its content (e.g. NUL and LF characters) has not yet been solved. The past days, I've talked to a lot of my friends not on this list and I have also looked at various ways to solve the issue. Be prepared, this is another long mail, but I think it is appropriate as this is our top issue left open. It is complex and it requires a good amount of thinking, theory and arguments. I am trying to convey a proposal and the facts it builds on in this mail. Let's first quickly review what has been discussed on list: - current implementations sometimes use LF as a record delimiter - some implementations use LF inside the MSG part - some implementations include binary data in syslog messages and would like to continue to do so (but these seem to be few) - there are at least some use cases where a syslogd can not definitely detect the character encoding of a message (some of that might be related to the POSIX API, but there may be a work-around [I had no time yet to evaluate this in-depth]). It gets problematic if a message from a legacy sender is received (no encoding information) and transformed into a syslog-protocol message [I assume this is a valid use-case]) - previous discussion showed the need for Unicode. With Unicode, the term "printable character" basically becomes useless, because there are so many non-printable characters in Unicode and new ones are potentially added constantly. - previous consensus thus was that any valid UTF-8 string MUST be supported inside MSG (including NUL and LF) - current discussion has shown that backwards-compatibility is not absolutely vital (but still desirable) - it was suggested that an "encoding SD-ID" be defined which carries the character set definition - as a side-note, Tom Petch has provided a very good digression on "character encoding" terminology which I have reproduced after my signature. I guess most people on this list already know the exact differences, but I still find it useful... It is somewhat hard to find a good compromise. A compromise, in my point of view, must allow the following: - transforming existing messages into -protocol format should not intentionally be forbidden - transformation is a very important "feature" when it comes to deploying new technology - new receivers should be able to precisely "enough" understand the message content - I also find it advisable that newer receivers are capable to process both old-style and new-style messages concurrently. While this is an implementation issue, it might be a hint for us that some subleties in character encoding must be dealt with in any case. - we should try NOT to include the myriad of possible encoding technologies, at least not promote this for needs other than backwards compatibility To solve the encoding issue, an "encoding" SD-ID has been proposed that describes the encoding of the MSG part (I do not use precise wording on which encoding, simply because it is not relevant in this context - read on...). This SD-ID would by its very nature be optional. I follow Darren's reminder that truncation can always make SD-IDs (all or part) disappear. As such, the encoding specification would not be guaranteed to be received by the final destination. This contradicts with the intension of that SD-ID: it's ultimate purpose was to enable the receiver to use proper decoding for the MSG part. Of course, this also raises the question if the SD-ID concept is good enough. For obvious reasons it suffers from the lack of reliability. I think this in general is acceptable. The only cure would be to bring reliablity and thus full-duplex communication to syslog. This is way beyond our charter (if you like this, you should probably join NETCONF and help on NETCONF notifications). We have addressed this concern by moving all absolutely vital data to the header. If we allow multiple encodings, the information about the encoding belongs into the header, so we would have another header field. While this is a solution, I think it is overengineered for what we actually need. Let us keep in mind that our ultimate desire is to have as many messages as possible use Unicode (CCS) and be UTF-8 encoded (CES), with with UTF-8 also being the transfer encoding (Tom: I hope I got it right ;)). Any other encoding should only be supported for backward compatibility either at the protocol level (transforming relays) or to leverage existing APIs (POSIX et al). So we are accepting the fact that other encodings need to be used, but we do not really like it (at least I don't). Assigning a header field for such a somewhat auxiluary feature would put to much weight on it and may even promote its use. So I am now back to the proposal with the Unicode BOM. Let's keep in mind that we either a) know the character set [then we can convert to Unicode] or b) we do not know it [then we can convey no information about it, because else we would act