RE: [Syslog] #5 - character encoding (was: Consensus?)

Rainer Gerhards Thu, 01 Dec 2005 04:14:49 -0800

Tom,

I apprecite your point. My intension is:


-15 specifies that MSG must contain UTF-8 encoding exclusively (full character 
set). During implementation, I have seen that I can not obtain the encoding 
information for to-be-sent messages under Unix. In the mean time, Balazs 
Scheidler has suggest a potential way to do that, then this would probably be a 
no-issue. For the time being, let's assume it can not be obtained. In many 
cases, I have different encodings, like ISO 8859-1 or EUC, at least in parts of 
the message. As I do not know the encoding, I can not properly convert it to 
Unicode. So the syslogd would send a non-compliant message. As there is a high 
chance of invalid UTF-8 sequences (as it is no UTF-8), a compliant receiving 
syslogd must drop this message because it is invalid. My point here is that I 
am not really interested if the sending syslogd is to blame or not. My point is 
that the message can not be received.

My proposal was to recommend UTF-8 whenever possible, but allow MSGs with 
unknown encoding when we can not obtain encoding information. To differentiate, 
I suggested the the Unicode BOM is used if it is UTF-8. Though there might 
still be small window of misinterpretation, I'd expect that a UTF-8 encoded BOM 
is very unlikely to appear in the first three octests of an ordinary syslog 
message. I'd found this easy and acceptable.

If the syslogd reliably can obtain the provided encoding - as Balazs thankfully 
mentioned - we could stick with UTF-8 only, as it now would be no issue. The 
only issue eventually present in it would be if we could expect implementors to 
implement a converter for any given character set to Unicode - but that's a 
different story.

The ever-changing fragile WG consensus at this time of the year seems to be 
that we are back to supporting all possible encodings to address the need I 
mentioned. While I do not really like this approach, it will allow me to do 
what I need to do. So I do not object it.

I agree with you that we should not try to focus too much on backwards 
compatibility. But on the other hand, Vancouver told us people would like to 
see it. The list then said "oh no". A few days later we have multiple voices 
saying we must support this and that. I have to admit that I loose sense of 
stable consensus the longer I discuss this now.

For me, I have decided to only voice my concerns if I believe something will be 
broken. Field order, field semantics and a lot of the other issues currently 
being re-re-re-re-considered are not really that important. Even if we end up 
with something totally horrible, I am sure it is possible to program a parser 
that handles it. After all, our parsers handle todays syslog - can it really 
become worse? I think there would be huge value in a syslog standard, no matter 
how ugly the details may look to some of us. After all, beauty is a very 
subjective concept ;)

I hope I have been able to convey my root concern on the encoding. On the other 
issues, I am waiting for WG consensus to be declared and then I will include 
that consensus, whatever it is, into the I-D. I just hope it'll stay stable 
long enough so that the I-D can proceed...

Rainer

> -----Original Message-----
> From: Tom Petch [mailto:[EMAIL PROTECTED] 
> Sent: Thursday, December 01, 2005 9:25 AM
> To: Rainer Gerhards; Chris Lonvick
> Cc: [EMAIL PROTECTED]
> Subject: Re: [Syslog] #5 - character encoding (was: Consensus?)
> 
> Rainer
> 
> I think I detect an approach I do not agree with, in this and 
> perhaps other
> issues.
> 
> You seem to be saying that the (eg POSIX) syslogd must emit 
> perfect syslog
> messages and is responsible for anything that is wrong with 
> them no matter what
> it received from the application (I exaggerate slightly).
> 
> I would say that if the application passes incomprehensible 
> garbage, something
> criminal or illegal, then it is the application that is at 
> fault; syslogd can
> only be held responsible if it produces messages that are 
> invalid for the parts
> over which it has control, eg header syntax.
> 
> So if syslogd has no idea what the transfer encoding is 
> because the rest of the
> system does not tell it, then syslogd cannot be held 
> responsible for the absence
> of a field saying what the transfer encoding actually is.  Or 
> put differently,
> if our RFC specify what the application MUST or SHOULD do, as 
> well as syslogd,
> then that is ok with me.
> 
> What syslogd would be responsible for, IMO, would be allowing 
> characters that
> have a special meaning in the syntax (eg NUL is end of 
> message) appearing
> unescaped (or otherwise encoded).  Whether we have such 
> problems depends on the
> resolution of other issues, not saying that we have at present.
> 
> Tom Petch
> 
> ----- Original Message -----
> From: "Rainer Gerhards" <[EMAIL PROTECTED]>
> To: "Chris Lonvick" <[EMAIL PROTECTED]>
> Cc: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
> Sent: Wednesday, November 30, 2005 2:48 PM
> Subject: RE: [Syslog] #5 - character encoding (was: Consensus?)
> 
> 
> Chris,
> 
> I fully agree - thanks ;)
> 
> Rainer
> 
> > -----Original Message-----
> > From: Chris Lonvick [mailto:[EMAIL PROTECTED]
> > Sent: Wednesday, November 30, 2005 2:39 PM
> > To: Rainer Gerhards
> > Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
> > Subject: RE: [Syslog] #5 - character encoding (was: Consensus?)
> >
> > Hi Rainer,
> >
> > I believe that we are saying the same thing.  :)
> >
> > If there is no indicator of encoding or language then a
> > reciever will not
> > know what it is receiving - just like receivers don't know
> > what they are
> > receiving today.  They MAY make an assumption that it is 
> something in
> > US-ASCII (but may be disappointed).
> >
> > If there is an indicator of the encoding and language then
> > the receiver
> > will know exactly what it is.  Having an indicator should be
> > RECOMMENDED
> > but not REQUIRED for ease of migration.
> >
> > Is that what we're all saying?
> >
> > Thanks,
> > Chris
> >
> >
> >
> > On Wed, 30 Nov 2005, Rainer Gerhards wrote:
> >
> > > Chris,
> > >
> > >> Let's use this email as an example.  :)  There is no
> > >> indication that I'm
> > >> using US-ASCII encoding or that I'm writing in English.
> > >
> > > I think there actually is. If I am right, the SMTP RFCs
> > require mail text to be US-ASCII. Only via MIME and/or escape
> > characters you can include 8-bit data. For example Müller and
> > Möller might create some problems in some mailers (But I
> > guess my Mail system will encode them with =<hexval>).
> > Dropping messages with octets > 127 in the subject is a
> > common spam protection setting...
> > >
> > >> However, you're
> > >> able to recieve this and read it.  Similarly, you could write
> > >> an email in
> > >> German and send it to me.  I would still be able to recieve
> > >> it but I'd
> > >> have a difficult time parsing the meaning.
> > >>
> > >> I'm suggesting that same approach for the transmission of
> > the syslog
> > >> content.  If I really wanted you to know what encoding and
> > >> language I'm
> > >> using in an email, I would specify a mime header.  syslog
> > >> senders will
> > >> continue to pump out whatever encoding and language they've
> > >> been using
> > >> and recievers will continue to do their best to parse them.
> > >> If a vendor
> > >> wants to get very specific about that, then they will have to
> > >> use an SD-ID
> > >> to identify the contents of the message.
> > >
> > > Here I agree with you. What I was saying is that IF the
> > header says it is US-ASCII, only then we should assume it
> > actually is. If there is no "enc" SD-ID, then we do not know
> > what it is but can assume ... whatever we assume. Let me
> > phrase it that way:
> > >
> > > If the message contains
> > >
> > > [enc="us-ascii" lang="en"]
> > >
> > > then the receiver can honestly expect it to be US-ASCII.
> > But if it does not contain any "enc" the receiver does not
> > know exactly and assume anything it finds useful (may be
> > ASCII, may not).
> > >
> > > Does this clarify? I somehow have the impression we mean
> > the same thing and I simply do not manage to convey what I
> > intend to ;)
> > >
> > > Rainer
> > >
> > >>
> > >> Mit Aufrichtigkeit,
> > >> Chris
> > >>
> > >>
> > >>
> > >>
> > >> On Wed, 30 Nov 2005, Rainer Gerhards wrote:
> > >>
> > >>> Andrew,
> > >>>
> > >>>>> Hi Rainer,
> > >>>>>
> > >>>>> Why don't we look at it from the other direction?  We could
> > >>>> state that any
> > >>>>> encoding is acceptable - for ease-of-use/migration with
> > >>>> existing syslog
> > >>>>> implementations.  It is RECOMMENDED that UTF-8 be used.
> > >> When it is
> > >>>>> used, an SD-ID element will be REQUIRED.  e.g. -
> > >>>> [enc="utf-8" lang="en"]
> > >>>>
> > >>>> I like that idea too.
> > >>>>
> > >>>> So, if no SD-ID encoding element is specified, then we must
> > >>>> assume US-ASCII
> > >>>> and deal with it accordingly??
> > >>>
> > >>> I think not. If it is not present, we known that we do not
> > >> know it. If
> > >>> it is US-ASCII, I would expect something like
> > >>>
> > >>> [enc="us-ascii" lang="en"]
> > >>>
> > >>> Of course, we could also say if it is non-present, we can assume
> > >>> US-ASCII. But then we would need to introduce
> > >>>
> > >>> [enc="unknown"]
> > >>>
> > >>> for the (common) case where we simply do not know it 
> (again: think
> > >>> POSIX). I find this somehwat confusing.
> > >>>
> > >>> Rainer
> > >>>
> > >>
> > >
> >
> 
> _______________________________________________
> Syslog mailing list
> Syslog@lists.ietf.org
> https://www1.ietf.org/mailman/listinfo/syslog
> 
> 

_______________________________________________
Syslog mailing list
Syslog@lists.ietf.org
https://www1.ietf.org/mailman/listinfo/syslog

RE: [Syslog] #5 - character encoding (was: Consensus?)

Reply via email to