Terminator: was Re: [Syslog] MSG encoding and content (#3, #4, #5) (fwd)

Tom Petch Fri, 09 Dec 2005 11:44:12 -0800

mmmm ....

As RFC like 2130 state, protocol designers should differentiate between
protocol -
which does not have language, charset etc - and text, which has.  I see the text
elements of syslog as MSG and PARAM-VALUE; these are the ones defined as
UTF-8-STRING and so the ones to which these issues apply.


The problem for me is termination of the string so that for the latter,
syslog-protocol says
characters '"', '\' and; ']' MUST be escaped
so these can then be used to tell us where the PARAM-VALUE ends.  In essence, we
are
defining a transfer syntax for this these fields (not IMHO a very elegant one
but I don't have a better idea -  I note that syslog-sign uses base64 to
transfer encode its binary).

So how do we terminate MSG?  Using a count has been suggested,  ASCII NUL is
obviously used in some implementations; elsewhere I assume that it is
determined implicitly by the length of the UDP packet.

I believe this is not enough, since TCP is around as a transport and should be
on our radar.  Allowing UTF-8 as the charset (IETF's preferred term for CCS+CES)
allows all octet values from +0 to +127 and most of +128 to +255 so we lose the
obvious terminating characters.  MIME either uses a transfer syntax such as
base64 or quoted
printable - which brings many values back into play - or allows the generating
software to
create its own terminating string which can then be chosen not to appear in the
free text.  NETCONF uses a string that can never be valid XML in a similar
manner.

My instinct is we should be doing more in this area, in particular having
greater consistency between MSG and PARAM-VALUE, in their transfer syntax and
termination..

Anyone else agree or disagree?

Tom Petch

----- Original Message -----
From: "Chris Lonvick" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Wednesday, December 07, 2005 8:10 PM
Subject: Re: [Syslog] MSG encoding and content (#3, #4, #5) (fwd)


Hi Folks,

I asked Patrik Faltstrom to review this proposal.  He has some comments
below.  Let's don't get hung up in his details - he has looked this over
without any knowledge of our prior discussions.  He does have some good
pointers.

We may want to consider a "belt and suspenders" approach.

- senders MAY indicate their charset in the SD-ID.  If the SD-ID does not
contain any indication of a charset, then the receiver will just have to
guess (it may be US-ASCII or it may be something entirely different).
Having the UTF-8 BOM there would be a good indication that it is UTF-8.

- senders are RECOMMENDED to include a charset indicator in the SD-ID.
The ONLY one defined in the syslog-protocol will be [charset="UTF-8"].
When that is specified, then the BOM MUST be present.

To address Bazsi's concerns of too many charset definitions, Rainer could
indicated that additional charset values can only be accepted by the IANA
through Standards Action (RFC 2434).

As Patrik indicates, it would be good to see this separated into
- what can the sender send
- what will the receiver expect to receive.


I would like to see other comments on this proposal.  I need to review the
threads but I believe that we have rough consensus on all of the other
issues so that Rainer can re-work syslog-protocol.

Thanks,
Chris

PAF's comments below >>>


---------- Forwarded message ----------
Date: Wed, 7 Dec 2005 17:23:24 +0100
From: "[ISO-8859-1] Patrik Fältström"
To: Chris Lonvick <[EMAIL PROTECTED]>
Subject: Re: [Syslog] MSG encoding and content (#3, #4, #5) (fwd)

> Let's first quickly review what has been discussed on list:
>
> - current implementations sometimes use LF as a record delimiter

Ok

> - some implementations use LF inside the MSG part

Ok

> - some implementations include binary data in syslog messages
>   and would like to continue to do so (but these seem to be few)

Ok

> - there are at least some use cases where a syslogd can not
>   definitely detect the character encoding of a message
>   (some of that might be related to the POSIX API, but there
>   may be a work-around [I had no time yet to evaluate this
>   in-depth]). It gets problematic if a message from a legacy
>   sender is received (no encoding information) and transformed
>   into a syslog-protocol message [I assume this is a valid use-case])

Ok

> - previous discussion showed the need for Unicode. With Unicode, the
>   term "printable character" basically becomes useless, because there
>   are so many non-printable characters in Unicode and new ones are
>   potentially added constantly.

Well...define "printable"... I don't really know what that means.

> - previous consensus thus was that any valid UTF-8 string MUST
>   be supported inside MSG (including NUL and LF)

NULL and LF are part of Unicode, and because of that UTF-8. The encoding UTF-8
encode NUL and LF as one byte only, with the same value as LF and NUL as we are
used to.

> - current discussion has shown that backwards-compatibility
>   is not absolutely vital (but still desirable)

Ok. Solves some of the binary problems.

> - it was suggested that an "encoding SD-ID" be defined which
>   carries the character set definition

Hmmm....why is the charset definition needed? That is then to be able to say
UTF-8 or BIG5 or...? It seems to be better and more important to say whether it
is UTF-8 or for example binhex encoded binary data.

Remember that the main difference between text and binary is that text is to be
converted regarding linebreak algorithms, while binary data is not.

> - as a side-note, Tom Petch has provided a very good digression
>   on "character encoding" terminology which I have reproduced after
>   my signature. I guess most people on this list already know the
>   exact differences, but I still find it useful...

Ok. Can not remember I have seen it, but anyway...

> It is somewhat hard to find a good compromise. A compromise, in my point
> of view, must allow the following:

When looking at a protocol like this, you have to first of all define whether
the charset translation/transformation is happening in the client or in the
server. This is not really clear to me. If the transformation is in the client,
the client translate to for example UTF-8. It can also be the server doing it.
(Or of course a client that read from wherever the syslog daemon store the
data, so that the storage can handle multiple charsets...but I think this is
out of the question?)

> - transforming existing messages into -protocol format should
>   not intentionally be forbidden - transformation is a very
>   important "feature" when it comes to deploying new technology

Yup.

> - new receivers should be able to precisely "enough" understand
>   the message content

Ok. Message content from old senders?

> - I also find it advisable that newer receivers are capable
>   to process both old-style and new-style messages concurrently.
>   While this is an implementation issue, it might be a hint for
>   us that some subleties in character encoding must be dealt
>   with in any case.

Ok.

> - we should try NOT to include the myriad of possible encoding
>   technologies, at least not promote this for needs other than
>   backwards compatibility

You have to differ between:

- The protocol have the ability to handle any encoding technology
- What encoding technologies to have as a MUST or SHOULD implement

Two different things.

> To solve the encoding issue, an "encoding" SD-ID has been proposed that
> describes the encoding of the MSG part (I do not use precise wording on
> which encoding, simply because it is not relevant in this context - read
> on...). This SD-ID would by its very nature be optional. I follow
> Darren's reminder that truncation can always make SD-IDs (all or part)
> disappear. As such, the encoding specification would not be guaranteed
> to be received by the final destination. This contradicts with the
> intension of that SD-ID: it's ultimate purpose was to enable the
> receiver to use proper decoding for the MSG part.

Ok.

If you talk about truncation, the important thing is that the encoding
information is coming before the data that is encoded, so the data and not the
meta-information is truncated, if any.

> Of course, this also raises the question if the SD-ID concept is good
> enough. For obvious reasons it suffers from the lack of reliability. I
> think this in general is acceptable. The only cure would be to bring
> reliablity and thus full-duplex communication to syslog. This is way
> beyond our charter (if you like this, you should probably join NETCONF
> and help on NETCONF notifications). We have addressed this concern by
> moving all absolutely vital data to the header. If we allow multiple
> encodings, the information about the encoding belongs into the header,
> so we would have another header field. While this is a solution, I think
> it is overengineered for what we actually need.

Ok.

> Let us keep in mind that our ultimate desire is to have as many messages
> as possible use Unicode (CCS) and be UTF-8 encoded (CES), with with
> UTF-8 also being the transfer encoding (Tom: I hope I got it right ;)).

In IETF, we say "the charset is UTF-8", and with that we imply Unicode is the
character set.

So, don't get stuck in the details.

See RFC 3629. Just reference that.

Note byte order.

> Any other encoding should only be supported for backward compatibility
> either at the protocol level (transforming relays) or to leverage
> existing APIs (POSIX et al). So we are accepting the fact that other
> encodings need to be used, but we do not really like it (at least I
> don't).
>
> Assigning a header field for such a somewhat auxiluary feature would put
> to much weight on it and may even promote its use.
>
> So I am now back to the proposal with the Unicode BOM. Let's keep in
> mind that we either a) know the character set [then we can convert to
> Unicode]

No, not really. You can not do a proper conversion without loosing data. The
question is whether you include the conversion as part of the protocol. Who is
doing the conversion? Is  a non-UTF-8 charset allowed in the protocol? In that
case, the receiver of the message is supposed to do the translation...right?

> or b) we do not know it [then we can convey no information
> about it, because else we would actually have case a)]. So a simple
> indication whether or not MSG contains UTF-8 would be sufficient.

New-style is no problem. Old style is hard.

> I hereby propose that we RECOMMEND to use UTF-8 in all cases where this
> is possible. If UTF-8 is used, the MSG field MUST be prefixed by the
> properly-encoded Unicode BOM (a 3-octet overhead).

See http://www.unicode.org/faq/utf_bom.html#29

You can not enforce this I think. I think you should instead have a proper
header that say whether this is text and whether it is UTF-8.

> Any other encoding
> MAY be used. In this case the MSG field MUST NOT start with the octet
> values of the 3-octet UTF-8 encodede Unicode BOM.

I don't think you can say this. You don't know what other charset's might use
as bytes.

And, how do you know what charset is in use?

How do you know what is binary and not text?

> If necessary, a SP
> MUST be inserted before this sequence. Such recommendations is within
> the expectation of a typical Unicode user/developer (at least I strongly
> think so).

What is "SP"? Space I guess. If one use UTF-16, space is not one byte...and in
EBCDIC I don't know what space is either. I think you talk about a specific
byte-value here, and not "space" as you don't know what to look for when you
don't know what charset is in use.

> The specification of other encodings, if there is an actual need for it,
> should be left for a separate document. That document should specify how
> to enhance syslog message content in a way inspired by MIME. I expect
> such an document to make use of SD-IDs to acomplish its goal. That would
> obviously again be subject to truncation. Here, I find this acceptable,
> because

Ok.

> a) any -protocol compliant receiver would still be able to process the
> message, at least in a basic way (thanks to the BOM)
> b) specific maximum minimum size restrictions can be placed on compliant
> receivers supporting such a specification
>
> That "encoding" document should also address the natural
> language/culture information, which I think we should not move into
> -protocol.

Ok.

Possible to have alternative formats?

> If we assume the encoding is solved, we still have not decided on NUL,
> LF and other US-ASCII control characters. If we look at existing syslog
> implementations, most of them use LF control characters as a kind of
> framing (End of Record - EOR - markers). Other control characters are
> simply escaped. Plain binary data is very seldomly seen. NUL causes
> confusion to many existing receivers.

If you use UTF-8, you are fine.

> We can now ask ourselfs: what problem does it cause if a sender sends a
> control character (e.g. BEL) and a relay transforms it to an escaped
> form (e.g. '^07'). If we follow this route, we see that there is nothing
> bad with it per se. It becomes a problem only if a digital signature of
> the message is transmitted (in the way syslog-sign intends to do).
>
> IMPORTANT FINDING: There is no problem with message transformation
> EXCEPT when the messages are digitally signed.
>
> IMPORTANT OBSERVATION: we do not yet have digital signatures in syslog.

Yup. Good catch.

> CONCLUSION: we do not need to care!
>
> As it looks, we are trying to solve a problem that does not yet even
> exist. And this not-yet-existing problem is the only issue that is
> causing us us real grief here, especially if we look at backwards
> compatibility. syslog-sign is still in draft state right now. It is free
> to place further restrictions on whatever -protocol specifies. Of
> course, it should not do this in an unexpected and unnecessray way. It
> can be done quite non-intrusive, at least for the vast majority of
> syslog data. Please read on, the simple solution will be below, but I
> need to switch the topic back to syslog-protocol.
>
> With all that said, I propose the following for the MSG field in
> syslog-protocol (in regard to control characters):

Ok.

> MSG MAY contain any character including octets with values less then 32.
> This is the US-ASCII control character range without DEL, which I
> generally consider harmless. HOWEVER, it is RECOMMENDED that MSG does
> NOT include any characters with octet values less then 32.

Ok.

> This applies
> to both UTF-8 encoded data as well as other data.

No difference.

> If a syslog sender
> uses octet values less than 32, it MUST expect that a receiver modifies
> the message, which will lead to invalidation of eventually existing
> digital signatures.

Ok.

> If message transformation is not acceptable to the
> sender, it MUST escape octet values less then 32 before sending the
> message. All other Unicode control character sequences are not
> considered extremely problematic, but are best avoided if no message
> transformation is required. LF and NUL have no special meaning per se.
> Most importantly, they do NOT indicate the end of the MSG field.

Ok.

What about bidirectional text?

> I think this proposal
>
> a) provides an easy way to properly encode all currently-existing syslog
> MSG content
> b) provides guideline for new implementation
> c) cautions against control character usage
> d) levels ground for syslog-sign
>
> While allowing everything, it tells the implementor what is bad.
> Syslog-sign could then use the hint provided here and restrict
> to-be-signed messages not to include the US-ASCII control character
> range without any transfer encoding (like base64).
>
> Think this proposal provides a backwards-compatibile and yet extensible
> way to useful MSG content formatting.
>
> Please let me know any objections you might have and, if so, please
> precisely describe the problem you are seeing. Examples, external
> references, and/or lab test results would be appreciated in those cases.
>
> Many thanks,
> Rainer
>
> Tom Petch's Digression on "character encoding" terminology:
> ####
> Character Set is a set of characters (letters, number, symbols, glyphs
> ...)
> Coded Character Set [CCS] gives each a (numeric) code, as in ISO 10646.
> Character Encoding (Scheme/Syntax) [CES] specifies how the codes become
> octets as in
> UTF-8.
> Transfer Encoding/Syntax specifies how the octets are put on the wire,
> as in
> Base64.
>
> MIME conflates CCS and CES to charset but keeps (Content) Transfer
> Encoding
> distinct; they can be different in different parts of an e-mail.
> ####

     paf


--------------------------------------------------------------------------------


> _______________________________________________
> Syslog mailing list
> Syslog@lists.ietf.org
> https://www1.ietf.org/mailman/listinfo/syslog
>


_______________________________________________
Syslog mailing list
Syslog@lists.ietf.org
https://www1.ietf.org/mailman/listinfo/syslog

Terminator: was Re: [Syslog] MSG encoding and content (#3, #4, #5) (fwd)

Reply via email to