RE: [Syslog] #5 - character encoding (was: Consensus?)

Chris Lonvick Wed, 30 Nov 2005 05:06:47 -0800

Hi Sheran,

On Tue, 29 Nov 2005, Shyyunn Lin (sheranl) wrote:

Chris:

I think having SD-ID with [enc="utf-8" lang="English"] may be a good
approach. If different language use utf-8 encoding, then "lang=" can
distinguish it.

We _should_ be using language codes from RFC 3066. That specifies ISO 639language tags. 639-1 has 2 character codes ("en" is English) and 639-2has 3 characters ("eng" is English). RFC 3066 will likely be replaced bythe works of the Language Tag Registry Update (ltru) Working Group.

  http://www.ietf.org/html.charters/ltru-charter.html

They have IDs in the works. Until those become RFCs we should continue toreference RFC 3066.


Also want to clarify that you suggest that if the message is in ASCII,
it will not required SD-ID, but for all other encodings, SD-ID will be
required.


Yes - that's my suggestion.


Note most other encoding methods already imply the language used, for
example, in Chinese, there are several encoding methods, Traditional
Chinese used in Taiwan and Hong Kong is Big5, and simplified Chinese
used in Mainland China is GBK, so if the message is in traditional
Chinese char, it will be shown as [enc="Big5", lang="Traditional
Chinese"], a little bit redundant. The Big5 also includes all English
char so it can be a mix of Chinese and English.

Good point. As far as I can tell, "Big5" is not recognized by anyaccredited standards developing organization. It is recognized by theIdeographic Rapporteur Group (IRG) which reports to the Unicodeconsortium. The recognized way to represent Chinese characters,traditional and simplified, is through ISO 639-2 with the subcodes toindicate traditional and simplified for the "zh" _language_. The ID on"Tags for Identifying Languages"


  http://www.ietf.org/internet-drafts/draft-ietf-ltru-registry-14.txt

identifies simplified Chinese as "zh-Hans" and traditional Chinese as"zh-Hant". Additional subtags could identify a locale such as"zh-Hant-TW" for Taiwan Chinese in traditional script. This is from the"Initial Language Subtag Registry" ID.


http://www.ietf.org/internet-drafts/draft-ietf-ltru-initial-06.txt

I think that we should specify encoding and language tags asstriaghtforward as possible and let others augment syslog-protocol (in thefuture) with other encoding mechanisms. We can RECOMMEND that encoding bein UTF-8 and language tags come from RFC 3066. We can allow that otherencoding and language identifications are acceptable. In the worst case,a vendor will have the option of [EMAIL PROTECTED]"something" [EMAIL PROTECTED]"piglatin"].


Does this work for you?

Thanks,
Chris




Regards,

Sheran

-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Chris Lonvick
(clonvick)
Sent: Tuesday, November 29, 2005 10:22 AM
To: Rainer Gerhards
Cc: [EMAIL PROTECTED]
Subject: RE: [Syslog] #5 - character encoding (was: Consensus?)

Hi Rainer,

Why don't we look at it from the other direction?  We could state that
any encoding is acceptable - for ease-of-use/migration with existing
syslog implementations.  It is RECOMMENDED that UTF-8 be used.  When it
is used, an SD-ID element will be REQUIRED.  e.g. - [enc="utf-8"
lang="en"]

Thoughts?

All:  Let's discuss this and close this issue.

Thanks,
Chris

On Tue, 29 Nov 2005, Rainer Gerhards wrote:

Chris & WG,

#5 Character encoding in MSG: due to my proof-of-concept
  implementation, I have raised the (ugly) question if we need
  to allow encodings other than UTF-8. Please note that this
  question arises from needs introduced by e.g. POSIX. So we
  can't easily argue them away by whishful thinking ;)

Not even discussed yet.


I haven't reviewed that yet.  However, I'll note that allowing
different encoding can be accomplished in the future as long as we
establish a default encoding and a way to identify it in our current
work.


I have read a little in the mailing archive. Please note that in 2000
it was consensus that the MSG part may contain encodings other then
US-ASCII. Follow this threat:

http://www.syslog.cc/ietf/autoarc/msg00127.html

This discussion lead to RFC 3164 saying "other encodings MAY be used".
While this was observed behaviour, we need still to be aware that the
POSIX (and glibc) API places the restrictions on us that we simply do
not know the character encoding used by the application. As such, no
*nix syslogd can be programmed to be compliant to syslog-protocol if
we demand UTF-8 exclusively.

I propose that we RECOMMEND UTF-8 that MUST start with the Unicode
Byte Order Mask (BOM) if used. If the MSG part does not start with the

BOM, it may be any encoding just as in RFC 3164. I do not see any
alternative to this.

Rainer

_______________________________________________
Syslog mailing list
Syslog@lists.ietf.org
https://www1.ietf.org/mailman/listinfo/syslog


_______________________________________________
Syslog mailing list
Syslog@lists.ietf.org
https://www1.ietf.org/mailman/listinfo/syslog


_______________________________________________
Syslog mailing list
Syslog@lists.ietf.org
https://www1.ietf.org/mailman/listinfo/syslog

RE: [Syslog] #5 - character encoding (was: Consensus?)

Reply via email to