Re: Proposed Successor to RFC 3066 (language tags)

Philippe Verdy Wed, 19 Nov 2003 17:45:21 -0800

From: Addison Phillips [wM]
> Please note that there is a discussion list for this topic at:
[EMAIL PROTECTED]
>
> While Mark and I welcome your comments here or privately, off-list, you
can best be
> a part of the discussion by joining that list. Join the list by sending a
request email
> to:  [EMAIL PROTECTED]


I note that the language tags proposal includes the following EBNF
productions for extensions that may be padded after the language code,
script code, region code, variant code:

extensions  = "-x" 1* ("-" key "=" value)
key  = ALPHA *alphanum
value  = 1* utf8uri
alphanum  = (ALPHA / DIGIT)
utf8uri  = (ALPHA / DIGIT / 1*4 ("%" 2 HEXDIG))

Under this new scheme, the following language tag may be valid:
"sr-Latn-SP-2003-x-href=http%3A%2F%2Fwww%2Eiana%2Eorg%2F-version=1%2E0"
which here would mean: {
    language="sr"; // Serbian
    script="Latn"; // Latin
    region="SP"; // Serbia-Montenegro
    variant="2003";
    extensions="-x"; {
        href="http://www.iana.org/";
        version="1.0"
    }
}

However the problem with that scheme is its new use of characters "%" and
"=". There are a lot of applications that where not expecting something else
in this field than just alphanum and "-" or "_" or ".", so that the language
tag could safely be used without specific escaping within URIs (for example
in HTTP GET URLs) or as options of a MIME type (I take a sample here, which
may not correspond to an existing option of the "text/plain" MIME type):

Content-Encoding: text/plain; charset=UTF-8;
lang=sr-Latn-SP-2003-x-href=http%3A%2F%2Fwww%2Eiana%2Eorg%2F-version=1%2E0

This may break the compatiblity of some parsers if such "extended language
tags" are found there, as there are two "=" signs within the value of the
"lang=" option.

For GET URLs, these extra "%" and "=" will need to be URL-encoded to get
through correctly, as the following would become possible and prone to
generate form data parsing errors:

http://www.anysite.domain/process-form.cgi?lang=sr-Latn-SP-2003-x-href=http%3A%2F%2Fwww%2Eiana%2Eorg%2F-version=1%2E0

I think it's quite strange that these extensions have not used the existing
restricted encoding set to encode them, instead on relying on "%" and "=".
Why not using "_" instead of "=" and "." instead of "%", like this:
"sr-Latn-SP-2003-x-href_http.3A.2F.2Fwww.2Eiana.2Eorg.2F-version_1.2E0"
(same meaning as the first example above).

But at least this draft offers a good starting point to indicate locales
more precisely.

I fully support the new reference to the ISO-15924 standard for the script
code, and for documenting the legal values of variant codes (either a year
with possible era, or a registered tag), as well as clearly indicating that
languages codes should be the shortest ISO-639 codes (is it true for a few
legacy languages which previously were coded with 3 letters and upgraded to
2-letter codes, until there was a policy not to do it anymore in the
future?)

Where does it affect Unicode, I don't know, except in its possible normative
data tables which may contain future language code conditions, or in
Language tags inserted in the Unicode encoded texts. Does Unicode need these
extensions?

Re: Proposed Successor to RFC 3066 (language tags)

Reply via email to