From: Addison Phillips [wM] > Please note that there is a discussion list for this topic at: [EMAIL PROTECTED] > > While Mark and I welcome your comments here or privately, off-list, you can best be > a part of the discussion by joining that list. Join the list by sending a request email > to: [EMAIL PROTECTED]
I note that the language tags proposal includes the following EBNF productions for extensions that may be padded after the language code, script code, region code, variant code: extensions = "-x" 1* ("-" key "=" value) key = ALPHA *alphanum value = 1* utf8uri alphanum = (ALPHA / DIGIT) utf8uri = (ALPHA / DIGIT / 1*4 ("%" 2 HEXDIG)) Under this new scheme, the following language tag may be valid: "sr-Latn-SP-2003-x-href=http%3A%2F%2Fwww%2Eiana%2Eorg%2F-version=1%2E0" which here would mean: { language="sr"; // Serbian script="Latn"; // Latin region="SP"; // Serbia-Montenegro variant="2003"; extensions="-x"; { href="http://www.iana.org/" version="1.0" } } However the problem with that scheme is its new use of characters "%" and "=". There are a lot of applications that where not expecting something else in this field than just alphanum and "-" or "_" or ".", so that the language tag could safely be used without specific escaping within URIs (for example in HTTP GET URLs) or as options of a MIME type (I take a sample here, which may not correspond to an existing option of the "text/plain" MIME type): Content-Encoding: text/plain; charset=UTF-8; lang=sr-Latn-SP-2003-x-href=http%3A%2F%2Fwww%2Eiana%2Eorg%2F-version=1%2E0 This may break the compatiblity of some parsers if such "extended language tags" are found there, as there are two "=" signs within the value of the "lang=" option. For GET URLs, these extra "%" and "=" will need to be URL-encoded to get through correctly, as the following would become possible and prone to generate form data parsing errors: http://www.anysite.domain/process-form.cgi?lang=sr-Latn-SP-2003-x-href=http%3A%2F%2Fwww%2Eiana%2Eorg%2F-version=1%2E0 I think it's quite strange that these extensions have not used the existing restricted encoding set to encode them, instead on relying on "%" and "=". Why not using "_" instead of "=" and "." instead of "%", like this: "sr-Latn-SP-2003-x-href_http.3A.2F.2Fwww.2Eiana.2Eorg.2F-version_1.2E0" (same meaning as the first example above). But at least this draft offers a good starting point to indicate locales more precisely. I fully support the new reference to the ISO-15924 standard for the script code, and for documenting the legal values of variant codes (either a year with possible era, or a registered tag), as well as clearly indicating that languages codes should be the shortest ISO-639 codes (is it true for a few legacy languages which previously were coded with 3 letters and upgraded to 2-letter codes, until there was a policy not to do it anymore in the future?) Where does it affect Unicode, I don't know, except in its possible normative data tables which may contain future language code conditions, or in Language tags inserted in the Unicode encoded texts. Does Unicode need these extensions?