Re: lowercased Unicode language tags ? (was: ISO 15924)

Philippe Verdy Mon, 03 May 2004 01:52:44 -0700

From: "Doug Ewell" <[EMAIL PROTECTED]>
> Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:
> As I mentioned before, this will never happen, because even if an ISO
> 3166-2 region code did appear in a language tag (by registration, as
> John Cowan points out), the country and region would still be separated
> by a hyphen.  The hypothetical region in Laos would be coded "LA-TN",
> and so the whole language tag would be "lo-LA-TN", distinguishable from
> "lo-Latn" regardless of capitalization.
>
> There is in fact no such region as LA-TN, but just for fun, I compiled a
> list of the codes that would be ambiguous if Philippe's hyphenless
> assumption were true.  It's not a long list.
>
> CA-NS:  Canada, Nova Scotia
> Cans:  Unified Canadian Aboriginal Syllabics
>
> IT-AL:  Italy, Alessandria
> Ital:  Old Italic (Etruscan, Oscan, etc.)
>
> This might qualify as the first recorded frivolous use of ISO 15924
> codes.


Note that I'm focussing on problems that may arise from RFC 3066. There's no
problem in fact with ISO 639, ISO 3166 or ISO 15924 isolately. The problem is
clearly in the ambiguous syntax of RFC 3066 once modified to include optional
script codes followed by optional country+region code.

OK suppose now that one requires an hyphen between a country and region code.
Isn't there some region code with  4 letters in ISO 3166-2 that may collide with
ISO 15924 codes? I have a partial list of ISO 3166-2, most codes are 1 or 2
letters or digits.

All ambiguities could be avoided if an updated RFC 3066 with script codes says
that letercase is significant for the distinction of ISO15924 Script codes, and
ISO3166 country/area codes.

Still, ISO3166-3 contains 4 letter codes as well which have legal use. Are they
allowed in RFC 3066 language tags?

All the new combinations cause a problem when one wants to support all the
forms:

<languagecode>-<COUNTRYCODE>
<languagecode>-<ScriptCode>
<languagecode>-<Scriptcode>-<COUNTRYCODE>
<languagecode>-<COUNTRYCODE>-<SUBCOUNTRYCODE>

It's impossible, in a parser, to distinguish them without compiling a list of
allowed code (but the 3 ISO standards are open to extensions...), unless case
distinction is made mandatory in RFC 3066 language tags.

In that case, the Unicode 4 TUS specification that says that language tags
should be lowercased would be non conforming in the context of RFC 3066 language
tags where case distinction is important, as soon as an optional script code can
be used now as a subtag.

If Unicode does not want to change the legacy use of lowercased ISO 3166
country/region codes converted to lowercase, an exception could be made so that
the ISO 15924 script code will NOT be lowercased but specified with its
normative titlecased form.

Re: lowercased Unicode language tags ? (was: ISO 15924)

Reply via email to