From: "Doug Ewell" <[EMAIL PROTECTED]> > Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote: > As I mentioned before, this will never happen, because even if an ISO > 3166-2 region code did appear in a language tag (by registration, as > John Cowan points out), the country and region would still be separated > by a hyphen. The hypothetical region in Laos would be coded "LA-TN", > and so the whole language tag would be "lo-LA-TN", distinguishable from > "lo-Latn" regardless of capitalization. > > There is in fact no such region as LA-TN, but just for fun, I compiled a > list of the codes that would be ambiguous if Philippe's hyphenless > assumption were true. It's not a long list. > > CA-NS: Canada, Nova Scotia > Cans: Unified Canadian Aboriginal Syllabics > > IT-AL: Italy, Alessandria > Ital: Old Italic (Etruscan, Oscan, etc.) > > This might qualify as the first recorded frivolous use of ISO 15924 > codes.
Note that I'm focussing on problems that may arise from RFC 3066. There's no problem in fact with ISO 639, ISO 3166 or ISO 15924 isolately. The problem is clearly in the ambiguous syntax of RFC 3066 once modified to include optional script codes followed by optional country+region code. OK suppose now that one requires an hyphen between a country and region code. Isn't there some region code with 4 letters in ISO 3166-2 that may collide with ISO 15924 codes? I have a partial list of ISO 3166-2, most codes are 1 or 2 letters or digits. All ambiguities could be avoided if an updated RFC 3066 with script codes says that letercase is significant for the distinction of ISO15924 Script codes, and ISO3166 country/area codes. Still, ISO3166-3 contains 4 letter codes as well which have legal use. Are they allowed in RFC 3066 language tags? All the new combinations cause a problem when one wants to support all the forms: <languagecode>-<COUNTRYCODE> <languagecode>-<ScriptCode> <languagecode>-<Scriptcode>-<COUNTRYCODE> <languagecode>-<COUNTRYCODE>-<SUBCOUNTRYCODE> It's impossible, in a parser, to distinguish them without compiling a list of allowed code (but the 3 ISO standards are open to extensions...), unless case distinction is made mandatory in RFC 3066 language tags. In that case, the Unicode 4 TUS specification that says that language tags should be lowercased would be non conforming in the context of RFC 3066 language tags where case distinction is important, as soon as an optional script code can be used now as a subtag. If Unicode does not want to change the legacy use of lowercased ISO 3166 country/region codes converted to lowercase, an exception could be made so that the ISO 15924 script code will NOT be lowercased but specified with its normative titlecased form.