Re: New Last Call: 'Tags for Identifying Languages' to BCP
Date: 2004-12-18 23:37 From: Doug Ewell [EMAIL PROTECTED] To: [EMAIL PROTECTED] Bruce Lilly blilly at erols dot com wrote: If you can write a reasonable grandfathered production in ABNF that will allow this set of tags and no others, such that the ABNF can be used without also referring to the prose, then I salute you. If there really are only 24 items of less than 11 octets each, a trivial solution is to simply list them (with the usual ABNF syntax) as literal strings. That should take no more than a half-dozen lines. Listing the 24 literal strings doesn't seem like a particularly elegant solution. Perhaps it doesn't meet your subjective criteria for elegance. But it is a *reasonable* production that meets specific criteria, and that is what you asked for. A list of specific literal strings is not unusual (e.g. RFC 3464 sect. 2.3.3, RFC 3798 sect. 3.2.6, RFC 2156 (summarized in Appendix E)). Look, RFCs 1766 and 3066 both had ABNF that was insufficient to describe the range of valid language tags, and AFAIK they were not greatly criticized for this. [...] The same is true for RFC 3066bis. A crucial difference is that RFC 3066 and 1766 required registration before use, and community review before registration. If a tag were proposed that failed to meet some criteria not adequately detailed in the ABNF, the reviewer, the community, and the Area Director could explain the issue *before* the darned thing went into use. As that safety mechanism is being removed, it is more important that the specification be clear and precise and consistent. RFC 2231, which you have mentioned often in this thread, has the following as part of its ABNF: -begin pasted material- charset := registered character set name language := registered language tag [RFC-1766] -end pasted material- If this type of syntax specification is good enough for RFC 2231, why wouldn't it be good enough here? RFC 2231 isn't BCP and doesn't obsolete BCP; it does not remove any registration requirements. While it obsoletes another RFC (2184), it does not attempt to incorporate content of the obsoleted RFC or artifacts of its use by a vague reference. Reference to (unaffected) external specifications is fine; the draft uses RFC 2234 productions, for example, and that is not a problem. ___ Ietf mailing list Ietf@ietf.org https://www1.ietf.org/mailman/listinfo/ietf
Re: New Last Call: 'Tags for Identifying Languages' to BCP
Date: 2004-12-21 00:57 From: Doug Ewell [EMAIL PROTECTED] To: [EMAIL PROTECTED] The RFC 3066bis approach involves creating a registry of all the pieces that can make, or be combined to make, a language tag. This is much easier to implement and understand than chasing down the various standards and their history, and it permits stability that cannot exist if ISO maintenance agencies change their codes. Substituting a Numbers Authority for a Maintenance Agency might not solve the problem; indeed it may bring new problems. IANA isn't infallible, and has botched some registry entries. See http://mail.apps.ietf.org/ietf/charsets/msg01477.html for an example. Vernon Schryver [...] characterized debating RFC 3066bis (for over a year!) within the IETF-Languages group, and only presenting it to other groups during the Last Call period, as a process problem, OK. and charged this group with engaging in lawyerly talk such as whether 'accounts' is more appropriate than 'account' even though no such exchange ever took place (I checked the archives back to January 2002). No, he was referring to concurrent discussions on the IETF mailing list. Now Bruce wants us to wait a few more days before rolling out his suggestions to fix these perceived problems. This is a filibuster, an attempt to stall RFC 3066bis out of existence. I also (i.e. in addition to JFC) find that characterization offensive. I am responding to an IETF New Last Call in accordance with established procedures, and within the time period established. I had at one time entertained an informal approach to addressing the procedural issues, but given such an accusation, I am now inclined to use the formal procedure outlined in RFC 2026 section 6.5.2. ___ Ietf mailing list Ietf@ietf.org https://www1.ietf.org/mailman/listinfo/ietf
RE: New Last Call: 'Tags for Identifying Languages' to BCP
Date: 2004-12-18 20:33 From: Addison Phillips [wM] [EMAIL PROTECTED] To: [EMAIL PROTECTED], [EMAIL PROTECTED], Bruce Lilly [EMAIL PROTECTED] CC: [EMAIL PROTECTED], [EMAIL PROTECTED] Reply to: [EMAIL PROTECTED] Hmm... That's as an editorial issue and not a technical issue. [...] The -CS subtag issue doesn't strike me as a technical issue with the draft. The draft stabilizes the meaning of subtags. There is a process in the draft for setting the initial (and thus stable) meaning of the -CS subtag. While it probably matters which value (Czechoslovakia or Serbia and Montenegro) that is selected, it is only of editorial interest to the draft itself... unless what Bruce is trying to prove is that stabilizing the meaning of the subtags is a Bad Idea, which I don't think is his point. I'm willing to entertain a debate about which meaning ought to be selected. But really it ought to be recognized as not an editorial issue with the draft and not a technical objection. I believe that it's more than an editorial issue, and that there are both technical and non-technical matters involved. While I wouldn't say that stabilizing the meaning of the subtags is a Bad Idea, I do believe that the particular approach taken raises some disturbing issues, and I suspect that there are process-related considerations that could have avoided them. Jefsey Morfin and Vernon Schryver have touched on procedural issues; I plan to discuss my specific concerns and suggestions, but it make take a few days due to the impending holidays and other work for me to collect and organize my thoughts on those matters. ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
RE: New Last Call: 'Tags for Identifying Languages' to BCP
On Sat, 18 Dec 2004, Brian Rosen wrote: I don't have any comment on the issue of language tags, but speaking as a reasonably avid ABNF hacker, I agree with Sam, and would not want to establish a convention that ABNF in IETF RFCs is expected to be precise. The counter-argument is the all-too-frequent occurance when you deal with willful cretins who will *insist* that the specification says such-and-such when it really says the opposite, and will leap upon the most bizarre interpretation of text in order to bolster their arguments. This is unavoidable; however, it helps a lot if the ABNF firmly comes down on the side of the good guys. I've spent entirely too much of my life in the past few years fending off cretins, to agree knowingly to anything that makes me more vulnerable to them in the future. Nor do gentlemen's agreements work any more. We may all be (ladies and) gentlemen here, but out there there are individuals who are not. As painful as the process may be, I believe that the ABNF should be as tight as possible, preferably by ABNF rules but at least through ABNF comments. However, be careful about comments. I had one cretin insist that between n and m inclusive (where n and m were integers) had an implied restriction that n = m, and that when n m it meant an empty set of values. -- Mark -- http://staff.washington.edu/mrc Science does not emerge from voting, party politics, or public debate. Si vis pacem, para bellum. ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
Re: New Last Call: 'Tags for Identifying Languages' to BCP
Date: 2004-12-14 16:01 From: Doug Ewell [EMAIL PROTECTED] To: [EMAIL PROTECTED] The grandfathered production in the RFC 3066bis ABNF is intended only for the 24 entries (not 46, as I wrote earlier) that are carried over from the RFC 3066 registry and that don't otherwise conform to the RFC 3066bis syntax. Take a look at the items marked grandfathered in the proposed registry: http://users.adelphia.net/~dewell/lstreg.html If you can write a reasonable grandfathered production in ABNF that will allow this set of tags and no others, such that the ABNF can be used without also referring to the prose, then I salute you. If there really are only 24 items of less than 11 octets each, a trivial solution is to simply list them (with the usual ABNF syntax) as literal strings. That should take no more than a half-dozen lines. ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
Re: New Last Call: 'Tags for Identifying Languages' to BCP
Date: 2004-12-15 14:41 From: Peter Constable [EMAIL PROTECTED] To: [EMAIL PROTECTED], [EMAIL PROTECTED] [...] How is it possible to predict ahead of time what is the worst-case length for a RFC3066-registered language tag? In some contexts, the length is limited by the context (e.g. encoded-words, Content-Language fields in an Internet Message). Neither is possible. In light of that, I think it best to make sure implementers of the revised RFC 3066 be reminded that some implementations may impose limits (whether those implementers be constructing tags or passing them from one process to another), and for implementers to incorporate robustness into their implementations so that they can respond gracefully if an unexpectedly-long tag is encountered -- after all, no matter what limit could be imposed in a revision to RFC 3066, there's no way to stop malware from sending bad data. (How *do* encoded-word parsers react if a bogus charset or language tag that's 2k octets long is encountered? By definition, that cannot happen. No encoded-word may be longer than 75 octets. A sequence longer than that limit, even if it matches all other characteristics of an encoded-word, is treated as ordinary ASCII text (RFC 2047, section 6.1, paragraph marked (1)). No header field line may be longer than 998 octets (not counting the terminating CRLF pair), so 2k is simply not permitted. The encoded-word spec already allows for segmenting long strings; To be a bit more precise, it permits text to be encoded to be split across multiple encoded-words (with several restrictions); the encoded-words themselves cannot be in any way segmented or split. That is because an encoded-word is treated by a MIME-unaware application as a single RFC [2]822 word. could it not also be revised to allow segmenting for the parameters, which would also make it more robust?) If you're referring to RFC 2231 extensions to Content-Type and Content-Disposition field parameters, that's a separate matter. In general, though, as MIME has been around for more than a decade and Internet Messages for more than three decades, with a substantial installed base of interoperating implementations, in what has become one of the core Internet protocols, any changes would have to be backwards compatible or would have to be negotiated between sender and receiver at the same protocol level, or would require a lengthy transition period before pulling the rug out from under existing implementations. It's probably more likely that a separate next-generation system would be implemented first. ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
Re: New Last Call: 'Tags for Identifying Languages' to BCP
Bruce == Bruce Lilly [EMAIL PROTECTED] writes: Bruce If there really are only 24 items of less than 11 octets Bruce each, a trivial solution is to simply list them (with the Bruce usual ABNF syntax) as literal strings. That should take no Bruce more than a half-dozen lines. Perhaps. I actually find a lot of ABNF specs are not as clear as they could be to humans because they are trying to describe the valid inputs as strictly as possible. In many cases I think the spec would be more clear if the ABNF were relaxed and other constraints were expressed at appropriate levels. ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
Re: New Last Call: 'Tags for Identifying Languages' to BCP
Date: 2004-12-15 13:22 From: John Cowan [EMAIL PROTECTED] To: [EMAIL PROTECTED], [EMAIL PROTECTED] The current process does *not* limit the length of non-private-use tags. It does by way of reviewer, community, and IETF Area Director review. But absolutely nothing except his good sense prevents Michael from registering en-the-dialect-spoken-on-the-bowery-between-1933-and-1945-by-alcoholic-drug-users-who-live-in-flophouses. Aside from specific technical details already addressed by others, such a long name would certainly solicit the strong suggestion that the submitter should find a suitable shortened form, as such a long tag could not be used in an encoded-word. ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
RE: New Last Call: 'Tags for Identifying Languages' to BCP
I don't have any comment on the issue of language tags, but speaking as a reasonably avid ABNF hacker, I agree with Sam, and would not want to establish a convention that ABNF in IETF RFCs is expected to be precise. One MUST read the text to understand what the limits of the syntax are. This is especially true with repetitions. It's usually tortuous to write ABNF that limits repetitions or string lengths. It's possible, but the result is very hard to understand. Brian -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Sam Hartman Sent: Saturday, December 18, 2004 1:55 PM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: Re: New Last Call: 'Tags for Identifying Languages' to BCP Bruce == Bruce Lilly [EMAIL PROTECTED] writes: Bruce If there really are only 24 items of less than 11 octets Bruce each, a trivial solution is to simply list them (with the Bruce usual ABNF syntax) as literal strings. That should take no Bruce more than a half-dozen lines. Perhaps. I actually find a lot of ABNF specs are not as clear as they could be to humans because they are trying to describe the valid inputs as strictly as possible. In many cases I think the spec would be more clear if the ABNF were relaxed and other constraints were expressed at appropriate levels. ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
Re: New Last Call: 'Tags for Identifying Languages' to BCP
I am somewhat sympathetic to the idea of having some total limit (except for the late date for the proposed change). Earlier feedback would have been had if there had been some announcement of the proposed considerable changes on the ietf-822 mailing list, or via an IETF WG charter. This sort of thing is exactly why we last call non-WG documents for four weeks rather than two. Less review is assumed to have occured and this may well mean the document is in some sense less done. So, while I know of no problems caused by inordinantly long language tags, now that the issue has been brought up using this opportunity to add a max length restriction seems like a very reasonable thing to do. However, we got considerable pushback on having RFC 3066bis make any previously valid RFC3066 tag be invalid Entirely appropriate. And the proposed draft would invalidate the meaning of the valid RFC 3066 language tag sr-CS, which is currently in use. and any length restriction would do that. If it makes you happy, you can exclude private-use tags from an explicit limit. I would only suggest doing this if it helps us reach consensus. Ned ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
RE: New Last Call: 'Tags for Identifying Languages' to BCP
We (Mark and I) welcome the last call process and timelines and the feedback these generate. That's the whole point of having a Last Call. The -CS subtag issue doesn't strike me as a technical issue with the draft. The draft stabilizes the meaning of subtags. There is a process in the draft for setting the initial (and thus stable) meaning of the -CS subtag. While it probably matters which value (Czechoslovakia or Serbia and Montenegro) that is selected, it is only of editorial interest to the draft itself... unless what Bruce is trying to prove is that stabilizing the meaning of the subtags is a Bad Idea, which I don't think is his point. I'm willing to entertain a debate about which meaning ought to be selected. But really it ought to be recognized as not an editorial issue with the draft and not a technical objection. Best Regards, Addison Addison P. Phillips Director, Globalization Architecture http://www.webMethods.com Chair, W3C Internationalization Working Group http://www.w3.org/International Internationalization is an architecture. It is not a feature. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of [EMAIL PROTECTED] Sent: 20041218 15:41 To: Bruce Lilly Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: New Last Call: 'Tags for Identifying Languages' to BCP I am somewhat sympathetic to the idea of having some total limit (except for the late date for the proposed change). Earlier feedback would have been had if there had been some announcement of the proposed considerable changes on the ietf-822 mailing list, or via an IETF WG charter. This sort of thing is exactly why we last call non-WG documents for four weeks rather than two. Less review is assumed to have occured and this may well mean the document is in some sense less done. So, while I know of no problems caused by inordinantly long language tags, now that the issue has been brought up using this opportunity to add a max length restriction seems like a very reasonable thing to do. However, we got considerable pushback on having RFC 3066bis make any previously valid RFC3066 tag be invalid Entirely appropriate. And the proposed draft would invalidate the meaning of the valid RFC 3066 language tag sr-CS, which is currently in use. and any length restriction would do that. If it makes you happy, you can exclude private-use tags from an explicit limit. I would only suggest doing this if it helps us reach consensus. Ned ___ Ietf-languages mailing list [EMAIL PROTECTED] http://www.alvestrand.no/mailman/listinfo/ietf-languages ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
RE: New Last Call: 'Tags for Identifying Languages' to BCP
Hmm... That's as an editorial issue and not a technical issue. Addison Addison P. Phillips Director, Globalization Architecture http://www.webMethods.com Chair, W3C Internationalization Working Group http://www.w3.org/International Internationalization is an architecture. It is not a feature. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Addison Phillips [wM] Sent: 20041218 16:49 To: [EMAIL PROTECTED]; Bruce Lilly Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: RE: New Last Call: 'Tags for Identifying Languages' to BCP We (Mark and I) welcome the last call process and timelines and the feedback these generate. That's the whole point of having a Last Call. The -CS subtag issue doesn't strike me as a technical issue with the draft. The draft stabilizes the meaning of subtags. There is a process in the draft for setting the initial (and thus stable) meaning of the -CS subtag. While it probably matters which value (Czechoslovakia or Serbia and Montenegro) that is selected, it is only of editorial interest to the draft itself... unless what Bruce is trying to prove is that stabilizing the meaning of the subtags is a Bad Idea, which I don't think is his point. I'm willing to entertain a debate about which meaning ought to be selected. But really it ought to be recognized as not an editorial issue with the draft and not a technical objection. Best Regards, Addison Addison P. Phillips Director, Globalization Architecture http://www.webMethods.com Chair, W3C Internationalization Working Group http://www.w3.org/International Internationalization is an architecture. It is not a feature. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of [EMAIL PROTECTED] Sent: 20041218 15:41 To: Bruce Lilly Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: New Last Call: 'Tags for Identifying Languages' to BCP I am somewhat sympathetic to the idea of having some total limit (except for the late date for the proposed change). Earlier feedback would have been had if there had been some announcement of the proposed considerable changes on the ietf-822 mailing list, or via an IETF WG charter. This sort of thing is exactly why we last call non-WG documents for four weeks rather than two. Less review is assumed to have occured and this may well mean the document is in some sense less done. So, while I know of no problems caused by inordinantly long language tags, now that the issue has been brought up using this opportunity to add a max length restriction seems like a very reasonable thing to do. However, we got considerable pushback on having RFC 3066bis make any previously valid RFC3066 tag be invalid Entirely appropriate. And the proposed draft would invalidate the meaning of the valid RFC 3066 language tag sr-CS, which is currently in use. and any length restriction would do that. If it makes you happy, you can exclude private-use tags from an explicit limit. I would only suggest doing this if it helps us reach consensus. Ned ___ Ietf-languages mailing list [EMAIL PROTECTED] http://www.alvestrand.no/mailman/listinfo/ietf-languages ___ Ietf-languages mailing list [EMAIL PROTECTED] http://www.alvestrand.no/mailman/listinfo/ietf-languages ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
Re: New Last Call: 'Tags for Identifying Languages' to BCP
Date: 2004-12-14 13:02 From: John Cowan [EMAIL PROTECTED] To: Addison Phillips [wM] [EMAIL PROTECTED] CC: [EMAIL PROTECTED] Addison Phillips [wM] scripsit: The IETF process is not really my concern. I will note that many IETF and non-IETF standards folks have participated in the process of developing and reviewing draft-langtags, though. Actually, we're all IETF people. If you're on an IETF mailing list discussing an IETF item of work like an RFC, you're part of the IETF. The process is designed to serve us, not vice versa. It's not quite that simple; IETF process has several specific requirements (as spelled out in RFC 2026) -- an IETF Working group requires a charter with a well-defined scope, specific milestones, etc. There is an official list of IETF working groups. ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
Re: New Last Call: 'Tags for Identifying Languages' to BCP
Date: 2004-12-14 23:35 From: John Cowan [EMAIL PROTECTED] To: Doug Ewell [EMAIL PROTECTED] CC: [EMAIL PROTECTED] Doug Ewell scripsit: * Region subtag 830, Channel Islands, is based on a UN M.49 code. Since that is an English-only standard, one must look elsewhere to find the French translation (it's not what you might expect, either). In fact, M.49 is available in all six official U.N. languages; it's just the *online* version that's English-only. No, I am quite certain (because I looked!) that the UN M.49 lists are available online as HTML-ized English and HTML-ized French. It's certainly not English-only (the online version might reasonably be called HTML-only, but that's another story). ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
Re: New Last Call: 'Tags for Identifying Languages' to BCP
Peter Constable wrote: The definitions we have now will remain, they will continue to be referenced and available. I've no idea where you found en-NH. And what's the correct form, pt-TP or pt-TL ? And the fallback algorithm makes no sense for cases like en-US-boont, de-CH-1996, or se-Latn-AX, when en-boont, de-1996, or se-AX are available. Bye, Frank ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
RE: New Last Call: 'Tags for Identifying Languages' to BCP
From: [EMAIL PROTECTED] [mailto:ietf-languages- [EMAIL PROTECTED] On Behalf Of Bruce Lilly The point is that under RFC 3066, the bilingual ISO language and country code lists are considered definitive. That is nowhere stated or even suggested in RFC 3066. RFC 3066 section 2.2 states, in part: - All 2-letter subtags are interpreted according to assignments found in ISO standard 639, Code for the representation of names of languages [ISO 639], or assignments subsequently made by the ISO 639 part 1 maintenance agency or governing standardization bodies. and has a similar statement regarding ISO 3166. interpreted according to assignments found in certainly sounds as if the ISO lists are considered definitive for their respective categories of subtags, since their interpretation is specified as that given in those lists. I don't see how the RFC 3066 text can be interpreted otherwise. You're now quoting things so far removed from their context that they are no longer being evaluated fairly. I believed we were talking about the specific strings, as you had made reference to implementers of bilingual products not having access to that data. Perhaps I misunderstood you, but whether or not, the relevant facts are that RFC 3066 referred to ISO source standards to establish the denotation of identifiers drawn from those standards, and the proposed revision does the same. Peter Constable Microsoft Corporation ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
Re: New Last Call: 'Tags for Identifying Languages' to BCP
Date: 2004-12-13 02:05 From: Peter Constable [EMAIL PROTECTED] To: [EMAIL PROTECTED], [EMAIL PROTECTED] If for whatever reason ISO and the UN decided that US should be used to designate the country of France[...] The only way that would be likely to happen would be if there were no longer a US *and* if the ISO and UN representatives of France were to initiate a request for such a change.[...] This scenario is not hypothetical; it actually occurred in the case of CS. In the case of CS, but *NOT* US a country had quite some time earlier ceased to exist. That is what makes your US scenario hypothetical. This is a situation we do not intend to repeat. That is precisely what would be repeated, and the problem would remain. CS currently means Serbia and Montenegro, and its use in accordance with RFC 3066 has precisely that meaning. Changing CS to mean something else at some future time (if/when the proposed draft goes into effect) would result in at least as many different definitions as exist at present, and adds yet another time epoch that needs to be considered in order to determine the meaning of CS. The usability flaw in treating ISO 639 and ISO 3166 as human-readable is evident in the confusion between ja and JP (or is it jp and JA?), [...] It is not uncommon for users to confuse JA and JP. Which clearly demonstrates why mere codes in the absence of definitions associated with the codes is a pointless proposition. And it illustrates the fact that the only practical way for a code to become associated with a particular piece of text is by way of the associated definition (or something derived from it) rather than directly. As for what is silly, if the UN country ID for Canada changed to CN (and that for PRC changed to something else)[] And it is precisely because of such problems that it is as unlikely to happen as your hypothetical FR-US change. Again, not hypothetical at all. Last time I checked, US didn't mean France, and CN didn't mean Canada -- I suggest that you might want to brush up on the definition of hypothetical, as it is difficult to have a rational discussion unless we're in agreement on basic definitions (just as it is difficult to have effective communications about what language is indicated by a code without agreement on the *definitions* of the codes). If you're really wanting to know what the meaning of CS would be per the proposed draft, the proposal is that it will forever remain valid with the meaning Czechoslovakia as it was originally defined in ISO 3166. But the current meaning under RFC 3066 is quite different. What about maintaining the stability of that meaning? I haven't specifically discussed display names; that is your assertion, and not my basis for objection. You didn't use the term display names, but it is clearly implied by your reference to bilingual implementations. Your inference (which you incorrectly claim as my implication) is different from my claim. My claim is that under RFC 3066, the definitions of the country and language codes is available in two languages (yes, it's true -- but irrelevant to that point -- that the IANA registered complete tags do not have that characteristic), and that the proposed registry would lack that characteristic of the current BCP (unnecessarily). I refer to the definitions and the need to map to and from those definitions at either end of the communications channel. Whether or not that happens by display is incidental to the issue of the number of languages that the definitions are provided in. Definitions in multiple languages are not a requisite to establishing the denotation of a coded element. True but irrelevant to the point. We now have definitions of specific types of elements (viz. country and language tags) in multiple languages, and the objection is to the unnecessary removal of that characteristic. ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
Re: New Last Call: 'Tags for Identifying Languages' to BCP
Date: 2004-12-13 01:05 From: Peter Constable [EMAIL PROTECTED] To: [EMAIL PROTECTED], [EMAIL PROTECTED] RFC 3066 does not impose any restrictions on what its replacements might do. This is the case with any specification: a given technical specification is not a specification of human behaviour and cannot keep us from revising the spec or replacing it in any way we may choose. It's not clear exactly who is meant by us, but I'll leave that to a separate message. It is considered bad practice for a document which obsoletes another document to depend on the obsoleted document for definitions or other interpretation of the meaning of what is contained in the successor document. You have mentioned conflict with RFCs 2047 and 2231. RFC 2047 does not make reference to language tags. The ABNF of RFC 2231 does not impose any limit on the length of language tags. RFC does contain an implicit length issue in that it updates RFC 2047, allowing language tags within encoded words, but it does not explicitly identify any upper bound on the length of language tags. By reading both RFC 2047 and RFC 2231, one finds that they assume that a language tag must be at most 64 characters long: You have missed several important and not-so-subtle points. One of which is that RFC 2231 explicitly amends RFC 2047; it clearly so states in the first page heading and in the text, and is also indicated in the RFC Index. Another is that neither uses ABNF; both use EBNF as defined in RFC 822. More details on specific missed points below: - the shortest charset names are 2 characters long (e.g. IT) Not all charsets have 2-character names. Not all two-character names which might be assigned are suitable for MIME use. Where a preferred MIME name is indicated, that should be used. - the minimum encoded-text length is 1 character long That is strictly only true for text that meets all of the following conditions: a) is representable in a specified subset of ANSI X3.4, and therefore requires no encoding b) does not use any encoding, even if unnecessary c) does not use a charset and character sequence involving shift sequences (e.g. as in ISO 2022-like charsets) It also misses the point that using 76+ octets to represent a single octet is rather wasteful. Any use of B encoding will require a multiple of 4 octets of encoded text. Q encoding has some special cases, but typically requires 3 octets or more. An encoded-word must contain at least 11 characters that are not part of the language tag and have a total length of no more than 75 characters. Therefore, an upper bound on language tags that can be used in an RFC 2047/2231 encoded-word production is 64 characters. That is a best case upper bound, for text which requires no encoding at all, one character per encoded-word. In many cases, where the charset tag or encoding is longer, the upper bound on the length of languages tags will be less, but the RFC gives no estimate or indication of how much less. The worst case appears to be the charset named Extended_UNIX_Code_Fixed_Width_for_Japanese (43 characters), which in fact uses ISO 2022-like sequences. That is the primary name for that charset; there is no preferred MIME alias, and the only other alias is the one specified for printer MIB use. Shifted characters are represented by two octets, each of which requires encoding. The shift sequences are 3 octets each, and RFC 2047 requires that an encoded-word start and begin in unshifted state. Therefore the minimum amount of encoded text for a single character in a shifted subset consists of an encoding of: a 3 octet shift sequence (one of which requires encoding), 2 octets representing the single character (both requiring encoding), and 3 octets restoring the unshifted state (one requiring encoding). Using B encoding results in 12 octets of encoded text as a minimum (Q-encoding would require a minimum of 16 octets). So a single character in a shifted subset of that particular charset, using B encoding, leaves at most 12 octets for a language-tag. As mentioned, use of an encoded-word plus the necessary whitespace around it to represent a single character is rather wasteful, so a brief language tag is indicated; fortunately ja suffices for text likely to be used with that charset. This is a constraint on an application of RFC 3066; it is not a constraint on RFC 3066 itself. It is possible that other applications of RFC 3066 may impose limits that may be longer or shorter than that imposed by RFC 2047/2231. Yes, and it is sometimes desirable to transfer text and tag from one application to another. For example, text in the body of a message can have language indicated by a Content-Language header field, where there is up to 997 octets available for a language tag. However a response regarding some portion of that message might well indicate the topic of the response in the response message's Subject field, where encoded-word limits apply. I
Re: New Last Call: 'Tags for Identifying Languages' to BCP
Bruce Lilly scripsit: I see no reason why limits must be added as a constraint in a revision of RFC 3066. The primary reason for specifying limits is due to the proposed removal of the review/registration process which currently limits the length of non-private-use tags. The current process does *not* limit the length of non-private-use tags. It's true that the process does not permit the registration of unlimited-length tags, as we do not have enough universe to represent them in full. But absolutely nothing except his good sense prevents Michael from registering en-the-dialect-spoken-on-the-bowery-between-1933-and-1945-by-alcoholic-drug-users-who-live-in-flophouses. -- John Cowan [EMAIL PROTECTED] www.reutershealth.com www.ccil.org/~cowan It's the old, old story. Droid meets droid. Droid becomes chameleon. Droid loses chameleon, chameleon becomes blob, droid gets blob back again. It's a classic tale. --Kryten, Red Dwarf ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
RE: New Last Call: 'Tags for Identifying Languages' to BCP
From: [EMAIL PROTECTED] [mailto:ietf-languages- [EMAIL PROTECTED] On Behalf Of John Cowan But absolutely nothing except his good sense prevents Michael from registering en-the-dialect-spoken-on-the-bowery-between-1933-and-1945-by-alcoholic- drug-users-who-live-in-flophouses. Sub-tags can be at most 8 chars long, so Michael would ask for it to be changed to something like en-the-dialect-spoken-on-the-bowery-between-1933-and-1945-by-alcoholc-dr ug-users-who-live-in-flophses. :-) Peter Constable Microsoft Corporation ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
Re: New Last Call: 'Tags for Identifying Languages' to BCP
Date: 2004-12-13 04:37 From: Mark Crispin [EMAIL PROTECTED] Silliness aside, the file may well have embedded language tags in the text of the file. Have you forgotten Plane 14? No, but I note that its introduction strongly discouraged its general use (specifically mentioning ACAP as the intended scope of usage, IIRC); the current version of the Unicode document continues that strong discouragement and further reinforces it by emphasis via italics. Another issue is that both RFC 3066 and the draft proposal call for language tags to be expressed in a subset of ANSI X3.4, corresponding to a subset of the first half of a particular Unicode plane -- and not plane 14. There may be an ambiguity as to whether such deprecated Unicode 3.x tags are in fact compliant with 3066 or the draft under discussion. I'm not eager to abolish uniqueness. There never was any guarantee that codes would never change. Both RFCs 1766 and 3066 specifically mention changes as a fact of life. That's what's now being fixed. No the problem will remain. Currently sr-CS has a specific meaning under RFC 3066; it has had for some time. For that meaning to remain stable, it will be necessary to take any change in the (current) meaning of the -CS part into account. I.e. for a future parse of language tags to do the right thing, it will have to recognize sr-CS generated under the RFC 3066 rules per the 3066/639 definitions. Why is this vestige of colonialism important in the IETF context? You seem to be making an incorrect assumption, one which renders your question meaningless. What magic attribute is there to French that provides definitiveness that is absent in English, or Mandarin, or Hindi, all of which are far more significant languages to the world? No such attribute of the language was claimed. It is the attribute of being used in the official ISO lists that provides the characteristic. A mandatory French translation to an English definition does not significantly increase the information content, and certainly does not double it. You are again making incorrect assumptions. The languages used in ISO documents are considered separate but equal, not a mandatory [...] translation of some other language. That is in fact why ISO is called ISO and not OIN or OIS -- you might wish to visit the ISO web site for details. [more nonsense about mandatory translation elided] You have not explained how the code came to be embedded within the text itself -- surely the author didn't say (or write, or sign) this text is in language QZ; most likely the language was indicated by name, or by some proxy representing the name (such as a locale). Plane 14. HTML and other markups. That provides no explanation of how a *code* came to be embedded in text -- authors in general do not refer to language by codes, and codes do not embed themselves by magic. ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
RE: New Last Call: 'Tags for Identifying Languages' to BCP
From: [EMAIL PROTECTED] [mailto:ietf-languages- [EMAIL PROTECTED] On Behalf Of Bruce Lilly This is a situation we do not intend to repeat. That is precisely what would be repeated, and the problem would remain. CS currently means Serbia and Montenegro, and its use in accordance with RFC 3066 has precisely that meaning. And that is a significant problem we wish to remedy as there is some unknown amount of data or implementations out there that use CS but with a different meaning intended. The usability flaw in treating ISO 639 and ISO 3166 as human-readable is evident in the confusion between ja and JP (or is it jp and JA?), [...] It is not uncommon for users to confuse JA and JP. Which clearly demonstrates why mere codes in the absence of definitions associated with the codes is a pointless proposition. I believe you have confirmed my point, that codes are not meant to be human readable. As for your concern regarding definition, it has been clearly pointed out that codes will not be lacking definitions -- the same definitions they have today from the same sources (with references made to the same sources) will still be available. Again, not hypothetical at all. Last time I checked, US didn't mean France, and CN didn't mean Canada -- I suggest that you might want to brush up on the definition of hypothetical... The case is hypothetical, but the hypothetical case serves to illustrate a general scenario, and the general scenario is not hypothetical. You didn't use the term display names, but it is clearly implied by your reference to bilingual implementations. Your inference (which you incorrectly claim as my implication) is different from my claim. My claim is that under RFC 3066, the definitions... You have failed to quote what you originally wrote which I claimed made this implication: you spoke not of definitions but of bilingual applications. Definitions in multiple languages are not a requisite to establishing the denotation of a coded element. True but irrelevant to the point. Oh? Simply because you make this assertion? We now have definitions of specific types of elements (viz. country and language tags) in multiple languages, and the objection is to the unnecessary removal of that characteristic. The definitions we have now will remain, they will continue to be referenced and available. I do not see how you say they are being removed? Peter Constable Microsoft Corporation ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
RE: New Last Call: 'Tags for Identifying Languages' to BCP
From: [EMAIL PROTECTED] [mailto:ietf-languages- [EMAIL PROTECTED] On Behalf Of Bruce Lilly Currently sr-CS has a specific meaning under RFC 3066; it has had for some time. The meaning Serbia and Montenegro was introduced relatively recently (a little more than a year ago), was immediately received with alarm by many in the IT sector. There were vain attempts to get it reversed, and that failure was an impetus to introduce protection against such changes in the revision of RFC 3066. I am not aware of CS being used in the IT sector with the new meaning, though cannot guarantee that. Peter Constable Microsoft Corporation ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
RE: New Last Call: 'Tags for Identifying Languages' to BCP
From: [EMAIL PROTECTED] [mailto:ietf-languages- [EMAIL PROTECTED] On Behalf Of Bruce Lilly By reading both RFC 2047 and RFC 2231, one finds that they assume that a language tag must be at most 64 characters long... - the shortest charset names are 2 characters long (e.g. IT) Not all charsets have 2-character names... In determining the longest language tag permitted, one must identify the shortest possibilities for all other components. - the minimum encoded-text length is 1 character long That is strictly only true for text that meets all of the following conditions... Hey, I just said what the EBNF said. An encoded-word must contain at least 11 characters that are not part of the language tag and have a total length of no more than 75 characters. Therefore, an upper bound on language tags that can be used in an RFC 2047/2231 encoded-word production is 64 characters. That is a best case upper bound... I identified it as such. The worst case appears to be the charset named Extended_UNIX_Code_Fixed_Width_for_Japanese (43 characters)... As mentioned, use of an encoded-word plus the necessary whitespace around it to represent a single character is rather wasteful, so a brief language tag is indicated; fortunately ja suffices for text likely to be used with that charset. Of course, the length limitations must be balanced between the charset tag, the language tag and the encoded-word itself. I see no reason why limits must be added as a constraint in a revision of RFC 3066. The primary reason for specifying limits is due to the proposed removal of the review/registration process which currently limits the length of non-private-use tags. The review/registration process for RFC 3066 registrations does not impose pre-defined limits that implementers of RFC 3066 can assume in their parsers. It would be a good idea, however, to point out in section 2.1 of the draft that some applications of this specification may impose limits on the length of accepted language tags, and perhaps to cite RFC 2231 as an example. As a general principle, that's fine, however I would point out that given the inability of experts to be able to accurately point out the limits quickly... I do not think it is sufficient merely to state the fact that there are limits, with or without a pointer to RFC 2231 as an example. Some indication of the magnitude of worst-case restrictions is at least advisable... How is it possible to identify what is the worst-case bound assumed in implementations that are out there? How is it possible to predict ahead of time what is the worst-case length for a RFC3066-registered language tag? Neither is possible. In light of that, I think it best to make sure implementers of the revised RFC 3066 be reminded that some implementations may impose limits (whether those implementers be constructing tags or passing them from one process to another), and for implementers to incorporate robustness into their implementations so that they can respond gracefully if an unexpectedly-long tag is encountered -- after all, no matter what limit could be imposed in a revision to RFC 3066, there's no way to stop malware from sending bad data. (How *do* encoded-word parsers react if a bogus charset or language tag that's 2k octets long is encountered? The encoded-word spec already allows for segmenting long strings; could it not also be revised to allow segmenting for the parameters, which would also make it more robust?) Peter Constable Microsoft Corporation ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
Re: New Last Call: 'Tags for Identifying Languages' to BCP
Deborah Goldsmith scripsit: And here's hoping they go to four digits or otherwise extend the scheme instead of recycling when they run out, a non-hypothetical issue if they're already up to 891. Not much to worry about. Of 1000 possible codes, currently 232 are assigned to countries, 32 are assigned to regions, and 10 are retired, leaving 726 codes yet to be assigned. I think that provides a comfortable pad for the future. It's not obvious on what principles, if any, the individual codes were assigned. For the fun of it, the retired codes represent Czechoslovakia, Ethiopia+Eritrea (then known as Ethiopia), East Germany, West Germany, the Netherlands Antilles, the Pacific Islands Trust Territories (now split up), the USSR, Yemen, Democratic Yemen, and Yugoslavia. -- Do NOT stray from the path! John Cowan [EMAIL PROTECTED] --Gandalf http://www.ccil.org/~cowan ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
RE: New Last Call: 'Tags for Identifying Languages' to BCP
Mark and I have both worked extensively with time zone issues, so we're aware of the potential problems. RFC 3339 would be an appropriate substitute: its full-date production describes the ISO 8601 profile used by the draft. I would also tend to agree that lack of a timezone would be ambiguous in most applications. However, for this use I think that: a) the dates indicate the date of accession of each subtag to the registry. These dates will all be in the past. Since the registry itself is versioned and has its own date record, the question of time zone is probably not important because implementations will use their registry date and not an arbitrary date to determine compatibility. That is: the dates will all be used in the same context with one another. b) we can safely assume (or explicitly state) the use of UTC time based on the above. Best Regards, Addison Addison P. Phillips Director, Globalization Architecture http://www.webMethods.com Chair, W3C Internationalization Working Group http://www.w3.org/International Internationalization is an architecture. It is not a feature. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Joe Abley Sent: 20041213 17:51 To: Peter Constable Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: New Last Call: 'Tags for Identifying Languages' to BCP On 13 Dec 2004, at 18:34, Peter Constable wrote: 3. Re ISO 8601 time/date format: What is used in the registry is dates expressed in the format -MM-DD. It was agreed that it would be better to identify the format precisely rather than make the generic reference to ISO 8601. Why not require dates to be formatted as per RFC 3339? In general, -MM-DD is ambiguous unless a timezone is specified. Joe ___ Ietf-languages mailing list [EMAIL PROTECTED] http://www.alvestrand.no/mailman/listinfo/ietf-languages ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
Re: New Last Call: 'Tags for Identifying Languages' to BCP
From: John Cowan ... For example, I'm unhappy about an apparent sentiment that would put ABNF on a lower footing that the English text. I think I'm like most implementors and perhaps unlike non-engineers in reversing that precedence. Whenever I read an RFC, I rely first and foremost on the ABNF. I use the English only for hints, and follow the ABNF instead of the English whenever there is a conflict. Then you would be incapable of implementing any programming language compiler, or an XML parser, for the specs for these things include literally hundreds of constraints that are specified only in technical English and not in the BNF. As far as the BNF is concerned, this is good sound C: main(argv, argc) { float Argv; int* Argc; print(32); } In contexts other than UNIX applications with modern compilers, that fragment is perfectly sound, if not something I'd write. An example context is before typing of formal args and in what ANSI/ISO 9899-1990 calls a freestanding environment where main() is not special. I've suppressed most of the memories, but I seem to recall that what Microsoft calls threaded WIN32 applications are such things, or were before the POSIX additions. Besides, I didn't say that one should ignore the English, but that implementors give precedence to the ABNF. When you are writing an RFC that you hope will be implemented, you MUST remember that programmers are lazy. We transliterate the ABNF to build the parser and so implement the syntax and read the English to figure out and so build the semantics. As I said, if you must have contradictions between your ABNF and your English, you must accept the fact that most technical people will assume your ABNF is right and your English is wrong. That fact seemed to me to conflict with statements in this thread, and that suggests a problem in your working group and your RFC. Vernon Schryver[EMAIL PROTECTED] ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
Re: New Last Call: 'Tags for Identifying Languages' to BCP
From: Vernon Schryver [EMAIL PROTECTED] Subject: Re: New Last Call: 'Tags for Identifying Languages' to BCP To: [EMAIL PROTECTED] Message-ID: [EMAIL PROTECTED] Besides, I didn't say that one should ignore the English, but that implementors give precedence to the ABNF. When you are writing an RFC that you hope will be implemented, you MUST remember that programmers are lazy. We transliterate the ABNF to build the parser and so implement the syntax and read the English to figure out and so build the semantics. As I said, if you must have contradictions between your ABNF and your English, you must accept the fact that most technical people will assume your ABNF is right and your English is wrong. That fact seemed to me to conflict with statements in this thread, and that suggests a problem in your working group and your RFC. This is somewhat moot since the author has indicated the relevant portion of the ABNF will be revised. In this case, though, the ABNF could not be said to be in contradiction with the English prose: anything permitted by the constraints specified in the English prose would be recognized using the ABNF. It is true that there are strings that could be recognized by the ABNF that would not be permitted by the English prose, but the revision being made to make the ABNF production in question match what Bruce Lilley thought it should be does not change that. The only way to write the ABNF in a way that it permits exactly no more or no less than what is specified by the English prose would be to have the production rule simply enumerate a specific set of terminal strings, which does not seem to be particularly helpful, especially when the the RFC would establish a machine-readable registry maintained by IANA in which those very strings are enumerated. Peter Constable ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
Re: New Last Call: 'Tags for Identifying Languages' to BCP
Hi - Perhaps it would be useful to consider http://www.ietf.org/IESG/STATEMENTS/pseudo-code-in-specs.txt Randy From: Peter Constable [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Tuesday, December 14, 2004 2:16 PM Subject: Re: New Last Call: 'Tags for Identifying Languages' to BCP This is somewhat moot since the author has indicated the relevant portion of the ABNF will be revised. In this case, though, the ABNF could not be said to be in contradiction with the English prose: anything permitted by the constraints specified in the English prose would be recognized using the ABNF. It is true that there are strings that could be recognized by the ABNF that would not be permitted by the English prose, but the revision being made to make the ABNF production in question match what Bruce Lilley thought it should be does not change that. The only way to write the ABNF in a way that it permits exactly no more or no less than what is specified by the English prose would be to have the production rule simply enumerate a specific set of terminal strings, which does not seem to be particularly helpful, especially when the the RFC would establish a machine-readable registry maintained by IANA in which those very strings are enumerated. ... ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
Re: New Last Call: 'Tags for Identifying Languages' to BCP
Bruce Lilly scripsit: There is in fact an ietf-languages list; RFC 3066 and the draft under discussion give its submission mailbox as [EMAIL PROTECTED], which makes finding the real list an exercise since IANA's web site makes no mention of any mailing lists. I made an educated guess that I might find the list at alvestrand.no, and indeed the list submission mailbox is [EMAIL PROTECTED], Both addresses seem to work for posting to this list. Not all IETF lists have ietf.org or iana.org mailing addresses anyhow; consider [EMAIL PROTECTED], which is not a W3C mailing list but an IETF one. The draft in question apparently seeks to get IANA into the business of defining countries (and languages), usurping those roles from ISO (as also noted in RFC 1591). This is doubly incorrect. To begin with, ISO defines neither countries nor languages. UNSD defines country-like objects for its purposes and assigns them numeric codes, specifying them using English and French names. Then ISO assigns alphabetic identifiers to the names. Languages are not defined at all; ISO assigns alpha and numeric identifiers to certain words which it believes to be the names of languages, without always specifying exactly which language among those so named is meant. Note the titles of the ISO 3166 and 639 standards. The proposed registry will merely serve to stabilize the ISO mappings, making it less likely that they will be gratuitously changed, because it will not be under the control of an MA with unfettered discretion to make changes. -- On the Semantic Web, it's too hard to prove John Cowan[EMAIL PROTECTED] you're not a dog. --Bill de hOra http://www.ccil.org/~cowan ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
Re: New Last Call: 'Tags for Identifying Languages' to BCP
There is a fundamental misunderstanding on two points. 1. Of course countries go in and out of existence, and change their borders; nobody disputes that. That is not the stability problem in question; it is where the meaning of tags changes so drastically as to refer to a completely different country. One can't willy-nilly change data that has significant effects on databases all over the world; when someone's birthplace is indicated by a stored country code, for example, it mustn't suddenly designate a different country! For more, see http://www.unicode.org/consortium/positions.html. 2. The fact that the 3066 registry is not in multiple languages (either currently or in the new draft) has nothing to do with any alleged discouragement of any language, French included. The names in the registry are simply to distinguish and identify the subtags, not to provide recommended localizations. The registry, and for that matter the ISO 639/3166 standards, are the wrong place for localization data. The language coverage (only 2!) is a very small fraction of what is really needed for any real product development -- and even for those languages that are present, the names used there are not optimal for user interfaces since they are sometimes not the customary form. For an example of a data repository that is designed for localization of language/region names, see http://www.unicode.org/cldr/. Mark - Original Message - From: Bruce Lilly [EMAIL PROTECTED] To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Sunday, December 12, 2004 08:46 Subject: Re: New Last Call: 'Tags for Identifying Languages' to BCP Date: 2004-12-10 22:37 From: John Cowan [EMAIL PROTECTED] Bruce Lilly scripsit: It's not clear to me that the proposal will provide protection against the whims of politicians. If the definition of CS as a country code changes again under the proposed scheme, how is one to determine specifically what some archived language-tag referred to at some point in time? I'm not particularly concerned about that problem, as I am resigned to instability associated with anything specified by politicians (and that includes the UN region codes). The U.N. Statistics Division are only politicians in the sense that IETF WG members are. They are, in fact, statisticians. Their track record for stability is considerably longer than the IETF's. I hope that I need not repeat any of the well-known remarks about statistics. Nor that I need point to the many uses by politicians of statistics (and statisticians) for political purposes. Moreover, the point is that countries do change, and that use of country codes (as provided for in RFC 3066 and in the proposed draft) carries with it the inherent instability which is characteristic of politics. A quest for stability of countries seems Quixotic and oxymoronic. According to the principle of stability as that term is used in defense of the draft, I suppose we're all intended to refer to Malawi as Rhodesia because that's what it (in part) was called 50 years ago, or that we're supposed to ignore the breakup of the USSR, Yugoslavia, etc., the reunification of Germany, etc. A related problem with the use of country codes in language tags is that there is not necessarily an inherent relationship between language and country borders. The borders of Germany have changed many, many times. If one is referring to the German language as spoken by inhabitants of Alsace, using country codes would imply that that same language spoken by the same people would have been tagged at various times as de-DE and de-FR according to where the France-Germany border happened to have been determined by politicians of the time. That strikes me as being a rather silly way to tag language, but that's the precedent set by RFC 1766. As far as I can tell, the draft doesn't really deal with the issue of changing borders or changing country names -- it merely pretends that these things don't happen by attempting to declare a snapshot of the status at some point in time as being valid for all time. But if the proposed new registry's description of CS says foo and the ISO standard code list says bar, what's an implementor supposed to present to a user as *the* description associated with CS? The former. That's the whole point of having a registry. But the user has indicated that he speaks French, and the proposed registry contains a description in English only. Where is the implementor supposed to get the *official* translation for display? N.B. under the current (RFC 3066) situation, the definitive ISO lists provide an official description in French. One possibility would be two description fields. Why two? There are now two in the ISO lists (and, as noted, in the UN list). I have no objection to more, but I object to a reduction. The text accompanying the new last call states: This specification addresses each of these issues with a simple, elegant design
Re: New Last Call: 'Tags for Identifying Languages' to BCP
Bruce Lilly scripsit: There is a limited supply of 2-letter codes and the supply of 3-digit codes is only slightly greater. Reassignment of codes from such a limited supply is inevitable. In the very long run, yes; but even the 75-octet limit probably won't stand in the *very* long run. Countries and languages, as opposed to codes for them, don't come and go like IETF protocols: many of them have centuries of history, or half a century in the case of the post-colonialist countries; the events of 1991-93 were historically anomalous. Too late. King Canute commands the tide not to come in, but his feet still get wet. Canute was making a moral object lesson about the limitations of kingship, not acting like an idiot. But I'm not concerned with translations, but with the definitions. And currently the definitions are available in French and English. What of it? In what case does the provision of a French name significantly tighten the definition provided by the English name (or for that matter vice versa)? Removing that requirement [for registration] -- as the draft would do -- necessitates a specific upper bound on tag length that will work with existing core protocols, to replace the reviewer, Area Director, and community review process that ensure that current registered tags work with those protocols. Michael, I assume you're ignoring this kerfuffle, and rightly so. But for the record, have you ever been given cause to take into account a hard limit in the length of language tags? -- Here lies the Christian,John Cowan judge, and poet Peter, http://www.reutershealth.com Who broke the laws of God http://www.ccil.org/~cowan and man and metre. [EMAIL PROTECTED] ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
RE: New Last Call: 'Tags for Identifying Languages' to BCP
From: [EMAIL PROTECTED] [mailto:ietf-languages- [EMAIL PROTECTED] On Behalf Of Bruce Lilly I don't know where the statement accompanying the announcement came from, According to the New Last Call issued by the IESG Secretary, the text is Author's discussion of drivers for this work. You singled out that one point to comment on as though it were the main factor. I mentioned a matter which was repeatedly indicated as a factor for existing implementations and with which I strongly disagree. You have not responded to the point that accessibility of source ISO standards is supposed to be a major factor, yet the draft itself clearly indicates otherwise. [regarding the proposed registry vs. internationally- standardized ISO lists for subtag definitions] It is certainly the case that only it should be consulted for determining what sub-tags are valid with what denotation, which was the intent. That is a problem for existing implementations of RFC 3066 tags, which can obtain official, internationally agreed descriptions of the codes in two languages. Descriptions (language names) are beyond the scope of RFC 3066. It is a non sequitor to claim that this draft creates a problem for existing implementations of RFC 3066 on this basis. By looking in the sub-tag registry. If ISO changed the meaning of US to something other than what it is now, its meaning for purposes of use in an IETF language tag would not change, because it would remain stable in the sub-tag registry. You would be fairly well protected against the whim of politicians. OK, continuing your hypothetical example and its relationship to language, suppose that there is another civil war and that what now corresponds to US is split into Blue America and Red America. Further suppose that in due course ISO assigns some other code to one of those countries and retains US for the other, and that that happens after the proposed registry is set up with a definition for US and some description referring to the old use. That is a scenario that has been well considered: it would be very bad IT practice to redefine a metadata tag US to have a narrower denotation than it previously did, as that immediately breaks an unknown amount of existing data. If ISO were to make such a change in the meaning of US, then IT implementations *absolutely should not* follow suit; the ID US must retain it's prior, broader meaning. Now suppose that one wishes to produce an appropriate language tag for the text moral values (which clearly has different meaning in Blue America (telling the truth, admitting to mistakes, etc.) and in Red America (imposing totalitarian control over others)). How specifically would the proposed registry handle such a change in the meaning of US, and how would the registry help differentiate the meaning of a 1990's en-us tag to that of the hypothetical time described? It would leave US with it's historic meaning, so that existing data is left intact. (You wouldn't want a document containing moral values created on the eve of the cival war by someone supporting the Blue America side of the divide to suddenly get assigned an interpretation of 'imposing totalitarian control over others'.) New identifiers would be assigned for use in IT applications to designate the two new countries. you already have to look beyond the ISO standards for anything more than English and French But existing RFC 3066 implementations can get official descriptions in *both* of those languages; the proposal would adversely affect those existing implementations by eliminating the French description. Of course, it is a more serious defect of the proposal that it would fail to reflect internationally-agreed codes and would fail to keep pace with changes... it would not be new that you have to look beyond the registry itself to decide what human-readable descriptors you should provide in a product. It would be new that one could not find a standard (i.e. official) French-language description in the list of codes. Incorrect. The registry for RFC 3066 did not provide a language/country name in *any* language for any ISO 639 or ISO 3166 identifier. Tags registered under RFC 3066 included an English-language name and an ASCII-transcription of the indigenous name; they did not contain French-language names. Again, you are trying to impose UI-localization concerns that have always been out of scope for the RFC 1766/3066/... sequence of specifications. One possibility would be two description fields. But the registry would need a charset closer to ISO-8859-1 than to ANSI X3.4 as currently specified. Or an encoding scheme. Personally, I don't see the value in something like that. Given the intent to have a registry that can be machine-readable, changing its charset from ANSI X3.4 in order to gain descriptors in just one more language is not worth it IMO. Fine,
Re: New Last Call: 'Tags for Identifying Languages' to BCP
And here's hoping they go to four digits or otherwise extend the scheme instead of recycling when they run out, a non-hypothetical issue if they're already up to 891. Deborah Goldsmith Internationalization, Unicode liaison Apple Computer, Inc. [EMAIL PROTECTED] On Dec 13, 2004, at 6:11 AM, John Cowan wrote: UNSD historically has assigned new numerical codes when new countries come into existence, and has managed to avoid reusing any of its 3-digit identifiers ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
RE: New Last Call: 'Tags for Identifying Languages' to BCP
On Sun, 12 Dec 2004, Peter Constable wrote: That is not at all the aim here wrt stability; rather, the aim is that a symbolic identifier used for metadata in IT systems not change because some government on a whim says, We would now prefer to use 'yz' rather than 'xy' to designate our country. This point needs to be stressed. If this registry does not do it, we'll need to create a new one which does. If anything, I am inclined to object to two: to avoid an Anglo-Franco colonial bias, Bravo! If there were to be just two languages, it would need to be Mandarin Chinese as primary entry, and English as secondary entry. either there is one name that is simply a reference name, or the registry be designed so that it could accommodate names in as many languages as may be available. In order to accomodate the Francophiles, we would need first to accomodate several other languages of greater international prominence than French; and by that point the registry would be so unwieldy as to be useless. Even worse is the matter of coordinating all these various descriptions and what happens when (not if) an ambiguity is created because the Lower Slobbovian version means something different than the English version? Among other things, that means that a developer in Lower Slobbovia can't use an abridged version of the registry that only has the Lower Slobbovian descriptions, because if he is unaware of the other texts he may make an unwarranted assumption as to the meaning of that description.n What is done when (not if) international politics rears its ugly head? We have numerous instances where the name of a language is official in one place, and highly-offensive in another place. What's more, all of that effort is for naught, since the only thing that matters is the tag, a machine-readable token intended to identify a language, and not the description. RFC 3066 *does not at any point* suggest let alone state that implementations should use ISO 639 language names or ISO 3166 country names for UI purposes. IMO, you are creating an issue where none exists. Bravo! Another point which bears emphasis; these are machine-readable tags for the purpose of software, not user interface elements. IETF language tags are used in a wide variety of applications. The parties involved in development of this spec (the authors and others) have examined these issues for the past several years and have arrived at this architecture. And have done a fine job at it. -- Mark -- http://staff.washington.edu/mrc Science does not emerge from voting, party politics, or public debate. Si vis pacem, para bellum. ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
Re: New Last Call: 'Tags for Identifying Languages' to BCP
On Sun, 12 Dec 2004, Bruce Lilly wrote: If by international agreement, 'yz' becomes the designation for that country, then it is rather silly to stick one's fingers in one's ears and shout NA-NA-NA-NA-NA I don't want to hear you. What is silly is saying that every language tag has to have a date/time attribute associated with it so that computer software managing that text knows the language of that text. But that is precisely what you are advocating. It's rather silly to change that correspondence simply because a few people are piqued that international agreement has been reached to change a few 2-letter codes. It's bad enough that TLDs get recycled. It is a disaster for language identifiers to get recycled. Something has to make those identifiers unique. Your notion will force the inclusion of a date/time stamp in language tags, to restore the uniqueness that you are so excruciatingly eager to abolish. Never mind the shortcomings of that particular example; consider de-DE -- does that mean Germany as it exists today, West Germany as it existed 25 years ago, Germany as it existed in the 1930s, the 1900s, ...? For the 98% case, it does not matter at all. But it does matter if, one day, DE becomes Denmark. As far as I can tell, the draft pretends that the meaning of CS hasn't changed, and would in fact change the meaning of the currently valid RFC 3066 language tag sr-CS. No, it restores the previous meaning of sr-CS. It is very different; under the proposed draft, there is only an English definition, somebody wishing to provide a French definition finds that he has none and must resort to an unofficial translation. Why is the situation for French different from someobody wishing to provide a Lower Slobbobian definition? SO where are the French definitions? Ask a person who is bilingual in English and French to provide one. Well, sure. But the name is an important thing by itself. It is rather pointless to ask a user to indicate the language of a piece of text by selecting from a list AB, ACE, ACH,..., ZHA, ZUL, ZUN -- the user doesn't normally refer to languages by codes. It's quite a different matter to ask the user to select from Abkhaze, Aceh, Acoli,..., Zhuang (Chuang), Zoulou, Zuni. Abkhaze, Aceh, Acoli,..., Zhuang (Chuang), Zoulou, and Zuni are not language tags. So what's your point? Note that the RFC 3066 specifies a registry that does not include French language names. I suggest that this issue should be dropped. Yes, the current IANA registry has that problem for the non-ISO-based tags only. If the registry is to be changed to subsume ISO codes as well, that defect should be remedied. Why is it a problem? Why is it a defect? On the contrary, it is preposterous to suggest that codes will be attached to text by magic Here is where you are misled. Many of these tags are embedded within the text itself. That text may long outlive its author in an archive. My concern is the elimination of the French definition in the first place. Why is this a problem? -- Mark -- http://staff.washington.edu/mrc Science does not emerge from voting, party politics, or public debate. Si vis pacem, para bellum. ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
Re: New Last Call: 'Tags for Identifying Languages' to BCP
Bruce Lilly [EMAIL PROTECTED] wrote on 2004-12-12 T 18:44:27 -0500 (...) RFC 1766 (and 3066) leave you little choice; if you wish to indicate a region, you either have to do it with ISO 639 codes or you have to register a separate tag (no separate tag for German as spoken in Alsace exists). Never mind the shortcomings of that particular example; consider de-DE -- does that mean Germany as it exists today, West Germany as it existed 25 years ago, Germany as it existed in the 1930s, the 1900s, ...? As far as I can tell, this is about about /language/ tags, not about the tagging of borders or nationalities. These last two have certainly influence on the use of language, but should not on the name and tagging of the language itself. Unless there is some forced change of language, which needs to be reflected in tagging. (...) On the contrary, it is preposterous to suggest that codes will be attached to text by magic; some human somewhere, somehow is going to have to indicate the language to something, Tagging may also be done by some software instead of a human being. Whether you consider that magic is up to you, but I think both ways of finding a tag for a given text are possible. and it certainly isn't going to be by way of a 2- or 3-letter code without some reference to what those codes *mean*. I beg to differ. Many humans do not know the lists, not even their name, and yet they use the codes on a daily basis. They simply recall the codes they've seen and found relevant to them, like the TLD ones they are used to; or they are told by somebody use this tag when using language A and that tag when using language B. This information may be wrong or outdated, of course. Please do not assume tags will be assigned only by humans who have a recent list of the code(s) at hand. Just my 0.02. Best regards, J. Wilkes ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
RE: New Last Call: 'Tags for Identifying Languages' to BCP
From: [EMAIL PROTECTED] [mailto:ietf-languages- [EMAIL PROTECTED] On Behalf Of Bruce Lilly The point is that under RFC 3066, the bilingual ISO language and country code lists are considered definitive. That is nowhere stated or even suggested in RFC 3066. RFC 3066 section 2.2 states, in part: - All 2-letter subtags are interpreted according to assignments found in ISO standard 639, Code for the representation of names of languages [ISO 639], or assignments subsequently made by the ISO 639 part 1 maintenance agency or governing standardization bodies. and has a similar statement regarding ISO 3166. interpreted according to assignments found in certainly sounds as if the ISO lists are considered definitive for their respective categories of subtags, since their interpretation is specified as that given in those lists. I don't see how the RFC 3066 text can be interpreted otherwise. RFC 3066 indicates that the *interpretation* is determined by the source ISO standards. You were discussing display names. (Though, now that I've shown that display names are out of scope, you appear to be attempting to change things as though you had been discussing definitions.) Peter Constable Microsoft Corporation ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
RE: New Last Call: 'Tags for Identifying Languages' to BCP
(B On the contrary, what the authors of a standard intend is not normative. (B As much as possible, every standard must say what it means, because (B what a standard says *is* its technical content. For example, I'm (B unhappy about an apparent sentiment that would put ABNF on a lower (B footing that the English text. I think I'm like most implementors and (B perhaps unlike non-engineers in reversing that precedence. Whenever (B I read an RFC, I rely first and foremost on the ABNF. I use the English (B only for hints, and follow the ABNF instead of the English whenever (B there is a conflict. (B (BThe ABNF is not on a lower footing than the English text. But it is (Bdependent on the English text in exactly the same way that the ABNF in RFC (B3066 was. (B (BI think the suggestion to change the "grandfathered" production is a good (Bone and will help implementers who start with the ABNF. (B (BI also think, though, that the establishment of a comprehensive (as opposed (Bto fractional) registry is the real salient point for implementers here. An (Bimplementation of RFC 3066 that follows *only* the ABNF would happily (Bproliferate garbage tags like "c57-x", not just valid ones. The existence of (Ba registry in draft-langtags should focus implementer's attention on two (Bthings: the ABNF and the subtags that fit into them. In that regard (Bdraft-langtags will simplify the lives of implementers who do not read the (Btext (in the same way that having a registry for character encoding (Bnames--"charsets"--does). (B (B (B There are a couple other issues that ought to be addressed. (B (B I think Bruce Lilly started by charging that a potentially disruptive (B document had reached last-call without any review by those concerned (B with related, affected IETF standards. That sounds like a process (B problem that needs at least 1% as many words as have been spent in (B this mailing list in lawyerly talk such as whether "accounts" is more (B appropriate than "account." (B (BThe IETF process is not really my concern. I will note that many IETF and (Bnon-IETF standards folks have participated in the process of developing and (Breviewing draft-langtags, though. I don't know if a wider audience should (Bhave been invoked earlier in the process. Mark and I welcome comments and (Bquestions on the technical suitability of our draft. I think that we have (Bfully and carefully considered the potential impact and, in fact, have (Bhelped to stabilize language tags, not just now but for the future as well. (B (BPeter made the argument that future I-D authors could write a draft that (Bdoes whatever they please with regard to language tags. Which is true. (BHowever, draft-langtags lays down a framework that should guide the (Bactivities of these authors and constrain the changes they make in a manner (Bthat is completely compatible with implementations of draft-langtags (not to (Bmention RFC 3066 and RFC 1766). I think that a guarantee of future (Bstability---in implementations (including current ones), extensions, and the (Btags (data) themselves---is of great benefit to related and/or affected IETF (Bstandards. (B (BBest Regards, (B (BAddison (B (BAddison P. Phillips (BDirector, Globalization Architecture (Bhttp://www.webMethods.com (B (BChair, W3C Internationalization Working Group (Bhttp://www.w3.org/International (B (BInternationalization is an architecture. (BIt is not a feature. (B (B (B (B___ (BIetf mailing list ([EMAIL PROTECTED] (Bhttps://www1.ietf.org/mailman/listinfo/ietf
Re: New Last Call: 'Tags for Identifying Languages' to BCP
The ABNF is an expression of the grammar that describes the set of all valid tags. No, this is simply incorrect. You cannot expect that any implementation that simply does the ABNF is conformant. There are a great many constraints on the tags that are not in the ABNF grammar, that are clearly required in any reading of the text. Most of these *cannot* be encompassed in any ABNF grammar. There are a few that could be expressed in the ABNF; some at little cost, some with a great deal of complication. This is not a technical problem for the draft. as reasonable as the current worst-case of 11 octets. Also simply untrue. You seem not to be reading all the messages on this subject. Look at the ABNF for RFC 3066. There is *no* limit in the ABNF there! The syntax of this tag in ABNF [RFC 2234] is: Language-Tag = Primary-subtag *( - Subtag ) Primary-subtag = 1*8ALPHA Subtag = 1*8(ALPHA / DIGIT) -- http://www.ietf.org/rfc/rfc3066.txt?number=3066 Mark - Original Message - From: Bruce Lilly [EMAIL PROTECTED] To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Friday, December 10, 2004 20:39 Subject: Re: New Last Call: 'Tags for Identifying Languages' to BCP RE: New Last Call: 'Tags for Identifying Languages' to BCP Date: 2004-12-10 20:03 From: Peter Constable [EMAIL PROTECTED] To: [EMAIL PROTECTED] CC: [EMAIL PROTECTED] Resuming my comments: Specifically, the draft allows, and RFC 3066 disallows: subtags more than 8 octets in length hyphens which do not separate subtags zero-length subtags primary tags which are not purely alphabetic Curiously, all of those are permitted by the draft ABNF production grandfathered... The grandfathered production in the current draft is grandfathered = ALPHA *(alphanum / -) which does permit the sequences claimed by Bruce (except for not-purely-alphabetic primary sub-tags), No exception. alphanum is ALPHA / DIGIT. In plain English, grandfathered as defined in the draft is a letter followed by any number of letters, digits, and/or hyphens, in any order. And that includes a123-xyz as I initially stated, and clearly 1, 2, and 3 are digits. syntactically; but the set of tags available for use is constrained by more than the ABNF syntax alone: the acceptable productions for each sub-tag must either be taken from one of the source standards or be registered. So what? The ABNF is an expression of the grammar that describes the set of all valid tags. If the grammar permits y-, a123-xyz, etc. (and it does) then a parser claiming to parse language tags as defined by that ABNF must be able to parse such tags. That is, the ABNF- specified grammar imposes requirements on parsers. If one doesn't intend to impose such requirements, the ABNF specifying the grammar should be changed accordingly. This is no different from RFC 3066, so it is no more of a problem in this specification than it was in RFC 3066. It is a very different grammar from RFC 3066, imposing very different requirements on parsers. It might be that the wording in 2.2 could be tightened up to eliminate any possible question regarding the source for grandfathered productions. It's not a matter of wording; the problem is with the ABNF. Alternately, there's no reason why the grandfathered production shouldn't be composed exactly to match what was used in RFC 3066: grandfathered = 1*8ALPHA *(- 1*8alphanum) I believe I said as much (though one then needs to look at reduce/reduce conflicts implied by the revised grammar): I see no reason for the ABNF to permit such content as is forbidden by RFC 3066; the actual ABNF for what RFC 3066 permits is contained within 3066, and could have been directly incorporated rather than producing a grandfathered production which opens up several cans of worms. This vastly overstates the problem. There is no can of worms unless it exists in tags currently available under RFC 3066. I referred to the additional requirements imposed on parsers, as well as the unlimited tag length permitted. One defect related to tag length in RFC 3066 is not remedied by the draft; indeed the problem is greatly exacerbated... Unfortunately, a language- tag's length is unlimited by the ABNF in RFC 3066 (due to an unlimited number of subtags) and in the draft... In particular, tags other than private-use tags with more than two subtags require registration under RFC 3066 rules, and it is a trivial matter to determine the longest registered tag. The draft, however, encourages use of more subtags as well as removal of the subtag length upper bound; moreover, it permits infinite numbers of subtags without requiring registration of the resulting complete tag. Bruce states incorrectly that there is no upper bound on the length of sub-tags. Look again at the draft definition of grandfathered -- now show me where there's a limit in that production on subtag length. His other concern
Re: New Last Call: 'Tags for Identifying Languages' to BCP
Bruce Lilly scripsit: Moreover, the point is that countries do change, and that use of country codes (as provided for in RFC 3066 and in the proposed draft) carries with it the inherent instability which is characteristic of politics. A quest for stability of countries seems Quixotic and oxymoronic. Of course countries change, and then the numeric country codes change as well. The point is that the alpha codes change for political reasons when there has been *no* change in the underlying country: Romania's 3-alpha code changed from ROM to ROU without any change in Romania at all. The CS case is particularly gratuitous, as its denotation changed from Czechoslovakia (a no longer existent country) to Serbia and Montenegro (a newly created country). A related problem with the use of country codes in language tags is that there is not necessarily an inherent relationship between language and country borders. Of course not. But for the most part, variations in orthography do tend to follow national boundaries, since orthography in many languages is either de jure or de facto a national matter. As far as I can tell, the draft doesn't really deal with the issue of changing borders or changing country names -- it merely pretends that these things don't happen by attempting to declare a snapshot of the status at some point in time as being valid for all time. No, it attempts to freeze the code-to-country mapping at a single point. New countries or changes in old countries should involve only the additions of codes, not the reuse of old codes. Where is the implementor supposed to get the *official* translation for display? I don't know. Where is the implementor supposed to get the official German, or Catalan, or Mandarin translations? Not in the ISO registry, for sure. To say nothing of the cases where no official translations exist. There are 6000 languages spoken on Earth, of which perhaps 600 have a standard written form. ISO 639 lists about 650, not precisely 6000. ISO 639-2 is deliberately incomplete. The current draft of ISO 639-3, which is not yet an IS, lists over 7000 languages. It might be worthwhile considering the differences in the way languages tags are used, by whom they are used, and for what purpose. There may well be a substantial difference between use of a tag to represent an obscure dialect of a dead language in a research paper vs. tagging a piece of text in one of the core Internet protocols such as SMTP. That count does not include dead languages. Whether it includes dialects is a matter of terminology. -- Deshil Holles eamus. Deshil Holles eamus. Deshil Holles eamus. Send us, bright one, light one, Horhorn, quickening, and wombfruit. (3x) Hoopsa, boyaboy, hoopsa! Hoopsa, boyaboy, hoopsa! Hoopsa, boyaboy, hoopsa! -- Joyce, Ulysses, Oxen of the Sun [EMAIL PROTECTED] ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
RE: New Last Call: 'Tags for Identifying Languages' to BCP
Resuming my comments: -Original Message- From: [EMAIL PROTECTED] [mailto:ietf-languages- [EMAIL PROTECTED] On Behalf Of Bruce Lilly [snip] Specifically, the draft allows, and RFC 3066 disallows: subtags more than 8 octets in length hyphens which do not separate subtags zero-length subtags primary tags which are not purely alphabetic Curiously, all of those are permitted by the draft ABNF production grandfathered... The grandfathered production in the current draft is grandfathered = ALPHA *(alphanum / -) which does permit the sequences claimed by Bruce (except for not-purely-alphabetic primary sub-tags), syntactically; but the set of tags available for use is constrained by more than the ABNF syntax alone: the acceptable productions for each sub-tag must either be taken from one of the source standards or be registered. This is no different from RFC 3066, so it is no more of a problem in this specification than it was in RFC 3066. It might be that the wording in 2.2 could be tightened up to eliminate any possible question regarding the source for grandfathered productions. Maybe it's not as obvious to someone coming to this cold as it for us who have been discussing it for the past year. Alternately, there's no reason why the grandfathered production shouldn't be composed exactly to match what was used in RFC 3066: grandfathered = 1*8ALPHA *(- 1*8alphanum) So, perhaps there is room for technical improvement, but there are not any serious problems IMO -- certainly nothing as serious as the tone of Bruce's conveyed. I see no reason for the ABNF to permit such content as is forbidden by RFC 3066; the actual ABNF for what RFC 3066 permits is contained within 3066, and could have been directly incorporated rather than producing a grandfathered production which opens up several cans of worms. This vastly overstates the problem. There is no can of worms unless it exists in tags currently available under RFC 3066. One defect related to tag length in RFC 3066 is not remedied by the draft; indeed the problem is greatly exacerbated... Unfortunately, a language- tag's length is unlimited by the ABNF in RFC 3066 (due to an unlimited number of subtags) and in the draft... In particular, tags other than private-use tags with more than two subtags require registration under RFC 3066 rules, and it is a trivial matter to determine the longest registered tag. The draft, however, encourages use of more subtags as well as removal of the subtag length upper bound; moreover, it permits infinite numbers of subtags without requiring registration of the resulting complete tag. Bruce states incorrectly that there is no upper bound on the length of sub-tags. His other concern, on the overall length of complete tags, is valid, however: in terms of the ABNF syntax for both RFC 3066 and RFC 3066bis, infinite-length productions are possible, but RFC 3066 would require registration of complete non-private-use tags while RFC 3066bis does not. There are three open doors for infinite-length productions in the ABNF of the current draft: - unlimited extlang sub-tags - unlimited variant sub-tags - the number of possible extensions is limited to 25, but the length of extensions is unlimited We could impose some upper limits on these things; e.g. Language-Tag = ... *8(- extlang) ... *8(- variant) ... 1*25(- extension) ... extension = singleton 1*8(- 2*8alphanum) If we also imposed limits on the length of private-use tags and defined the grandfathered production in a way that made clear there was an upper limit for those, then we could end up eliminating an issue that had existed in RFC 3066. So, I think Bruce has identified a valid issue here. I personally would not have characterized it as greatly exacerbating, though, as the issue was present in RFC 3066: private-use tags did not need to be registered in RFC 3066, so there was no way in implementation could be written with certain knowledge that tags beyond some given length would not be encountered. The new registry provides a complete, easily parseable file which provides the precise the contents of valid tags for any point in time. That is the first time I have ever heard ISO 8601 date format described as easily parseable. Perhaps the draft authors meant to say that a specific subset of the tortuously complex ISO 8601 date format is used, but that is not what the draft states... It seems very clear that the authors intended only a specific subset: -MM-DD. This is a minor technical issue that the authors can very easily remedy. I am absolutely shocked that a draft dealing with language lacks an Internationalization considerations section as recommended by RFC 2277 (a.k.a. BCP 18). No more or less shocking than for RFC 3066, regarding which I'm not aware of any complaints. I don't quite understand what the critique is here: what is there to internationalize about language tags? They are
Re: New Last Call: 'Tags for Identifying Languages' to BCP
Show me a general-use RFC 3066 language tag which is too long to fit on an RFC 2822/3282 Content-Language header field line. Your claim was that RFC 3066bis (the informal name we've been using for the new draft) permits language tags that are longer than those permitted by RFC 3066. That is clearly false, as many people have pointed out. Any subsequent niggling that particular *types* of language tags can be longer or not is not relevant to the conformance implications of the two documents for language tags. The new draft neither extends nor contracts the maximum length of language tags conformant to RFC 3066. Your claim that the RFC 3066 ABNF itself has a restriction in length is also clearly false. I will quote that again since you seem somehow not to have seen it: The syntax of this tag in ABNF [RFC 2234] is: Language-Tag = Primary-subtag *( - Subtag ) Primary-subtag = 1*8ALPHA Subtag = 1*8(ALPHA / DIGIT) Both documents establish many further limitations on the contents of language tags in the text of each document. Ignoring those stated limitations will, in both documents, result in nonconformant language tags. Mark - Original Message - From: Bruce Lilly [EMAIL PROTECTED] To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Sunday, December 12, 2004 09:16 Subject: Re: New Last Call: 'Tags for Identifying Languages' to BCP Date: 2004-12-11 00:52 From: Mark Davis [EMAIL PROTECTED] To: [EMAIL PROTECTED], [EMAIL PROTECTED] CC: [EMAIL PROTECTED] The ABNF is an expression of the grammar that describes the set of all valid tags. No, this is simply incorrect. You cannot expect that any implementation that simply does the ABNF is conformant. I made no such claim. I do claim that if the ABNF contradicts the normative text, as is the case in your draft w.r.t. acceptance of several constructs not permitted by RFC 3066 ABNF, that there is an error in either the normative text or the ABNF. There are a great many constraints on the tags that are not in the ABNF grammar, that are clearly required in any reading of the text. Most of these *cannot* be encompassed in any ABNF grammar. If your claim is that the ABNF cannot express a grammar consistent with the RFC 3066 ABNF, that is clearly false. There are a few that could be expressed in the ABNF; some at little cost, some with a great deal of complication. Are you claiming that it is unduly difficult to make the ABNF match RFC 3066's? This is not a technical problem for the draft. It is a problem due to the conflict between the ABNF and the text. It is a problem because it opens a loophole for future revisions to formalize content which is incompatible with RFC 3066 implementations. as reasonable as the current worst-case of 11 octets. Also simply untrue. You seem not to be reading all the messages on this subject. Look at the ABNF for RFC 3066. There is *no* limit in the ABNF there! The draft proposes closing RFC 3066-style registrations. Show me a registered RFC 3066 language tag longer than 11 octets. Show me a general-use (i.e. not private-use) RFC 3066 language tag which is too long to be used in an RFC 2047/2231 encoded-word. Show me a general-use RFC 3066 language tag which is too long to fit on an RFC 2822/3282 Content-Language header field line. ___ Ietf-languages mailing list [EMAIL PROTECTED] http://www.alvestrand.no/mailman/listinfo/ietf-languages ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
RE: New Last Call: 'Tags for Identifying Languages' to BCP
From: [EMAIL PROTECTED] [mailto:ietf-languages- [EMAIL PROTECTED] On Behalf Of Bruce Lilly The point is that under RFC 3066, the bilingual ISO language and country code lists are considered definitive. That is nowhere stated or even suggested in RFC 3066. Peter Constable Microsoft Corporation ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
Re: New Last Call: 'Tags for Identifying Languages' to BCP
On Mon, Dec 13, 2004 at 01:37:04AM -0800, Mark Crispin wrote: When I retrieve a file via FTP, HTTP, etc. the time stamp of that file on my computer is the date/time of retrieval, not the date/time of the file on the source. Unless, of course, both systems are running TOPS-20 and thus use that wonderful XTP mode that copies file metadata. Now, if you want to mandate that all UNIX and Windows systems be replaced with TOPS-20, I might support that... :-) Actually some ftp and http transfer programs, incl wget and ncftp keep the original date stamp. With Xmas greetings Keld ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
RE: New Last Call: 'Tags for Identifying Languages' to BCP
From: [EMAIL PROTECTED] [mailto:ietf-languages- [EMAIL PROTECTED] On Behalf Of Bruce Lilly Moreover, the point is that countries do change, and that use of country codes (as provided for in RFC 3066 and in the proposed draft) carries with it the inherent instability which is characteristic of politics. A quest for stability of countries seems Quixotic and oxymoronic. According to the principle of stability as that term is used in defense of the draft, I suppose we're all intended to refer to Malawi as Rhodesia because that's what it (in part) was called 50 years ago, or that we're supposed to ignore the breakup of the USSR, Yugoslavia, etc., the reunification of Germany, etc. That is not at all the aim here wrt stability; rather, the aim is that a symbolic identifier used for metadata in IT systems not change because some government on a whim says, We would now prefer to use 'yz' rather than 'xy' to designate our country. Sure, there will be changes that we need to deal with; but there's no reason to subject all implementations, users and data to changes that are purely cosmetic changes to things that are not designed to be read by humans. A related problem with the use of country codes in language tags is that there is not necessarily an inherent relationship between language and country borders. That is not what country IDs within a language tag is intended to suggest. In fact, if there were inherent relationships, we probably would never have needed to use country IDs in a language tag. The borders of Germany have changed many, many times. If one is referring to the German language as spoken by inhabitants of Alsace, using country codes would imply that that same language spoken by the same people would have been tagged at various times as de-DE and de-FR according to where the France-Germany border happened to have been determined by politicians of the time. That strikes me as being a rather silly way to tag language, but that's the precedent set by RFC 1766. I agree that that's a silly way to tag that language; I disagree that RFC 1766 suggests I should tag it that way. As far as I can tell, the draft doesn't really deal with the issue of changing borders or changing country names -- it merely pretends that these things don't happen by attempting to declare a snapshot of the status at some point in time as being valid for all time. That may be your reading of the situation, but it is not how it is seen by those of us who have been working on this spec and examining these issues closely. But the user has indicated that he speaks French, and the proposed registry contains a description in English only. Where is the implementor supposed to get the *official* translation for display? N.B. under the current (RFC 3066) situation, the definitive ISO lists provide an official description in French. Neither RFC 1766 or RFC 3066 has ever presented official translations; this is no different for RFC 3066bis. Under RFC 3066, one is pointed to ISO 639-1 and ISO 639-2 to get the alpha-2 and alpha-3 IDs, but it does not anywhere state that implementors should use the English and French language names in those ISO standards; exactly the same situation holds for RFC 3066bis. (Note, btw, that the names listed by ISO 639-1/-2 have no particular official status; they are normative in those standards to the extent that the indicate what language variety a given ID denotes, but they do not claim that the particular form of the language names have any particular status.) One possibility would be two description fields. Why two? There are now two in the ISO lists (and, as noted, in the UN list). I have no objection to more, but I object to a reduction. If anything, I am inclined to object to two: to avoid an Anglo-Franco colonial bias, either there is one name that is simply a reference name, or the registry be designed so that it could accommodate names in as many languages as may be available. Note that the RFC 3066 specifies a registry that does not include French language names. I suggest that this issue should be dropped. I have an implementation which (in accordance with RFC 3066) uses the official ISO lists. It has provision for displaying ISO 639 language tags with their descriptions in either of the two languages supported by the official 639 lists, and likewise for the ISO 3166 country codes. RFC 3066 *does not at any point* suggest let alone state that implementations should use ISO 639 language names or ISO 3166 country names for UI purposes. IMO, you are creating an issue where none exists. The specification of the draft is *NOT* compatible with that existing implementation because it removes the existing functionality of official descriptions in French of language and country codes. As a result of that incompatibility, the newly proposed specification does not work with (at least that one) existing implementation
Re: New Last Call: 'Tags for Identifying Languages' to BCP
Bruce Lilly scripsit: Feh. Whatever. The human-readable stuff that corresponds to the code which you say shouldn't be read. The stuff without which codes are meaningless. The stuff without which two communicating parties cannot agree on the meaning of XX. Two communicating parties can unquestionably agree on the meaning of XX without both English and French definitions. Either will suffice. Indeed, if either definition provided a nuance not available to the other, they would not be interchangeable, and one would have to be the authentic definition and the other a mere aide-memoire. -- [W]hen I wrote it I was more than a little John Cowan febrile with foodpoisoning from an antique carrot [EMAIL PROTECTED] that I foolishly ate out of an illjudged faith www.ccil.org/~cowan in the benignancy of vegetables. --And Rosta www.reutershealth.com ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
Re: New Last Call: 'Tags for Identifying Languages' to BCP
Are you claiming that sr-CS-891-boont-gaulish-guoyu-boont-gaulish-guoyu-boont-gaulish-guoyu is nonconformant per some specification in the draft proposal? Clearly not. But x-sr-CS-891-boont-gaulish-guoyu-boont-gaulish-guoyu-boont-gaulish-guoyu is already absolutely conformant with the current RFC 3066. And the current RFC 3066 clearly permits the registration of something as long as sr-CS-891-boont-gaulish-guoyu-boont-gaulish-guoyu-boont-gaulish-guoyu (although of course this particular combination would certainly never get in). Inutile d'aller plu loin... There is no use to trying to declare a difference in conformant lengths between these two documents when one doesn't exist. If you want to do something productive, you should make a practical suggestion for a change in the current text of the new draft. If the new draft is to backward compatible, then it has to be worded carefully. I haven't thought it through at length, but would need to be something like: - A conformant implementation need not support the storage of language tags which exceed a specified length. However, such a limitation must be clearly documented, including the disposition of any longer tags (for example, whether an error value is generated or the language tag is truncated -- and if so, how it is to be truncated). Mark - Original Message - From: Bruce Lilly [EMAIL PROTECTED] To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Sunday, December 12, 2004 12:20 Subject: Re: New Last Call: 'Tags for Identifying Languages' to BCP Date: 2004-12-12 13:00 From: Mark Davis [EMAIL PROTECTED] To: [EMAIL PROTECTED], [EMAIL PROTECTED] CC: [EMAIL PROTECTED] Your claim that the RFC 3066 ABNF itself has a restriction in length is also clearly false. I will quote that again since you seem somehow not to have seen it: I made no such claim; indeed it was I who pointed out that RFC 3066 *theoretically* permits an infinite- length tag. On that basis alone (even if you missed the fact that I am an implementor of RFC 3066 language tags) you can be sure that I am well aware of the RFC 3066 ABNF. Both documents establish many further limitations on the contents of language tags in the text of each document. Ignoring those stated limitations will, in both documents, result in nonconformant language tags. Are you claiming that sr-CS-891-boont-gaulish-guoyu-boont-gaulish-guoyu-boont-gaulish-guoyu is nonconformant per some specification in the draft proposal? It is certainly too long to be used in an RFC 2047/2231 encoded-word. It is much longer than any registered RFC 3066 language tag, and the draft proposes removing full tag registration procedure restrictions as well as decoupling use from registration that would combine to permit such an abomination. ___ Ietf-languages mailing list [EMAIL PROTECTED] http://www.alvestrand.no/mailman/listinfo/ietf-languages ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
RE: New Last Call: 'Tags for Identifying Languages' to BCP
From: [EMAIL PROTECTED] [mailto:ietf-languages- [EMAIL PROTECTED] On Behalf Of Bruce Lilly What is silly is saying that every language tag has to have a date/time attribute associated with it so that computer software managing that text knows the language of that text. In the specific cases of the core Internet protocols that I have mentioned, there *is* a date/time attribute in the form of an RFC [2]822 Date field. If we're talking about some file stored on some machine, every OS that I know of has a date/time stamp associated with that file. If you have something else in mind, a concrete description and/ or example might help. That is not sufficient for many other implementations of RFC 3066. For instance, an XML document may well be stored in a file system that has date/time stamps associated with the file; it might also be stored in a content manangement system that does not report creation dates when returning content. And elements from within that XML document may be returned as the result of an X-Path query or a call into a DOM API, and those surely cannot be assumed to have creation date/time stamps, though one certain must assume that they can have RFC 3066 tags as xml:lang attributes. I'm not eager to abolish uniqueness. There never was any guarantee that codes would never change. Both RFCs 1766 and 3066 specifically mention changes as a fact of life. Some of us consider that fact and the instability particularly of ISO 3166 to be a serious problem. That (not accessibility) was one of the key reasons for this revision. SO where are the French definitions? Ask a person who is bilingual in English and French to provide one. That would lack definitiveness which characterizes the ISO lists. You started out this thread by talking about display names, not definitions; hence Mark's suggestion. Now you have switched to talking about definitions. The draft clearly indicates where one finds the definitions: o All 2-character language subtags were defined in the IANA registry according to the assignments found in the standard ISO 639... I.e. the definition is provided in the registry on the basis of what is defined in ISO 639; hence if what is indicated in the registry is for any reason insufficient for your purposes, you consult the definitive source, the ISO standard. Peter Constable Microsoft Corporation ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
RE: New Last Call: 'Tags for Identifying Languages' to BCP
From: [EMAIL PROTECTED] [mailto:ietf-languages- [EMAIL PROTECTED] On Behalf Of Bruce Lilly The grandfathered production in the current draft is grandfathered = ALPHA *(alphanum / -) which does permit the sequences claimed by Bruce (except for not-purely-alphabetic primary sub-tags), No exception. alphanum is ALPHA / DIGIT. My mistake; again, I had on my mind constaints beyond the ABNF. syntactically; but the set of tags available for use is constrained by more than the ABNF syntax alone: the acceptable productions for each sub-tag must either be taken from one of the source standards or be registered. So what? The ABNF is an expression of the grammar that describes the set of all valid tags. It is *part* of the expression of the grammar. Even in RFC 3066 this is the case: you know that t-abc is not valid under RFC 3066, but not because that is constrained by the ABNF of RFC 3066. I will accept that the ABNF of draft should be changed to better reflect what the form of grandfathered productions can be, which, as I stated in my previous message, would be the equivalent of the ABNF of RFC 3066: grandfathered = 1*8ALPHA *(- 1*8alphanum) I think that's an improvement, though technically I don't think it changes anything. If one doesn't intend to impose such requirements, the ABNF specifying the grammar should be changed accordingly. This is no different from RFC 3066, so it is no more of a problem in this specification than it was in RFC 3066. It is a very different grammar from RFC 3066, imposing very different requirements on parsers. Our disagreement amounts to a basic question of whether parsers should be written based on the ABNF alone, or based on the ABNF plus other constraints provided in the spec. Clearly, I think anyone writing a parser should consider other constraints as well. In particular, tags other than private-use tags with more than two subtags require registration under RFC 3066 rules, and it is a trivial matter to determine the longest registered tag. The draft, however, encourages use of more subtags as well as removal of the subtag length upper bound; moreover, it permits infinite numbers of subtags without requiring registration of the resulting complete tag. Bruce states incorrectly that there is no upper bound on the length of sub-tags. Look again at the draft definition of grandfathered -- now show me where there's a limit in that production on subtag length. As mentioned, the limit is imposed by other tight constraints on 'grandfathered'; you have already identified that the longest registered tag under RFC 3066 is 11 octets in length, therefore a 'grandfathered' tag can be at most 11 octets in length. There are three open doors for infinite-length productions in the ABNF of the current draft: - unlimited extlang sub-tags - unlimited variant sub-tags - the number of possible extensions is limited to 25 ... , but the length of extensions is unlimited You have missed several others: 1. privateuse length is unlimited (either tacked on after lang etc., or directly as an alternative in Language-Tag) I disregarded this since it is identical to the case for RFC 3066, and you were, after all, charging that the draft creates problems that were worse than for RFC 3066. 2. grandfathered, which as already discussed permits unlimited length. But as already stated is very tightly constrained, with a de-facto upper limit of 11 (subject to change if new tags are registered before the proposed spec is accepted). We could impose some upper limits on these things... That leaves the extension portions' length at up to 25 * (1 + 1 + 8 * 9) = 1850 octets, not taking any other parts of a tag into account! That's way too long (the RFC 2047 limit for an encoded-word is 75 octets, including charset tag, some text, and some syntactic glue in addition to the language tag). The problem already exists in RFC 3066. Even apart from private-use tags, tomorrow someone could request a registration for a tag that's 87 octets long, and there's nothing in RFC 3066 that would prohibit acceptance. So, I think Bruce has identified a valid issue here. I personally would not have characterized it as greatly exacerbating, though, IMO, an increase from 11 octets worst-case, which is tolerable for constructing RFC 2047/2231 encoded-words, to 1850 octets, which exceeds by a large margin what can be handled in a Content-Language or Accept-Language message header field, constitutes greatly exacerbated. Repeating my previous point, RFC 3066 doesn't stop a registered tag from being 10^100 octets in length. Of course, all of us know that such a tag wouldn't be useful. At some point, we have to engage common sense, even for RFC 3066. The draft would allow a tag en-boont-boont-boont-boont-boont-boont-boont-boont-boont-boont-boont-boont-boont (over
Re: New Last Call: 'Tags for Identifying Languages' to BCP
Peter Constable scripsit: My suggestions, then, in response to Bruce Lilley's comments are: I heartily support all of this, despite the extra burden it imposes on our esteemed editors, and hope that none of it is in any way controversial. -- John Cowan www.reutershealth.com www.ccil.org/~cowan [EMAIL PROTECTED] Arise, you prisoners of Windows / Arise, you slaves of Redmond, Wash, The day and hour soon are coming / When all the IT folks say Gosh! It isn't from a clever lawsuit / That Windowsland will finally fall, But thousands writing open source code / Like mice who nibble through a wall. --The Linux-nationale by Greg Baker ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
Re: New Last Call: 'Tags for Identifying Languages' to BCP
Peter Constable scripsit: The ISO 3166 MA maintains that standard in accordance with the identifiers specified by the UN Statistics Division; a change by the UN is all the convincing that is required. Umm, not quite. The UNSD defines what a country is, and assigns it a 3-digit code (normative) and a name (informative); the ISO 3166 MA then specifies 2-letter and 3-letter codes for that name. This scenario is not hypothetical; it actually occurred in the case of CS. The change was solely under the control of the UN Statistics Division; it is not part of their process to consult with developers and users of IT systems in general, and they were not consulted in this case. They were completely powerless to influence the change, learning about it only after the fact. UNSD had nothing to do with this. It assigned the hitherto-unused code 891 for the country now called Serbia and Montenegro. (Yugoslavia had the code 890, Czechoslovakia the code 200). This was a reasonable judgment in the circumstances: the question of when a country has changed into another country is always fuzzy. It was the ISO 3166 MA and no one else who chose to assign the 2-letter code CS to the new country. UNSD historically has assigned new numerical codes when new countries come into existence, and has managed to avoid reusing any of its 3-digit identifiers, which is precisely why those identifiers are being used as trusted backups in RFC 3066bis for the unstable ISO 3166 identifiers. This is a situation we do not intend to repeat. Agreed, but let's make sure not to blame the innocent. It is not uncommon for users to confuse JA and JP. *blush* I've done it myself, and in implementation, not merely in discussion. Fortunately, the evidence is now buried. -- John Cowan www.reutershealth.com www.ccil.org/~cowan [EMAIL PROTECTED] Arise, you prisoners of Windows / Arise, you slaves of Redmond, Wash, The day and hour soon are coming / When all the IT folks say Gosh! It isn't from a clever lawsuit / That Windowsland will finally fall, But thousands writing open source code / Like mice who nibble through a wall. --The Linux-nationale by Greg Baker ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
RE: New Last Call: 'Tags for Identifying Languages' to BCP
From: [EMAIL PROTECTED] [mailto:ietf-languages- [EMAIL PROTECTED] On Behalf Of Bruce Lilly That is not at all the aim here wrt stability; rather, the aim is that a symbolic identifier used for metadata in IT systems not change because some government on a whim says, We would now prefer to use 'yz' rather than 'xy' to designate our country. If by international agreement, 'yz' becomes the designation for that country, then it is rather silly to stick one's fingers in one's ears and shout NA-NA-NA-NA-NA I don't want to hear you. That misses the point entirely. The point is that IDs used by political administrations may change for any number of reasons, and those admministrations may have no qualms with such changes; but in IT systems, we cannot afford changes that break existing implementations and data. If for whatever reason ISO and the UN decided that US should be used to designate the country of France, I doubt you'd expect every software vendor to update all of their deployed installations to use fr-US instead of fr-FR, and for every user to go through every data repository they manage to make such changes in their data. The people that maintain time zone definitions may have their means for changing times; that's fine for them. They are not dealing with the same concerns as we are dealing with. The group here that has focused specifically on language-tagging issues for several years has evaluated issues that affect language tags and the impact of changes and has decided what is best practice for *this* domain, and it is to maintain stability of data rather than cater to whims of political administrations. Designed or not, country codes *are* read by humans; they appear in top-level domain names. Currently the ISO 639 2-letter codes mean the same thing as the last component of a domain name I think you mean ISO 3166 2-letter codes. and as the second component of a language-tag. It's rather silly to change that correspondence simply because a few people are piqued that international agreement has been reached to change a few 2-letter codes. The usability flaw in treating ISO 639 and ISO 3166 as human-readable is evident in the confusion between ja and JP (or is it jp and JA?), and GB vs UK. As for what is silly, if the UN country ID for Canada changed to CN (and that for PRC changed to something else), I'm sure it would cause far greater problems for users to have to change the last two letters in domain names than for them to keep doing what they always did. In fact, I would have thought it would create a rather significant problem on the Internet if such a change were made. (URIs don't come with versioning dates for domain names, so how would a DNS server know what the cn meant?) Neither RFC 1766 or RFC 3066 has ever presented official translations; Both defer to the ISO lists for definitions (not translations) of the various codes. Definitions; not language names for display use. this is no different for RFC 3066bis. It is very different; under the proposed draft, there is only an English definition, somebody wishing to provide a French definition finds that he has none and must resort to an unofficial translation. The more you press this, the more silly it seems. RFC 3066 does not anywhere discuss display names; localization data is beyond its scope. The registry it defines does not give provision for French language names. The source ISO standards are every bit as accessible as they ever were, and just as RFC 3066 gave the user no option but to refer to the source ISO standard, so users should and can continue to do so. After this response, I will not waste my time any further on this foolishness. I'm willing to postpone the discussion (other problems with the proposed registry format dictate a broader solution which could easily have provision for an arbitrary number of descriptions). I strongly object to the suggestion that progress on this draft be delayed to deal with this non issue that caters to implementation issues that are well beyond the scope of either RFC 3066 or its proposed replacement. No, you are overlooking the fact that a set of codes with no corresponding definitions is useless. RFC 3066 defers the code/definition pairs to ISO, which provides multilingual definitions. The proposed draft would remove that multilingual characteristic. What if the registry provide no name, just the ID? Then people would have to refer to the source ISO standard as they did in the past, and we would be able specify which ISO IDs were or were not valid. That would achieve the goal that we had wrt stability while eliminating the concern that English-only annotations for some reason apparently create for you. Personally, I think the English annotation is helpful, but it seems that the real solution you're looking for is to remove any annotation whatsoever so that the situation is closer to what we have under RFC 3066. Display
RE: New Last Call: 'Tags for Identifying Languages' to BCP
From: [EMAIL PROTECTED] [mailto:ietf-languages- [EMAIL PROTECTED] On Behalf Of Bruce Lilly As mentioned, the limit is imposed by other tight constraints on 'grandfathered'; you have already identified that the longest registered tag under RFC 3066 is 11 octets in length, therefore a 'grandfathered' tag can be at most 11 octets in length. But the constraints probably aren't as tight as you believe; the draft specifically permits a future revision to allow a primary subtag longer than 8 octets, or not purely alphabetic, etc. RFC 3066 does not impose any restrictions on what its replacements might do. This is the case with any specification: a given technical specification is not a specification of human behaviour and cannot keep us from revising the spec or replacing it in any way we may choose. One would hope that under RFC 3066 rules, that the reviewer, a list subscriber, or an Applications Area Director would recognize the conflict with RFCs 2047/2231 and would object. You have mentioned conflict with RFCs 2047 and 2231. RFC 2047 does not make reference to language tags. The ABNF of RFC 2231 does not impose any limit on the length of language tags. RFC does contain an implicit length issue in that it updates RFC 2047, allowing language tags within encoded words, but it does not explicitly identify any upper bound on the length of language tags. By reading both RFC 2047 and RFC 2231, one finds that they assume that a language tag must be at most 64 characters long: - the maximum length for the encoded-word production is 75 characters long (not stated in the ABNF of RFC 2047 but rather in the prose) - encoded-word production of RFC 2047 includes 6 literal characters - RFC 2231 adds one delimiting character * between the charset and language tag - the shortest charset names are 2 characters long (e.g. IT) - the shortest encoding length is 1 character long - the minimum encoded-text length is 1 character long An encoded-word must contain at least 11 characters that are not part of the language tag and have a total length of no more than 75 characters. Therefore, an upper bound on language tags that can be used in an RFC 2047/2231 encoded-word production is 64 characters. In many cases, where the charset tag or encoding is longer, the upper bound on the length of languages tags will be less, but the RFC gives no estimate or indication of how much less. This is a constraint on an application of RFC 3066; it is not a constraint on RFC 3066 itself. It is possible that other applications of RFC 3066 may impose limits that may be longer or shorter than that imposed by RFC 2047/2231. I see no reason why limits must be added as a constraint in a revision of RFC 3066. It would be a good idea, however, to point out in section 2.1 of the draft that some applications of this specification may impose limits on the length of accepted language tags, and perhaps to cite RFC 2231 as an example. My suggestions, then, in response to Bruce Lilley's comments are: - that we add a note prominently in section 2.1 of the draft explaining that some applications may impose limits on the lengths of language tags, and cite RFC 2231 as an example - that we revise the ABNF for the 'grandfathered' production rule to grandfathered = 1*3ALPHA *(= 1*8alphanum) - that we add a note in the discussion of extensions stating that, when a language tag instance is to be used in a specific, known protocol, it is advisable that the language tag not include extensions not supported by that protocol (text can be added pointing out the inadvisability of including unrecognized extensions in the case of protocols that impose upper limits on the length of strings that may contain a language tag) - that recommendation 4 in section 2.4.2 be changed to say that extensions should not be removed except in the case that the language tag instance is to be inserted into a specific protocol known not to support the extension - that the language subtag registration form include an additional field following #7 (recommended prefixes for variants) asking for a reasonable estimate and examplar of the maximum length anticipated for language tags using the requested varient - that a requirement on extension RFCs be added in section 3.4 stating that they must include some explicit discussion of concerns related to upper bounds on length of language tags using the given extension - that we do not attempt any other changes to the ABNF to impose an upper bound on the length of language tags - that we add a note in section 3.1 indicating that descriptions in registry entries for ISO 639, ISO 3166 or ISO 15924 identifiers are intended only to indicate the meaning of that identifier as defined in the source ISO standard at the time it was added to the registry, and that the descriptions are not replacements for content of the source standards themselves - that we do not need to change the proposed format of the registry to
Re: New Last Call: 'Tags for Identifying Languages' to BCP
On Sun, 12 Dec 2004, Bruce Lilly wrote: In the specific cases of the core Internet protocols that I have mentioned, there *is* a date/time attribute in the form of an RFC [2]822 Date field. If we're talking about some file stored on some machine, every OS that I know of has a date/time stamp associated with that file. If you have something else in mind, a concrete description and/ or example might help. When I retrieve a file via FTP, HTTP, etc. the time stamp of that file on my computer is the date/time of retrieval, not the date/time of the file on the source. Unless, of course, both systems are running TOPS-20 and thus use that wonderful XTP mode that copies file metadata. Now, if you want to mandate that all UNIX and Windows systems be replaced with TOPS-20, I might support that... :-) Silliness aside, the file may well have embedded language tags in the text of the file. Have you forgotten Plane 14? I'm not eager to abolish uniqueness. There never was any guarantee that codes would never change. Both RFCs 1766 and 3066 specifically mention changes as a fact of life. That's what's now being fixed. French is an official language used by the ISO in its publications. Why is this vestige of colonialism important in the IETF context? SO where are the French definitions? Ask a person who is bilingual in English and French to provide one. That would lack definitiveness which characterizes the ISO lists. What magic attribute is there to French that provides definitiveness that is absent in English, or Mandarin, or Hindi, all of which are far more significant languages to the world? Why is it a problem? Why is it a defect? Because it unnecessarily reduces by 50% the information content currently available. A mandatory French translation to an English definition does not significantly increase the information content, and certainly does not double it. The only increase in the information content would be to those individuals who comprehend French but not English. This is a very small number of individuals. If there is to be a mandatory translation into a second language to increase information content, then that language should be Mandarin. Among individuals who do not comprehend English, far more comprehend Mandarin than comprehend French. If there is to be a mandatory translation into a third language, that would probably be Hindi. You have not explained how the code came to be embedded within the text itself -- surely the author didn't say (or write, or sign) this text is in language QZ; most likely the language was indicated by name, or by some proxy representing the name (such as a locale). Plane 14. HTML and other markups. -- Mark -- http://staff.washington.edu/mrc Science does not emerge from voting, party politics, or public debate. Si vis pacem, para bellum. ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
Re: New Last Call: 'Tags for Identifying Languages' to BCP
Bruce Lilly scripsit: If by international agreement, 'yz' becomes the designation for that country, then it is rather silly to stick one's fingers in one's ears and shout NA-NA-NA-NA-NA I don't want to hear you. Actually, 'yz' doesn't designate the country in the ISO standard, as I explained yesterday. Rather, it designates the *name* of the country, which is of course subject to change *without* international agreement. In RFC 1766/3066, we attempt to use it to designate the country, which requires some straining of the concept. As I have pointed out, politicians change the definitions of time zones frequently, and those who have to deal with time zone issues have found a way to cope with such change without trying to declare international standardization organizations irrelevant. Ah, but you kick the ball through your own goalposts here. The Olsen time zone system is excellent -- but it becomes so only by totally ignoring the customary names of time zones and inventing its own! (Thus U.S. Eastern time is named America/New_York, e.g.) The customary names are carried only as time zone abbreviations such as EST, which are not unique, are English-only, and most of which are also made up. (Countries with a single time zone generally don't bother with an official name for it, with some obvious exceptions.) It's rather silly to change that correspondence simply because a few people are piqued that international agreement has been reached to change a few 2-letter codes. Not much of an international agreement, really. -- Samuel Johnson on playing the violin: John Cowan Difficult do you call it, Sir? [EMAIL PROTECTED] I wish it were impossible.http://www.ccil.org/~cowan ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
Re: New Last Call: 'Tags for Identifying Languages' to BCP
Bruce Lilly has posted comments on the IETF list in response to the last-call announcement for a proposed revision to RFC 3066. His comments were generally negative, raising a number of concerns. I and others involved in preparation of the revision have discussed Bruces concerns with him, but they were not made available on the IETF list since those of us other than Bruce were not subscribed to this list. I wish to briefly summarize the outcome of that discussion for the benefit of people here. Some of Bruces comments were purely editorial (e.g. formatting of draft); I will not review those. Bruces substantive concerns were: - Accessibility of source ISO standards was referred to in the announcment as a major reason for the proposed revision, but accessibility has not been a problem in his experience. - RFC directed users to source ISO standards; the proposed revision would establish a registry that includes all ISO identifiers considered valid for use in language tags, but the documentation for those identifiers in this registry does not include both English and French language / country names. - The proposed revision makes referene to ISO 8601 time/date format being used in the registry, which is a complex and not-readily-available specification. - The ABNF used in the proposed draft permits many strings that do not conform with RFC 3066. - The proposed revision imposes no bounds on the length of tags (same as RFC 3066), and does not require registration of complete tags (different from RFC 3066). - The lack of an Internationalization considerations section as recommended by RFC 2277 (a.k.a. BCP 18). As a result of Bruces comments, those of us contributing to the development of this revision have suggested certain revisions to which the authors have indicated openness. As I will explain, these revisions would provide clarification on various matters, but would not constitute technical changes in the draft. 1. Re accessibility: it was pointed out that the draft itself does not identify accessibility of source ISO standards as one of the primary reasons for the revision. There are some minor accessibility concerns having to do with uncertainty of the on-going availability to the relevant ISO code tables, and to change histories for each of the relevant ISO standards. The proposed changes to the language-tag registry address these concerns, though there were bigger reasons for the proposed registry changes, particularly having to do with stability. 2. Re the lack of French descriptions in the registry: it was pointed out that the registry defined by RFC 3066 did not include French descriptions, and that the revised registry is not intended to replace the source ISO standards or make them irrelevant. The meaning of IDs would still be established from the ISO standards from which they were drawn, and the proposed revision would continue to make reference to them. As a result of Bruces comments, it was suggested that wording be revised in the draft to make this relationship clearer. 3. Re ISO 8601 time/date format: What is used in the registry is dates expressed in the format -MM-DD. It was agreed that it would be better to identify the format precisely rather than make the generic reference to ISO 8601. 4. Re the less restrictive ABNF: the one place that had less restrictive syntax was a production rule that was subject to additional strict constraints, namely that only certain pre-existing tags registered under RFC 3066 could fall under that production. A change to the ABNF has been suggested that would make the ABNF at that point consistent with the ABNF for RFC 3066. This does not constitute a change having any technical consequence as there is no resulting change in the set of valid tags. 5. Re upper bounds on length of tags: It was pointed out that private-use tags for both RFC 3066 and the proposed revision have no bounds on their length. The greater concern was for non-private-use tags. For these, it was pointed out that RFC 3066 also imposes no bounds on length. Admittedly, though, there is a difference because RFC 3066 requires registration of complete tags, so one can determine at any time what is the longest valid tag that may be encountered, whereas the proposed revision requires registration of sub-tags which can then be combined productively, and one cannot predict with certainty what combinations may be used. (This, IMO, is the most significant of the concerns Bruce raised.) While the proposed revision allows productive combinations of registered sub-tags, there are some limits on how combinations can be made, as specified by the ABNF. The ABNF does allow unlimited numbers of certain elements specifically three. One of these (extlang) is defined by the ABNF in anticipation of possible future extension of the language tag specification to incorporate mechanisms expected in a new part to ISO 639 that is in preparation, but
Re: New Last Call: 'Tags for Identifying Languages' to BCP
From: Peter Constable [EMAIL PROTECTED] To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] This is a multi-part message in MIME format. --===1521567419== Content-class: urn:content-classes:message Content-Type: multipart/alternative; boundary=_=_NextPart_001_01C4E16C.40BF0707 This is a multi-part message in MIME format. --_=_NextPart_001_01C4E16C.40BF0707 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Bruce Lilly has posted comments on the IETF list in response to the last-call announcement for a proposed revision to RFC 3066. His comments were generally negative, raising a number of concerns. I and others involved in preparation of the revision have discussed Bruce's concerns with him, but they were not made available on the IETF list since those of us other than Bruce were not subscribed to this list. I wish to briefly summarize the outcome of that discussion for the benefit of people here. =20 ... In conclusion, I think that some of Bruce's concerns were valid, and suggestions for changes have been presented to the authors accordingly. I believe all of these changes can be considered to be for clarification purposes, rather than technical changes. (No changes affecting the set of valid tags have been made.) ... --_=_NextPart_001_01C4E16C.40BF0707 Content-Type: text/html; charset=us-ascii Content-Transfer-Encoding: quoted-printable html head meta http-equiv=3DContent-Type content=3Dtext/html; = charset=3Dus-ascii meta name=3DGenerator content=3DMicrosoft Word 11 (filtered) style !-- /* Font Definitions */ @font-face {font-family:Wingdings; panose-1:5 0 0 0 0 0 0 0 0 0;} @font-face {font-family:SimSun; panose-1:2 1 6 0 3 1 1 1 1 1;} @font-face On the contrary, what the authors of a standard intend is not normative. As much as possible, every standard must say what it means, because what a standard says *is* its technical content. For example, I'm unhappy about an apparent sentiment that would put ABNF on a lower footing that the English text. I think I'm like most implementors and perhaps unlike non-engineers in reversing that precedence. Whenever I read an RFC, I rely first and foremost on the ABNF. I use the English only for hints, and follow the ABNF instead of the English whenever there is a conflict. There are a couple other issues that ought to be addressed. I think Bruce Lilly started by charging that a potentially disruptive document had reached last-call without any review by those concerned with related, affected IETF standards. That sounds like a process problem that needs at least 1% as many words as have been spent in this mailing list in lawyerly talk such as whether accounts is more appropriate than account. The other issue is that some of us consider the completely unnecessary and gratuitous use of duplicate-copy/quoted-printable/HTML email somewhere among aggressive, offensive, and a security attack. In purely text contexts like this mailing list QP/HTML never contributes to an impression of technical accuracy and relevance of whatever message it enciphers. Then there is the use of Microsoft's XML flavor of HTML mail ... Vernon Schryver[EMAIL PROTECTED] ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
Re: New Last Call: 'Tags for Identifying Languages' to BCP
On 13 Dec 2004, at 18:34, Peter Constable wrote: 3. Re ISO 8601 time/date format: What is used in the registry is dates expressed in the format -MM-DD. It was agreed that it would be better to identify the format precisely rather than make the generic reference to ISO 8601. Why not require dates to be formatted as per RFC 3339? In general, -MM-DD is ambiguous unless a timezone is specified. Joe ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
Re: New Last Call: 'Tags for Identifying Languages' to BCP
Date: 2004-12-10 22:37 From: John Cowan [EMAIL PROTECTED] Bruce Lilly scripsit: It's not clear to me that the proposal will provide protection against the whims of politicians. If the definition of CS as a country code changes again under the proposed scheme, how is one to determine specifically what some archived language-tag referred to at some point in time? I'm not particularly concerned about that problem, as I am resigned to instability associated with anything specified by politicians (and that includes the UN region codes). The U.N. Statistics Division are only politicians in the sense that IETF WG members are. They are, in fact, statisticians. Their track record for stability is considerably longer than the IETF's. I hope that I need not repeat any of the well-known remarks about statistics. Nor that I need point to the many uses by politicians of statistics (and statisticians) for political purposes. Moreover, the point is that countries do change, and that use of country codes (as provided for in RFC 3066 and in the proposed draft) carries with it the inherent instability which is characteristic of politics. A quest for stability of countries seems Quixotic and oxymoronic. According to the principle of stability as that term is used in defense of the draft, I suppose we're all intended to refer to Malawi as Rhodesia because that's what it (in part) was called 50 years ago, or that we're supposed to ignore the breakup of the USSR, Yugoslavia, etc., the reunification of Germany, etc. A related problem with the use of country codes in language tags is that there is not necessarily an inherent relationship between language and country borders. The borders of Germany have changed many, many times. If one is referring to the German language as spoken by inhabitants of Alsace, using country codes would imply that that same language spoken by the same people would have been tagged at various times as de-DE and de-FR according to where the France-Germany border happened to have been determined by politicians of the time. That strikes me as being a rather silly way to tag language, but that's the precedent set by RFC 1766. As far as I can tell, the draft doesn't really deal with the issue of changing borders or changing country names -- it merely pretends that these things don't happen by attempting to declare a snapshot of the status at some point in time as being valid for all time. But if the proposed new registry's description of CS says foo and the ISO standard code list says bar, what's an implementor supposed to present to a user as *the* description associated with CS? The former. That's the whole point of having a registry. But the user has indicated that he speaks French, and the proposed registry contains a description in English only. Where is the implementor supposed to get the *official* translation for display? N.B. under the current (RFC 3066) situation, the definitive ISO lists provide an official description in French. One possibility would be two description fields. Why two? There are now two in the ISO lists (and, as noted, in the UN list). I have no objection to more, but I object to a reduction. The text accompanying the new last call states: This specification addresses each of these issues with a simple, elegant design that is compatible with existing language tags and implementations. and One concern that is crucial to acceptance of the new language tag design is how it works with existing implementations of RFC 3066 and how existing implementations will interact with implementations of the newer language tags. and It is important to recognize that all language tags that were valid under the existing RFC 3066 will remain valid, with their meanings intact, under this specification. I have an implementation which (in accordance with RFC 3066) uses the official ISO lists. It has provision for displaying ISO 639 language tags with their descriptions in either of the two languages supported by the official 639 lists, and likewise for the ISO 3166 country codes. The specification of the draft is *NOT* compatible with that existing implementation because it removes the existing functionality of official descriptions in French of language and country codes. As a result of that incompatibility, the newly proposed specification does not work with (at least that one) existing implementation (but I agree that that is a crucial concern). Language tags remaining valid, I presume that the tag sr-CS will continue to mean Serbian as used in Serbia and Montenegro (officially equivalent to Serbe par Serbie et Monténégro) as that is a valid RFC 3066 language tag and its corresponding meaning... but I can see no evidence of that in the draft -- indeed it appears that the draft would change that meaning significantly. There are 6000 languages spoken on Earth, of which perhaps 600 have a standard written form. ISO 639
Re: New Last Call: 'Tags for Identifying Languages' to BCP
Date: Sat, 11 Dec 2004 12:14:42 -0800 From: Randy Presuhn [EMAIL PROTECTED] Subject: Re: Ietf-languages Digest, Vol 24, Issue 5 To: [EMAIL PROTECTED], [EMAIL PROTECTED] Message-ID: [EMAIL PROTECTED] Hi - From: Bruce Lilly [EMAIL PROTECTED] To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Friday, December 10, 2004 4:54 PM Subject: Re: Ietf-languages Digest, Vol 24, Issue 5 ... Eliminating bilingual descriptions for the language, country (and UN region) codes leaves implementors in a quandary. ... Huh? These are language TAGS. If, for some reason, some implementor thought it made sense to display one of these in a localized form (rather than just using them to determine what locale, etc. should be used in rendering some text) there's no requirement that the English-language country names that appear in the registration be used. That's not the point. The point is that under RFC 3066, the bilingual ISO language and country code lists are considered definitive. An implementor can (and has) therefore use those lists for (e.g.) providing users with menus (in either language) from which a language or country code may be selected. By declaring the ISO lists no longer definitive, and by providing only English descriptions of the codes in the proposed revised registry which would be used instead of the ISO lists, the draft proposal deprives implementors of being able to provide that functionality (viz. an official description in French of codes). Indeed, a UI could just as well draw a map as display a name. That would be awfully difficult for a character-based UI, and would not be useful for language codes. Nor would it be helpful for users who lack map-reading skills, but who recognize Allemagne when they see it. ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
Re: New Last Call: 'Tags for Identifying Languages' to BCP
Date: 2004-12-11 00:52 From: Mark Davis [EMAIL PROTECTED] To: [EMAIL PROTECTED], [EMAIL PROTECTED] CC: [EMAIL PROTECTED] The ABNF is an expression of the grammar that describes the set of all valid tags. No, this is simply incorrect. You cannot expect that any implementation that simply does the ABNF is conformant. I made no such claim. I do claim that if the ABNF contradicts the normative text, as is the case in your draft w.r.t. acceptance of several constructs not permitted by RFC 3066 ABNF, that there is an error in either the normative text or the ABNF. There are a great many constraints on the tags that are not in the ABNF grammar, that are clearly required in any reading of the text. Most of these *cannot* be encompassed in any ABNF grammar. If your claim is that the ABNF cannot express a grammar consistent with the RFC 3066 ABNF, that is clearly false. There are a few that could be expressed in the ABNF; some at little cost, some with a great deal of complication. Are you claiming that it is unduly difficult to make the ABNF match RFC 3066's? This is not a technical problem for the draft. It is a problem due to the conflict between the ABNF and the text. It is a problem because it opens a loophole for future revisions to formalize content which is incompatible with RFC 3066 implementations. as reasonable as the current worst-case of 11 octets. Also simply untrue. You seem not to be reading all the messages on this subject. Look at the ABNF for RFC 3066. There is *no* limit in the ABNF there! The draft proposes closing RFC 3066-style registrations. Show me a registered RFC 3066 language tag longer than 11 octets. Show me a general-use (i.e. not private-use) RFC 3066 language tag which is too long to be used in an RFC 2047/2231 encoded-word. Show me a general-use RFC 3066 language tag which is too long to fit on an RFC 2822/3282 Content-Language header field line. ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
Re: New Last Call: 'Tags for Identifying Languages' to BCP
Date: 2004-12-11 11:53 From: Peter Constable [EMAIL PROTECTED] To: [EMAIL PROTECTED], [EMAIL PROTECTED] Our disagreement amounts to a basic question of whether parsers should be written based on the ABNF alone, or based on the ABNF plus other constraints provided in the spec. Clearly, I think anyone writing a parser should consider other constraints as well. No, I agree that a parser should take normative text into account, but I feel that there should be a reasonable effort made to make the ABNF agree with that normative text -- otherwise there's little point in providing ABNF. As mentioned, the limit is imposed by other tight constraints on 'grandfathered'; you have already identified that the longest registered tag under RFC 3066 is 11 octets in length, therefore a 'grandfathered' tag can be at most 11 octets in length. But the constraints probably aren't as tight as you believe; the draft specifically permits a future revision to allow a primary subtag longer than 8 octets, or not purely alphabetic, etc. a de-facto upper limit of 11 (subject to change if new tags are registered before the proposed spec is accepted). We're agreed on that, for the present draft, but apparently Mark Davis disagrees. And I am concerned about the loophole left for future revisions. We could impose some upper limits on these things... That leaves the extension portions' length at up to 25 * (1 + 1 + 8 * 9) = 1850 octets, not taking any other parts of a tag into account! That's way too long (the RFC 2047 limit for an encoded-word is 75 octets, including charset tag, some text, and some syntactic glue in addition to the language tag). The problem already exists in RFC 3066. Even apart from private-use tags, tomorrow someone could request a registration for a tag that's 87 octets long, and there's nothing in RFC 3066 that would prohibit acceptance. One would hope that under RFC 3066 rules, that the reviewer, a list subscriber, or an Applications Area Director would recognize the conflict with RFCs 2047/2231 and would object. If indeed that were to happen literally tomorrow, I am quite sure that an objection would be made. The situation is quite different under the draft proposal, where registration of a complete tag is not required, and where there are no upper bounds on length of a tag. So, I think Bruce has identified a valid issue here. I personally would not have characterized it as greatly exacerbating, though, IMO, an increase from 11 octets worst-case, which is tolerable for constructing RFC 2047/2231 encoded-words, to 1850 octets, which exceeds by a large margin what can be handled in a Content-Language or Accept-Language message header field, constitutes greatly exacerbated. Repeating my previous point, RFC 3066 doesn't stop a registered tag from being 10^100 octets in length. RFC 3066 provides a registration mechanism that can be trusted to prevent that; in particular, the Applications Area Directors are supposed to look out for issues affecting the core Internet applications protocols. I suggest that wording be added to the draft giving a strong recommendatation to users that they not use tags the complete length of which exceeds 75 characters. 75 octets would be too large for a language-tag used in an encoded word (perhaps different limits could be specified for different uses, but one would have to be careful about implicit re-use between applications). An encoded-word has the form: =?charset*language-tag?encoding?text?= and is limited to a total of 75 octets. Eliminating the syntactic glue (7 octets, unbracketed above) leaves a total of at most 68 octets for text, charset, encoding, and language-tag. There are at present two encodings, specified with 1-octet tags. Assuming that longer encoding tags are not required, that leaves 67 octets for charset, language-tag, and text. The text must be at least four octets in order to accommodate B encoded text, leaving 63 octets at most for charset and language-tag (ideally, one would prefer to leave more room than that for text). It is guaranteed (in theory, if not in practice) that there will be a charset name of no more than 40 octets for each charset, but that is not necessarily the preferred name (there has been some discussion about possibly reducing that limit). That leaves about 23 octets for a language-tag as an upper bound for use in an encoded-word. Obviously that hasn't been a problem in practice to date; the longest registered language tag is less than half that length. By deferring to the bilingual ISO lists for language and country tags, 3066 at least provided a minimal degree of internationalization. By explicitly limiting description fields to English and restricting the charset to US-ASCII, the draft proposal takes a giant leap backwards. The US-ASCII limitation existed in RFC 3066, so is not new. No, I'm talking about the character set of
Re: New Last Call: 'Tags for Identifying Languages' to BCP
Bruce == Bruce Lilly [EMAIL PROTECTED] writes: Date: Sat, 11 Dec 2004 12:14:42 -0800 From: Randy Presuhn [EMAIL PROTECTED] Subject: Re: Ietf-languages Digest, Vol 24, Issue 5 To: [EMAIL PROTECTED], [EMAIL PROTECTED] Message-ID: [EMAIL PROTECTED] Hi - From: Bruce Lilly [EMAIL PROTECTED] To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Friday, December 10, 2004 4:54 PM Subject: Re: Ietf-languages Digest, Vol 24, Issue 5 ... Eliminating bilingual descriptions for the language, country (and UN region) codes leaves implementors in a quandary. ... Huh? These are language TAGS. If, for some reason, some implementor thought it made sense to display one of these in a localized form (rather than just using them to determine what locale, etc. should be used in rendering some text) there's no requirement that the English-language country names that appear in the registration be used. Bruce That's not the point. The point is that under RFC 3066, the Bruce bilingual ISO language and country code lists are Bruce considered definitive. An implementor can (and has) Bruce therefore use those lists for (e.g.) providing users with Bruce menus (in either language) from which a language or country Bruce code may be selected. By declaring the ISO lists no longer Bruce definitive, and by providing only English descriptions of Bruce the codes in the proposed revised registry which would be Bruce used instead of the ISO lists, the draft proposal deprives Bruce implementors of being able to provide that functionality Bruce (viz. an official description in French of codes). Programming lore has the rule of zero, one or infinity; it goes by many other names but the concept is in part that by the time you need more than one of something, you'll probably need a lot of that thing. Language descriptions seem to fit this rule fairly well. By the time we need to support multilingual language descriptions, we'll need more than just English and French. That means implementers today already have to deal with the fact that they only have some of the language descriptions they need from definitive standards. They will already have to get descriptions for other languages. Since they are already using non-definitive language descriptions, implementers can feel free to take the French descriptions from the ISO standard for the many cases where the IANA registry and ISO standard overlap. Why is two definitive languages better than one definitive language and one set of descriptions from an ISO standard? --Sam ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
Re: New Last Call: 'Tags for Identifying Languages' to BCP
Date: 2004-12-11 11:59 From: JFC (Jefsey) Morfin [EMAIL PROTECTED] To: [EMAIL PROTECTED], [EMAIL PROTECTED] Gentlemen, I see several points discussed here which are/are not of the same order and seem confusing the issue. 1. the discussion creeps from Harald's RFC 3066 to Multilingual Internet. It seems strange to discuss byte oriented details without having first a Multilingual framework telling what is the scope of the discussion and its implications (which are certainly major) on the whole Internet architecture. I submit that an IAB guidance is first necessary. Before going any further a true WG-Multilingualism should be created and open to everyone (a private IETF-Language lists should be an interim situation towards such a WG) There is in fact an ietf-languages list; RFC 3066 and the draft under discussion give its submission mailbox as [EMAIL PROTECTED], which makes finding the real list an exercise since IANA's web site makes no mention of any mailing lists. I made an educated guess that I might find the list at alvestrand.no, and indeed the list submission mailbox is [EMAIL PROTECTED], and the list archive is available at http://www.alvestrand.no/mailman/listinfo/ietf-languages Neither RFC 3066 nor the draft provide any instruction for joining the mailing list, and from the remarks above it should be clear that IANA's web site provides no clear clue either. 2. I see quoted RFC 3066bis as a document. The RFC Editor seems to ignore that RFC? Where can I find it? It is apparently an unofficial term for the Phillips draft mentioned in the new last call and to which you have repeated the URI. 3. there are at least four different levels: - what is Multilingualism vs. vernacularism (there are 6000 human languages but a standard should be able to support non scripted and computer generated and past languages, what may lead to millions of references). One should then consider different types of tags for different uses -- a tag for a non-scripted language makes no sense in an RFC 2047/2231 encoded-word, which is strictly text. - vernacular granularity has nothing to do with geography and countries. True in general; but can we reverse the precedent set by RFC 1766? The way this inserts into the general digital convergence (is the IANA the proper register?). [...] The same as the IANA is not in the business of defining countries (Jon Postel, RFC 1591) it should not be in the business of defining languages. The draft in question apparently seeks to get IANA into the business of defining countries (and languages), usurping those roles from ISO (as also noted in RFC 1591). I also submit that IANA is not the proper place anymore to support such a Register. Experience shown that IANA (now a function of ICANN) is subject to controversies in this or in parallel real life areas: ccTLD delegation, ccTLD entries in the root file, accepted MINC reaction to the Polish non concerted introduction of Arabic, Russian and Hebraic tables, ICANN strategy for internationalized rather than multilingual TLDs, etc. I also submit that UNESCO, MPEG or other standard/cultural organizations involved in the daily reality (universities, editors, posts, governments, copyrights, WIPO, etc. etc.) are more concerned and may make their own standard prevail after an unnecessary and harassing dispute. It seems that any semantic able to support open sub-tags whatever they originate from, is useful. Going any further would push in favor of a less and less [unilingual or internationalized] network centric market against a market evolution toward user centric [multilingual/multiulcural] networked relations [P2P, VoIP, NAT, coreboxes, OPES, etc.]. Good points all, though I do sympathize with the concern about loss of information about what a tag meant at a given time due to changes in the ISO lists. I would support provision for a definitive time-stamped registry of changes of some sort; ideally that would be provided by ISO as part of (or a supplement to) the lists, and I would be quite surprised if ISO were not receptive to such a suggestion if made appropriately and with a clear indication of the problems. ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
Re: New Last Call: 'Tags for Identifying Languages' to BCP
Date: 2004-12-12 15:31 From: Peter Constable [EMAIL PROTECTED] To: [EMAIL PROTECTED], [EMAIL PROTECTED] From: [EMAIL PROTECTED] [mailto:ietf-languages- [EMAIL PROTECTED] On Behalf Of Bruce Lilly Moreover, the point is that countries do change, and that use of country codes (as provided for in RFC 3066 and in the proposed draft) carries with it the inherent instability which is characteristic of politics. A quest for stability of countries seems Quixotic and oxymoronic. According to the principle of stability as that term is used in defense of the draft, I suppose we're all intended to refer to Malawi as Rhodesia because that's what it (in part) was called 50 years ago, or that we're supposed to ignore the breakup of the USSR, Yugoslavia, etc., the reunification of Germany, etc. That is not at all the aim here wrt stability; rather, the aim is that a symbolic identifier used for metadata in IT systems not change because some government on a whim says, We would now prefer to use 'yz' rather than 'xy' to designate our country. If by international agreement, 'yz' becomes the designation for that country, then it is rather silly to stick one's fingers in one's ears and shout NA-NA-NA-NA-NA I don't want to hear you. A more rational approach would be to say that before such-and-such a date/time the designation was 'xy' and after that date/time (until further notice) it is 'yz'. As I have pointed out, politicians change the definitions of time zones frequently, and those who have to deal with time zone issues have found a way to cope with such change without trying to declare international standardization organizations irrelevant. Sure, there will be changes that we need to deal with; but there's no reason to subject all implementations, users and data to changes that are purely cosmetic changes to things that are not designed to be read by humans. Designed or not, country codes *are* read by humans; they appear in top-level domain names. Currently the ISO 639 2-letter codes mean the same thing as the last component of a domain name and as the second component of a language-tag. It's rather silly to change that correspondence simply because a few people are piqued that international agreement has been reached to change a few 2-letter codes. A related problem with the use of country codes in language tags is that there is not necessarily an inherent relationship between language and country borders. That is not what country IDs within a language tag is intended to suggest. In fact, if there were inherent relationships, we probably would never have needed to use country IDs in a language tag. I submit that it was never a good idea. Language evolves over time, even in a given place. The borders of Germany have changed many, many times. If one is referring to the German language as spoken by inhabitants of Alsace, using country codes would imply that that same language spoken by the same people would have been tagged at various times as de-DE and de-FR according to where the France-Germany border happened to have been determined by politicians of the time. That strikes me as being a rather silly way to tag language, but that's the precedent set by RFC 1766. I agree that that's a silly way to tag that language; I disagree that RFC 1766 suggests I should tag it that way. RFC 1766 (and 3066) leave you little choice; if you wish to indicate a region, you either have to do it with ISO 639 codes or you have to register a separate tag (no separate tag for German as spoken in Alsace exists). Never mind the shortcomings of that particular example; consider de-DE -- does that mean Germany as it exists today, West Germany as it existed 25 years ago, Germany as it existed in the 1930s, the 1900s, ...? As far as I can tell, the draft doesn't really deal with the issue of changing borders or changing country names -- it merely pretends that these things don't happen by attempting to declare a snapshot of the status at some point in time as being valid for all time. That may be your reading of the situation, but it is not how it is seen by those of us who have been working on this spec and examining these issues closely. As far as I can tell, the draft pretends that the meaning of CS hasn't changed, and would in fact change the meaning of the currently valid RFC 3066 language tag sr-CS. But the user has indicated that he speaks French, and the proposed registry contains a description in English only. Where is the implementor supposed to get the *official* translation for display? N.B. under the current (RFC 3066) situation, the definitive ISO lists provide an official description in French. Neither RFC 1766 or RFC 3066 has ever presented official translations; Both defer to the ISO lists for definitions (not translations) of the various codes. this is no different for RFC 3066bis. It is very different;
Re: New Last Call: 'Tags for Identifying Languages' to BCP
Date: 2004-12-12 15:33 From: Peter Constable [EMAIL PROTECTED] To: [EMAIL PROTECTED], [EMAIL PROTECTED] From: [EMAIL PROTECTED] [mailto:ietf-languages- [EMAIL PROTECTED] On Behalf Of Bruce Lilly The point is that under RFC 3066, the bilingual ISO language and country code lists are considered definitive. That is nowhere stated or even suggested in RFC 3066. RFC 3066 section 2.2 states, in part: - All 2-letter subtags are interpreted according to assignments found in ISO standard 639, Code for the representation of names of languages [ISO 639], or assignments subsequently made by the ISO 639 part 1 maintenance agency or governing standardization bodies. and has a similar statement regarding ISO 3166. interpreted according to assignments found in certainly sounds as if the ISO lists are considered definitive for their respective categories of subtags, since their interpretation is specified as that given in those lists. I don't see how the RFC 3066 text can be interpreted otherwise. ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
Re: New Last Call: 'Tags for Identifying Languages' to BCP
Date: 2004-12-12 15:34 From: John Cowan [EMAIL PROTECTED] To: [EMAIL PROTECTED], [EMAIL PROTECTED] Of course countries change, and then the numeric country codes change as well. The point is that the alpha codes change for political reasons when there has been *no* change in the underlying country: Romania's 3-alpha code changed from ROM to ROU without any change in Romania at all. The CS case is particularly gratuitous, as its denotation changed from Czechoslovakia (a no longer existent country) to Serbia and Montenegro (a newly created country). There is a limited supply of 2-letter codes and the supply of 3-digit codes is only slightly greater. Reassignment of codes from such a limited supply is inevitable. Better to deal with the fact of tides than to try to command the tide not to flow in. As far as I can tell, the draft doesn't really deal with the issue of changing borders or changing country names -- it merely pretends that these things don't happen by attempting to declare a snapshot of the status at some point in time as being valid for all time. No, it attempts to freeze the code-to-country mapping at a single point. New countries or changes in old countries should involve only the additions of codes, not the reuse of old codes. Too late. King Canute commands the tide not to come in, but his feet still get wet. Better to deal with such change appropriately rather than commanding countries (or international standards bodies) not to change. I don't know. Where is the implementor supposed to get the official German, or Catalan, or Mandarin translations? Not in the ISO registry, for sure. To say nothing of the cases where no official translations exist. But I'm not concerned with translations, but with the definitions. And currently the definitions are available in French and English. It might be worthwhile considering the differences in the way languages tags are used, by whom they are used, and for what purpose. There may well be a substantial difference between use of a tag to represent an obscure dialect of a dead language in a research paper vs. tagging a piece of text in one of the core Internet protocols such as SMTP. That count does not include dead languages. Whether it includes dialects is a matter of terminology. Fine. The point is that the draft provides for language tags that are so long that they cannot be used with the core Internet protocols. A tag associated with audio media doesn't need a means to indicate script or other orthography -- they're irrelevant for spoken material. RFC 3066's provision for registry worked well. Removing that requirement -- as the draft would do -- necessitates a specific upper bound on tag length that will work with existing core protocols, to replace the reviewer, Area Director, and community review process that ensure that current registered tags work with those protocols. ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
Re: New Last Call: 'Tags for Identifying Languages' to BCP
Date: 2004-12-12 15:55 From: Peter Constable [EMAIL PROTECTED] To: [EMAIL PROTECTED], [EMAIL PROTECTED] You have not responded to the point that accessibility of source ISO standards is supposed to be a major factor, yet the draft itself clearly indicates otherwise. The source for the statement claiming accessibility as a major factor has been indicated to be the author or authors of the draft. I can't explain why it says what it says; I suggest that you direct that question to the author(s). That is a problem for existing implementations of RFC 3066 tags, which can obtain official, internationally agreed descriptions of the codes in two languages. Descriptions (language names) are beyond the scope of RFC 3066. It is a non sequitor to claim that this draft creates a problem for existing implementations of RFC 3066 on this basis. 3066 refers to interpretation of the codes and defers that interpretation to that given by the ISO lists. One cannot have an interpretation based on the lists without the natural language definitions which are paired with the codes. It is a fact that those definitions are available in two languages in the ISO lists, and that the proposed replacement for the ISO lists would eliminate one of those languages. OK, continuing your hypothetical example and its relationship to language, suppose that there is another civil war and that what now corresponds to US is split into Blue America and Red America. Further suppose that in due course ISO assigns some other code to one of those countries and retains US for the other, and that that happens after the proposed registry is set up with a definition for US and some description referring to the old use. That is a scenario that has been well considered: it would be very bad IT practice to redefine a metadata tag US to have a narrower denotation than it previously did, as that immediately breaks an unknown amount of existing data. If ISO were to make such a change in the meaning of US, then IT implementations *absolutely should not* follow suit; the ID US must retain it's prior, broader meaning. So long as it is known what definition of US applied at the time, there is no problem. This is dealt with in IT all the time; EST has had many definitions in terms of exact offset from UTC, and when it goes into and out of effect (likewise for other time zones). Yet we manage to be able to state with precision the offset and effective times of EST well into the past, and without declaring that a single value must hold true for all time. I have provided a URI to the time zone data; a similar mechanism could be used to track historical values for ISO language and country codes. Given the existence of such proven technology, there is no need for the incompatible approach outlined in the draft. ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
Re: New Last Call: 'Tags for Identifying Languages' to BCP
Date: 2004-12-12 17:34 From: Mark Davis [EMAIL PROTECTED] To: [EMAIL PROTECTED], [EMAIL PROTECTED] CC: [EMAIL PROTECTED] Are you claiming that sr-CS-891-boont-gaulish-guoyu-boont-gaulish-guoyu-boont-gaulish-guoyu is nonconformant per some specification in the draft proposal? Clearly not. But x-sr-CS-891-boont-gaulish-guoyu-boont-gaulish-guoyu-boont-gaulish-guoyu So what? A private-use tag has to be agreed to by the communicating parties; in this case they'll find that such an unwieldy tag is unusable in an encoded-word and will have to agree to use something more manageable. That's a problem for the parties involved and nobody else, since it doesn't affect the rest of us. That's a different matter from a public tag that everybody is expected to be able to use. is already absolutely conformant with the current RFC 3066. And the current RFC 3066 clearly permits the registration of something as long as sr-CS-891-boont-gaulish-guoyu-boont-gaulish-guoyu-boont-gaulish-guoyu (although of course this particular combination would certainly never get in). I agree that that would never be registered -- because of the review process which is part of RFC 3066. But the draft under discussion has no mechanism to prevent it, unlike 3066. ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
Re: New Last Call: 'Tags for Identifying Languages' to BCP
Date: 2004-12-12 19:20 From: Mark Crispin [EMAIL PROTECTED] To: [EMAIL PROTECTED], [EMAIL PROTECTED] On Sun, 12 Dec 2004, Bruce Lilly wrote: If by international agreement, 'yz' becomes the designation for that country, then it is rather silly to stick one's fingers in one's ears and shout NA-NA-NA-NA-NA I don't want to hear you. What is silly is saying that every language tag has to have a date/time attribute associated with it so that computer software managing that text knows the language of that text. In the specific cases of the core Internet protocols that I have mentioned, there *is* a date/time attribute in the form of an RFC [2]822 Date field. If we're talking about some file stored on some machine, every OS that I know of has a date/time stamp associated with that file. If you have something else in mind, a concrete description and/ or example might help. It is a disaster for language identifiers to get recycled. Something has to make those identifiers unique. Your notion will force the inclusion of a date/time stamp in language tags, to restore the uniqueness that you are so excruciatingly eager to abolish. I'm not eager to abolish uniqueness. There never was any guarantee that codes would never change. Both RFCs 1766 and 3066 specifically mention changes as a fact of life. Never mind the shortcomings of that particular example; consider de-DE -- does that mean Germany as it exists today, West Germany as it existed 25 years ago, Germany as it existed in the 1930s, the 1900s, ...? For the 98% case, it does not matter at all. But it does matter if, one day, DE becomes Denmark. In either case, to understand precisely what geographical area is referred to requires knowing the date to more or less degree of accuracy. As far as I can tell, the draft pretends that the meaning of CS hasn't changed, and would in fact change the meaning of the currently valid RFC 3066 language tag sr-CS. No, it restores the previous meaning of sr-CS. But what of the current meaning under the current standard (RFC 3066 + ISO 639 + ISO 3166)? Surely the draft would change the meaning of that valid RFC 3066 language-tag. It is very different; under the proposed draft, there is only an English definition, somebody wishing to provide a French definition finds that he has none and must resort to an unofficial translation. Why is the situation for French different from someobody wishing to provide a Lower Slobbobian definition? French is an official language used by the ISO in its publications. Lower Slobbobian is probably about as meaningful as BLURDYBOOP. SO where are the French definitions? Ask a person who is bilingual in English and French to provide one. That would lack definitiveness which characterizes the ISO lists. Well, sure. But the name is an important thing by itself. It is rather pointless to ask a user to indicate the language of a piece of text by selecting from a list AB, ACE, ACH,..., ZHA, ZUL, ZUN -- the user doesn't normally refer to languages by codes. It's quite a different matter to ask the user to select from Abkhaze, Aceh, Acoli,..., Zhuang (Chuang), Zoulou, Zuni. Abkhaze, Aceh, Acoli,..., Zhuang (Chuang), Zoulou, and Zuni are not language tags. So what's your point? They are the human-readable names corresponding to codes. For interoperability, it is insufficient to label any and all languages as ZZ with no definition of what ZZ means. Moreover, it is necessary for two (or more) communicating parties to *agree* on the meaning of ZZ; that is done by assigning the code ZZ to an agreed-upon name. The code ZZ is nothing more than shorthand for that agreed-upon name. If one produces some text in the BCP 18 sense of text (spoken, written, signed, etc.), it is useful to indicate the language of that text; languages are known to humans by names of languages -- the codes are, as noted, merely shorthand for those names. Likewise, somebody presented with some text may desire or need to know the language of that text; informing that person that the language has code QZ is unlikely to mean anything to most people -- only the name corresponding to the shorthand code is likely to be meaningful to persons other than those involved in standardizing the codes. Note that the RFC 3066 specifies a registry that does not include French language names. I suggest that this issue should be dropped. Yes, the current IANA registry has that problem for the non-ISO-based tags only. If the registry is to be changed to subsume ISO codes as well, that defect should be remedied. Why is it a problem? Why is it a defect? Because it unnecessarily reduces by 50% the information content currently available. On the contrary, it is preposterous to suggest that codes will be attached to text by magic Here is where you are misled. Many of these tags are embedded within the text itself. That text
Re: New Last Call: 'Tags for Identifying Languages' to BCP
Date: 2004-12-11 10:48 From: Peter Constable [EMAIL PROTECTED] To: [EMAIL PROTECTED], [EMAIL PROTECTED] -Original Message- From: [EMAIL PROTECTED] [mailto:ietf-languages- [EMAIL PROTECTED] On Behalf Of Bruce Lilly My comments are in response to the New Last Call made on the ietf-announce list. They are in response to the text which accompanied that new last call and the text of draft-phillips-langtags-08.txt dated November 2002. The specific claim that accessibility has been a problem was made in the text accompanying the new last call I don't know where the statement accompanying the announcement came from, According to the New Last Call issued by the IESG Secretary, the text is Author's discussion of drivers for this work. You singled out that one point to comment on as though it were the main factor. I mentioned a matter which was repeatedly indicated as a factor for existing implementations and with which I strongly disagree. There are points with which I do not necessarily disagree, and there are points with which I have not yet had time to study in detail, due to the surprise of the announcement of an impending decision (I do not understand why no announcement of work on an RFC 3066 replacement was made to the ietf-822 list, especially as the core Internet protocols discussed there are affected by this draft), the shortness of the time before a decision (deadline for comments was given as 5 Jan 2005), and the impending holidays. [regarding the proposed registry vs. internationally- standardized ISO lists for subtag definitions] It is certainly the case that only it should be consulted for determining what sub-tags are valid with what denotation, which was the intent. That is a problem for existing implementations of RFC 3066 tags, which can obtain official, internationally agreed descriptions of the codes in two languages. By looking in the sub-tag registry. If ISO changed the meaning of US to something other than what it is now, its meaning for purposes of use in an IETF language tag would not change, because it would remain stable in the sub-tag registry. You would be fairly well protected against the whim of politicians. OK, continuing your hypothetical example and its relationship to language, suppose that there is another civil war and that what now corresponds to US is split into Blue America and Red America. Further suppose that in due course ISO assigns some other code to one of those countries and retains US for the other, and that that happens after the proposed registry is set up with a definition for US and some description referring to the old use. Now suppose that one wishes to produce an appropriate language tag for the text moral values (which clearly has different meaning in Blue America (telling the truth, admitting to mistakes, etc.) and in Red America (imposing totalitarian control over others)). How specifically would the proposed registry handle such a change in the meaning of US, and how would the registry help differentiate the meaning of a 1990's en-us tag to that of the hypothetical time described? I suspect that it won't help, and I recommend review of how another artifact of politics (viz. time zones) are handled by the (unofficial) database of time zones maintained at ftp://elsie.nci.hih.gov/pub/tzdata2004g.tar.gz. The format used handles multiple changes in definitions that went into effect at different times, something that the proposed registry doesn't appear to handle. But if the proposed new registry's description of CS says foo and the ISO standard code list says bar, what's an implementor supposed to present to a user as *the* description associated with CS? The *meaning* of the sub-tag is determined by the sub-tag registry. If you want human-readable descriptors, The draft says that the proposed registry will contain a description, in English (only). you already have to look beyond the ISO standards for anything more than English and French But existing RFC 3066 implementations can get official descriptions in *both* of those languages; the proposal would adversely affect those existing implementations by eliminating the French description. Of course, it is a more serious defect of the proposal that it would fail to reflect internationally-agreed codes and would fail to keep pace with changes... it would not be new that you have to look beyond the registry itself to decide what human-readable descriptors you should provide in a product. It would be new that one could not find a standard (i.e. official) French-language description in the list of codes. One possibility would be two description fields. But the registry would need a charset closer to ISO-8859-1 than to ANSI X3.4 as currently specified. Or an encoding scheme. Personally, I don't see the value in something like that. Given the intent to have a registry that can be machine-readable, changing
Re: New Last Call: 'Tags for Identifying Languages' to BCP
Date: 2004-12-12 13:00 From: Mark Davis [EMAIL PROTECTED] To: [EMAIL PROTECTED], [EMAIL PROTECTED] CC: [EMAIL PROTECTED] Your claim that the RFC 3066 ABNF itself has a restriction in length is also clearly false. I will quote that again since you seem somehow not to have seen it: I made no such claim; indeed it was I who pointed out that RFC 3066 *theoretically* permits an infinite- length tag. On that basis alone (even if you missed the fact that I am an implementor of RFC 3066 language tags) you can be sure that I am well aware of the RFC 3066 ABNF. Both documents establish many further limitations on the contents of language tags in the text of each document. Ignoring those stated limitations will, in both documents, result in nonconformant language tags. Are you claiming that sr-CS-891-boont-gaulish-guoyu-boont-gaulish-guoyu-boont-gaulish-guoyu is nonconformant per some specification in the draft proposal? It is certainly too long to be used in an RFC 2047/2231 encoded-word. It is much longer than any registered RFC 3066 language tag, and the draft proposes removing full tag registration procedure restrictions as well as decoupling use from registration that would combine to permit such an abomination. ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
Re: New Last Call: 'Tags for Identifying Languages' to BCP
Date: 2004-12-12 20:57 From: Peter Constable [EMAIL PROTECTED] To: [EMAIL PROTECTED], [EMAIL PROTECTED] From: [EMAIL PROTECTED] [mailto:ietf-languages- [EMAIL PROTECTED] On Behalf Of Bruce Lilly That is not at all the aim here wrt stability; rather, the aim is that a symbolic identifier used for metadata in IT systems not change because some government on a whim says, We would now prefer to use 'yz' rather than 'xy' to designate our country. If by international agreement, 'yz' becomes the designation for that country, then it is rather silly to stick one's fingers in one's ears and shout NA-NA-NA-NA-NA I don't want to hear you. That misses the point entirely. The point is that IDs used by political administrations may change for any number of reasons, and those admministrations may have no qualms with such changes; For such changes to become enshrined in an ISO standard requires a bit more than a mere whim on the part of one party; in the case of the particular ISO standards under discussion, it requires convincing the duly appointed maintenance authority to make the change. but in IT systems, we cannot afford changes that break existing implementations and data. Any implementations that depend on country/language codes never changing are by definition broken implementations, since there was never any guarantee that codes would never change. Change happens, and IT knows how to cope; it's a versioning problem, and that's not a particularly difficult problem. Now I fully agree that in hindsight the ISO and its appointed MAs could have provided a better record of changes. If for whatever reason ISO and the UN decided that US should be used to designate the country of France, I doubt you'd expect every software vendor to update all of their deployed installations to use fr-US instead of fr-FR, and for every user to go through every data repository they manage to make such changes in their data. The only way that would be likely to happen would be if there were no longer a US *and* if the ISO and UN representatives of France were to initiate a request for such a change. One would presume that they would have good reason to do so, and could explain said reasons in order to convince their ISO and UN counterparts to agree to the change. Under those hypothetical circumstances, I can only assume that software vendors who care about such matters would either agree with the hypothetical reasons or would have acted to convince those in favor of the change of reasons to avoid the change. And while I would not expect users to retroactively change documents any more than I would expect coins and paper money to be reissued with old dates but new designations of country name, I would expect that as of the agreed-upon effective date of the change that new documents would be prepared in accordance with the new standard. It's difficult to be more precise about such a wild hypothetical, but consider similar changes made to time zones... The people that maintain time zone definitions may have their means for changing times; that's fine for them. They are not dealing with the same concerns as we are dealing with. Sure they are; it's another instance of the same sort of versioning problem, with the same root causes, viz. items which are changed (more frequently than some would like) by politicians. The group here that has focused specifically on language-tagging issues for several years has evaluated issues that affect language tags and the impact of changes and has decided what is best practice for *this* domain, and it is to maintain stability of data rather than cater to whims of political administrations. Now that the horses have all run away, you'd better make sure the stable doors are locked. :-) There was never any guarantee of stability of country codes or of language codes. Declaring at some time in the future that today's meaning of sr-CS never meant what it in fact does mean doesn't create stability; it creates instability -- it doesn't make the versioning problem go away; it adds yet a third version to the existing two. Designed or not, country codes *are* read by humans; they appear in top-level domain names. Currently the ISO 639 2-letter codes mean the same thing as the last component of a domain name I think you mean ISO 3166 2-letter codes. Yes, my error. and as the second component of a language-tag. It's rather silly to change that correspondence simply because a few people are piqued that international agreement has been reached to change a few 2-letter codes. The usability flaw in treating ISO 639 and ISO 3166 as human-readable is evident in the confusion between ja and JP (or is it jp and JA?), and GB vs UK. Without looking I can easily tell that jp and uk are country codes precisely *because* they are well-known as TLDs. As for what is silly, if the UN country ID for Canada changed to CN
Re: New Last Call: 'Tags for Identifying Languages' to BCP
Hi, two problems in draft-phillips-langtags-08.txt : 1 - ISO 3166-1 is dead This memo should not be used in new Internet standards, see http://www.iab.org/documents/correspondance/2003-09-25-iso-cs-code.html A reference to some obscure 1998 edition of ISO 3166-1 doesn't help, would it include TL ? What about the numerous dubious countries in 3166, not the simple cases like CS, EU, or PS, but RB, RC, FX, EH, BX, SF, or NT ? The draft is about languages, an appendix listing relevant country codes copied from an old ISO 3166-1 version (before CS) should be good enough, and future changes could be handled as IANA registry. Where can I find the NH in en-NH ? It's not in the public list http://www.iso.org/iso/en/prods-services/iso3166ma/02iso-3166-code-lists/iso_3166-1_decoding_table.html?printable=true#AA 2 - Fallback The text explains why en-US-boont matches en-US or en. But it does apparently not match en-boont. That's ugly. If I'd use de-CH-1996, then I want it to to match de-CH or de-1996 before a plain de. (de-1996 = new orthography, de-CH = no szlig;) Another example in the draft is fr-Latn-CA. I've no idea what other scripts are popular in fr-CA, but maybe fr-CA is somewhat different from fr-FX, and then I wouldn't want a match with fr if fr-CA is also available. A counterexample is sr-Latn-YU, a match with sr-YU or sr won't help if it's in fact sr-Cyrl-YU or sr-Cyrl. In that case the priority script before region is okay. In other cases like se-Latn-AX the script is less important than the region. Bye, Frank ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
Re: New Last Call: 'Tags for Identifying Languages' to BCP
On Thu December 9 2004 12:23, [EMAIL PROTECTED] wrote: New Last Call: 'Tags for Identifying Languages' to BCP Date: 2004-12-08 17:56 From: The IESG [EMAIL PROTECTED] To: IETF-Announce [EMAIL PROTECTED] Reply to: [EMAIL PROTECTED] The IESG has been considering - 'Tags for Identifying Languages ' draft-phillips-langtags-08.txt as a BCP There have been considerable changes to the document since the initial last call, and the IESG would like the community to consider the changes. In addition, the authors have prepared text describing why this mechanism is needed as a replacement for the existing procedure; it is included below. The IESG plans to make a decision in the next few weeks, and solicits final comments on this action. Please send any comments to the [EMAIL PROTECTED] or [EMAIL PROTECTED] mailing lists by 2005-01-05. The file can be obtained via http://www.ietf.org/internet-drafts/draft-phillips-langtags-08.txt I have some comments below. They should not be construed as a complete or thorough critique of the draft; they're initial comments based on a quick review of the draft. One overall comment; I'm surprised to hear that this was already at last call -- some notice to mailing lists which are heavily affected by the proposed changes (e.g. ietf-822) would have been nice... Considering the depth and breadth of the specific issues discussed below, I'm not sure that surprise is adequate... This specification, the proposed successor to RFC 3066, addresses a number of issues that implementers of language tags have faced in recent years: [...] * Accessibility of the underlying ISO standards for implementers [...] There are problems with the the RFC 3066 definition of generative tags, however. The ISO 639 and ISO 3166 standards are not freely available and evolve over time. Accessibility has not been a problem for this implementor (who, incidentally, was unaware of this draft until the New Last Call). ISO 639 language code lists are readily available in HTML-ized English and French via http://www.loc.gov/standards/iso639-2/englangn.html and http://www.loc.gov/standards/iso639-2/frenchlangn.html ISO 3166 country code lists are readily available in plain text in English and French via http://www.iso.org/iso/en/prods-services/iso3166ma/02iso-3166-code-lists/list-en1-semic.txt and http://www.iso.org/iso/en/prods-services/iso3166ma/02iso-3166-code-lists/list-fr1-semic.txt The ISO registered code lists are freely available at the URIs given above. This implementor has used those URIs for years without difficulty. The ISO standards themselves are not free, but neither are they required for an implementor to identify the valid codes -- the free lists suffice for that purpose. The largest change in the specification is that it modifies the structure of the language tag registry. Instead of having to obtain lists of codes from five separate external standards (not all of which are easily available), the IANA registry will maintain a comprehensive list of valid subtags that can be used in the generative mechanism in a machine-parseable text format. Contrary to the implicit claim, the ISO documents mentioned above comprise two standards (available in two languages each), not five separate external standards. The availability of those two definitive standards in bilingual forms allows implementors to (for example) construct menus of available language and country code tags in BOTH languages used in ISO standards. The draft proposes declaring those standards effectively irrelevant, being replaced by a single monolingual (English) IANA registry. While it has become fashionable in recent years among some factions within the United States to bash France, the French people, their culture, and their language, it seems inappropriate to extend such bashing to technical standards which supposedly apply in an international context. Especially when dealing with the subject matter of language itself. The unavailability of the registered value description in 50% of the languages traditionally used for international standards publication, including the existing ISO 639 and 3166 codes, is a serious defect in the proposal, and a departure from the status quo under RFC 3066 (which directly refers to the bilingual ISO standards as definitive). [N.B. I am not accusing the draft authors of French-bashing; it's just that some of us are a bit more sensitive to Anglo-centricity than others. And it remains a fact that the draft has no provision for bilingual descriptions of any subtag fields. (I note in passing that the UN regional codes newly referenced by this draft are available in HTML-ized (ostensibly) English (though I've never seen an A-ring in English text before...) and French).] It is claimed that: In addition, and very importantly, language tags that are newly defined by this specification are compatible with the ABNF syntax,
Re: New Last Call: 'Tags for Identifying Languages' to BCP
On Fri, 10 Dec 2004 14:46:52 EST, Bruce Lilly said: Accessibility has not been a problem for this implementor (who, incidentally, was unaware of this draft until the New Last Call). ISO 639 language code lists are readily available in HTML-ized English and French via http://www.loc.gov/standards/iso639-2/englangn.html and http://www.loc.gov/standards/iso639-2/frenchlangn.html ISO 3166 country code lists are readily available in plain text in English and French via http://www.iso.org/iso/en/prods-services/iso3166ma/02iso-3166-code-lists/list-en1-semic.txt and http://www.iso.org/iso/en/prods-services/iso3166ma/02iso-3166-code-lists/list-fr1-semic.txt The ISO registered code lists are freely available at the URIs given above. This implementor has used those URIs for years without difficulty. The ISO standards themselves are not free, but neither are they required for an implementor to identify the valid codes -- the free lists suffice for that purpose. I'm certainly belaboring the obvious (in that the standards in question are basically useless unless at least this subset of information is freely accessible so everybody uses the same values), but is there any statement from the ISO side that this state of affairs (or equivalent access) is going to continue for at least the code lists we need? (I'd not even ask, except this seems to be the month we spend time worrying about explosive bolts attached to our *own* infrastructure - seems to be a good time to worry about institutional insanity on the part of a totally separate standards organization.. ;) pgpc7iR7ZiTYx.pgp Description: PGP signature ___ Ietf mailing list [EMAIL PROTECTED] https://www1.ietf.org/mailman/listinfo/ietf
Re: New Last Call: 'Tags for Identifying Languages' to BCP
RE: New Last Call: 'Tags for Identifying Languages' to BCP Date: 2004-12-10 20:03 From: Peter Constable [EMAIL PROTECTED] To: [EMAIL PROTECTED] CC: [EMAIL PROTECTED] Resuming my comments: Specifically, the draft allows, and RFC 3066 disallows: subtags more than 8 octets in length hyphens which do not separate subtags zero-length subtags primary tags which are not purely alphabetic Curiously, all of those are permitted by the draft ABNF production grandfathered... The grandfathered production in the current draft is grandfathered = ALPHA *(alphanum / -) which does permit the sequences claimed by Bruce (except for not-purely-alphabetic primary sub-tags), No exception. alphanum is ALPHA / DIGIT. In plain English, grandfathered as defined in the draft is a letter followed by any number of letters, digits, and/or hyphens, in any order. And that includes a123-xyz as I initially stated, and clearly 1, 2, and 3 are digits. syntactically; but the set of tags available for use is constrained by more than the ABNF syntax alone: the acceptable productions for each sub-tag must either be taken from one of the source standards or be registered. So what? The ABNF is an expression of the grammar that describes the set of all valid tags. If the grammar permits y-, a123-xyz, etc. (and it does) then a parser claiming to parse language tags as defined by that ABNF must be able to parse such tags. That is, the ABNF- specified grammar imposes requirements on parsers. If one doesn't intend to impose such requirements, the ABNF specifying the grammar should be changed accordingly. This is no different from RFC 3066, so it is no more of a problem in this specification than it was in RFC 3066. It is a very different grammar from RFC 3066, imposing very different requirements on parsers. It might be that the wording in 2.2 could be tightened up to eliminate any possible question regarding the source for grandfathered productions. It's not a matter of wording; the problem is with the ABNF. Alternately, there's no reason why the grandfathered production shouldn't be composed exactly to match what was used in RFC 3066: grandfathered = 1*8ALPHA *(- 1*8alphanum) I believe I said as much (though one then needs to look at reduce/reduce conflicts implied by the revised grammar): I see no reason for the ABNF to permit such content as is forbidden by RFC 3066; the actual ABNF for what RFC 3066 permits is contained within 3066, and could have been directly incorporated rather than producing a grandfathered production which opens up several cans of worms. This vastly overstates the problem. There is no can of worms unless it exists in tags currently available under RFC 3066. I referred to the additional requirements imposed on parsers, as well as the unlimited tag length permitted. One defect related to tag length in RFC 3066 is not remedied by the draft; indeed the problem is greatly exacerbated... Unfortunately, a language- tag's length is unlimited by the ABNF in RFC 3066 (due to an unlimited number of subtags) and in the draft... In particular, tags other than private-use tags with more than two subtags require registration under RFC 3066 rules, and it is a trivial matter to determine the longest registered tag. The draft, however, encourages use of more subtags as well as removal of the subtag length upper bound; moreover, it permits infinite numbers of subtags without requiring registration of the resulting complete tag. Bruce states incorrectly that there is no upper bound on the length of sub-tags. Look again at the draft definition of grandfathered -- now show me where there's a limit in that production on subtag length. His other concern, on the overall length of complete tags, is valid, however: in terms of the ABNF syntax for both RFC 3066 and RFC 3066bis, infinite-length productions are possible, but RFC 3066 would require registration of complete non-private-use tags while RFC 3066bis does not. Yes, and a quick look at the registry reveals that the longest tag is 11 octets (cel-gaulish). There are three open doors for infinite-length productions in the ABNF of the current draft: - unlimited extlang sub-tags - unlimited variant sub-tags - the number of possible extensions is limited to 25 The ABNF indicates no such limit. , but the length of extensions is unlimited You have missed several others: 1. privateuse length is unlimited (either tacked on after lang etc., or directly as an alternative in Language-Tag) 2. grandfathered, which as already discussed permits unlimited length. We could impose some upper limits on these things; e.g. Language-Tag = ... *8(- extlang) ... *8(- variant) ... 1*25(- extension) I think you mean *25(- extension), not 1*25... extension = singleton 1*8(- 2*8alphanum) That leaves the extension portions' length at up to 25 * (1