Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-21 Thread Bruce Lilly
  Date: 2004-12-18 23:37
  From: Doug Ewell [EMAIL PROTECTED]
  To: [EMAIL PROTECTED]
  
 Bruce Lilly blilly at erols dot com wrote:
 
  If you can write a reasonable grandfathered production in ABNF that
  will allow this set of tags and no others, such that the ABNF can be
  used without also referring to the prose, then I salute you.
 
  If there really are only 24 items of less than 11 octets each,
  a trivial solution is to simply list them (with the usual ABNF
  syntax) as literal strings. That should take no more than a
  half-dozen lines.
 
 Listing the 24 literal strings doesn't seem like a particularly elegant
 solution.

Perhaps it doesn't meet your subjective criteria for elegance.
But it is a *reasonable* production that meets specific criteria,
and that is what you asked for.  A list of specific literal
strings is not unusual (e.g. RFC 3464 sect. 2.3.3, RFC 3798
sect. 3.2.6, RFC 2156 (summarized in Appendix E)).

 Look, RFCs 1766 and 3066 both had ABNF that was insufficient to describe
 the range of valid language tags, and AFAIK they were not greatly
 criticized for this. [...] The same is true for RFC 3066bis.

A crucial difference is that RFC 3066 and 1766 required
registration before use, and community review before
registration.  If a tag were proposed that failed to meet
some criteria not adequately detailed in the ABNF, the
reviewer, the community, and the Area Director could
explain the issue *before* the darned thing went into use.
As that safety mechanism is being removed, it is more
important that the specification be clear and precise and
consistent.

 RFC 2231, which you have mentioned often in this thread, has the
 following as part of its ABNF:
 
 -begin pasted material-
  charset := registered character set name
 
  language := registered language tag [RFC-1766]
 -end pasted material-
 
 If this type of syntax specification is good enough for RFC 2231, why
 wouldn't it be good enough here?

RFC 2231 isn't BCP and doesn't obsolete BCP; it does not
remove any registration requirements.  While it obsoletes
another RFC (2184), it does not attempt to incorporate
content of the obsoleted RFC or artifacts of its use by a
vague reference.  Reference to (unaffected) external
specifications is fine; the draft uses RFC 2234 productions,
for example, and that is not a problem.

___
Ietf mailing list
Ietf@ietf.org
https://www1.ietf.org/mailman/listinfo/ietf


Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-21 Thread Bruce Lilly
  Date: 2004-12-21 00:57
  From: Doug Ewell [EMAIL PROTECTED]
  To: [EMAIL PROTECTED]

 The RFC 3066bis approach involves creating a registry of all the pieces
 that can make, or be combined to make, a language tag. This is much
 easier to implement and understand than chasing down the various
 standards and their history, and it permits stability that cannot exist
 if ISO maintenance agencies change their codes.

Substituting a Numbers Authority for a Maintenance Agency
might not solve the problem; indeed it may bring new problems.
IANA isn't infallible, and has botched some registry entries.
See http://mail.apps.ietf.org/ietf/charsets/msg01477.html
for an example.
 
 Vernon Schryver [...] characterized debating RFC 3066bis (for over a year!)
 within the IETF-Languages group, and only presenting it to other groups
 during the Last Call period, as a process problem,

OK.

 and charged this 
 group with engaging in lawyerly talk such as whether 'accounts' is more
 appropriate than 'account' even though no such exchange ever took place
 (I checked the archives back to January 2002).

No, he was referring to concurrent discussions on the
IETF mailing list.
 
 Now Bruce wants us to wait a few more days before rolling out his
 suggestions to fix these perceived problems.
 
 This is a filibuster, an attempt to stall RFC 3066bis out of existence.

I also (i.e. in addition to JFC) find that characterization
offensive.  I am responding to an IETF New Last Call in
accordance with established procedures, and within the time
period established.  I had at one time entertained an
informal approach to addressing the procedural issues, but
given such an accusation, I am now inclined to use the
formal procedure outlined in RFC 2026 section 6.5.2.

___
Ietf mailing list
Ietf@ietf.org
https://www1.ietf.org/mailman/listinfo/ietf


RE: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-20 Thread Bruce Lilly
  Date: 2004-12-18 20:33
  From: Addison Phillips [wM] [EMAIL PROTECTED]
  To: [EMAIL PROTECTED], [EMAIL PROTECTED], Bruce Lilly [EMAIL PROTECTED]
  CC: [EMAIL PROTECTED], [EMAIL PROTECTED]
  Reply to: [EMAIL PROTECTED]
  
 Hmm...
 
 That's as an editorial issue and not a technical issue.
[...]
  The -CS subtag issue doesn't strike me as a technical issue with 
  the draft. The draft stabilizes the meaning of subtags. There is 
  a process in the draft for setting the initial (and thus stable) 
  meaning of the -CS subtag. While it probably matters which value 
  (Czechoslovakia or Serbia and Montenegro) that is selected, it is 
  only of editorial interest to the draft itself... unless what 
  Bruce is trying to prove is that stabilizing the meaning of the 
  subtags is a Bad Idea, which I don't think is his point.
  
  I'm willing to entertain a debate about which meaning ought to be 
  selected. But really it ought to be recognized as not an 
  editorial issue with the draft and not a technical objection.

I believe that it's more than an editorial issue, and that
there are both technical and non-technical matters involved.
While I wouldn't say that stabilizing the meaning of the
subtags is a Bad Idea, I do believe that the particular
approach taken raises some disturbing issues, and I suspect
that there are process-related considerations that could have
avoided them.  Jefsey Morfin and Vernon Schryver have touched
on procedural issues; I plan to discuss my specific concerns
and suggestions, but it make take a few days due to the
impending holidays and other work for me to collect and
organize my thoughts on those matters.

___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


RE: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-19 Thread Mark Crispin
On Sat, 18 Dec 2004, Brian Rosen wrote:
I don't have any comment on the issue of language tags, but speaking as a
reasonably avid ABNF hacker, I agree with Sam, and would not want to
establish a convention that ABNF in IETF RFCs is expected to be precise.
The counter-argument is the all-too-frequent occurance when you deal with 
willful cretins who will *insist* that the specification says 
such-and-such when it really says the opposite, and will leap upon the 
most bizarre interpretation of text in order to bolster their arguments.

This is unavoidable; however, it helps a lot if the ABNF firmly comes down 
on the side of the good guys.

I've spent entirely too much of my life in the past few years fending off 
cretins, to agree knowingly to anything that makes me more vulnerable to 
them in the future.

Nor do gentlemen's agreements work any more.  We may all be (ladies and) 
gentlemen here, but out there there are individuals who are not.

As painful as the process may be, I believe that the ABNF should be as 
tight as possible, preferably by ABNF rules but at least through ABNF 
comments.

However, be careful about comments.  I had one cretin insist that between 
n and m inclusive (where n and m were integers) had an implied 
restriction that n = m, and that when n  m it meant an empty set of 
values.

-- Mark --
http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum.
___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-18 Thread Bruce Lilly
  Date: 2004-12-14 16:01
  From: Doug Ewell [EMAIL PROTECTED]
  To: [EMAIL PROTECTED]

 The grandfathered production in the RFC 3066bis ABNF is intended only
 for the 24 entries (not 46, as I wrote earlier) that are carried over
 from the RFC 3066 registry and that don't otherwise conform to the RFC
 3066bis syntax. Take a look at the items marked grandfathered in the
 proposed registry:
 
 http://users.adelphia.net/~dewell/lstreg.html
 
 If you can write a reasonable grandfathered production in ABNF that
 will allow this set of tags and no others, such that the ABNF can be
 used without also referring to the prose, then I salute you.

If there really are only 24 items of less than 11 octets each,
a trivial solution is to simply list them (with the usual ABNF
syntax) as literal strings.  That should take no more than a
half-dozen lines. 

___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-18 Thread Bruce Lilly
  Date: 2004-12-15 14:41
  From: Peter Constable [EMAIL PROTECTED]
  To: [EMAIL PROTECTED], [EMAIL PROTECTED]
[...]
 How is it possible to predict ahead of time what is the worst-case
 length for a RFC3066-registered language tag?

In some contexts, the length is limited by the context
(e.g. encoded-words, Content-Language fields in an
Internet Message).
 
 Neither is possible. In light of that, I think it best to make sure
 implementers of the revised RFC 3066 be reminded that some
 implementations may impose limits (whether those implementers be
 constructing tags or passing them from one process to another), and for
 implementers to incorporate robustness into their implementations so
 that they can respond gracefully if an unexpectedly-long tag is
 encountered -- after all, no matter what limit could be imposed in a
 revision to RFC 3066, there's no way to stop malware from sending bad
 data.
 
 (How *do* encoded-word parsers react if a bogus charset or language tag
 that's 2k octets long is encountered?

By definition, that cannot happen. No encoded-word may be
longer than 75 octets.  A sequence longer than that limit,
even if it matches all other characteristics of an
encoded-word, is treated as ordinary ASCII text (RFC 2047,
section 6.1, paragraph marked (1)).  No header field
line may be longer than 998 octets (not counting the
terminating CRLF pair), so 2k is simply not permitted.

 The encoded-word spec already 
 allows for segmenting long strings;

To be a bit more precise, it permits text to be encoded to
be split across multiple encoded-words (with several
restrictions); the encoded-words themselves cannot be in
any way segmented or split.  That is because an encoded-word
is treated by a MIME-unaware application as a single RFC
[2]822 word.

 could it not also be revised to 
 allow segmenting for the parameters, which would also make it more
 robust?)

If you're referring to RFC 2231 extensions to Content-Type
and Content-Disposition field parameters, that's a separate
matter.

In general, though, as MIME has been around for more than a
decade and Internet Messages for more than three decades,
with a substantial installed base of interoperating
implementations, in what has become one of the core Internet
protocols, any changes would have to be backwards compatible
or would have to be negotiated between sender and receiver
at the same protocol level, or would require a lengthy
transition period before pulling the rug out from under
existing implementations.  It's probably more likely that a
separate next-generation system would be implemented first.

___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-18 Thread Sam Hartman
 Bruce == Bruce Lilly [EMAIL PROTECTED] writes:

Bruce If there really are only 24 items of less than 11 octets
Bruce each, a trivial solution is to simply list them (with the
Bruce usual ABNF syntax) as literal strings.  That should take no
Bruce more than a half-dozen lines.

Perhaps.  I actually find a lot of ABNF specs are not as clear as they
could be to humans because they are trying to describe the valid
inputs as strictly as possible.  In many cases I think the spec would
be more clear if the ABNF were relaxed and other constraints were
expressed at appropriate levels.


___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-18 Thread Bruce Lilly
  Date: 2004-12-15 13:22
  From: John Cowan [EMAIL PROTECTED]
  To: [EMAIL PROTECTED], [EMAIL PROTECTED]

 The current process does *not* limit the length of non-private-use
 tags.

It does by way of reviewer, community, and IETF Area
Director review.

 But absolutely nothing except his good sense prevents Michael from registering
 en-the-dialect-spoken-on-the-bowery-between-1933-and-1945-by-alcoholic-drug-users-who-live-in-flophouses.

Aside from specific technical details already addressed
by others, such a long name would certainly solicit the
strong suggestion that the submitter should find a
suitable shortened form, as such a long tag could not
be used in an encoded-word.

___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


RE: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-18 Thread Brian Rosen
I don't have any comment on the issue of language tags, but speaking as a
reasonably avid ABNF hacker, I agree with Sam, and would not want to
establish a convention that ABNF in IETF RFCs is expected to be precise.
One MUST read the text to understand what the limits of the syntax are.
This is especially true with repetitions.  It's usually tortuous to write
ABNF that limits repetitions or string lengths.  It's possible, but the
result is very hard to understand.

Brian

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of
 Sam Hartman
 Sent: Saturday, December 18, 2004 1:55 PM
 To: [EMAIL PROTECTED]
 Cc: [EMAIL PROTECTED]
 Subject: Re: New Last Call: 'Tags for Identifying Languages' to BCP
 
  Bruce == Bruce Lilly [EMAIL PROTECTED] writes:
 
 Bruce If there really are only 24 items of less than 11 octets
 Bruce each, a trivial solution is to simply list them (with the
 Bruce usual ABNF syntax) as literal strings.  That should take no
 Bruce more than a half-dozen lines.
 
 Perhaps.  I actually find a lot of ABNF specs are not as clear as they
 could be to humans because they are trying to describe the valid
 inputs as strictly as possible.  In many cases I think the spec would
 be more clear if the ABNF were relaxed and other constraints were
 expressed at appropriate levels.
 
 
 ___
 Ietf mailing list
 [EMAIL PROTECTED]
 https://www1.ietf.org/mailman/listinfo/ietf
 




___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-18 Thread ned . freed
  I am somewhat sympathetic to the idea of having some
  total limit (except for the late date for the proposed change).

 Earlier feedback would have been had if there had been
 some announcement of the proposed considerable changes
 on the ietf-822 mailing list, or via an IETF WG
 charter.

This sort of thing is exactly why we last call non-WG documents for four weeks
rather than two. Less review is assumed to have occured and this may well mean
the document is in some sense less done.

So, while I know of no problems caused by inordinantly long language tags, now
that the issue has been brought up using this opportunity to add a max length
restriction seems like a very reasonable thing to do.

  However, we
  got considerable pushback on having RFC 3066bis make any previously valid
  RFC3066 tag be invalid

 Entirely appropriate.  And the proposed draft would
 invalidate the meaning of the valid RFC 3066 language
 tag sr-CS, which is currently in use.

  and any length restriction would do that.

 If it makes you happy, you can exclude private-use
 tags from an explicit limit.

I would only suggest doing this if it helps us reach consensus.

Ned

___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


RE: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-18 Thread Addison Phillips [wM]
We (Mark and I) welcome the last call process and timelines and the feedback 
these generate. That's the whole point of having a Last Call.

The -CS subtag issue doesn't strike me as a technical issue with the draft. The 
draft stabilizes the meaning of subtags. There is a process in the draft for 
setting the initial (and thus stable) meaning of the -CS subtag. While it 
probably matters which value (Czechoslovakia or Serbia and Montenegro) that is 
selected, it is only of editorial interest to the draft itself... unless what 
Bruce is trying to prove is that stabilizing the meaning of the subtags is a 
Bad Idea, which I don't think is his point.

I'm willing to entertain a debate about which meaning ought to be selected. But 
really it ought to be recognized as not an editorial issue with the draft and 
not a technical objection.

Best Regards,

Addison

Addison P. Phillips
Director, Globalization Architecture
http://www.webMethods.com

Chair, W3C Internationalization Working Group
http://www.w3.org/International

Internationalization is an architecture. 
It is not a feature.

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] Behalf Of 
 [EMAIL PROTECTED]
 Sent: 20041218 15:41
 To: Bruce Lilly
 Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
 Subject: Re: New Last Call: 'Tags for Identifying Languages' to BCP
 
 
   I am somewhat sympathetic to the idea of having some
   total limit (except for the late date for the proposed change).
 
  Earlier feedback would have been had if there had been
  some announcement of the proposed considerable changes
  on the ietf-822 mailing list, or via an IETF WG
  charter.
 
 This sort of thing is exactly why we last call non-WG documents 
 for four weeks
 rather than two. Less review is assumed to have occured and this 
 may well mean
 the document is in some sense less done.
 
 So, while I know of no problems caused by inordinantly long 
 language tags, now
 that the issue has been brought up using this opportunity to add 
 a max length
 restriction seems like a very reasonable thing to do.
 
   However, we
   got considerable pushback on having RFC 3066bis make any 
 previously valid
   RFC3066 tag be invalid
 
  Entirely appropriate.  And the proposed draft would
  invalidate the meaning of the valid RFC 3066 language
  tag sr-CS, which is currently in use.
 
   and any length restriction would do that.
 
  If it makes you happy, you can exclude private-use
  tags from an explicit limit.
 
 I would only suggest doing this if it helps us reach consensus.
 
   Ned
 ___
 Ietf-languages mailing list
 [EMAIL PROTECTED]
 http://www.alvestrand.no/mailman/listinfo/ietf-languages


___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


RE: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-18 Thread Addison Phillips [wM]
Hmm...

That's as an editorial issue and not a technical issue.

Addison

Addison P. Phillips
Director, Globalization Architecture
http://www.webMethods.com

Chair, W3C Internationalization Working Group
http://www.w3.org/International

Internationalization is an architecture. 
It is not a feature.

 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] Behalf Of Addison
 Phillips [wM]
 Sent: 20041218 16:49
 To: [EMAIL PROTECTED]; Bruce Lilly
 Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
 Subject: RE: New Last Call: 'Tags for Identifying Languages' to BCP
 
 
 We (Mark and I) welcome the last call process and timelines and 
 the feedback these generate. That's the whole point of having a Last Call.
 
 The -CS subtag issue doesn't strike me as a technical issue with 
 the draft. The draft stabilizes the meaning of subtags. There is 
 a process in the draft for setting the initial (and thus stable) 
 meaning of the -CS subtag. While it probably matters which value 
 (Czechoslovakia or Serbia and Montenegro) that is selected, it is 
 only of editorial interest to the draft itself... unless what 
 Bruce is trying to prove is that stabilizing the meaning of the 
 subtags is a Bad Idea, which I don't think is his point.
 
 I'm willing to entertain a debate about which meaning ought to be 
 selected. But really it ought to be recognized as not an 
 editorial issue with the draft and not a technical objection.
 
 Best Regards,
 
 Addison
 
 Addison P. Phillips
 Director, Globalization Architecture
 http://www.webMethods.com
 
 Chair, W3C Internationalization Working Group
 http://www.w3.org/International
 
 Internationalization is an architecture. 
 It is not a feature.
 
  -Original Message-
  From: [EMAIL PROTECTED] 
  [mailto:[EMAIL PROTECTED] Behalf Of 
  [EMAIL PROTECTED]
  Sent: 20041218 15:41
  To: Bruce Lilly
  Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
  Subject: Re: New Last Call: 'Tags for Identifying Languages' to BCP
  
  
I am somewhat sympathetic to the idea of having some
total limit (except for the late date for the proposed change).
  
   Earlier feedback would have been had if there had been
   some announcement of the proposed considerable changes
   on the ietf-822 mailing list, or via an IETF WG
   charter.
  
  This sort of thing is exactly why we last call non-WG documents 
  for four weeks
  rather than two. Less review is assumed to have occured and this 
  may well mean
  the document is in some sense less done.
  
  So, while I know of no problems caused by inordinantly long 
  language tags, now
  that the issue has been brought up using this opportunity to add 
  a max length
  restriction seems like a very reasonable thing to do.
  
However, we
got considerable pushback on having RFC 3066bis make any 
  previously valid
RFC3066 tag be invalid
  
   Entirely appropriate.  And the proposed draft would
   invalidate the meaning of the valid RFC 3066 language
   tag sr-CS, which is currently in use.
  
and any length restriction would do that.
  
   If it makes you happy, you can exclude private-use
   tags from an explicit limit.
  
  I would only suggest doing this if it helps us reach consensus.
  
  Ned
  ___
  Ietf-languages mailing list
  [EMAIL PROTECTED]
  http://www.alvestrand.no/mailman/listinfo/ietf-languages
 
 ___
 Ietf-languages mailing list
 [EMAIL PROTECTED]
 http://www.alvestrand.no/mailman/listinfo/ietf-languages
 


___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-17 Thread Bruce Lilly
  Date: 2004-12-14 13:02
  From: John Cowan [EMAIL PROTECTED]
  To: Addison Phillips [wM] [EMAIL PROTECTED]
  CC: [EMAIL PROTECTED]
  
 Addison Phillips [wM] scripsit:
 
  The IETF process is not really my concern. I will note that many IETF and
  non-IETF standards folks have participated in the process of developing and
  reviewing draft-langtags, though. 
 
 Actually, we're all IETF people. If you're on an IETF mailing list 
 discussing
 an IETF item of work like an RFC, you're part of the IETF. The process
 is designed to serve us, not vice versa.

It's not quite that simple; IETF process has several specific requirements
(as spelled out in RFC 2026) -- an IETF Working group requires a charter
with a well-defined scope, specific milestones, etc.  There is an official
list of IETF working groups. 

___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-17 Thread Bruce Lilly
  Date: 2004-12-14 23:35
  From: John Cowan [EMAIL PROTECTED]
  To: Doug Ewell [EMAIL PROTECTED]
  CC: [EMAIL PROTECTED]
  
 Doug Ewell scripsit:
 
 
  * Region subtag 830, Channel Islands, is based on a UN M.49 code.
  Since that is an English-only standard, one must look elsewhere to find
  the French translation (it's not what you might expect, either).
 
 In fact, M.49 is available in all six official U.N. languages; it's
 just the *online* version that's English-only.

No, I am quite certain (because I looked!) that the UN M.49 lists
are available online as HTML-ized English and HTML-ized French. It's
certainly not English-only (the online version might reasonably be
called HTML-only, but that's another story).

___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-17 Thread Frank Ellermann
Peter Constable wrote:

 The definitions we have now will remain, they will continue
 to be referenced and available.

I've no idea where you found en-NH.  And what's the correct
form, pt-TP or pt-TL ?  And the fallback algorithm makes no
sense for cases like en-US-boont, de-CH-1996, or se-Latn-AX,
when en-boont, de-1996, or se-AX are available.  Bye, Frank



___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


RE: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-16 Thread Peter Constable
 From: [EMAIL PROTECTED] [mailto:ietf-languages-
 [EMAIL PROTECTED] On Behalf Of Bruce Lilly


   The point is that under RFC 3066,
   the bilingual ISO language and country code lists are
   considered definitive.
 
  That is nowhere stated or even suggested in RFC 3066.
 
 RFC 3066 section 2.2 states, in part:
 
- All 2-letter subtags are interpreted according to assignments
found
  in ISO standard 639, Code for the representation of names of
  languages [ISO 639], or assignments subsequently made by the ISO
  639 part 1 maintenance agency or governing standardization
bodies.
 
 and has a similar statement regarding ISO 3166.
 
 interpreted according to assignments found in certainly
 sounds as if the ISO lists are considered definitive for
 their respective categories of subtags, since their
 interpretation is specified as that given in those lists.
 I don't see how the RFC 3066 text can be interpreted
 otherwise.

You're now quoting things so far removed from their context that they
are no longer being evaluated fairly. I believed we were talking about
the specific strings, as you had made reference to implementers of
bilingual products not having access to that data. Perhaps I
misunderstood you, but whether or not, the relevant facts are that RFC
3066 referred to ISO source standards to establish the denotation of
identifiers drawn from those standards, and the proposed revision does
the same.


Peter Constable
Microsoft Corporation

___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-15 Thread Bruce Lilly
  Date: 2004-12-13 02:05
  From: Peter Constable [EMAIL PROTECTED]
  To: [EMAIL PROTECTED], [EMAIL PROTECTED]

   If for whatever reason ISO and the UN decided that US should
   be used to designate the country of France[...]

  The only way that would be likely to happen would be if
  there were no longer a US *and* if the ISO and UN
  representatives of France were to initiate a request for
  such a change.[...]

 This scenario is not hypothetical; it actually occurred in the case of
 CS.

In the case of CS, but *NOT* US a country had quite
some time earlier ceased to exist.  That is what makes
your US scenario hypothetical.

 This is a situation we do not intend to repeat.

That is precisely what would be repeated, and the problem
would remain.  CS currently means Serbia and Montenegro,
and its use in accordance with RFC 3066 has precisely that
meaning.  Changing CS to mean something else at some
future time (if/when the proposed draft goes into effect)
would result in at least as many different definitions as
exist at present, and adds yet another time epoch that
needs to be considered in order to determine the meaning
of CS.

   The usability flaw in treating ISO 639 and ISO 3166 as
 human-readable is
   evident in the confusion between ja and JP (or is it jp and JA?),
[...]
 It is not uncommon for users to confuse JA and JP. 

Which clearly demonstrates why mere codes in the absence
of definitions associated with the codes is a pointless
proposition. And it illustrates the fact that the only
practical way for a code to become associated with a
particular piece of text is by way of the associated
definition (or something derived from it) rather than
directly.

   As for what is silly, if the UN country ID for Canada changed to
   CN (and that for PRC changed to something else)[]

  And it is precisely because of such problems that it is
  as unlikely to happen as your hypothetical FR-US change.
 
 Again, not hypothetical at all.

Last time I checked, US didn't mean France, and CN
didn't mean Canada -- I suggest that you might want to
brush up on the definition of hypothetical, as it is
difficult to have a rational discussion unless we're in
agreement on basic definitions (just as it is difficult
to have effective communications about what language is
indicated by a code without agreement on the *definitions*
of the codes).

 If you're really wanting to know what the meaning of CS would be per
 the proposed draft, the proposal is that it will forever remain valid
 with the meaning Czechoslovakia as it was originally defined in ISO
 3166.

But the current meaning under RFC 3066 is quite different.  What
about maintaining the stability of that meaning?

  I haven't specifically discussed display names; that is your
  assertion, and not my basis for objection.
 
 You didn't use the term display names, but it is clearly implied by
 your reference to bilingual implementations.

Your inference (which you incorrectly claim as my implication)
is different from my claim.  My claim is that under RFC 3066,
the definitions of the country and language codes is available
in two languages (yes, it's true -- but irrelevant to that
point -- that the IANA registered complete tags do not have
that characteristic), and that the proposed registry would
lack that characteristic of the current BCP (unnecessarily).
 
  I refer to the
  definitions and the need to map to and from those definitions
  at either end of the communications channel. Whether or not
  that happens by display is incidental to the issue of the
  number of languages that the definitions are provided in.
 
 Definitions in multiple languages are not a requisite to establishing
 the denotation of a coded element.

True but irrelevant to the point.  We now have definitions of
specific types of elements (viz. country and language tags) in
multiple languages, and the objection is to the unnecessary
removal of that characteristic.

___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-15 Thread Bruce Lilly
  Date: 2004-12-13 01:05
  From: Peter Constable [EMAIL PROTECTED]
  To: [EMAIL PROTECTED], [EMAIL PROTECTED]

 RFC 3066 does not impose any restrictions on what its replacements might
 do. This is the case with any specification: a given technical
 specification is not a specification of human behaviour and cannot keep
 us from revising the spec or replacing it in any way we may choose.

It's not clear exactly who is meant by us, but I'll leave
that to a separate message.  It is considered bad practice
for a document which obsoletes another document to depend
on the obsoleted document for definitions or other interpretation
of the meaning of what is contained in the successor document.
 
 You have mentioned conflict with RFCs 2047 and 2231. RFC 2047 does not
 make reference to language tags. The ABNF of RFC 2231 does not impose
 any limit on the length of language tags. RFC does contain an implicit
 length issue in that it updates RFC 2047, allowing language tags within
 encoded words, but it does not explicitly identify any upper bound on
 the length of language tags. By reading both RFC 2047 and RFC 2231, one
 finds that they assume that a language tag must be at most 64 characters
 long:

You have missed several important and not-so-subtle points.
One of which is that RFC 2231 explicitly amends RFC 2047; it
clearly so states in the first page heading and in the text,
and is also indicated in the RFC Index. Another is that
neither uses ABNF; both use EBNF as defined in RFC 822.
More details on specific missed points below:

 - the shortest charset names are 2 characters long (e.g. IT)

Not all charsets have 2-character names. Not all two-character
names which might be assigned are suitable for MIME use. Where
a preferred MIME name is indicated, that should be used.

 - the minimum encoded-text length is 1 character long

That is strictly only true for text that meets all of the
following conditions:
a) is representable in a specified subset of ANSI X3.4, and
   therefore requires no encoding
b) does not use any encoding, even if unnecessary
c) does not use a charset and character sequence involving
   shift sequences (e.g. as in ISO 2022-like charsets)

It also misses the point that using 76+ octets to represent
a single octet is rather wasteful.

Any use of B encoding will require a multiple of 4 octets
of encoded text. Q encoding has some special cases, but
typically requires 3 octets or more.

 An encoded-word must contain at least 11 characters that are not part of
 the language tag and have a total length of no more than 75 characters.
 Therefore, an upper bound on language tags that can be used in an RFC
 2047/2231 encoded-word production is 64 characters.

That is a best case upper bound, for text which requires
no encoding at all, one character per encoded-word.

 In many cases, where 
 the charset tag or encoding is longer, the upper bound on the length of
 languages tags will be less, but the RFC gives no estimate or indication
 of how much less.

The worst case appears to be the charset named
Extended_UNIX_Code_Fixed_Width_for_Japanese (43 characters),
which in fact uses ISO 2022-like sequences. That is the
primary name for that charset; there is no preferred MIME
alias, and the only other alias is the one specified for
printer MIB use. Shifted characters are represented by two
octets, each of which requires encoding. The shift sequences
are 3 octets each, and RFC 2047 requires that an encoded-word
start and begin in unshifted state.  Therefore the
minimum amount of encoded text for a single character in
a shifted subset consists of an encoding of: a 3 octet
shift sequence (one of which requires encoding), 2 octets
representing the single character (both requiring encoding),
and 3 octets restoring the unshifted state (one requiring
encoding). Using B encoding results in 12 octets of encoded
text as a minimum (Q-encoding would require a minimum of 16
octets). So a single character in a shifted subset of that
particular charset, using B encoding, leaves at most 12 octets
for a language-tag.  As mentioned, use of an encoded-word
plus the necessary whitespace around it to represent a
single character is rather wasteful, so a brief language tag
is indicated; fortunately ja suffices for text likely to
be used with that charset.
 
 This is a constraint on an application of RFC 3066; it is not a
 constraint on RFC 3066 itself. It is possible that other applications of
 RFC 3066 may impose limits that may be longer or shorter than that
 imposed by RFC 2047/2231.

Yes, and it is sometimes desirable to transfer text and
tag from one application to another.  For example, text in
the body of a message can have language indicated by a
Content-Language header field, where there is up to 997
octets available for a language tag.  However a response
regarding some portion of that message might well indicate
the topic of the response in the response message's Subject
field, where encoded-word limits apply.

 I 

Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-15 Thread John Cowan
Bruce Lilly scripsit:

  I see no reason why limits must be added as a 
  constraint in a revision of RFC 3066.
 
 The primary reason for specifying limits is due to the
 proposed removal of the review/registration process
 which currently limits the length of non-private-use
 tags.

The current process does *not* limit the length of non-private-use
tags.  It's true that the process does not permit the registration of
unlimited-length tags, as we do not have enough universe to represent them in 
full.

But absolutely nothing except his good sense prevents Michael from registering
en-the-dialect-spoken-on-the-bowery-between-1933-and-1945-by-alcoholic-drug-users-who-live-in-flophouses.

-- 
John Cowan  [EMAIL PROTECTED]  www.reutershealth.com  www.ccil.org/~cowan
It's the old, old story.  Droid meets droid.  Droid becomes chameleon. 
Droid loses chameleon, chameleon becomes blob, droid gets blob back
again.  It's a classic tale.  --Kryten, Red Dwarf

___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


RE: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-15 Thread Peter Constable
 From: [EMAIL PROTECTED] [mailto:ietf-languages-
 [EMAIL PROTECTED] On Behalf Of John Cowan


 But absolutely nothing except his good sense prevents Michael from
 registering

en-the-dialect-spoken-on-the-bowery-between-1933-and-1945-by-alcoholic-
 drug-users-who-live-in-flophouses.

Sub-tags can be at most 8 chars long, so Michael would ask for it to be
changed to something like
en-the-dialect-spoken-on-the-bowery-between-1933-and-1945-by-alcoholc-dr
ug-users-who-live-in-flophses. :-)


Peter Constable
Microsoft Corporation

___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-15 Thread Bruce Lilly
  Date: 2004-12-13 04:37
  From: Mark Crispin [EMAIL PROTECTED]

 Silliness aside, the file may well have embedded language tags in the text 
 of the file. Have you forgotten Plane 14?

No, but I note that its introduction strongly discouraged
its general use (specifically mentioning ACAP as the
intended scope of usage, IIRC); the current version of the
Unicode document continues that strong discouragement and
further reinforces it by emphasis via italics.

Another issue is that both RFC 3066 and the draft proposal
call for language tags to be expressed in a subset of ANSI
X3.4, corresponding to a subset of the first half of a
particular Unicode plane -- and not plane 14.  There may
be an ambiguity as to whether such deprecated Unicode 3.x
tags are in fact compliant with 3066 or the draft under
discussion.

  I'm not eager to abolish uniqueness. There never was
  any guarantee that codes would never change. Both RFCs
  1766 and 3066 specifically mention changes as a fact of
  life.
 
 That's what's now being fixed.

No the problem will remain. Currently sr-CS has a specific
meaning under RFC 3066; it has had for some time.  For that
meaning to remain stable, it will be necessary to take any
change in the (current) meaning of the -CS part into
account. I.e. for a future parse of language tags to do the
right thing, it will have to recognize sr-CS generated under
the RFC 3066 rules per the 3066/639 definitions.
 
 Why is this vestige of colonialism important in the IETF context?

You seem to be making an incorrect assumption, one which
renders your question meaningless.

 What magic attribute is there to French that provides definitiveness 
 that is absent in English, or Mandarin, or Hindi, all of which are far 
 more significant languages to the world?

No such attribute of the language was claimed.  It is the
attribute of being used in the official ISO lists that
provides the characteristic.
 
 A mandatory French translation to an English definition does not 
 significantly increase the information content, and certainly does not 
 double it.

You are again making incorrect assumptions.  The languages
used in ISO documents are considered separate but equal,
not a mandatory [...] translation of some other language.
That is in fact why ISO is called ISO and not OIN or
OIS -- you might wish to visit the ISO web site for
details.
 
[more nonsense about mandatory translation elided]

  You have not explained how the code came to be embedded
  within the text itself -- surely the author didn't say
  (or write, or sign) this text is in language QZ; most
  likely the language was indicated by name, or by some proxy
  representing the name (such as a locale).
 
 Plane 14.
 
 HTML and other markups.

That provides no explanation of how a *code* came to be
embedded in text -- authors in general do not refer to
language by codes, and codes do not embed themselves by
magic.

___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


RE: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-15 Thread Peter Constable
 From: [EMAIL PROTECTED] [mailto:ietf-languages-
 [EMAIL PROTECTED] On Behalf Of Bruce Lilly


  This is a situation we do not intend to repeat.
 
 That is precisely what would be repeated, and the problem
 would remain.  CS currently means Serbia and Montenegro,
 and its use in accordance with RFC 3066 has precisely that
 meaning.

And that is a significant problem we wish to remedy as there is some
unknown amount of data or implementations out there that use CS but
with a different meaning intended.


The usability flaw in treating ISO 639 and ISO 3166 as
  human-readable is
evident in the confusion between ja and JP (or is it jp and
JA?),
 [...]
  It is not uncommon for users to confuse JA and JP.
 
 Which clearly demonstrates why mere codes in the absence
 of definitions associated with the codes is a pointless
 proposition.

I believe you have confirmed my point, that codes are not meant to be
human readable.

As for your concern regarding definition, it has been clearly pointed
out that codes will not be lacking definitions -- the same definitions
they have today from the same sources (with references made to the same
sources) will still be available.


  Again, not hypothetical at all.
 
 Last time I checked, US didn't mean France, and CN
 didn't mean Canada -- I suggest that you might want to
 brush up on the definition of hypothetical...

The case is hypothetical, but the hypothetical case serves to illustrate
a general scenario, and the general scenario is not hypothetical.



  You didn't use the term display names, but it is clearly implied
by
  your reference to bilingual implementations.
 
 Your inference (which you incorrectly claim as my implication)
 is different from my claim. My claim is that under RFC 3066,
 the definitions...

You have failed to quote what you originally wrote which I claimed made
this implication: you spoke not of definitions but of bilingual
applications.



  Definitions in multiple languages are not a requisite to
establishing
  the denotation of a coded element.
 
 True but irrelevant to the point.

Oh? Simply because you make this assertion?


 We now have definitions of
 specific types of elements (viz. country and language tags) in
 multiple languages, and the objection is to the unnecessary
 removal of that characteristic.

The definitions we have now will remain, they will continue to be
referenced and available. I do not see how you say they are being
removed?


Peter Constable
Microsoft Corporation

___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


RE: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-15 Thread Peter Constable
 From: [EMAIL PROTECTED] [mailto:ietf-languages-
 [EMAIL PROTECTED] On Behalf Of Bruce Lilly


 Currently sr-CS has a specific
 meaning under RFC 3066; it has had for some time.

The meaning Serbia and Montenegro was introduced relatively recently
(a little more than a year ago), was immediately received with alarm by
many in the IT sector. There were vain attempts to get it reversed, and
that failure was an impetus to introduce protection against such changes
in the revision of RFC 3066. I am not aware of CS being used in the IT
sector with the new meaning, though cannot guarantee that.


Peter Constable
Microsoft Corporation

___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


RE: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-15 Thread Peter Constable
 From: [EMAIL PROTECTED] [mailto:ietf-languages-
 [EMAIL PROTECTED] On Behalf Of Bruce Lilly


  By reading both RFC 2047 and RFC 2231, one
  finds that they assume that a language tag must be at most 64
characters
  long...

  - the shortest charset names are 2 characters long (e.g. IT)
 
 Not all charsets have 2-character names...

In determining the longest language tag permitted, one must identify the
shortest possibilities for all other components. 


  - the minimum encoded-text length is 1 character long
 
 That is strictly only true for text that meets all of the
 following conditions...

Hey, I just said what the EBNF said.



  An encoded-word must contain at least 11 characters that are not
part of
  the language tag and have a total length of no more than 75
characters.
  Therefore, an upper bound on language tags that can be used in an
RFC
  2047/2231 encoded-word production is 64 characters.
 
 That is a best case upper bound...

I identified it as such.


 The worst case appears to be the charset named
 Extended_UNIX_Code_Fixed_Width_for_Japanese (43 characters)...

 As mentioned, use of an encoded-word
 plus the necessary whitespace around it to represent a
 single character is rather wasteful, so a brief language tag
 is indicated; fortunately ja suffices for text likely to
 be used with that charset.

Of course, the length limitations must be balanced between the charset
tag, the language tag and the encoded-word itself.


  I see no reason why limits must be added as a
  constraint in a revision of RFC 3066.
 
 The primary reason for specifying limits is due to the
 proposed removal of the review/registration process
 which currently limits the length of non-private-use
 tags.

The review/registration process for RFC 3066 registrations does not
impose pre-defined limits that implementers of RFC 3066 can assume in
their parsers.



  It would be a good idea, however,
  to point out in section 2.1 of the draft that some applications of
this
  specification may impose limits on the length of accepted language
tags,
  and perhaps to cite RFC 2231 as an example.
 
 As a general principle, that's fine, however I would point
 out that given the inability of experts to be able to
 accurately point out the limits quickly...  I do
 not think it is sufficient merely to state the fact that
 there are limits, with or without a pointer to RFC 2231 as
 an example.  Some indication of the magnitude of worst-case
 restrictions is at least advisable...

How is it possible to identify what is the worst-case bound assumed in
implementations that are out there?

How is it possible to predict ahead of time what is the worst-case
length for a RFC3066-registered language tag?

Neither is possible. In light of that, I think it best to make sure
implementers of the revised RFC 3066 be reminded that some
implementations may impose limits (whether those implementers be
constructing tags or passing them from one process to another), and for
implementers to incorporate robustness into their implementations so
that they can respond gracefully if an unexpectedly-long tag is
encountered -- after all, no matter what limit could be imposed in a
revision to RFC 3066, there's no way to stop malware from sending bad
data.

(How *do* encoded-word parsers react if a bogus charset or language tag
that's 2k octets long is encountered? The encoded-word spec already
allows for segmenting long strings; could it not also be revised to
allow segmenting for the parameters, which would also make it more
robust?)


Peter Constable
Microsoft Corporation



___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-14 Thread John Cowan
Deborah Goldsmith scripsit:

 And here's hoping they go to four digits or otherwise extend the scheme 
 instead of recycling when they run out, a non-hypothetical issue if 
 they're already up to 891.

Not much to worry about.  Of 1000 possible codes, currently 232 are
assigned to countries, 32 are assigned to regions, and 10 are retired,
leaving 726 codes yet to be assigned.  I think that provides a comfortable
pad for the future.  It's not obvious on what principles, if any, the
individual codes were assigned.

For the fun of it, the retired codes represent Czechoslovakia,
Ethiopia+Eritrea (then known as Ethiopia), East Germany, West Germany,
the Netherlands Antilles, the Pacific Islands Trust Territories (now
split up), the USSR, Yemen, Democratic Yemen, and Yugoslavia.

-- 
Do NOT stray from the path! John Cowan [EMAIL PROTECTED]
--Gandalf   http://www.ccil.org/~cowan

___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


RE: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-14 Thread Addison Phillips [wM]
Mark and I have both worked extensively with time zone issues, so we're aware 
of the potential problems.

RFC 3339 would be an appropriate substitute: its full-date production 
describes the ISO 8601 profile used by the draft.

I would also tend to agree that lack of a timezone would be ambiguous in most 
applications. However, for this use I think that:

  a) the dates indicate the date of accession of each subtag to the registry. 
These dates will all be in the past. Since the registry itself is versioned and 
has its own date record, the question of time zone is probably not important 
because implementations will use their registry date and not an arbitrary date 
to determine compatibility. That is: the dates will all be used in the same 
context with one another.

  b) we can safely assume (or explicitly state) the use of UTC time based on 
the above.

Best Regards,

Addison

Addison P. Phillips
Director, Globalization Architecture
http://www.webMethods.com

Chair, W3C Internationalization Working Group
http://www.w3.org/International

Internationalization is an architecture. 
It is not a feature.

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] Behalf Of Joe Abley
 Sent: 20041213 17:51
 To: Peter Constable
 Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
 Subject: Re: New Last Call: 'Tags for Identifying Languages' to BCP
 
 
 
 On 13 Dec 2004, at 18:34, Peter Constable wrote:
 
  3. Re ISO 8601 time/date format: What is used in the registry is dates 
  expressed in the format -MM-DD. It was agreed that it would be 
  better to identify the format precisely rather than make the generic 
  reference to ISO 8601.
 
 Why not require dates to be formatted as per RFC 3339?
 
 In general, -MM-DD is ambiguous unless a timezone is specified.
 
 
 Joe
 
 ___
 Ietf-languages mailing list
 [EMAIL PROTECTED]
 http://www.alvestrand.no/mailman/listinfo/ietf-languages


___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-14 Thread Vernon Schryver
 From: John Cowan

 ... 
  For example, I'm
  unhappy about an apparent sentiment that would put ABNF on a lower
  footing that the English text.  I think I'm like most implementors and
  perhaps unlike non-engineers in reversing that precedence.  Whenever
  I read an RFC, I rely first and foremost on the ABNF.  I use the English
  only for hints, and follow the ABNF instead of the English whenever
  there is a conflict.
 
 Then you would be incapable of implementing any programming language compiler,
 or an XML parser, for the specs for these things include literally hundreds
 of constraints that are specified only in technical English and not in the
 BNF.  As far as the BNF is concerned, this is good sound C:
 
   main(argv, argc) {
   float Argv;
   int* Argc;
   print(32);
   }

In contexts other than UNIX applications with modern compilers,
that fragment is perfectly sound, if not something I'd write.  An
example context is before typing of formal args and in what ANSI/ISO
9899-1990 calls a freestanding environment where main() is not
special.  I've suppressed most of the memories, but I seem to recall
that what Microsoft calls threaded WIN32 applications are such
things, or were before the POSIX additions.

Besides, I didn't say that one should ignore the English, but that
implementors give precedence to the ABNF.  When you are writing an RFC
that you hope will be implemented, you MUST remember that programmers
are lazy.  We transliterate the ABNF to build the parser and so implement
the syntax and read the English to figure out and so build the semantics.
As I said, if you must have contradictions between your ABNF and your
English, you must accept the fact that most technical people will
assume your ABNF is right and your English is wrong.  That fact seemed
to me to conflict with statements in this thread, and that suggests a
problem in your working group and your RFC.


Vernon Schryver[EMAIL PROTECTED]

___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-14 Thread Peter Constable
 From: Vernon Schryver [EMAIL PROTECTED]
 Subject: Re: New Last Call: 'Tags for Identifying Languages' to BCP
 To: [EMAIL PROTECTED]
 Message-ID: [EMAIL PROTECTED]

 Besides, I didn't say that one should ignore the English, but that
 implementors give precedence to the ABNF.  When you are writing an RFC
 that you hope will be implemented, you MUST remember that programmers
 are lazy.  We transliterate the ABNF to build the parser and so
implement
 the syntax and read the English to figure out and so build the
semantics.
 As I said, if you must have contradictions between your ABNF and your
 English, you must accept the fact that most technical people will
 assume your ABNF is right and your English is wrong.  That fact seemed
 to me to conflict with statements in this thread, and that suggests a
 problem in your working group and your RFC.

This is somewhat moot since the author has indicated the relevant
portion of the ABNF will be revised. In this case, though, the ABNF
could not be said to be in contradiction with the English prose:
anything permitted by the constraints specified in the English prose
would be recognized using the ABNF. 

It is true that there are strings that could be recognized by the ABNF
that would not be permitted by the English prose, but the revision being
made to make the ABNF production in question match what Bruce Lilley
thought it should be does not change that. The only way to write the
ABNF in a way that it permits exactly no more or no less than what is
specified by the English prose would be to have the production rule
simply enumerate a specific set of terminal strings, which does not seem
to be particularly helpful, especially when the the RFC would establish
a machine-readable registry maintained by IANA in which those very
strings are enumerated.


Peter Constable

___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-14 Thread Randy Presuhn
Hi -

Perhaps it would be useful to consider
http://www.ietf.org/IESG/STATEMENTS/pseudo-code-in-specs.txt

Randy

  From: Peter Constable [EMAIL PROTECTED]
  To: [EMAIL PROTECTED]
  Sent: Tuesday, December 14, 2004 2:16 PM
  Subject: Re: New Last Call: 'Tags for Identifying Languages' to BCP

 This is somewhat moot since the author has indicated the relevant
 portion of the ABNF will be revised. In this case, though, the ABNF
 could not be said to be in contradiction with the English prose:
 anything permitted by the constraints specified in the English prose
 would be recognized using the ABNF.

 It is true that there are strings that could be recognized by the ABNF
 that would not be permitted by the English prose, but the revision being
 made to make the ABNF production in question match what Bruce Lilley
 thought it should be does not change that. The only way to write the
 ABNF in a way that it permits exactly no more or no less than what is
 specified by the English prose would be to have the production rule
 simply enumerate a specific set of terminal strings, which does not seem
 to be particularly helpful, especially when the the RFC would establish
 a machine-readable registry maintained by IANA in which those very
 strings are enumerated.
...



___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-14 Thread John Cowan
Bruce Lilly scripsit:

 There is in fact an ietf-languages list; RFC 3066 and the
 draft under discussion give its submission mailbox as
 [EMAIL PROTECTED], which makes finding the real
 list an exercise since IANA's web site makes no mention
 of any mailing lists.  I made an educated guess that I
 might find the list at alvestrand.no, and indeed the
 list submission mailbox is [EMAIL PROTECTED],

Both addresses seem to work for posting to this list.  Not all
IETF lists have ietf.org or iana.org mailing addresses anyhow;
consider [EMAIL PROTECTED], which is not a W3C mailing list but an IETF one.

 The draft in question apparently seeks to get IANA into the
 business of defining countries (and languages), usurping
 those roles from ISO (as also noted in RFC 1591).

This is doubly incorrect.  To begin with, ISO defines neither countries
nor languages.  UNSD defines country-like objects for its purposes and
assigns them numeric codes, specifying them using English and French
names.  Then ISO assigns alphabetic identifiers to the names.  Languages
are not defined at all; ISO assigns alpha and numeric identifiers to
certain words which it believes to be the names of languages, without
always specifying exactly which language among those so named is meant.
Note the titles of the ISO 3166 and 639 standards.

The proposed registry will merely serve to stabilize the ISO mappings,
making it less likely that they will be gratuitously changed, because
it will not be under the control of an MA with unfettered discretion to
make changes.

-- 
On the Semantic Web, it's too hard to prove John Cowan[EMAIL PROTECTED]
you're not a dog.  --Bill de hOra   http://www.ccil.org/~cowan

___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-14 Thread Mark Davis
There is a fundamental misunderstanding on two points.

1. Of course countries go in and out of existence, and change their borders;
nobody disputes that. That is not the stability problem in question; it is
where the meaning of tags changes so drastically as to refer to a completely
different country. One can't willy-nilly change data that has significant
effects on databases all over the world; when someone's birthplace is
indicated by a stored country code, for example, it mustn't suddenly
designate a different country! For more, see
http://www.unicode.org/consortium/positions.html.

2. The fact that the 3066 registry is not in multiple languages (either
currently or in the new draft) has nothing to do with any alleged
discouragement of any language, French included. The names in the registry
are simply to distinguish and identify the subtags, not to provide
recommended localizations.

The registry, and for that matter the ISO 639/3166 standards, are the wrong
place for localization data. The language coverage (only 2!) is a very small
fraction of what is really needed for any real product development -- and
even for those languages that are present, the names used there are not
optimal for user interfaces since they are sometimes not the customary form.
For an example of a data repository that is designed for localization of
language/region names, see http://www.unicode.org/cldr/.

Mark

- Original Message - 
From: Bruce Lilly [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Sunday, December 12, 2004 08:46
Subject: Re: New Last Call: 'Tags for Identifying Languages' to BCP


  Date: 2004-12-10 22:37
  From: John Cowan [EMAIL PROTECTED]

 Bruce Lilly scripsit:

  It's not clear to me that the proposal will provide protection
  against the whims of politicians. If the definition of CS as
  a country code changes again under the proposed scheme,
  how is one to determine specifically what some archived
  language-tag referred to at some point in time? I'm not
  particularly concerned about that problem, as I am resigned
  to instability associated with anything specified by politicians
  (and that includes the UN region codes).

 The U.N. Statistics Division are only politicians in the sense
 that IETF WG members are. They are, in fact, statisticians.
 Their track record for stability is considerably longer than the
 IETF's.

I hope that I need not repeat any of the well-known remarks
about statistics.  Nor that I need point to the many uses
by politicians of statistics (and statisticians) for
political purposes.

Moreover, the point is that countries do change, and that use
of country codes (as provided for in RFC 3066 and in the
proposed draft) carries with it the inherent instability
which is characteristic of politics.  A quest for stability
of countries seems Quixotic and oxymoronic.  According to the
principle of stability as that term is used in defense of the
draft, I suppose we're all intended to refer to Malawi as
Rhodesia because that's what it (in part) was called 50 years
ago, or that we're supposed to ignore the breakup of the USSR,
Yugoslavia, etc., the reunification of Germany, etc.

A related problem with the use of country codes in language
tags is that there is not necessarily an inherent relationship
between language and country borders.  The borders of Germany
have changed many, many times.  If one is referring to the
German language as spoken by inhabitants of Alsace, using
country codes would imply that that same language spoken by
the same people would have been tagged at various times as
de-DE and de-FR according to where the France-Germany border
happened to have been determined by politicians of the time.
That strikes me as being a rather silly way to tag language,
but that's the precedent set by RFC 1766.  As far as I can tell,
the draft doesn't really deal with the issue of changing borders
or changing country names -- it merely pretends that these
things don't happen by attempting to declare a snapshot of the
status at some point in time as being valid for all time.

  But if the proposed new registry's description of CS says
  foo and the ISO standard code list says bar, what's
  an implementor supposed to present to a user as *the*
  description associated with CS?

 The former. That's the whole point of having a registry.

But the user has indicated that he speaks French, and the
proposed registry contains a description in English only.
Where is the implementor supposed to get the *official*
translation for display?  N.B. under the current (RFC 3066)
situation, the definitive ISO lists provide an official
description in French.

  One possibility would be two description fields.

 Why two?

There are now two in the ISO lists (and, as noted, in the
UN list).  I have no objection to more, but I object to
a reduction.  The text accompanying the new last call
states:

This specification addresses each of these issues with a simple, elegant
design

Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-14 Thread John Cowan
Bruce Lilly scripsit:

 There is a limited supply of 2-letter codes and the supply
 of 3-digit codes is only slightly greater.  Reassignment of
 codes from such a limited supply is inevitable.  

In the very long run, yes; but even the 75-octet limit probably won't
stand in the *very* long run.  Countries and languages, as opposed
to codes for them, don't come and go like IETF protocols: many of
them have centuries of history, or half a century in the case of the
post-colonialist countries; the events of 1991-93 were historically
anomalous.

 Too late. King Canute commands the tide not to come in, but
 his feet still get wet.  

Canute was making a moral object lesson about the limitations of
kingship, not acting like an idiot.

 But I'm not concerned with translations, but with the
 definitions. And currently the definitions are available
 in French and English.

What of it?  In what case does the provision of a French name
significantly tighten the definition provided by the English
name (or for that matter vice versa)?

 Removing that requirement [for registration] -- as the draft would do
 -- necessitates a specific upper bound on tag length that will work
 with existing core protocols, to replace the reviewer, Area Director,
 and community review process that ensure that current registered tags
 work with those protocols.

Michael, I assume you're ignoring this kerfuffle, and rightly so.
But for the record, have you ever been given cause to take into
account a hard limit in the length of language tags?

-- 
Here lies the Christian,John Cowan
judge, and poet Peter,  http://www.reutershealth.com
Who broke the laws of God   http://www.ccil.org/~cowan
and man and metre.  [EMAIL PROTECTED]

___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


RE: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-14 Thread Peter Constable
 From: [EMAIL PROTECTED] [mailto:ietf-languages-
 [EMAIL PROTECTED] On Behalf Of Bruce Lilly


  I don't know where the statement accompanying the announcement came from,
 
 According to the New Last Call issued by the IESG Secretary,
 the text is Author's discussion of drivers for this work.
 
  You singled out that one point to comment on as though it were the main
 factor.
 
 I mentioned a matter which was repeatedly indicated as a
 factor for existing implementations and with which I
 strongly disagree.

You have not responded to the point that accessibility of source ISO standards 
is supposed to be a major factor, yet the draft itself clearly indicates 
otherwise.


 [regarding the proposed registry vs. internationally-
 standardized ISO lists for subtag definitions]
  It is certainly the case that only it should be consulted for
 determining what sub-tags are valid with what denotation, which was the
 intent.
 
 That is a problem for existing implementations of RFC 3066
 tags, which can obtain official, internationally agreed
 descriptions of the codes in two languages.

Descriptions (language names) are beyond the scope of RFC 3066. It is a non 
sequitor to claim that this draft creates a problem for existing 
implementations of RFC 3066 on this basis.


  By looking in the sub-tag registry. If ISO changed the meaning of US
 to something other than what it is now, its meaning for purposes of use in
 an IETF language tag would not change, because it would remain stable in
 the sub-tag registry. You would be fairly well protected against the whim
 of politicians.
 
 OK, continuing your hypothetical example and its relationship
 to language, suppose that there is another civil war and
 that what now corresponds to US is split into Blue America
 and Red America.  Further suppose that in due course ISO
 assigns some other code to one of those countries and retains
 US for the other, and that that happens after the proposed
 registry is set up with a definition for US and some
 description referring to the old use.

That is a scenario that has been well considered: it would be very bad IT 
practice to redefine a metadata tag US to have a narrower denotation than it 
previously did, as that immediately breaks an unknown amount of existing data. 
If ISO were to make such a change in the meaning of US, then IT 
implementations *absolutely should not* follow suit; the ID US must retain 
it's prior, broader meaning.


 Now suppose that one
 wishes to produce an appropriate language tag for the text
 moral values (which clearly has different meaning in Blue
 America (telling the truth, admitting to mistakes, etc.) and
 in Red America (imposing totalitarian control over others)).
 How specifically would the proposed registry handle such a
 change in the meaning of US, and how would the registry
 help differentiate the meaning of a 1990's en-us tag to
 that of the hypothetical time described?

It would leave US with it's historic meaning, so that existing data is left 
intact. (You wouldn't want a document containing moral values created on the 
eve of the cival war by someone supporting the Blue America side of the divide 
to suddenly get assigned an interpretation of 'imposing totalitarian control 
over others'.) New identifiers would be assigned for use in IT applications to 
designate the two new countries.


  you already have to look beyond the ISO standards for anything more than
 English and French
 
 But existing RFC 3066 implementations can get official
 descriptions in *both* of those languages; the proposal
 would adversely affect those existing implementations by
 eliminating the French description.
 
 Of course, it is a more serious defect of the proposal
 that it would fail to reflect internationally-agreed
 codes and would fail to keep pace with changes...
 
  it would not be new that you have to look beyond the registry itself to
 decide what human-readable descriptors you should provide in a product.
 
 It would be new that one could not find a standard
 (i.e. official) French-language description in the
 list of codes.

Incorrect. The registry for RFC 3066 did not provide a language/country name in 
*any* language for any ISO 639 or ISO 3166 identifier. Tags registered under 
RFC 3066 included an English-language name and an ASCII-transcription of the 
indigenous name; they did not contain French-language names.

Again, you are trying to impose UI-localization concerns that have always been 
out of scope for the RFC 1766/3066/... sequence of specifications.


   One possibility would be two description fields.  But the
   registry would need a charset closer to ISO-8859-1 than
   to ANSI X3.4 as currently specified.  Or an encoding
   scheme.
 
  Personally, I don't see the value in something like that. Given the
 intent to have a registry that can be machine-readable, changing its
 charset from ANSI X3.4 in order to gain descriptors in just one more
 language is not worth it IMO.
 
 Fine, 

Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-14 Thread Deborah Goldsmith
And here's hoping they go to four digits or otherwise extend the scheme 
instead of recycling when they run out, a non-hypothetical issue if 
they're already up to 891.

Deborah Goldsmith
Internationalization, Unicode liaison
Apple Computer, Inc.
[EMAIL PROTECTED]
On Dec 13, 2004, at 6:11 AM, John Cowan wrote:
UNSD historically has assigned new numerical codes when new countries
come into existence, and has managed to avoid reusing any of its 
3-digit
identifiers

___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


RE: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-14 Thread Mark Crispin
On Sun, 12 Dec 2004, Peter Constable wrote:
That is not at all the aim here wrt stability; rather, the aim is that a
symbolic identifier used for metadata in IT systems not change because
some government on a whim says, We would now prefer to use 'yz' rather
than 'xy' to designate our country.
This point needs to be stressed.
If this registry does not do it, we'll need to create a new one which 
does.

If anything, I am inclined to object to two: to avoid an Anglo-Franco
colonial bias,
Bravo!
If there were to be just two languages, it would need to be Mandarin 
Chinese as primary entry, and English as secondary entry.

either there is one name that is simply a reference name,
or the registry be designed so that it could accommodate names in as
many languages as may be available.
In order to accomodate the Francophiles, we would need first to accomodate 
several other languages of greater international prominence than French; 
and by that point the registry would be so unwieldy as to be useless.

Even worse is the matter of coordinating all these various descriptions 
and what happens when (not if) an ambiguity is created because the Lower 
Slobbovian version means something different than the English version?

Among other things, that means that a developer in Lower Slobbovia can't 
use an abridged version of the registry that only has the Lower Slobbovian 
descriptions, because if he is unaware of the other texts he may make an 
unwarranted assumption as to the meaning of that description.n

What is done when (not if) international politics rears its ugly head? 
We have numerous instances where the name of a language is official in one 
place, and highly-offensive in another place.

What's more, all of that effort is for naught, since the only thing that 
matters is the tag, a machine-readable token intended to identify a 
language, and not the description.

RFC 3066 *does not at any point* suggest let alone state that
implementations should use ISO 639 language names or ISO 3166 country
names for UI purposes. IMO, you are creating an issue where none exists.
Bravo!
Another point which bears emphasis; these are machine-readable tags for 
the purpose of software, not user interface elements.

IETF language tags are used in a wide variety of applications. The
parties involved in development of this spec (the authors and others)
have examined these issues for the past several years and have arrived
at this architecture.
And have done a fine job at it.
-- Mark --
http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum.
___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-14 Thread Mark Crispin
On Sun, 12 Dec 2004, Bruce Lilly wrote:
If by international agreement, 'yz' becomes the designation
for that country, then it is rather silly to stick one's
fingers in one's ears and shout NA-NA-NA-NA-NA I don't want
to hear you.
What is silly is saying that every language tag has to have a date/time 
attribute associated with it so that computer software managing that text 
knows the language of that text.

But that is precisely what you are advocating.
It's rather silly to change that correspondence simply because
a few people are piqued that international agreement has been
reached to change a few 2-letter codes.
It's bad enough that TLDs get recycled.
It is a disaster for language identifiers to get recycled.  Something has 
to make those identifiers unique.  Your notion will force the inclusion of 
a date/time stamp in language tags, to restore the uniqueness that you are 
so excruciatingly eager to abolish.

Never
mind the shortcomings of that particular example; consider
de-DE -- does that mean Germany as it exists today, West
Germany as it existed 25 years ago, Germany as it existed
in the 1930s, the 1900s, ...?
For the 98% case, it does not matter at all.
But it does matter if, one day, DE becomes Denmark.
As far as I can tell, the draft pretends that the meaning
of CS hasn't changed, and would in fact change the meaning
of the currently valid RFC 3066 language tag sr-CS.
No, it restores the previous meaning of sr-CS.
It is very different; under the proposed draft, there is only
an English definition, somebody wishing to provide a French
definition finds that he has none and must resort to an
unofficial translation.
Why is the situation for French different from someobody wishing to 
provide a Lower Slobbobian definition?

SO where are the French definitions?
Ask a person who is bilingual in English and French to provide one.
Well, sure. But the name is an important thing by itself.
It is rather pointless to ask a user to indicate the
language of a piece of text by selecting from a list AB, ACE,
ACH,..., ZHA, ZUL, ZUN -- the user doesn't normally refer to
languages by codes. It's quite a different matter to ask the
user to select from Abkhaze, Aceh, Acoli,..., Zhuang (Chuang),
Zoulou, Zuni.
Abkhaze, Aceh, Acoli,..., Zhuang (Chuang), Zoulou, and Zuni are not 
language tags.  So what's your point?

Note that the RFC 3066 specifies a registry that does not include French
language names. I suggest that this issue should be dropped.
Yes, the current IANA registry has that problem for
the non-ISO-based tags only. If the registry is to be
changed to subsume ISO codes as well, that defect should
be remedied.
Why is it a problem?  Why is it a defect?
On the contrary, it is preposterous to suggest that codes
will be attached to text by magic
Here is where you are misled.  Many of these tags are embedded within the 
text itself.  That text may long outlive its author in an archive.

My concern
is the elimination of the French definition in the first place.
Why is this a problem?
-- Mark --
http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum.
___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-14 Thread Jo Wilkes
Bruce Lilly [EMAIL PROTECTED] wrote on 2004-12-12 T 18:44:27 -0500

(...)
 RFC 1766 (and 3066) leave you little choice; if you wish
 to indicate a region, you either have to do it with ISO
 639 codes or you have to register a separate tag (no
 separate tag for German as spoken in Alsace exists). Never
 mind the shortcomings of that particular example; consider
 de-DE -- does that mean Germany as it exists today, West
 Germany as it existed 25 years ago, Germany as it existed
 in the 1930s, the 1900s, ...?

As far as I can tell, this is about about /language/ tags, not about the 
tagging of borders or nationalities.
These last two have certainly influence on the use of language, but should not 
on the name and tagging of the language itself. Unless there is some forced 
change of language, which needs to be reflected in tagging.

(...)
 On the contrary, it is preposterous to suggest that codes
 will be attached to text by magic; some human somewhere,
 somehow is going to have to indicate the language to
 something, 

Tagging may also be done by some software instead of a human being. 
Whether you consider that magic is up to you, but I think both ways of 
finding a tag for a given text are possible.

 and it certainly isn't going to be by way of
 a 2- or 3-letter code without some reference to what those
 codes *mean*. 

I beg to differ. Many humans do not know the lists, not even their name, and 
yet they use the codes on a daily basis. They simply recall the codes they've 
seen and found relevant to them, like the TLD ones they are used to; or they 
are told by somebody use this tag when using language A and that tag when 
using language B. This information may be wrong or outdated, of course.


Please do not assume tags will be assigned only by humans who have a recent 
list of the code(s) at hand.

Just my 0.02€.

Best regards,
J. Wilkes


___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


RE: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-14 Thread Peter Constable
 From: [EMAIL PROTECTED] [mailto:ietf-languages-
 [EMAIL PROTECTED] On Behalf Of Bruce Lilly


   The point is that under RFC 3066,
   the bilingual ISO language and country code lists are
   considered definitive.
 
  That is nowhere stated or even suggested in RFC 3066.
 
 RFC 3066 section 2.2 states, in part:
 
- All 2-letter subtags are interpreted according to assignments
found
  in ISO standard 639, Code for the representation of names of
  languages [ISO 639], or assignments subsequently made by the ISO
  639 part 1 maintenance agency or governing standardization
bodies.
 
 and has a similar statement regarding ISO 3166.
 
 interpreted according to assignments found in certainly
 sounds as if the ISO lists are considered definitive for
 their respective categories of subtags, since their
 interpretation is specified as that given in those lists.
 I don't see how the RFC 3066 text can be interpreted
 otherwise.

RFC 3066 indicates that the *interpretation* is determined by the source
ISO standards. You were discussing display names. (Though, now that I've
shown that display names are out of scope, you appear to be attempting
to change things as though you had been discussing definitions.)



Peter Constable
Microsoft Corporation

___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


RE: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-14 Thread Addison Phillips [wM]

(B On the contrary, what the authors of a standard intend is not normative.
(B As much as possible, every standard must say what it means, because
(B what a standard says *is* its technical content.  For example, I'm
(B unhappy about an apparent sentiment that would put ABNF on a lower
(B footing that the English text.  I think I'm like most implementors and
(B perhaps unlike non-engineers in reversing that precedence.  Whenever
(B I read an RFC, I rely first and foremost on the ABNF.  I use the English
(B only for hints, and follow the ABNF instead of the English whenever
(B there is a conflict.
(B
(BThe ABNF is not on a lower footing than the English text. But it is
(Bdependent on the English text in exactly the same way that the ABNF in RFC
(B3066 was.
(B
(BI think the suggestion to change the "grandfathered" production is a good
(Bone and will help implementers who start with the ABNF.
(B
(BI also think, though, that the establishment of a comprehensive (as opposed
(Bto fractional) registry is the real salient point for implementers here. An
(Bimplementation of RFC 3066 that follows *only* the ABNF would happily
(Bproliferate garbage tags like "c57-x", not just valid ones. The existence of
(Ba registry in draft-langtags should focus implementer's attention on two
(Bthings: the ABNF and the subtags that fit into them. In that regard
(Bdraft-langtags will simplify the lives of implementers who do not read the
(Btext (in the same way that having a registry for character encoding
(Bnames--"charsets"--does).
(B
(B
(B There are a couple other issues that ought to be addressed.
(B
(B I think Bruce Lilly started by charging that a potentially disruptive
(B document had reached last-call without any review by those concerned
(B with related, affected IETF standards.  That sounds like a process
(B problem that needs at least 1% as many words as have been spent in
(B this mailing list in lawyerly talk such as whether "accounts" is more
(B appropriate than "account."
(B
(BThe IETF process is not really my concern. I will note that many IETF and
(Bnon-IETF standards folks have participated in the process of developing and
(Breviewing draft-langtags, though. I don't know if a wider audience should
(Bhave been invoked earlier in the process. Mark and I welcome comments and
(Bquestions on the technical suitability of our draft. I think that we have
(Bfully and carefully considered the potential impact and, in fact, have
(Bhelped to stabilize language tags, not just now but for the future as well.
(B
(BPeter made the argument that future I-D authors could write a draft that
(Bdoes whatever they please with regard to language tags. Which is true.
(BHowever, draft-langtags lays down a framework that should guide the
(Bactivities of these authors and constrain the changes they make in a manner
(Bthat is completely compatible with implementations of draft-langtags (not to
(Bmention RFC 3066 and RFC 1766). I think that a guarantee of future
(Bstability---in implementations (including current ones), extensions, and the
(Btags (data) themselves---is of great benefit to related and/or affected IETF
(Bstandards.
(B
(BBest Regards,
(B
(BAddison
(B
(BAddison P. Phillips
(BDirector, Globalization Architecture
(Bhttp://www.webMethods.com
(B
(BChair, W3C Internationalization Working Group
(Bhttp://www.w3.org/International
(B
(BInternationalization is an architecture.
(BIt is not a feature.
(B
(B
(B
(B___
(BIetf mailing list
([EMAIL PROTECTED]
(Bhttps://www1.ietf.org/mailman/listinfo/ietf

Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-13 Thread Mark Davis
 The ABNF is an expression of the grammar that
describes the set of all valid tags.

No, this is simply incorrect. You cannot expect that any implementation that
simply does the ABNF is conformant. There are a great many constraints on
the tags that are not in the ABNF grammar, that are clearly required in any
reading of the text. Most of these *cannot* be encompassed in any ABNF
grammar. There are a few that could be expressed in the ABNF; some at little
cost, some with a great deal of complication. This is not a technical
problem for the draft.

 as reasonable as the current worst-case of 11 octets.
Also simply untrue. You seem not to be reading all the messages on this
subject. Look at the ABNF for RFC 3066. There is *no* limit in the ABNF
there!


   The syntax of this tag in ABNF [RFC 2234] is:

Language-Tag = Primary-subtag *( - Subtag )

Primary-subtag = 1*8ALPHA

Subtag = 1*8(ALPHA / DIGIT)


-- http://www.ietf.org/rfc/rfc3066.txt?number=3066


Mark

- Original Message - 
From: Bruce Lilly [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Friday, December 10, 2004 20:39
Subject: Re: New Last Call: 'Tags for Identifying Languages' to BCP


 RE: New Last Call: 'Tags for Identifying Languages' to BCP
  Date: 2004-12-10 20:03
  From: Peter Constable [EMAIL PROTECTED]
  To: [EMAIL PROTECTED]
  CC: [EMAIL PROTECTED]

 Resuming my comments:

  Specifically, the draft allows, and RFC 3066 disallows:
  subtags more than 8 octets in length
  hyphens which do not separate subtags
  zero-length subtags
  primary tags which are not purely alphabetic
  Curiously, all of those are permitted by the draft ABNF
  production grandfathered...

 The grandfathered production in the current draft is

 grandfathered = ALPHA *(alphanum / -)

 which does permit the sequences claimed by Bruce (except for
 not-purely-alphabetic primary sub-tags),

No exception.  alphanum is ALPHA / DIGIT.  In plain
English, grandfathered as defined in the draft is a letter
followed by any number of letters, digits, and/or hyphens, in
any order.  And that includes a123-xyz as I initially stated,
and clearly 1, 2, and 3 are digits.

 syntactically; but the set of
 tags available for use is constrained by more than the ABNF syntax
 alone: the acceptable productions for each sub-tag must either be taken
 from one of the source standards or be registered.

So what? The ABNF is an expression of the grammar that
describes the set of all valid tags.  If the grammar permits
y-, a123-xyz, etc. (and it does) then a parser
claiming to parse language tags as defined by that ABNF
must be able to parse such tags.  That is, the ABNF-
specified grammar imposes requirements on parsers.  If
one doesn't intend to impose such requirements, the
ABNF specifying the grammar should be changed
accordingly.

 This is no different
 from RFC 3066, so it is no more of a problem in this specification than
 it was in RFC 3066.

It is a very different grammar from RFC 3066, imposing
very different requirements on parsers.

 It might be that the wording in 2.2 could be tightened up to eliminate
 any possible question regarding the source for grandfathered
 productions.

It's not a matter of wording; the problem is with the ABNF.

 Alternately, there's no reason why the grandfathered production
 shouldn't be composed exactly to match what was used in RFC 3066:

 grandfathered = 1*8ALPHA *(- 1*8alphanum)

I believe I said as much (though one then needs to look
at reduce/reduce conflicts implied by the revised grammar):

  I see no reason for the ABNF to permit such content as is
  forbidden by RFC 3066; the actual ABNF for what RFC 3066
  permits is contained within 3066, and could have been directly
  incorporated rather than producing a grandfathered
  production which opens up several cans of worms.

 This vastly overstates the problem. There is no can of worms unless it
 exists in tags currently available under RFC 3066.

I referred to the additional requirements imposed on
parsers, as well as the unlimited tag length permitted.

  One defect related to tag length in RFC 3066 is not remedied
  by the draft; indeed the problem is greatly exacerbated...

  Unfortunately, a language- tag's length is unlimited by
  the ABNF in RFC 3066 (due to an unlimited number of subtags)
  and in the draft...

  In particular, tags other than private-use tags with more than
  two subtags require registration under RFC 3066 rules, and it
  is a trivial matter to determine the longest registered tag.
  The draft, however, encourages use of more subtags as well as
  removal of the subtag length upper bound; moreover, it permits
  infinite numbers of subtags without requiring registration of
  the resulting complete tag.

 Bruce states incorrectly that there is no upper bound on the length of
 sub-tags.

Look again at the draft definition of grandfathered -- now
show me where there's a limit in that production on subtag
length.

 His other concern

Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-13 Thread John Cowan
Bruce Lilly scripsit:

 Moreover, the point is that countries do change, and that use
 of country codes (as provided for in RFC 3066 and in the
 proposed draft) carries with it the inherent instability
 which is characteristic of politics.  A quest for stability
 of countries seems Quixotic and oxymoronic.  

Of course countries change, and then the numeric country codes change
as well.  The point is that the alpha codes change for political reasons
when there has been *no* change in the underlying country:  Romania's
3-alpha code changed from ROM to ROU without any change in Romania at all.
The CS case is particularly gratuitous, as its denotation changed from
Czechoslovakia (a no longer existent country) to Serbia and Montenegro
(a newly created country).

 A related problem with the use of country codes in language
 tags is that there is not necessarily an inherent relationship
 between language and country borders.  

Of course not.  But for the most part, variations in orthography
do tend to follow national boundaries, since orthography in many
languages is either de jure or de facto a national matter.

 As far as I can tell,
 the draft doesn't really deal with the issue of changing borders
 or changing country names -- it merely pretends that these
 things don't happen by attempting to declare a snapshot of the
 status at some point in time as being valid for all time.

No, it attempts to freeze the code-to-country mapping at a single
point.  New countries or changes in old countries should involve only the
additions of codes, not the reuse of old codes.

 Where is the implementor supposed to get the *official*
 translation for display?  

I don't know.  Where is the implementor supposed to get the
official German, or Catalan, or Mandarin translations?
Not in the ISO registry, for sure.  To say nothing of the
cases where no official translations exist.

  There are 6000 languages spoken on Earth, of which 
  perhaps 600 have a standard written form.
 
 ISO 639 lists about 650, not precisely 6000.

ISO 639-2 is deliberately incomplete.  The current draft of ISO 639-3,
which is not yet an IS, lists over 7000 languages.

 It might be worthwhile considering the differences in the
 way languages tags are used, by whom they are used, and for
 what purpose.  There may well be a substantial difference
 between use of a tag to represent an obscure dialect of a
 dead language in a research paper vs. tagging a piece of
 text in one of the core Internet protocols such as SMTP.

That count does not include dead languages.  Whether it includes
dialects is a matter of terminology.

-- 
Deshil Holles eamus.  Deshil Holles eamus.  Deshil Holles eamus.
Send us, bright one, light one, Horhorn, quickening, and wombfruit. (3x)
Hoopsa, boyaboy, hoopsa!  Hoopsa, boyaboy, hoopsa!  Hoopsa, boyaboy, hoopsa!
  -- Joyce, Ulysses, Oxen of the Sun   [EMAIL PROTECTED]

___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


RE: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-13 Thread Peter Constable
Resuming my comments:


 -Original Message-
 From: [EMAIL PROTECTED] [mailto:ietf-languages-
 [EMAIL PROTECTED] On Behalf Of Bruce Lilly

[snip]

 Specifically, the draft allows, and RFC 3066 disallows:
subtags more than 8 octets in length
hyphens which do not separate subtags
zero-length subtags
primary tags which are not purely alphabetic
 Curiously, all of those are permitted by the draft ABNF
 production grandfathered...

The grandfathered production in the current draft is 

grandfathered   = ALPHA *(alphanum / -)

which does permit the sequences claimed by Bruce (except for
not-purely-alphabetic primary sub-tags), syntactically; but the set of
tags available for use is constrained by more than the ABNF syntax
alone: the acceptable productions for each sub-tag must either be taken
from one of the source standards or be registered. This is no different
from RFC 3066, so it is no more of a problem in this specification than
it was in RFC 3066.

It might be that the wording in 2.2 could be tightened up to eliminate
any possible question regarding the source for grandfathered
productions. Maybe it's not as obvious to someone coming to this cold as
it for us who have been discussing it for the past year.

Alternately, there's no reason why the grandfathered production
shouldn't be composed exactly to match what was used in RFC 3066:

grandfathered = 1*8ALPHA *(- 1*8alphanum)

So, perhaps there is room for technical improvement, but there are not
any serious problems IMO -- certainly nothing as serious as the tone of
Bruce's conveyed.


 I see no reason for the ABNF to permit such content as is
 forbidden by RFC 3066; the actual ABNF for what RFC 3066
 permits is contained within 3066, and could have been directly
 incorporated rather than producing a grandfathered
 production which opens up several cans of worms.

This vastly overstates the problem. There is no can of worms unless it
exists in tags currently available under RFC 3066.

 
 One defect related to tag length in RFC 3066 is not remedied
 by the draft; indeed the problem is greatly exacerbated...

 Unfortunately, a language- tag's length is unlimited by
 the ABNF in RFC 3066 (due to an unlimited number of subtags)
 and in the draft...

 In particular, tags other than private-use tags with more than
 two subtags require registration under RFC 3066 rules, and it
 is a trivial matter to determine the longest registered tag.
 The draft, however, encourages use of more subtags as well as
 removal of the subtag length upper bound; moreover, it permits
 infinite numbers of subtags without requiring registration of
 the resulting complete tag.

Bruce states incorrectly that there is no upper bound on the length of
sub-tags. His other concern, on the overall length of complete tags, is
valid, however: in terms of the ABNF syntax for both RFC 3066 and RFC
3066bis, infinite-length productions are possible, but RFC 3066 would
require registration of complete non-private-use tags while RFC 3066bis
does not.

There are three open doors for infinite-length productions in the ABNF
of the current draft:

- unlimited extlang sub-tags
- unlimited variant sub-tags
- the number of possible extensions is limited to 25, but the length of
extensions is unlimited

We could impose some upper limits on these things; e.g.

Language-Tag = ... *8(- extlang) ... *8(- variant) ... 1*25(-
extension)
...
extension = singleton 1*8(- 2*8alphanum)

If we also imposed limits on the length of private-use tags and defined
the grandfathered production in a way that made clear there was an upper
limit for those, then we could end up eliminating an issue that had
existed in RFC 3066.

So, I think Bruce has identified a valid issue here. I personally would
not have characterized it as greatly exacerbating, though, as the issue
was present in RFC 3066: private-use tags did not need to be registered
in RFC 3066, so there was no way in implementation could be written with
certain knowledge that tags beyond some given length would not be
encountered.


  The new registry provides a complete,
  easily parseable file which provides the precise the contents of
valid tags for
  any point in time.
 
 That is the first time I have ever heard ISO 8601 date
 format described as easily parseable.  Perhaps the draft
 authors meant to say that a specific subset of the tortuously
 complex ISO 8601 date format is used, but that is not what
 the draft states...

It seems very clear that the authors intended only a specific subset:
-MM-DD. This is a minor technical issue that the authors can very
easily remedy.


 I am absolutely shocked that a draft dealing with language
 lacks an Internationalization considerations section as
 recommended by RFC 2277 (a.k.a. BCP 18).

No more or less shocking than for RFC 3066, regarding which I'm not
aware of any complaints.

I don't quite understand what the critique is here: what is there to
internationalize about language tags? They are 

Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-13 Thread Mark Davis
 Show me a general-use RFC
 3066 language tag which is too long to fit on an RFC
 2822/3282 Content-Language header field line.

Your claim was that RFC 3066bis (the informal name we've been using for the
new draft) permits language tags that are longer than those permitted by RFC
3066. That is clearly false, as many people have pointed out. Any subsequent
niggling that particular *types* of language tags can be longer or not is
not relevant to the conformance implications of the two documents for
language tags. The new draft neither extends nor contracts the maximum
length of language tags conformant to RFC 3066.

Your claim that the RFC 3066 ABNF itself has a restriction in length is also
clearly false. I will quote that again since you seem somehow not to have
seen it:


   The syntax of this tag in ABNF [RFC 2234] is:
Language-Tag = Primary-subtag *( - Subtag )
Primary-subtag = 1*8ALPHA
Subtag = 1*8(ALPHA / DIGIT)


Both documents establish many further limitations on the contents of
language tags in the text of each document. Ignoring those stated
limitations will, in both documents, result in nonconformant language tags.


Mark

- Original Message - 
From: Bruce Lilly [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Sunday, December 12, 2004 09:16
Subject: Re: New Last Call: 'Tags for Identifying Languages' to BCP


   Date: 2004-12-11 00:52
   From: Mark Davis [EMAIL PROTECTED]
   To: [EMAIL PROTECTED], [EMAIL PROTECTED]
   CC: [EMAIL PROTECTED]
 
   The ABNF is an expression of the grammar that
  describes the set of all valid tags.
 
  No, this is simply incorrect. You cannot expect that any implementation
that
  simply does the ABNF is conformant.

 I made no such claim.  I do claim that if the ABNF
 contradicts the normative text, as is the case in
 your draft w.r.t. acceptance of several constructs
 not permitted by RFC 3066 ABNF, that there is an
 error in either the normative text or the ABNF.

  There are a great many constraints on
  the tags that are not in the ABNF grammar, that are clearly required in
any
  reading of the text. Most of these *cannot* be encompassed in any ABNF
  grammar.

 If your claim is that the ABNF cannot express a
 grammar consistent with the RFC 3066 ABNF, that
 is clearly false.

  There are a few that could be expressed in the ABNF; some at little
  cost, some with a great deal of complication.

 Are you claiming that it is unduly difficult to
 make the ABNF match RFC 3066's?

  This is not a technical
  problem for the draft.

 It is a problem due to the conflict between the
 ABNF and the text.  It is a problem because it
 opens a loophole for future revisions to formalize
 content which is incompatible with RFC 3066
 implementations.

   as reasonable as the current worst-case of 11 octets.
  Also simply untrue. You seem not to be reading all the messages on this
  subject. Look at the ABNF for RFC 3066. There is *no* limit in the ABNF
  there!

 The draft proposes closing RFC 3066-style registrations.
 Show me a registered RFC 3066 language tag longer than
 11 octets.  Show me a general-use (i.e. not private-use)
 RFC 3066 language tag which is too long to be used in an
 RFC 2047/2231 encoded-word.  Show me a general-use RFC
 3066 language tag which is too long to fit on an RFC
 2822/3282 Content-Language header field line.
 ___
 Ietf-languages mailing list
 [EMAIL PROTECTED]
 http://www.alvestrand.no/mailman/listinfo/ietf-languages



___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


RE: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-13 Thread Peter Constable
 From: [EMAIL PROTECTED] [mailto:ietf-languages-
 [EMAIL PROTECTED] On Behalf Of Bruce Lilly


 The point is that under RFC 3066,
 the bilingual ISO language and country code lists are
 considered definitive.

That is nowhere stated or even suggested in RFC 3066.


Peter Constable
Microsoft Corporation

___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-13 Thread Keld Jørn Simonsen
On Mon, Dec 13, 2004 at 01:37:04AM -0800, Mark Crispin wrote:
 
 When I retrieve a file via FTP, HTTP, etc. the time stamp of that file on 
 my computer is the date/time of retrieval, not the date/time of the file 
 on the source.
 
 Unless, of course, both systems are running TOPS-20 and thus use that 
 wonderful XTP mode that copies file metadata.  Now, if you want to mandate 
 that all UNIX and Windows systems be replaced with TOPS-20, I might 
 support that... :-)

Actually some ftp and http transfer programs, incl wget and ncftp keep
the original date stamp.

With Xmas greetings
Keld

___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


RE: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-13 Thread Peter Constable
 From: [EMAIL PROTECTED] [mailto:ietf-languages-
 [EMAIL PROTECTED] On Behalf Of Bruce Lilly


 Moreover, the point is that countries do change, and that use
 of country codes (as provided for in RFC 3066 and in the
 proposed draft) carries with it the inherent instability
 which is characteristic of politics.  A quest for stability
 of countries seems Quixotic and oxymoronic.  According to the
 principle of stability as that term is used in defense of the
 draft, I suppose we're all intended to refer to Malawi as
 Rhodesia because that's what it (in part) was called 50 years
 ago, or that we're supposed to ignore the breakup of the USSR,
 Yugoslavia, etc., the reunification of Germany, etc.

That is not at all the aim here wrt stability; rather, the aim is that a
symbolic identifier used for metadata in IT systems not change because
some government on a whim says, We would now prefer to use 'yz' rather
than 'xy' to designate our country.

Sure, there will be changes that we need to deal with; but there's no
reason to subject all implementations, users and data to changes that
are purely cosmetic changes to things that are not designed to be read
by humans.

 
 A related problem with the use of country codes in language
 tags is that there is not necessarily an inherent relationship
 between language and country borders.

That is not what country IDs within a language tag is intended to
suggest. In fact, if there were inherent relationships, we probably
would never have needed to use country IDs in a language tag.


 The borders of Germany
 have changed many, many times.  If one is referring to the
 German language as spoken by inhabitants of Alsace, using
 country codes would imply that that same language spoken by
 the same people would have been tagged at various times as
 de-DE and de-FR according to where the France-Germany border
 happened to have been determined by politicians of the time.
 That strikes me as being a rather silly way to tag language,
 but that's the precedent set by RFC 1766.

I agree that that's a silly way to tag that language; I disagree that
RFC 1766 suggests I should tag it that way. 


 As far as I can tell,
 the draft doesn't really deal with the issue of changing borders
 or changing country names -- it merely pretends that these
 things don't happen by attempting to declare a snapshot of the
 status at some point in time as being valid for all time.

That may be your reading of the situation, but it is not how it is seen
by those of us who have been working on this spec and examining these
issues closely.



 But the user has indicated that he speaks French, and the
 proposed registry contains a description in English only.
 Where is the implementor supposed to get the *official*
 translation for display?  N.B. under the current (RFC 3066)
 situation, the definitive ISO lists provide an official
 description in French.

Neither RFC 1766 or RFC 3066 has ever presented official translations;
this is no different for RFC 3066bis. Under RFC 3066, one is pointed to
ISO 639-1 and ISO 639-2 to get the alpha-2 and alpha-3 IDs, but it does
not anywhere state that implementors should use the English and French
language names in those ISO standards; exactly the same situation holds
for RFC 3066bis. (Note, btw, that the names listed by ISO 639-1/-2 have
no particular official status; they are normative in those standards
to the extent that the indicate what language variety a given ID
denotes, but they do not claim that the particular form of the language
names have any particular status.) 



   One possibility would be two description fields.
 
  Why two?
 
 There are now two in the ISO lists (and, as noted, in the
 UN list).  I have no objection to more, but I object to
 a reduction.

If anything, I am inclined to object to two: to avoid an Anglo-Franco
colonial bias, either there is one name that is simply a reference name,
or the registry be designed so that it could accommodate names in as
many languages as may be available. 

Note that the RFC 3066 specifies a registry that does not include French
language names. I suggest that this issue should be dropped.



 I have an implementation which (in accordance with RFC 3066)
 uses the official ISO lists. It has provision for displaying
 ISO 639 language tags with their descriptions in either of the
 two languages supported by the official 639 lists, and likewise
 for the ISO 3166 country codes.  

RFC 3066 *does not at any point* suggest let alone state that
implementations should use ISO 639 language names or ISO 3166 country
names for UI purposes. IMO, you are creating an issue where none exists.


 The specification of the
 draft is *NOT* compatible with that existing implementation
 because it removes the existing functionality of official
 descriptions in French of language and country codes. As a
 result of that incompatibility,  the newly proposed
 specification does not work with (at least that one)
 existing implementation 

Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-13 Thread John Cowan
Bruce Lilly scripsit:

 Feh. Whatever. The human-readable stuff that corresponds
 to the code which you say shouldn't be read.   The stuff
 without which codes are meaningless.  The stuff without
 which two communicating parties cannot agree on the meaning
 of XX.

Two communicating parties can unquestionably agree on the meaning of
XX without both English and French definitions.  Either will suffice.
Indeed, if either definition provided a nuance not available to the
other, they would not be interchangeable, and one would have to be
the authentic definition and the other a mere aide-memoire.

-- 
[W]hen I wrote it I was more than a little  John Cowan
febrile with foodpoisoning from an antique carrot   [EMAIL PROTECTED]
that I foolishly ate out of an illjudged faith  www.ccil.org/~cowan
in the benignancy of vegetables.  --And Rosta   www.reutershealth.com

___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-13 Thread Mark Davis
 Are you claiming that

 sr-CS-891-boont-gaulish-guoyu-boont-gaulish-guoyu-boont-gaulish-guoyu

 is nonconformant per some specification in the draft
 proposal?

Clearly not. But

  x-sr-CS-891-boont-gaulish-guoyu-boont-gaulish-guoyu-boont-gaulish-guoyu

is already absolutely conformant with the current RFC 3066. And the current
RFC 3066 clearly permits the registration of something as long as

  sr-CS-891-boont-gaulish-guoyu-boont-gaulish-guoyu-boont-gaulish-guoyu

(although of course this particular combination would certainly never get
in).

Inutile d'aller plu loin...

There is no use to trying to declare a difference in conformant lengths
between these two documents when one doesn't exist. If you want to do
something productive, you should make a practical suggestion for a change in
the current text of the new draft. If the new draft is to backward
compatible, then it has to be worded carefully. I haven't thought it through
at length, but would need to be something like:

- A conformant implementation need not support the storage of language tags
which exceed a specified length. However, such a limitation must be clearly
documented, including the disposition of any longer tags (for example,
whether an error value is generated or the language tag is truncated -- and
if so, how it is to be truncated).

Mark

- Original Message - 
From: Bruce Lilly [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Sunday, December 12, 2004 12:20
Subject: Re: New Last Call: 'Tags for Identifying Languages' to BCP


   Date: 2004-12-12 13:00
   From: Mark Davis [EMAIL PROTECTED]
   To: [EMAIL PROTECTED], [EMAIL PROTECTED]
   CC: [EMAIL PROTECTED]

  Your claim that the RFC 3066 ABNF itself has a restriction in length is
also
  clearly false. I will quote that again since you seem somehow not to
have
  seen it:

 I made no such claim; indeed it was I who pointed out
 that RFC 3066 *theoretically* permits an infinite-
 length tag.  On that basis alone (even if you missed
 the fact that I am an implementor of RFC 3066
 language tags) you can be sure that I am well aware
 of the RFC 3066 ABNF.

  Both documents establish many further limitations on the contents of
  language tags in the text of each document. Ignoring those stated
  limitations will, in both documents, result in nonconformant language
tags.

 Are you claiming that

 sr-CS-891-boont-gaulish-guoyu-boont-gaulish-guoyu-boont-gaulish-guoyu

 is nonconformant per some specification in the draft
 proposal?  It is certainly too long to be used in an
 RFC 2047/2231 encoded-word.  It is much longer than
 any registered RFC 3066 language tag, and the draft
 proposes removing full tag registration procedure
 restrictions as well as decoupling use from registration
 that would combine to permit such an abomination.
 ___
 Ietf-languages mailing list
 [EMAIL PROTECTED]
 http://www.alvestrand.no/mailman/listinfo/ietf-languages



___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


RE: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-13 Thread Peter Constable
 From: [EMAIL PROTECTED] [mailto:ietf-languages-
 [EMAIL PROTECTED] On Behalf Of Bruce Lilly


  What is silly is saying that every language tag has to have a
date/time
  attribute associated with it so that computer software managing that
 text
  knows the language of that text.
 
 In the specific cases of the core Internet protocols that
 I have mentioned, there *is* a date/time attribute in the
 form of an RFC [2]822 Date field.  If we're talking about
 some file stored on some machine, every OS that I know of
 has a date/time stamp associated with that file.  If you
 have something else in mind, a concrete description and/
 or example might help.

That is not sufficient for many other implementations of RFC 3066. For
instance, an XML document may well be stored in a file system that has
date/time stamps associated with the file; it might also be stored in a
content manangement system that does not report creation dates when
returning content. And elements from within that XML document may be
returned as the result of an X-Path query or a call into a DOM API, and
those surely cannot be assumed to have creation date/time stamps, though
one certain must assume that they can have RFC 3066 tags as xml:lang
attributes.


 I'm not eager to abolish uniqueness.  There never was
 any guarantee that codes would never change. Both RFCs
 1766 and 3066 specifically mention changes as a fact of
 life.

Some of us consider that fact and the instability particularly of ISO
3166 to be a serious problem. That (not accessibility) was one of the
key reasons for this revision.


   SO where are the French definitions?
 
  Ask a person who is bilingual in English and French to provide one.
 
 That would lack definitiveness which characterizes the
 ISO lists.

You started out this thread by talking about display names, not
definitions; hence Mark's suggestion. Now you have switched to talking
about definitions. The draft clearly indicates where one finds the
definitions:

   o  All 2-character language subtags were defined in the IANA
registry
  according to the assignments found in the standard ISO 639...

I.e. the definition is provided in the registry on the basis of what is
defined in ISO 639; hence if what is indicated in the registry is for
any reason insufficient for your purposes, you consult the definitive
source, the ISO standard.



Peter Constable
Microsoft Corporation

___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


RE: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-13 Thread Peter Constable
 From: [EMAIL PROTECTED] [mailto:ietf-languages-
 [EMAIL PROTECTED] On Behalf Of Bruce Lilly


  The grandfathered production in the current draft is
 
  grandfathered   = ALPHA *(alphanum / -)
 
  which does permit the sequences claimed by Bruce (except for
  not-purely-alphabetic primary sub-tags),
 
 No exception.  alphanum is ALPHA / DIGIT.

My mistake; again, I had on my mind constaints beyond the ABNF.


  syntactically; but the set of
  tags available for use is constrained by more than the ABNF syntax
  alone: the acceptable productions for each sub-tag must either be taken
  from one of the source standards or be registered.
 
 So what? The ABNF is an expression of the grammar that
 describes the set of all valid tags.

It is *part* of the expression of the grammar. Even in RFC 3066 this is the 
case: you know that t-abc is not valid under RFC 3066, but not because that is 
constrained by the ABNF of RFC 3066.

I will accept that the ABNF of draft should be changed to better reflect what 
the form of grandfathered productions can be, which, as I stated in my previous 
message, would be the equivalent of the ABNF of RFC 3066:

grandfathered = 1*8ALPHA *(- 1*8alphanum)

I think that's an improvement, though technically I don't think it changes 
anything.


 If
 one doesn't intend to impose such requirements, the
 ABNF specifying the grammar should be changed
 accordingly.
 
  This is no different
  from RFC 3066, so it is no more of a problem in this specification than
  it was in RFC 3066.
 
 It is a very different grammar from RFC 3066, imposing
 very different requirements on parsers.

Our disagreement amounts to a basic question of whether parsers should be 
written based on the ABNF alone, or based on the ABNF plus other constraints 
provided in the spec. Clearly, I think anyone writing a parser should consider 
other constraints as well.



   In particular, tags other than private-use tags with more than
   two subtags require registration under RFC 3066 rules, and it
   is a trivial matter to determine the longest registered tag.
   The draft, however, encourages use of more subtags as well as
   removal of the subtag length upper bound; moreover, it permits
   infinite numbers of subtags without requiring registration of
   the resulting complete tag.
 
  Bruce states incorrectly that there is no upper bound on the length of
  sub-tags.
 
 Look again at the draft definition of grandfathered -- now
 show me where there's a limit in that production on subtag
 length.

As mentioned, the limit is imposed by other tight constraints on 
'grandfathered'; you have already identified that the longest registered tag 
under RFC 3066 is 11 octets in length, therefore a 'grandfathered' tag can be 
at most 11 octets in length.



  There are three open doors for infinite-length productions in the ABNF
  of the current draft:
 
  - unlimited extlang sub-tags
  - unlimited variant sub-tags
  - the number of possible extensions is limited to 25
...
  , but the length of
  extensions is unlimited
 
 You have missed several others:
 
 1. privateuse length is unlimited (either tacked on
 after lang etc., or directly as an alternative in
 Language-Tag)

I disregarded this since it is identical to the case for RFC 3066, and you 
were, after all, charging that the draft creates problems that were worse than 
for RFC 3066.


 2. grandfathered, which as already discussed
 permits unlimited length.

But as already stated is very tightly constrained, with a de-facto upper limit 
of 11 (subject to change if new tags are registered before the proposed spec is 
accepted).


  We could impose some upper limits on these things...

 That leaves the extension portions' length at up to
 25 * (1 + 1 + 8 * 9) = 1850 octets, not taking any other parts
 of a tag into account!   That's way too long (the RFC 2047
 limit for an encoded-word is 75 octets, including charset tag,
 some text, and some syntactic glue in addition to the language
 tag).

The problem already exists in RFC 3066. Even apart from private-use tags, 
tomorrow someone could request a registration for a tag that's 87 octets long, 
and there's nothing in RFC 3066 that would prohibit acceptance.


  So, I think Bruce has identified a valid issue here. I personally would
  not have characterized it as greatly exacerbating, though,
 
 IMO, an increase from 11 octets worst-case, which is tolerable
 for constructing RFC 2047/2231 encoded-words, to  1850
 octets, which exceeds by a large margin what can be handled
 in a Content-Language or Accept-Language message header
 field, constitutes greatly exacerbated.

Repeating my previous point, RFC 3066 doesn't stop a registered tag from being 
10^100 octets in length. Of course, all of us know that such a tag wouldn't be 
useful. At some point, we have to engage common sense, even for RFC 3066. The 
draft would allow a tag 

en-boont-boont-boont-boont-boont-boont-boont-boont-boont-boont-boont-boont-boont

(over 

Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-13 Thread John Cowan
Peter Constable scripsit:

 My suggestions, then, in response to Bruce Lilley's comments are:

I heartily support all of this, despite the extra burden it imposes
on our esteemed editors, and hope that none of it is in any way
controversial.

-- 
John Cowan  www.reutershealth.com  www.ccil.org/~cowan  [EMAIL PROTECTED]
Arise, you prisoners of Windows / Arise, you slaves of Redmond, Wash,
The day and hour soon are coming / When all the IT folks say Gosh!
It isn't from a clever lawsuit / That Windowsland will finally fall,
But thousands writing open source code / Like mice who nibble through a wall.
--The Linux-nationale by Greg Baker

___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-13 Thread John Cowan
Peter Constable scripsit:

 The ISO 3166 MA maintains that standard in accordance with the
 identifiers specified by the UN Statistics Division; a change by the UN
 is all the convincing that is required.

Umm, not quite.  The UNSD defines what a country is, and assigns it
a 3-digit code (normative) and a name (informative); the ISO 3166 MA
then specifies 2-letter and 3-letter codes for that name.

 This scenario is not hypothetical; it actually occurred in the case of
 CS. The change was solely under the control of the UN Statistics
 Division; it is not part of their process to consult with developers and
 users of IT systems in general, and they were not consulted in this
 case. They were completely powerless to influence the change, learning
 about it only after the fact.

UNSD had nothing to do with this.  It assigned the hitherto-unused code
891 for the country now called Serbia and Montenegro.  (Yugoslavia
had the code 890, Czechoslovakia the code 200).  This was a reasonable
judgment in the circumstances: the question of when a country has changed
into another country is always fuzzy.  It was the ISO 3166 MA and no
one else who chose to assign the 2-letter code CS to the new country.

UNSD historically has assigned new numerical codes when new countries
come into existence, and has managed to avoid reusing any of its 3-digit
identifiers, which is precisely why those identifiers are being used as
trusted backups in RFC 3066bis for the unstable ISO 3166 identifiers.

 This is a situation we do not intend to repeat.

Agreed, but let's make sure not to blame the innocent.

 It is not uncommon for users to confuse JA and JP. 

*blush*

I've done it myself, and in implementation, not merely in discussion.
Fortunately, the evidence is now buried.

-- 
John Cowan  www.reutershealth.com  www.ccil.org/~cowan  [EMAIL PROTECTED]
Arise, you prisoners of Windows / Arise, you slaves of Redmond, Wash,
The day and hour soon are coming / When all the IT folks say Gosh!
It isn't from a clever lawsuit / That Windowsland will finally fall,
But thousands writing open source code / Like mice who nibble through a wall.
--The Linux-nationale by Greg Baker

___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


RE: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-13 Thread Peter Constable
 From: [EMAIL PROTECTED] [mailto:ietf-languages-
 [EMAIL PROTECTED] On Behalf Of Bruce Lilly


  That is not at all the aim here wrt stability; rather, the aim is
that a
  symbolic identifier used for metadata in IT systems not change
because
  some government on a whim says, We would now prefer to use 'yz'
rather
  than 'xy' to designate our country.
 
 If by international agreement, 'yz' becomes the designation
 for that country, then it is rather silly to stick one's
 fingers in one's ears and shout NA-NA-NA-NA-NA I don't want
 to hear you.

That misses the point entirely. The point is that IDs used by political
administrations may change for any number of reasons, and those
admministrations may have no qualms with such changes; but in IT
systems, we cannot afford changes that break existing implementations
and data. If for whatever reason ISO and the UN decided that US should
be used to designate the country of France, I doubt you'd expect every
software vendor to update all of their deployed installations to use
fr-US instead of fr-FR, and for every user to go through every data
repository they manage to make such changes in their data.

The people that maintain time zone definitions may have their means for
changing times; that's fine for them. They are not dealing with the same
concerns as we are dealing with. The group here that has focused
specifically on language-tagging issues for several years has evaluated
issues that affect language tags and the impact of changes and has
decided what is best practice for *this* domain, and it is to maintain
stability of data rather than cater to whims of political
administrations.


 Designed or not, country codes *are* read by humans; they
 appear in top-level domain names.  Currently the ISO 639
 2-letter codes mean the same thing as the last component of
 a domain name

I think you mean ISO 3166 2-letter codes.

 and as the second component of a language-tag.
 It's rather silly to change that correspondence simply because
 a few people are piqued that international agreement has been
 reached to change a few 2-letter codes.

The usability flaw in treating ISO 639 and ISO 3166 as human-readable is
evident in the confusion between ja and JP (or is it jp and JA?), and GB
vs UK. As for what is silly, if the UN country ID for Canada changed to
CN (and that for PRC changed to something else), I'm sure it would cause
far greater problems for users to have to change the last two letters in
domain names than for them to keep doing what they always did. In fact,
I would have thought it would create a rather significant problem on the
Internet if such a change were made. (URIs don't come with versioning
dates for domain names, so how would a DNS server know what the cn
meant?)


  Neither RFC 1766 or RFC 3066 has ever presented official
translations;
 
 Both defer to the ISO lists for definitions (not translations)
 of the various codes.

Definitions; not language names for display use.


  this is no different for RFC 3066bis.
 
 It is very different; under the proposed draft, there is only
 an English definition, somebody wishing to provide a French
 definition finds that he has none and must resort to an
 unofficial translation.

The more you press this, the more silly it seems. RFC 3066 does not
anywhere discuss display names; localization data is beyond its scope.
The registry it defines does not give provision for French language
names. The source ISO standards are every bit as accessible as they ever
were, and just as RFC 3066 gave the user no option but to refer to the
source ISO standard, so users should and can continue to do so.

After this response, I will not waste my time any further on this
foolishness.


 I'm willing to postpone the discussion
 (other problems with the proposed registry format dictate
 a broader solution which could easily have provision for
 an arbitrary number of descriptions).

I strongly object to the suggestion that progress on this draft be
delayed to deal with this non issue that caters to implementation issues
that are well beyond the scope of either RFC 3066 or its proposed
replacement.


 No, you are overlooking the fact that a set of codes with
 no corresponding definitions is useless.  RFC 3066 defers
 the code/definition pairs to ISO, which provides multilingual
 definitions. The proposed draft would remove that multilingual
 characteristic.

What if the registry provide no name, just the ID? Then people would
have to refer to the source ISO standard as they did in the past, and we
would be able specify which ISO IDs were or were not valid. That would
achieve the goal that we had wrt stability while eliminating the concern
that English-only annotations for some reason apparently create for you.
Personally, I think the English annotation is helpful, but it seems that
the real solution you're looking for is to remove any annotation
whatsoever so that the situation is closer to what we have under RFC
3066.



  Display 

RE: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-13 Thread Peter Constable
 From: [EMAIL PROTECTED] [mailto:ietf-languages-
 [EMAIL PROTECTED] On Behalf Of Bruce Lilly


  As mentioned, the limit is imposed by other tight constraints on
 'grandfathered'; you have already identified that the longest
registered
 tag under RFC 3066 is 11 octets in length, therefore a 'grandfathered'
tag
 can be at most 11 octets in length.
 
 But the constraints probably aren't as tight as you
 believe; the draft specifically permits a future
 revision to allow a primary subtag longer than
 8 octets, or not purely alphabetic, etc.

RFC 3066 does not impose any restrictions on what its replacements might
do. This is the case with any specification: a given technical
specification is not a specification of human behaviour and cannot keep
us from revising the spec or replacing it in any way we may choose.


 One would hope that under RFC 3066 rules, that the
 reviewer, a list subscriber, or an Applications Area
 Director would recognize the conflict with RFCs 2047/2231
 and would object.

You have mentioned conflict with RFCs 2047 and 2231. RFC 2047 does not
make reference to language tags. The ABNF of RFC 2231 does not impose
any limit on the length of language tags. RFC does contain an implicit
length issue in that it updates RFC 2047, allowing language tags within
encoded words, but it does not explicitly identify any upper bound on
the length of language tags. By reading both RFC 2047 and RFC 2231, one
finds that they assume that a language tag must be at most 64 characters
long:

- the maximum length for the encoded-word production is 75 characters
long (not stated in the ABNF of RFC 2047 but rather in the prose)

- encoded-word production of RFC 2047 includes 6 literal characters

- RFC 2231 adds one delimiting character * between the charset and
language tag

- the shortest charset names are 2 characters long (e.g. IT)

- the shortest encoding length is 1 character long

- the minimum encoded-text length is 1 character long

An encoded-word must contain at least 11 characters that are not part of
the language tag and have a total length of no more than 75 characters.
Therefore, an upper bound on language tags that can be used in an RFC
2047/2231 encoded-word production is 64 characters. In many cases, where
the charset tag or encoding is longer, the upper bound on the length of
languages tags will be less, but the RFC gives no estimate or indication
of how much less.

This is a constraint on an application of RFC 3066; it is not a
constraint on RFC 3066 itself. It is possible that other applications of
RFC 3066 may impose limits that may be longer or shorter than that
imposed by RFC 2047/2231. I see no reason why limits must be added as a
constraint in a revision of RFC 3066. It would be a good idea, however,
to point out in section 2.1 of the draft that some applications of this
specification may impose limits on the length of accepted language tags,
and perhaps to cite RFC 2231 as an example.

My suggestions, then, in response to Bruce Lilley's comments are:

- that we add a note prominently in section 2.1 of the draft explaining
that some applications may impose limits on the lengths of language
tags, and cite RFC 2231 as an example

- that we revise the ABNF for the 'grandfathered' production rule to 

grandfathered = 1*3ALPHA *(= 1*8alphanum)

- that we add a note in the discussion of extensions stating that, when
a language tag instance is to be used in a specific, known protocol, it
is advisable that the language tag not include extensions not supported
by that protocol (text can be added pointing out the inadvisability of
including unrecognized extensions in the case of protocols that impose
upper limits on the length of strings that may contain a language tag)

- that recommendation 4 in section 2.4.2 be changed to say that
extensions should not be removed except in the case that the language
tag instance is to be inserted into a specific protocol known not to
support the extension

- that the language subtag registration form include an additional field
following #7 (recommended prefixes for variants) asking for a reasonable
estimate and examplar of the maximum length anticipated for language
tags using the requested varient

- that a requirement on extension RFCs be added in section 3.4 stating
that they must include some explicit discussion of concerns related to
upper bounds on length of language tags using the given extension

- that we do not attempt any other changes to the ABNF to impose an
upper bound on the length of language tags

- that we add a note in section 3.1 indicating that descriptions in
registry entries for ISO 639, ISO 3166 or ISO 15924 identifiers are
intended only to indicate the meaning of that identifier as defined in
the source ISO standard at the time it was added to the registry, and
that the descriptions are not replacements for content of the source
standards themselves

- that we do not need to change the proposed format of the registry to

Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-13 Thread Mark Crispin
On Sun, 12 Dec 2004, Bruce Lilly wrote:
In the specific cases of the core Internet protocols that
I have mentioned, there *is* a date/time attribute in the
form of an RFC [2]822 Date field.  If we're talking about
some file stored on some machine, every OS that I know of
has a date/time stamp associated with that file.  If you
have something else in mind, a concrete description and/
or example might help.
When I retrieve a file via FTP, HTTP, etc. the time stamp of that file on 
my computer is the date/time of retrieval, not the date/time of the file 
on the source.

Unless, of course, both systems are running TOPS-20 and thus use that 
wonderful XTP mode that copies file metadata.  Now, if you want to mandate 
that all UNIX and Windows systems be replaced with TOPS-20, I might 
support that... :-)

Silliness aside, the file may well have embedded language tags in the text 
of the file.  Have you forgotten Plane 14?

I'm not eager to abolish uniqueness.  There never was
any guarantee that codes would never change. Both RFCs
1766 and 3066 specifically mention changes as a fact of
life.
That's what's now being fixed.
French is an official language used by the ISO in its
publications.
Why is this vestige of colonialism important in the IETF context?
SO where are the French definitions?
Ask a person who is bilingual in English and French to provide one.
That would lack definitiveness which characterizes the
ISO lists.
What magic attribute is there to French that provides definitiveness 
that is absent in English, or Mandarin, or Hindi, all of which are far 
more significant languages to the world?

Why is it a problem?  Why is it a defect?
Because it unnecessarily reduces by 50% the information
content currently available.
A mandatory French translation to an English definition does not 
significantly increase the information content, and certainly does not 
double it.

The only increase in the information content would be to those individuals 
who comprehend French but not English.  This is a very small number of 
individuals.

If there is to be a mandatory translation into a second language to 
increase information content, then that language should be Mandarin. 
Among individuals who do not comprehend English, far more comprehend 
Mandarin than comprehend French.

If there is to be a mandatory translation into a third language, that 
would probably be Hindi.

You have not explained how the code came to be embedded
within the text itself -- surely the author didn't say
(or write, or sign) this text is in language QZ; most
likely the language was indicated by name, or by some proxy
representing the name (such as a locale).
Plane 14.
HTML and other markups.
-- Mark --
http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum.
___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-13 Thread John Cowan
Bruce Lilly scripsit:

 If by international agreement, 'yz' becomes the designation
 for that country, then it is rather silly to stick one's
 fingers in one's ears and shout NA-NA-NA-NA-NA I don't want
 to hear you.  

Actually, 'yz' doesn't designate the country in the ISO standard,
as I explained yesterday.  Rather, it designates the *name* of the
country, which is of course subject to change *without* international
agreement.  In RFC 1766/3066, we attempt to use it to designate
the country, which requires some straining of the concept.

 As I have pointed out, politicians change the definitions of time
 zones frequently, and those who have to deal with time zone issues
 have found a way to cope with such change without trying to declare
 international standardization organizations irrelevant.

Ah, but you kick the ball through your own goalposts here.  The
Olsen time zone system is excellent -- but it becomes so only by totally
ignoring the customary names of time zones and inventing its own!
(Thus U.S. Eastern time is named America/New_York, e.g.)  The
customary names are carried only as time zone abbreviations such as
EST, which are not unique, are English-only, and most of which are
also made up.  (Countries with a single time zone generally don't
bother with an official name for it, with some obvious exceptions.)

 It's rather silly to change that correspondence simply because
 a few people are piqued that international agreement has been
 reached to change a few 2-letter codes.

Not much of an international agreement, really.

-- 
Samuel Johnson on playing the violin:   John Cowan
Difficult do you call it, Sir? [EMAIL PROTECTED]
 I wish it were impossible.http://www.ccil.org/~cowan

___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-13 Thread Peter Constable








Bruce Lilly has posted comments on the IETF list in
response to the last-call announcement for a proposed revision to RFC 3066. His
comments were generally negative, raising a number of concerns. I and others
involved in preparation of the revision have discussed Bruces concerns
with him, but they were not made available on the IETF list since those of us
other than Bruce were not subscribed to this list. I wish to briefly summarize
the outcome of that discussion for the benefit of people here.



Some of Bruces comments were purely
editorial (e.g. formatting of draft); I will not review those.



Bruces substantive concerns were:



-
Accessibility of source
ISO standards was referred to in the announcment as a major reason for the proposed
revision, but accessibility has not been a problem in his experience.



-
RFC directed users to
source ISO standards; the proposed revision would establish a registry that
includes all ISO identifiers considered valid for use in language tags, but the
documentation for those identifiers in this registry does not include both
English and French language / country names. 



-
The proposed revision
makes referene to ISO 8601 time/date format being used in the registry, which
is a complex and not-readily-available specification.



-
The ABNF used in the
proposed draft permits many strings that do not conform with RFC 3066.



-
The proposed revision imposes
no bounds on the length of tags (same as RFC 3066), and does not require
registration of complete tags (different from RFC 3066).



-
The lack of an Internationalization
considerations section as recommended by RFC 2277 (a.k.a. BCP 18).



As a result of Bruces comments, those of us
contributing to the development of this revision have suggested certain revisions
to which the authors have indicated openness. As I will explain, these revisions
would provide clarification on various matters, but would not constitute
technical changes in the draft.



1. Re accessibility: it was pointed out that the
draft itself does not identify accessibility of source ISO standards as one of
the primary reasons for the revision. There are some minor accessibility
concerns having to do with uncertainty of the on-going availability to the
relevant ISO code tables, and to change histories for each of the relevant ISO
standards. The proposed changes to the language-tag registry address these
concerns, though there were bigger reasons for the proposed registry changes,
particularly having to do with stability. 





2. Re the lack of French descriptions in the
registry: it was pointed out that the registry defined by RFC 3066 did not
include French descriptions, and that the revised registry is not intended to
replace the source ISO standards or make them irrelevant. The meaning of IDs
would still be established from the ISO standards from which they were drawn,
and the proposed revision would continue to make reference to them. As a result
of Bruces comments, it was suggested that wording be revised in the
draft to make this relationship clearer.





3. Re ISO 8601 time/date format: What is used in
the registry is dates expressed in the format -MM-DD. It was
agreed that it would be better to identify the format precisely rather than
make the generic reference to ISO 8601.





4. Re the less restrictive ABNF: the one place that
had less restrictive syntax was a production rule that was subject to additional
strict constraints, namely that only certain pre-existing tags registered under
RFC 3066 could fall under that production. A change to the ABNF has been
suggested that would make the ABNF at that point consistent with the ABNF for
RFC 3066. This does not constitute a change having any technical consequence as
there is no resulting change in the set of valid tags.





5. Re upper bounds on length of tags: It was pointed
out that private-use tags for both RFC 3066 and the proposed revision have no
bounds on their length. The greater concern was for non-private-use tags. For
these, it was pointed out that RFC 3066 also imposes no bounds on length. Admittedly,
though, there is a difference because RFC 3066 requires registration of
complete tags, so one can determine at any time what is the longest valid tag that
may be encountered, whereas the proposed revision requires registration of
sub-tags which can then be combined productively, and one cannot predict with
certainty what combinations may be used. (This, IMO, is the most significant of
the concerns Bruce raised.)



While the proposed revision allows productive
combinations of registered sub-tags, there are some limits on how combinations
can be made, as specified by the ABNF. The ABNF does allow unlimited numbers of
certain elements  specifically three. 



One of these (extlang) is defined by
the ABNF in anticipation of possible future extension of the language tag
specification to incorporate mechanisms expected in a new part to ISO 639 that
is in preparation, but 

Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-13 Thread Vernon Schryver
 From: Peter Constable [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Cc: [EMAIL PROTECTED]

 This is a multi-part message in MIME format.

 --===1521567419==
 Content-class: urn:content-classes:message
 Content-Type: multipart/alternative;
   boundary=_=_NextPart_001_01C4E16C.40BF0707

 This is a multi-part message in MIME format.

 --_=_NextPart_001_01C4E16C.40BF0707
 Content-Type: text/plain;
   charset=us-ascii
 Content-Transfer-Encoding: quoted-printable

 Bruce Lilly has posted comments on the IETF list in response to the
 last-call announcement for a proposed revision to RFC 3066. His comments
 were generally negative, raising a number of concerns. I and others
 involved in preparation of the revision have discussed Bruce's concerns
 with him, but they were not made available on the IETF list since those
 of us other than Bruce were not subscribed to this list. I wish to
 briefly summarize the outcome of that discussion for the benefit of
 people here.

 =20
 ...

 In conclusion, I think that some of Bruce's concerns were valid, and
 suggestions for changes have been presented to the authors accordingly.
 I believe all of these changes can be considered to be for clarification
 purposes, rather than technical changes. (No changes affecting the set
 of valid tags have been made.)
 ...


 --_=_NextPart_001_01C4E16C.40BF0707
 Content-Type: text/html;
   charset=us-ascii
 Content-Transfer-Encoding: quoted-printable

 html

 head
 meta http-equiv=3DContent-Type content=3Dtext/html; =
 charset=3Dus-ascii
 meta name=3DGenerator content=3DMicrosoft Word 11 (filtered)
 style
 !--
  /* Font Definitions */
  @font-face
   {font-family:Wingdings;
   panose-1:5 0 0 0 0 0 0 0 0 0;}
 @font-face
   {font-family:SimSun;
   panose-1:2 1 6 0 3 1 1 1 1 1;}
 @font-face


On the contrary, what the authors of a standard intend is not normative.
As much as possible, every standard must say what it means, because
what a standard says *is* its technical content.  For example, I'm
unhappy about an apparent sentiment that would put ABNF on a lower
footing that the English text.  I think I'm like most implementors and
perhaps unlike non-engineers in reversing that precedence.  Whenever
I read an RFC, I rely first and foremost on the ABNF.  I use the English
only for hints, and follow the ABNF instead of the English whenever
there is a conflict.


There are a couple other issues that ought to be addressed.

I think Bruce Lilly started by charging that a potentially disruptive
document had reached last-call without any review by those concerned
with related, affected IETF standards.  That sounds like a process
problem that needs at least 1% as many words as have been spent in
this mailing list in lawyerly talk such as whether accounts is more
appropriate than account.

The other issue is that some of us consider the completely unnecessary
and gratuitous use of duplicate-copy/quoted-printable/HTML email
somewhere among aggressive, offensive, and a security attack.  In
purely text contexts like this mailing list QP/HTML never contributes
to an impression of technical accuracy and relevance of whatever
message it enciphers.  Then there is the use of Microsoft's XML
flavor of HTML mail ...


Vernon Schryver[EMAIL PROTECTED]

___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-13 Thread Joe Abley
On 13 Dec 2004, at 18:34, Peter Constable wrote:
3. Re ISO 8601 time/date format: What is used in the registry is dates 
expressed in the format -MM-DD. It was agreed that it would be 
better to identify the format precisely rather than make the generic 
reference to ISO 8601.
Why not require dates to be formatted as per RFC 3339?
In general, -MM-DD is ambiguous unless a timezone is specified.
Joe
___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-12 Thread Bruce Lilly
  Date: 2004-12-10 22:37
  From: John Cowan [EMAIL PROTECTED]
  
 Bruce Lilly scripsit:
 
  It's not clear to me that the proposal will provide protection
  against the whims of politicians.  If the definition of CS as
  a country code changes again under the proposed scheme,
  how is one to determine specifically what some archived
  language-tag referred to at some point in time?  I'm not
  particularly concerned about that problem, as I am resigned
  to instability associated with anything specified by politicians
  (and that includes the UN region codes).
 
 The U.N. Statistics Division are only politicians in the sense
 that IETF WG members are.  They are, in fact, statisticians.
 Their track record for stability is considerably longer than the
 IETF's.

I hope that I need not repeat any of the well-known remarks
about statistics.  Nor that I need point to the many uses
by politicians of statistics (and statisticians) for
political purposes.

Moreover, the point is that countries do change, and that use
of country codes (as provided for in RFC 3066 and in the
proposed draft) carries with it the inherent instability
which is characteristic of politics.  A quest for stability
of countries seems Quixotic and oxymoronic.  According to the
principle of stability as that term is used in defense of the
draft, I suppose we're all intended to refer to Malawi as
Rhodesia because that's what it (in part) was called 50 years
ago, or that we're supposed to ignore the breakup of the USSR,
Yugoslavia, etc., the reunification of Germany, etc.

A related problem with the use of country codes in language
tags is that there is not necessarily an inherent relationship
between language and country borders.  The borders of Germany
have changed many, many times.  If one is referring to the
German language as spoken by inhabitants of Alsace, using
country codes would imply that that same language spoken by
the same people would have been tagged at various times as
de-DE and de-FR according to where the France-Germany border
happened to have been determined by politicians of the time.
That strikes me as being a rather silly way to tag language,
but that's the precedent set by RFC 1766.  As far as I can tell,
the draft doesn't really deal with the issue of changing borders
or changing country names -- it merely pretends that these
things don't happen by attempting to declare a snapshot of the
status at some point in time as being valid for all time.

  But if the proposed new registry's description of CS says
  foo and the ISO standard code list says bar, what's
  an implementor supposed to present to a user as *the*
  description associated with CS?
 
 The former.  That's the whole point of having a registry.

But the user has indicated that he speaks French, and the
proposed registry contains a description in English only.
Where is the implementor supposed to get the *official*
translation for display?  N.B. under the current (RFC 3066)
situation, the definitive ISO lists provide an official
description in French.
 
  One possibility would be two description fields.  
 
 Why two?

There are now two in the ISO lists (and, as noted, in the
UN list).  I have no objection to more, but I object to
a reduction.  The text accompanying the new last call
states:

This specification addresses each of these issues with a simple, elegant design
that is compatible with existing language tags and implementations.
and
One concern that is crucial to acceptance of the new language tag design is how
it works with existing implementations of RFC 3066 and how existing
implementations will interact with implementations of the newer language tags.
and
It is important to recognize that all language tags that were valid under the
existing RFC 3066 will remain valid, with their meanings intact, under this
specification.

I have an implementation which (in accordance with RFC 3066)
uses the official ISO lists. It has provision for displaying
ISO 639 language tags with their descriptions in either of the
two languages supported by the official 639 lists, and likewise
for the ISO 3166 country codes.  The specification of the
draft is *NOT* compatible with that existing implementation
because it removes the existing functionality of official
descriptions in French of language and country codes. As a
result of that incompatibility,  the newly proposed
specification does not work with (at least that one)
existing implementation (but I agree that that is a crucial
concern).

Language tags remaining valid, I presume that the tag sr-CS
will continue to mean Serbian as used in Serbia and Montenegro
(officially equivalent to Serbe par Serbie et Monténégro) as that
is a valid RFC 3066 language tag and its corresponding meaning...
but I can see no evidence of that in the draft -- indeed it
appears that the draft would change that meaning significantly.

 There are 6000 languages spoken on Earth, of which 
 perhaps 600 have a standard written form.

ISO 639 

Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-12 Thread Bruce Lilly
 Date: Sat, 11 Dec 2004 12:14:42 -0800
 From: Randy Presuhn [EMAIL PROTECTED]
 Subject: Re: Ietf-languages Digest, Vol 24, Issue 5
 To: [EMAIL PROTECTED], [EMAIL PROTECTED]
 Message-ID: [EMAIL PROTECTED]
 
 Hi -
 
  From: Bruce Lilly [EMAIL PROTECTED]
  To: [EMAIL PROTECTED]
  Cc: [EMAIL PROTECTED]
  Sent: Friday, December 10, 2004 4:54 PM
  Subject: Re: Ietf-languages Digest, Vol 24, Issue 5
 ...
  Eliminating bilingual descriptions for the language,
  country (and UN region) codes leaves implementors
  in a quandary.
 ...
 
 Huh?  These are language TAGS.  If, for some reason, some implementor
 thought it made sense to display one of these in a localized form (rather
 than just using them to determine what locale, etc. should be used in
 rendering some text) there's no requirement that the English-language
 country names that appear in the registration be used.

That's not the point. The point is that under RFC 3066,
the bilingual ISO language and country code lists are
considered definitive. An implementor can (and has)
therefore use those lists for (e.g.) providing users
with menus (in either language) from which a language
or country code may be selected.  By declaring the ISO
lists no longer definitive, and by providing only
English descriptions of the codes in the proposed
revised registry which would be used instead of the ISO
lists, the draft proposal deprives implementors of
being able to provide that functionality (viz. an
official description in French of codes).

 Indeed, a UI 
 could just as well draw a map as display a name.

That would be awfully difficult for a character-based
UI, and would not be useful for language codes. Nor
would it be helpful for users who lack map-reading
skills, but who recognize Allemagne when they see it.

___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-12 Thread Bruce Lilly
  Date: 2004-12-11 00:52
  From: Mark Davis [EMAIL PROTECTED]
  To: [EMAIL PROTECTED], [EMAIL PROTECTED]
  CC: [EMAIL PROTECTED]
  
  The ABNF is an expression of the grammar that
 describes the set of all valid tags.
 
 No, this is simply incorrect. You cannot expect that any implementation that
 simply does the ABNF is conformant.

I made no such claim.  I do claim that if the ABNF
contradicts the normative text, as is the case in
your draft w.r.t. acceptance of several constructs
not permitted by RFC 3066 ABNF, that there is an
error in either the normative text or the ABNF.

 There are a great many constraints on 
 the tags that are not in the ABNF grammar, that are clearly required in any
 reading of the text. Most of these *cannot* be encompassed in any ABNF
 grammar.

If your claim is that the ABNF cannot express a
grammar consistent with the RFC 3066 ABNF, that
is clearly false.

 There are a few that could be expressed in the ABNF; some at little 
 cost, some with a great deal of complication.

Are you claiming that it is unduly difficult to
make the ABNF match RFC 3066's?

 This is not a technical 
 problem for the draft.

It is a problem due to the conflict between the
ABNF and the text.  It is a problem because it
opens a loophole for future revisions to formalize
content which is incompatible with RFC 3066
implementations.
 
  as reasonable as the current worst-case of 11 octets.
 Also simply untrue. You seem not to be reading all the messages on this
 subject. Look at the ABNF for RFC 3066. There is *no* limit in the ABNF
 there!

The draft proposes closing RFC 3066-style registrations.
Show me a registered RFC 3066 language tag longer than
11 octets.  Show me a general-use (i.e. not private-use)
RFC 3066 language tag which is too long to be used in an
RFC 2047/2231 encoded-word.  Show me a general-use RFC
3066 language tag which is too long to fit on an RFC
2822/3282 Content-Language header field line.

___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-12 Thread Bruce Lilly
  Date: 2004-12-11 11:53
  From: Peter Constable [EMAIL PROTECTED]
  To: [EMAIL PROTECTED], [EMAIL PROTECTED]

 Our disagreement amounts to a basic question of whether parsers should be 
 written based on the ABNF alone, or based on the ABNF plus other constraints 
 provided in the spec. Clearly, I think anyone writing a parser should 
 consider other constraints as well.

No, I agree that a parser should take normative text
into account, but I feel that there should be a
reasonable effort made to make the ABNF agree with
that normative text -- otherwise there's little
point in providing ABNF.

 As mentioned, the limit is imposed by other tight constraints on 
 'grandfathered'; you have already identified that the longest registered tag 
 under RFC 3066 is 11 octets in length, therefore a 'grandfathered' tag can be 
 at most 11 octets in length.

But the constraints probably aren't as tight as you
believe; the draft specifically permits a future
revision to allow a primary subtag longer than
8 octets, or not purely alphabetic, etc.

 a de-facto upper limit of 11 (subject to change if new tags are registered 
 before the proposed spec is accepted).

We're agreed on that, for the present draft, but
apparently Mark Davis disagrees.  And I am concerned
about the loophole left for future revisions.

   We could impose some upper limits on these things...
 
  That leaves the extension portions' length at up to
  25 * (1 + 1 + 8 * 9) = 1850 octets, not taking any other parts
  of a tag into account!  That's way too long (the RFC 2047
  limit for an encoded-word is 75 octets, including charset tag,
  some text, and some syntactic glue in addition to the language
  tag).
 
 The problem already exists in RFC 3066. Even apart from private-use tags, 
 tomorrow someone could request a registration for a tag that's 87 octets 
 long, and there's nothing in RFC 3066 that would prohibit acceptance.

One would hope that under RFC 3066 rules, that the
reviewer, a list subscriber, or an Applications Area
Director would recognize the conflict with RFCs 2047/2231
and would object.  If indeed that were to happen
literally tomorrow, I am quite sure that an objection
would be made.  The situation is quite different
under the draft proposal, where registration of a
complete tag is not required, and where there are
no upper bounds on length of a tag.

   So, I think Bruce has identified a valid issue here. I personally would
   not have characterized it as greatly exacerbating, though,
  
  IMO, an increase from 11 octets worst-case, which is tolerable
  for constructing RFC 2047/2231 encoded-words, to  1850
  octets, which exceeds by a large margin what can be handled
  in a Content-Language or Accept-Language message header
  field, constitutes greatly exacerbated.
 
 Repeating my previous point, RFC 3066 doesn't stop a registered tag from 
 being 10^100 octets in length.

RFC 3066 provides a registration mechanism that can be
trusted to prevent that; in particular, the Applications
Area Directors are supposed to look out for issues affecting
the core Internet applications protocols.

 I suggest that wording be added to the draft giving a strong recommendatation 
 to users that they not use tags the complete length of which exceeds 75 
 characters.

75 octets would be too large for a language-tag used in
an encoded word (perhaps different limits could be
specified for different uses, but one would have to be
careful about implicit re-use between applications). An
encoded-word has the form:
  =?charset*language-tag?encoding?text?=
and is limited to a total of 75 octets. Eliminating the
syntactic glue (7 octets, unbracketed above) leaves a
total of at most 68 octets for text, charset, encoding,
and language-tag.  There are at present two encodings,
specified with 1-octet tags.  Assuming that longer
encoding tags are not required, that leaves 67 octets
for charset, language-tag, and text.  The text must be
at least four octets in order to accommodate B encoded
text, leaving 63 octets at most for charset and
language-tag (ideally, one would prefer to leave more
room than that for text).  It is guaranteed (in theory,
if not in practice) that there will be a charset name
of no more than 40 octets for each charset, but that is
not necessarily the preferred name (there has been some
discussion about possibly reducing that limit). That
leaves about 23 octets for a language-tag as an upper
bound for use in an encoded-word.  Obviously that
hasn't been a problem in practice to date; the longest
registered language tag is less than half that length.

  By deferring to the bilingual ISO lists for language and country
  tags, 3066 at least provided a minimal degree of internationalization.
  By explicitly limiting description fields to English and restricting
  the charset to US-ASCII, the draft proposal takes a giant leap
  backwards.
 
 The US-ASCII limitation existed in RFC 3066, so is not new. 

No, I'm talking about the character set of 

Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-12 Thread Sam Hartman
 Bruce == Bruce Lilly [EMAIL PROTECTED] writes:

 Date: Sat, 11 Dec 2004 12:14:42 -0800 From: Randy Presuhn
 [EMAIL PROTECTED] Subject: Re: Ietf-languages
 Digest, Vol 24, Issue 5 To: [EMAIL PROTECTED],
 [EMAIL PROTECTED] Message-ID:
 [EMAIL PROTECTED]
 
 Hi -
 
  From: Bruce Lilly [EMAIL PROTECTED]  To:
 [EMAIL PROTECTED]  Cc: [EMAIL PROTECTED]  Sent:
 Friday, December 10, 2004 4:54 PM  Subject: Re: Ietf-languages
 Digest, Vol 24, Issue 5 ...   Eliminating bilingual
 descriptions for the language,  country (and UN region) codes
 leaves implementors  in a quandary.  ...
 
 Huh?  These are language TAGS.  If, for some reason, some
 implementor thought it made sense to display one of these in a
 localized form (rather than just using them to determine what
 locale, etc. should be used in rendering some text) there's no
 requirement that the English-language country names that appear
 in the registration be used.

Bruce That's not the point. The point is that under RFC 3066, the
Bruce bilingual ISO language and country code lists are
Bruce considered definitive. An implementor can (and has)
Bruce therefore use those lists for (e.g.) providing users with
Bruce menus (in either language) from which a language or country
Bruce code may be selected.  By declaring the ISO lists no longer
Bruce definitive, and by providing only English descriptions of
Bruce the codes in the proposed revised registry which would be
Bruce used instead of the ISO lists, the draft proposal deprives
Bruce implementors of being able to provide that functionality
Bruce (viz. an official description in French of codes).

Programming lore has the rule of zero, one or infinity; it goes by
many other names but the concept is in part that by the time you need
more than one of something, you'll probably need a lot of that thing.

Language descriptions seem to fit this rule fairly well.  By the time
we need to support multilingual language descriptions, we'll need more
than just English and French.

That means implementers today already have to deal with the fact that
they only have some of the language descriptions they need from
definitive standards.  They will already have to get descriptions for
other languages.

Since they are already using non-definitive language descriptions,
implementers can feel free to take the French descriptions from the
ISO standard for the many cases where the IANA registry and ISO
standard overlap.

Why is two definitive languages better than one definitive language
and one set of descriptions from an ISO standard?


--Sam

___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-12 Thread Bruce Lilly
  Date: 2004-12-11 11:59
  From: JFC (Jefsey) Morfin [EMAIL PROTECTED]
  To: [EMAIL PROTECTED], [EMAIL PROTECTED]
  
 Gentlemen,
 I see several points discussed here which are/are not of the same order and 
 seem confusing the issue.
 
 1. the discussion creeps from Harald's RFC 3066 to Multilingual Internet. 
 It seems strange to discuss byte oriented details without having first a 
 Multilingual framework telling what is the scope of the discussion and its 
 implications (which are certainly major) on the whole Internet 
 architecture. I submit that an IAB guidance is first necessary. Before 
 going any further a true WG-Multilingualism should be created and open to 
 everyone (a private IETF-Language lists should be an interim situation 
 towards such a WG)

There is in fact an ietf-languages list; RFC 3066 and the
draft under discussion give its submission mailbox as
[EMAIL PROTECTED], which makes finding the real
list an exercise since IANA's web site makes no mention
of any mailing lists.  I made an educated guess that I
might find the list at alvestrand.no, and indeed the
list submission mailbox is [EMAIL PROTECTED],
and the list archive is available at
http://www.alvestrand.no/mailman/listinfo/ietf-languages

Neither RFC 3066 nor the draft provide any instruction
for joining the mailing list, and from the remarks above
it should be clear that IANA's web site provides no clear
clue either.
 
 2. I see quoted RFC 3066bis as a document. The RFC Editor seems to ignore 
 that RFC? Where can I find it?

It is apparently an unofficial term for the Phillips draft
mentioned in the new last call and to which you have
repeated the URI.
 
 3. there are at least four different levels:
 
 - what is Multilingualism vs. vernacularism (there are 6000 human languages 
 but a standard should be able to support non scripted and computer 
 generated and past languages, what may lead to millions of references).

One should then consider different types of tags for
different uses -- a tag for a non-scripted language
makes no sense in an RFC 2047/2231 encoded-word,
which is strictly text.
 
 - vernacular granularity has nothing to do with geography and countries. 

True in general; but can we reverse the precedent
set by RFC 1766?

 The way this inserts into the general digital convergence (is the IANA the 
 proper register?).
[...]
 The same as the IANA is not in the business of defining 
 countries (Jon Postel, RFC 1591) it should not be in the business of 
 defining languages.

The draft in question apparently seeks to get IANA into the
business of defining countries (and languages), usurping
those roles from ISO (as also noted in RFC 1591).

 I also submit that IANA is not the proper place anymore to support such a 
 Register. Experience shown that IANA (now a function of ICANN) is subject 
 to controversies in this or in parallel real life areas: ccTLD delegation, 
 ccTLD entries in the root file, accepted MINC reaction to the Polish non 
 concerted introduction of Arabic, Russian and Hebraic tables, ICANN 
 strategy for internationalized rather than multilingual TLDs, etc. I also 
 submit that UNESCO, MPEG or other standard/cultural organizations involved 
 in the daily reality (universities, editors, posts, governments, 
 copyrights, WIPO, etc. etc.) are more concerned and may make their own 
 standard prevail after an unnecessary and harassing dispute. It seems that 
 any semantic able to support open sub-tags whatever they originate from, is 
 useful. Going any further would push in favor of a less and less 
 [unilingual or internationalized] network centric market against a 
 market evolution toward user centric [multilingual/multiulcural] networked 
 relations [P2P, VoIP, NAT, coreboxes, OPES, etc.].

Good points all, though I do sympathize with the concern about
loss of information about what a tag meant at a given time
due to changes in the ISO lists.  I would support provision for
a definitive time-stamped registry of changes of some sort;
ideally that would be provided by ISO as part of (or a supplement
to) the lists, and I would be quite surprised if ISO were not
receptive to such a suggestion if made appropriately and with
a clear indication of the problems.

___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-12 Thread Bruce Lilly
  Date: 2004-12-12 15:31
  From: Peter Constable [EMAIL PROTECTED]
  To: [EMAIL PROTECTED], [EMAIL PROTECTED]
  
  From: [EMAIL PROTECTED] [mailto:ietf-languages-
  [EMAIL PROTECTED] On Behalf Of Bruce Lilly
 
 
  Moreover, the point is that countries do change, and that use
  of country codes (as provided for in RFC 3066 and in the
  proposed draft) carries with it the inherent instability
  which is characteristic of politics. A quest for stability
  of countries seems Quixotic and oxymoronic. According to the
  principle of stability as that term is used in defense of the
  draft, I suppose we're all intended to refer to Malawi as
  Rhodesia because that's what it (in part) was called 50 years
  ago, or that we're supposed to ignore the breakup of the USSR,
  Yugoslavia, etc., the reunification of Germany, etc.
 
 That is not at all the aim here wrt stability; rather, the aim is that a
 symbolic identifier used for metadata in IT systems not change because
 some government on a whim says, We would now prefer to use 'yz' rather
 than 'xy' to designate our country.

If by international agreement, 'yz' becomes the designation
for that country, then it is rather silly to stick one's
fingers in one's ears and shout NA-NA-NA-NA-NA I don't want
to hear you.  A more rational approach would be to say that
before such-and-such a date/time the designation was 'xy' and
after that date/time (until further notice) it is 'yz'. As I
have pointed out, politicians change the definitions of time
zones frequently, and those who have to deal with time zone
issues have found a way to cope with such change without
trying to declare international standardization organizations
irrelevant.

 Sure, there will be changes that we need to deal with; but there's no
 reason to subject all implementations, users and data to changes that
 are purely cosmetic changes to things that are not designed to be read
 by humans.

Designed or not, country codes *are* read by humans; they
appear in top-level domain names.  Currently the ISO 639
2-letter codes mean the same thing as the last component of
a domain name and as the second component of a language-tag.
It's rather silly to change that correspondence simply because
a few people are piqued that international agreement has been
reached to change a few 2-letter codes.
 
  A related problem with the use of country codes in language
  tags is that there is not necessarily an inherent relationship
  between language and country borders.
 
 That is not what country IDs within a language tag is intended to
 suggest. In fact, if there were inherent relationships, we probably
 would never have needed to use country IDs in a language tag.

I submit that it was never a good idea. Language evolves
over time, even in a given place.

  The borders of Germany
  have changed many, many times. If one is referring to the
  German language as spoken by inhabitants of Alsace, using
  country codes would imply that that same language spoken by
  the same people would have been tagged at various times as
  de-DE and de-FR according to where the France-Germany border
  happened to have been determined by politicians of the time.
  That strikes me as being a rather silly way to tag language,
  but that's the precedent set by RFC 1766.
 
 I agree that that's a silly way to tag that language; I disagree that
 RFC 1766 suggests I should tag it that way. 

RFC 1766 (and 3066) leave you little choice; if you wish
to indicate a region, you either have to do it with ISO
639 codes or you have to register a separate tag (no
separate tag for German as spoken in Alsace exists). Never
mind the shortcomings of that particular example; consider
de-DE -- does that mean Germany as it exists today, West
Germany as it existed 25 years ago, Germany as it existed
in the 1930s, the 1900s, ...?
 
  As far as I can tell,
  the draft doesn't really deal with the issue of changing borders
  or changing country names -- it merely pretends that these
  things don't happen by attempting to declare a snapshot of the
  status at some point in time as being valid for all time.
 
 That may be your reading of the situation, but it is not how it is seen
 by those of us who have been working on this spec and examining these
 issues closely.

As far as I can tell, the draft pretends that the meaning
of CS hasn't changed, and would in fact change the meaning
of the currently valid RFC 3066 language tag sr-CS.

  But the user has indicated that he speaks French, and the
  proposed registry contains a description in English only.
  Where is the implementor supposed to get the *official*
  translation for display? N.B. under the current (RFC 3066)
  situation, the definitive ISO lists provide an official
  description in French.
 
 Neither RFC 1766 or RFC 3066 has ever presented official translations;

Both defer to the ISO lists for definitions (not translations)
of the various codes.

 this is no different for RFC 3066bis.

It is very different; 

Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-12 Thread Bruce Lilly
  Date: 2004-12-12 15:33
  From: Peter Constable [EMAIL PROTECTED]
  To: [EMAIL PROTECTED], [EMAIL PROTECTED]
  
  From: [EMAIL PROTECTED] [mailto:ietf-languages-
  [EMAIL PROTECTED] On Behalf Of Bruce Lilly
 
 
  The point is that under RFC 3066,
  the bilingual ISO language and country code lists are
  considered definitive.
 
 That is nowhere stated or even suggested in RFC 3066.

RFC 3066 section 2.2 states, in part:

   - All 2-letter subtags are interpreted according to assignments found
 in ISO standard 639, Code for the representation of names of
 languages [ISO 639], or assignments subsequently made by the ISO
 639 part 1 maintenance agency or governing standardization bodies.

and has a similar statement regarding ISO 3166.

interpreted according to assignments found in certainly
sounds as if the ISO lists are considered definitive for
their respective categories of subtags, since their
interpretation is specified as that given in those lists.
I don't see how the RFC 3066 text can be interpreted
otherwise.

___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-12 Thread Bruce Lilly
  Date: 2004-12-12 15:34
  From: John Cowan [EMAIL PROTECTED]
  To: [EMAIL PROTECTED], [EMAIL PROTECTED]

 Of course countries change, and then the numeric country codes change
 as well. The point is that the alpha codes change for political reasons
 when there has been *no* change in the underlying country: Romania's
 3-alpha code changed from ROM to ROU without any change in Romania at all.
 The CS case is particularly gratuitous, as its denotation changed from
 Czechoslovakia (a no longer existent country) to Serbia and Montenegro
 (a newly created country).

There is a limited supply of 2-letter codes and the supply
of 3-digit codes is only slightly greater.  Reassignment of
codes from such a limited supply is inevitable.  Better to
deal with the fact of tides than to try to command the tide
not to flow in.

  As far as I can tell,
  the draft doesn't really deal with the issue of changing borders
  or changing country names -- it merely pretends that these
  things don't happen by attempting to declare a snapshot of the
  status at some point in time as being valid for all time.
 
 No, it attempts to freeze the code-to-country mapping at a single
 point. New countries or changes in old countries should involve only the
 additions of codes, not the reuse of old codes.

Too late. King Canute commands the tide not to come in, but
his feet still get wet.  Better to deal with such change
appropriately rather than commanding countries (or
international standards bodies) not to change.
 
 I don't know. Where is the implementor supposed to get the
 official German, or Catalan, or Mandarin translations?
 Not in the ISO registry, for sure. To say nothing of the
 cases where no official translations exist.

But I'm not concerned with translations, but with the
definitions. And currently the definitions are available
in French and English.

  It might be worthwhile considering the differences in the
  way languages tags are used, by whom they are used, and for
  what purpose. There may well be a substantial difference
  between use of a tag to represent an obscure dialect of a
  dead language in a research paper vs. tagging a piece of
  text in one of the core Internet protocols such as SMTP.
 
 That count does not include dead languages. Whether it includes
 dialects is a matter of terminology.

Fine. The point is that the draft provides for language tags
that are so long that they cannot be used with the core
Internet protocols. A tag associated with audio media doesn't
need a means to indicate script or other orthography -- they're
irrelevant for spoken material.  RFC 3066's provision for
registry worked well. Removing that requirement -- as the
draft would do -- necessitates a specific upper bound on
tag length that will work with existing core protocols, to
replace the reviewer, Area Director, and community review
process that ensure that current registered tags work with
those protocols.

___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-12 Thread Bruce Lilly
  Date: 2004-12-12 15:55
  From: Peter Constable [EMAIL PROTECTED]
  To: [EMAIL PROTECTED], [EMAIL PROTECTED]

 You have not responded to the point that accessibility of source ISO 
 standards is supposed to be a major factor, yet the draft itself clearly 
 indicates otherwise.

The source for the statement claiming accessibility as a
major factor has been indicated to be the author or
authors of the draft.  I can't explain why it says what
it says; I suggest that you direct that question to
the author(s).
 
  That is a problem for existing implementations of RFC 3066
  tags, which can obtain official, internationally agreed
  descriptions of the codes in two languages.
 
 Descriptions (language names) are beyond the scope of RFC 3066. It is a non 
 sequitor to claim that this draft creates a problem for existing 
 implementations of RFC 3066 on this basis.

3066 refers to interpretation of the codes and defers
that interpretation to that given by the ISO lists. One
cannot have an interpretation based on the lists without
the natural language definitions which are paired with
the codes.  It is a fact that those definitions are
available in two languages in the ISO lists, and that
the proposed replacement for the ISO lists would eliminate
one of those languages.
 
  OK, continuing your hypothetical example and its relationship
  to language, suppose that there is another civil war and
  that what now corresponds to US is split into Blue America
  and Red America. Further suppose that in due course ISO
  assigns some other code to one of those countries and retains
  US for the other, and that that happens after the proposed
  registry is set up with a definition for US and some
  description referring to the old use.
 
 That is a scenario that has been well considered: it would be very bad IT 
 practice to redefine a metadata tag US to have a narrower denotation than 
 it previously did, as that immediately breaks an unknown amount of existing 
 data. If ISO were to make such a change in the meaning of US, then IT 
 implementations *absolutely should not* follow suit; the ID US must retain 
 it's prior, broader meaning.

So long as it is known what definition of US applied at
the time, there is no problem.  This is dealt with in IT
all the time; EST has had many definitions in terms of
exact offset from UTC, and when it goes into and out of
effect (likewise for other time zones).  Yet we manage to
be able to state with precision the offset and effective
times of EST well into the past, and without declaring
that a single value must hold true for all time. I have
provided a URI to the time zone data; a similar mechanism
could be used to track historical values for ISO language
and country codes.  Given the existence of such proven
technology, there is no need for the incompatible approach
outlined in the draft.

___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-12 Thread Bruce Lilly
  Date: 2004-12-12 17:34
  From: Mark Davis [EMAIL PROTECTED]
  To: [EMAIL PROTECTED], [EMAIL PROTECTED]
  CC: [EMAIL PROTECTED]
  
  Are you claiming that
 
  sr-CS-891-boont-gaulish-guoyu-boont-gaulish-guoyu-boont-gaulish-guoyu
 
  is nonconformant per some specification in the draft
  proposal?
 
 Clearly not. But
 
  x-sr-CS-891-boont-gaulish-guoyu-boont-gaulish-guoyu-boont-gaulish-guoyu

So what? A private-use tag has to be agreed to by the communicating
parties; in this case they'll find that such an unwieldy tag is
unusable in an encoded-word and will have to agree to use something
more manageable.  That's a problem for the parties involved and
nobody else, since it doesn't affect the rest of us.  That's a
different matter from a public tag that everybody is expected to
be able to use.

 is already absolutely conformant with the current RFC 3066. And the current
 RFC 3066 clearly permits the registration of something as long as
 
  sr-CS-891-boont-gaulish-guoyu-boont-gaulish-guoyu-boont-gaulish-guoyu
 
 (although of course this particular combination would certainly never get
 in).

I agree that that would never be registered -- because of the
review process which is part of RFC 3066.  But the draft under
discussion has no mechanism to prevent it, unlike 3066. 

___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-12 Thread Bruce Lilly
  Date: 2004-12-12 19:20
  From: Mark Crispin [EMAIL PROTECTED]
  To: [EMAIL PROTECTED], [EMAIL PROTECTED]
  
 On Sun, 12 Dec 2004, Bruce Lilly wrote:
  If by international agreement, 'yz' becomes the designation
  for that country, then it is rather silly to stick one's
  fingers in one's ears and shout NA-NA-NA-NA-NA I don't want
  to hear you.
 
 What is silly is saying that every language tag has to have a date/time 
 attribute associated with it so that computer software managing that text 
 knows the language of that text.

In the specific cases of the core Internet protocols that
I have mentioned, there *is* a date/time attribute in the
form of an RFC [2]822 Date field.  If we're talking about
some file stored on some machine, every OS that I know of
has a date/time stamp associated with that file.  If you
have something else in mind, a concrete description and/
or example might help.
 
 It is a disaster for language identifiers to get recycled. Something has 
 to make those identifiers unique. Your notion will force the inclusion of 
 a date/time stamp in language tags, to restore the uniqueness that you are 
 so excruciatingly eager to abolish.

I'm not eager to abolish uniqueness.  There never was
any guarantee that codes would never change. Both RFCs
1766 and 3066 specifically mention changes as a fact of
life.
 
  Never
  mind the shortcomings of that particular example; consider
  de-DE -- does that mean Germany as it exists today, West
  Germany as it existed 25 years ago, Germany as it existed
  in the 1930s, the 1900s, ...?
 
 For the 98% case, it does not matter at all.
 
 But it does matter if, one day, DE becomes Denmark.

In either case, to understand precisely what geographical
area is referred to requires knowing the date to more or
less degree of accuracy.
 
  As far as I can tell, the draft pretends that the meaning
  of CS hasn't changed, and would in fact change the meaning
  of the currently valid RFC 3066 language tag sr-CS.
 
 No, it restores the previous meaning of sr-CS.

But what of the current meaning under the current
standard (RFC 3066 + ISO 639 + ISO 3166)?  Surely
the draft would change the meaning of that valid
RFC 3066 language-tag.

  It is very different; under the proposed draft, there is only
  an English definition, somebody wishing to provide a French
  definition finds that he has none and must resort to an
  unofficial translation.
 
 Why is the situation for French different from someobody wishing to 
 provide a Lower Slobbobian definition?

French is an official language used by the ISO in its
publications.  Lower Slobbobian is probably about as
meaningful as BLURDYBOOP.
 
  SO where are the French definitions?
 
 Ask a person who is bilingual in English and French to provide one.

That would lack definitiveness which characterizes the
ISO lists.
 
  Well, sure. But the name is an important thing by itself.
  It is rather pointless to ask a user to indicate the
  language of a piece of text by selecting from a list AB, ACE,
  ACH,..., ZHA, ZUL, ZUN -- the user doesn't normally refer to
  languages by codes. It's quite a different matter to ask the
  user to select from Abkhaze, Aceh, Acoli,..., Zhuang (Chuang),
  Zoulou, Zuni.
 
 Abkhaze, Aceh, Acoli,..., Zhuang (Chuang), Zoulou, and Zuni are not 
 language tags. So what's your point?

They are the human-readable names corresponding to codes.
For interoperability, it is insufficient to label any and
all languages as ZZ with no definition of what ZZ
means. Moreover, it is necessary for two (or more) communicating 
parties to *agree* on the meaning of ZZ; that is done
by assigning the code ZZ to an agreed-upon name.  The
code ZZ is nothing more than shorthand for that agreed-upon
name.  If one produces some text in the BCP 18 sense of text
(spoken, written, signed, etc.), it is useful to indicate
the language of that text; languages are known to humans by
names of languages -- the codes are, as noted, merely
shorthand for those names.  Likewise, somebody presented
with some text may desire or need to know the language of
that text; informing that person that the language has code
QZ is unlikely to mean anything to most people -- only
the name corresponding to the shorthand code is likely to
be meaningful to persons other than those involved in
standardizing the codes.

  Note that the RFC 3066 specifies a registry that does not include French
  language names. I suggest that this issue should be dropped.
  Yes, the current IANA registry has that problem for
  the non-ISO-based tags only. If the registry is to be
  changed to subsume ISO codes as well, that defect should
  be remedied.
 
 Why is it a problem? Why is it a defect?

Because it unnecessarily reduces by 50% the information
content currently available.
 
  On the contrary, it is preposterous to suggest that codes
  will be attached to text by magic
 
 Here is where you are misled. Many of these tags are embedded within the 
 text itself. That text 

Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-12 Thread Bruce Lilly
  Date: 2004-12-11 10:48
  From: Peter Constable [EMAIL PROTECTED]
  To: [EMAIL PROTECTED], [EMAIL PROTECTED]
  
  -Original Message-
  From: [EMAIL PROTECTED] [mailto:ietf-languages-
  [EMAIL PROTECTED] On Behalf Of Bruce Lilly

  My comments are in response to the New Last Call made on
  the ietf-announce list. They are in response to the text which
  accompanied that new last call and the text of
  draft-phillips-langtags-08.txt dated November 2002. The
  specific claim that accessibility has been a problem was made in
  the text accompanying the new last call
 
 I don't know where the statement accompanying the announcement came from,

According to the New Last Call issued by the IESG Secretary,
the text is Author's discussion of drivers for this work.

 You singled out that one point to comment on as though it were the main 
 factor.

I mentioned a matter which was repeatedly indicated as a
factor for existing implementations and with which I
strongly disagree.

There are points with which I do not necessarily disagree,
and there are points with which I have not yet had time to
study in detail, due to the surprise of the announcement
of an impending decision (I do not understand why no
announcement of work on an RFC 3066 replacement was made
to the ietf-822 list, especially as the core Internet
protocols discussed there are affected by this draft),
the shortness of the time before a decision (deadline for
comments was given as 5 Jan 2005), and the impending
holidays.

[regarding the proposed registry vs. internationally-
standardized ISO lists for subtag definitions]
 It is certainly the case that only it should be consulted for determining 
 what sub-tags are valid with what denotation, which was the intent.

That is a problem for existing implementations of RFC 3066
tags, which can obtain official, internationally agreed
descriptions of the codes in two languages.
 

 By looking in the sub-tag registry. If ISO changed the meaning of US to 
 something other than what it is now, its meaning for purposes of use in an 
 IETF language tag would not change, because it would remain stable in the 
 sub-tag registry. You would be fairly well protected against the whim of 
 politicians.

OK, continuing your hypothetical example and its relationship
to language, suppose that there is another civil war and
that what now corresponds to US is split into Blue America
and Red America.  Further suppose that in due course ISO
assigns some other code to one of those countries and retains
US for the other, and that that happens after the proposed
registry is set up with a definition for US and some
description referring to the old use.  Now suppose that one
wishes to produce an appropriate language tag for the text
moral values (which clearly has different meaning in Blue
America (telling the truth, admitting to mistakes, etc.) and
in Red America (imposing totalitarian control over others)).
How specifically would the proposed registry handle such a
change in the meaning of US, and how would the registry
help differentiate the meaning of a 1990's en-us tag to
that of the hypothetical time described?

I suspect that it won't help, and I recommend review of
how another artifact of politics (viz. time zones) are
handled by the (unofficial) database of time zones
maintained at ftp://elsie.nci.hih.gov/pub/tzdata2004g.tar.gz.
The format used handles multiple changes in definitions
that went into effect at different times, something that
the proposed registry doesn't appear to handle.
  
   But if the proposed new registry's description of CS says
  foo and the ISO standard code list says bar, what's
  an implementor supposed to present to a user as *the*
  description associated with CS?
 
 The *meaning* of the sub-tag is determined by the sub-tag registry. If you 
 want human-readable descriptors,

The draft says that the proposed registry will contain a
description, in English (only).

 you already have to look beyond the ISO standards for anything more than 
 English and French

But existing RFC 3066 implementations can get official
descriptions in *both* of those languages; the proposal
would adversely affect those existing implementations by
eliminating the French description.

Of course, it is a more serious defect of the proposal
that it would fail to reflect internationally-agreed
codes and would fail to keep pace with changes...

 it would not be new that you have to look beyond the registry itself to 
 decide what human-readable descriptors you should provide in a product.  

It would be new that one could not find a standard
(i.e. official) French-language description in the
list of codes.

  One possibility would be two description fields. But the
  registry would need a charset closer to ISO-8859-1 than
  to ANSI X3.4 as currently specified. Or an encoding
  scheme.
 
 Personally, I don't see the value in something like that. Given the intent to 
 have a registry that can be machine-readable, changing 

Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-12 Thread Bruce Lilly
  Date: 2004-12-12 13:00
  From: Mark Davis [EMAIL PROTECTED]
  To: [EMAIL PROTECTED], [EMAIL PROTECTED]
  CC: [EMAIL PROTECTED]

 Your claim that the RFC 3066 ABNF itself has a restriction in length is also
 clearly false. I will quote that again since you seem somehow not to have
 seen it:

I made no such claim; indeed it was I who pointed out
that RFC 3066 *theoretically* permits an infinite-
length tag.  On that basis alone (even if you missed
the fact that I am an implementor of RFC 3066
language tags) you can be sure that I am well aware
of the RFC 3066 ABNF.

 Both documents establish many further limitations on the contents of
 language tags in the text of each document. Ignoring those stated
 limitations will, in both documents, result in nonconformant language tags.

Are you claiming that

sr-CS-891-boont-gaulish-guoyu-boont-gaulish-guoyu-boont-gaulish-guoyu

is nonconformant per some specification in the draft
proposal?  It is certainly too long to be used in an
RFC 2047/2231 encoded-word.  It is much longer than
any registered RFC 3066 language tag, and the draft
proposes removing full tag registration procedure
restrictions as well as decoupling use from registration
that would combine to permit such an abomination.

___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-12 Thread Bruce Lilly
  Date: 2004-12-12 20:57
  From: Peter Constable [EMAIL PROTECTED]
  To: [EMAIL PROTECTED], [EMAIL PROTECTED]
  
  From: [EMAIL PROTECTED] [mailto:ietf-languages-
  [EMAIL PROTECTED] On Behalf Of Bruce Lilly
 
 
   That is not at all the aim here wrt stability; rather, the aim is
 that a
   symbolic identifier used for metadata in IT systems not change
 because
   some government on a whim says, We would now prefer to use 'yz'
 rather
   than 'xy' to designate our country.
  
  If by international agreement, 'yz' becomes the designation
  for that country, then it is rather silly to stick one's
  fingers in one's ears and shout NA-NA-NA-NA-NA I don't want
  to hear you.
 
 That misses the point entirely. The point is that IDs used by political
 administrations may change for any number of reasons, and those
 admministrations may have no qualms with such changes;

For such changes to become enshrined in an ISO standard
requires a bit more than a mere whim on the part of one
party; in the case of the particular ISO standards under
discussion, it requires convincing the duly appointed
maintenance authority to make the change.

 but in IT 
 systems, we cannot afford changes that break existing implementations
 and data.

Any implementations that depend on country/language codes
never changing are by definition broken implementations,
since there was never any guarantee that codes would never
change.  Change happens, and IT knows how to cope; it's a
versioning problem, and that's not a particularly difficult
problem.  Now I fully agree that in hindsight the ISO and
its appointed MAs could have provided a better record of
changes.

 If for whatever reason ISO and the UN decided that US should 
 be used to designate the country of France, I doubt you'd expect every
 software vendor to update all of their deployed installations to use
 fr-US instead of fr-FR, and for every user to go through every data
 repository they manage to make such changes in their data.

The only way that would be likely to happen would be if
there were no longer a US *and* if the ISO and UN
representatives of France were to initiate a request for
such a change.  One would presume that they would have
good reason to do so, and could explain said reasons in
order to convince their ISO and UN counterparts to agree
to the change.  Under those hypothetical circumstances, I
can only assume that software vendors who care about such
matters would either agree with the hypothetical reasons
or would have acted to convince those in favor of the change
of reasons to avoid the change.  And while I would not
expect users to retroactively change documents any more than
I would expect coins and paper money to be reissued with old
dates but new designations of country name, I would expect
that as of the agreed-upon effective date of the change that
new documents would be prepared in accordance with the new
standard.  It's difficult to be more precise about such a
wild hypothetical, but consider similar changes made to
time zones...

 The people that maintain time zone definitions may have their means for
 changing times; that's fine for them. They are not dealing with the same
 concerns as we are dealing with.

Sure they are; it's another instance of the same sort of
versioning problem, with the same root causes, viz. items
which are changed (more frequently than some would like)
by politicians.

 The group here that has focused 
 specifically on language-tagging issues for several years has evaluated
 issues that affect language tags and the impact of changes and has
 decided what is best practice for *this* domain, and it is to maintain
 stability of data rather than cater to whims of political
 administrations.

Now that the horses have all run away, you'd better make
sure the stable doors are locked. :-)  There was never
any guarantee of stability of country codes or of language
codes.  Declaring at some time in the future that today's
meaning of sr-CS never meant what it in fact does mean
doesn't create stability; it creates instability -- it
doesn't make the versioning problem go away; it adds yet
a third version to the existing two.
 
  Designed or not, country codes *are* read by humans; they
  appear in top-level domain names. Currently the ISO 639
  2-letter codes mean the same thing as the last component of
  a domain name
 
 I think you mean ISO 3166 2-letter codes.

Yes, my error.
 
  and as the second component of a language-tag.
  It's rather silly to change that correspondence simply because
  a few people are piqued that international agreement has been
  reached to change a few 2-letter codes.
 
 The usability flaw in treating ISO 639 and ISO 3166 as human-readable is
 evident in the confusion between ja and JP (or is it jp and JA?), and GB
 vs UK.

Without looking I can easily tell that jp and uk are country
codes precisely *because* they are well-known as TLDs.

 As for what is silly, if the UN country ID for Canada changed to 
 CN 

Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-11 Thread Frank Ellermann
Hi, two problems in draft-phillips-langtags-08.txt :

1 - ISO 3166-1 is dead

This memo should not be used in new Internet standards, see
http://www.iab.org/documents/correspondance/2003-09-25-iso-cs-code.html

A reference to some obscure 1998 edition of ISO 3166-1 doesn't 
help, would it include TL ?  What about the numerous dubious
countries in 3166, not the simple cases like CS, EU, or PS,
but RB, RC, FX, EH, BX, SF, or NT ?

The draft is about languages, an appendix listing relevant
country codes copied from an old ISO 3166-1 version (before CS)
should be good enough, and future changes could be handled as
IANA registry.

Where can I find the NH in en-NH ?  It's not in the public list
http://www.iso.org/iso/en/prods-services/iso3166ma/02iso-3166-code-lists/iso_3166-1_decoding_table.html?printable=true#AA

2 - Fallback

The text explains why en-US-boont matches en-US or en.  But it
does apparently not match en-boont.  That's ugly.  If I'd use
de-CH-1996, then I want it to to match de-CH or de-1996 before
a plain de.  (de-1996 = new orthography, de-CH = no szlig;)

Another example in the draft is fr-Latn-CA.  I've no idea what
other scripts are popular in fr-CA, but maybe fr-CA is somewhat
different from fr-FX, and then I wouldn't want a match with fr
if fr-CA is also available.

A counterexample is sr-Latn-YU, a match with sr-YU or sr won't
help if it's in fact sr-Cyrl-YU or sr-Cyrl.  In that case the
priority script before region is okay.  In other cases like
se-Latn-AX the script is less important than the region.

 Bye, Frank



___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-10 Thread Bruce Lilly
On Thu December 9 2004 12:23, [EMAIL PROTECTED] wrote:
 New Last Call: 'Tags for Identifying Languages' to BCP
  Date: 2004-12-08 17:56
  From: The IESG [EMAIL PROTECTED]
  To: IETF-Announce [EMAIL PROTECTED]
  Reply to: [EMAIL PROTECTED]
  
 The IESG has been considering
 
 - 'Tags for Identifying Languages '
  draft-phillips-langtags-08.txt as a BCP
 
 There have been considerable changes to the document since the
 initial last call, and the IESG would like the community to consider
 the changes. In addition, the authors have prepared text describing
 why this mechanism is needed as a replacement for the existing
 procedure; it is included below.
 
 The IESG plans to make a decision in the next few weeks, and solicits
 final comments on this action. Please send any comments to the
 [EMAIL PROTECTED] or [EMAIL PROTECTED] mailing lists by 2005-01-05.
 
 The file can be obtained via
 http://www.ietf.org/internet-drafts/draft-phillips-langtags-08.txt

I have some comments below.  They should not be construed as
a complete or thorough critique of the draft; they're initial
comments based on a quick review of the draft.

One overall comment; I'm surprised to hear that this was
already at last call -- some notice to mailing lists which are
heavily affected by the proposed changes (e.g. ietf-822)
would have been nice...   Considering the depth and breadth
of the specific issues discussed below, I'm not sure that
surprise is adequate...

 This specification, the proposed successor to RFC 3066, addresses a number of
 issues that implementers of language tags have faced in recent years:
[...]
 * Accessibility of the underlying ISO standards for implementers
[...]
 There are problems with the the RFC 3066 definition of generative tags,
 however. The ISO 639 and ISO 3166 standards are not freely available and 
 evolve
 over time.

Accessibility has not been a problem for this implementor (who,
incidentally, was unaware of this draft until the New
Last Call).  ISO 639 language code lists are readily available in
HTML-ized English and French via
http://www.loc.gov/standards/iso639-2/englangn.html
and
http://www.loc.gov/standards/iso639-2/frenchlangn.html
ISO 3166 country code lists are readily available in plain text
in English and French via

http://www.iso.org/iso/en/prods-services/iso3166ma/02iso-3166-code-lists/list-en1-semic.txt
and

http://www.iso.org/iso/en/prods-services/iso3166ma/02iso-3166-code-lists/list-fr1-semic.txt

The ISO registered code lists are freely available at the URIs
given above.  This implementor has used those URIs for years
without difficulty.  The ISO standards themselves are not free,
but neither are they required for an implementor to identify
the valid codes -- the free lists suffice for that purpose.

 The largest change in the specification is that it modifies the structure of
 the language tag registry. Instead of having to obtain lists of codes from 
 five
 separate external standards (not all of which are easily available), the IANA
 registry will maintain a comprehensive list of valid subtags that can be used 
 in
 the generative mechanism in a machine-parseable text format.

Contrary to the implicit claim, the ISO documents mentioned
above comprise two standards (available in two languages each),
not five separate external standards.

The availability of those two definitive standards in bilingual
forms allows implementors to (for example) construct menus of
available language and country code tags in BOTH languages used
in ISO standards.  The draft proposes declaring those standards
effectively irrelevant, being replaced by a single monolingual
(English) IANA registry. While it has become fashionable in
recent years among some factions within the United States
to bash France, the French people, their culture, and their
language, it seems inappropriate to extend such bashing to
technical standards which supposedly apply in an international
context. Especially when dealing with the subject matter of
language itself. The unavailability of the registered value
description in 50% of the languages traditionally used for
international standards publication, including the existing ISO
639 and 3166 codes, is a serious defect in the proposal, and
a departure from the status quo under RFC 3066 (which directly
refers to the bilingual ISO standards as definitive). [N.B. I
am not accusing the draft authors of French-bashing; it's just
that some of us are a bit more sensitive to Anglo-centricity
than others.  And it remains a fact that the draft has no
provision for bilingual descriptions of any subtag fields. (I
note in passing that the UN regional codes newly referenced
by this draft are available in HTML-ized (ostensibly) English
(though I've never seen an A-ring in English text before...)
and French).]

It is claimed that:
 In addition, and very importantly, language tags that are newly
 defined by this specification are compatible with the ABNF syntax, 

Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-10 Thread Valdis . Kletnieks
On Fri, 10 Dec 2004 14:46:52 EST, Bruce Lilly said:

 Accessibility has not been a problem for this implementor (who,
 incidentally, was unaware of this draft until the New
 Last Call).  ISO 639 language code lists are readily available in
 HTML-ized English and French via
   http://www.loc.gov/standards/iso639-2/englangn.html
 and
   http://www.loc.gov/standards/iso639-2/frenchlangn.html
 ISO 3166 country code lists are readily available in plain text
 in English and French via
   
 http://www.iso.org/iso/en/prods-services/iso3166ma/02iso-3166-code-lists/list-en1-semic.txt
 and
   
 http://www.iso.org/iso/en/prods-services/iso3166ma/02iso-3166-code-lists/list-fr1-semic.txt
 
 The ISO registered code lists are freely available at the URIs
 given above.  This implementor has used those URIs for years
 without difficulty.  The ISO standards themselves are not free,
 but neither are they required for an implementor to identify
 the valid codes -- the free lists suffice for that purpose.

I'm certainly belaboring the obvious (in that the standards in question
are basically useless unless at least this subset of information is freely
accessible so everybody uses the same values), but is there any statement
from the ISO side that this state of affairs (or equivalent access) is
going to continue for at least the code lists we need?

(I'd not even ask, except this seems to be the month we spend time worrying
about explosive bolts attached to our *own* infrastructure - seems to be a good
time to worry about institutional insanity on the part of a totally separate
standards organization.. ;)


pgpc7iR7ZiTYx.pgp
Description: PGP signature
___
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf


Re: New Last Call: 'Tags for Identifying Languages' to BCP

2004-12-10 Thread Bruce Lilly
 RE: New Last Call: 'Tags for Identifying Languages' to BCP
  Date: 2004-12-10 20:03
  From: Peter Constable [EMAIL PROTECTED]
  To: [EMAIL PROTECTED]
  CC: [EMAIL PROTECTED]
  
 Resuming my comments:

  Specifically, the draft allows, and RFC 3066 disallows:
   subtags more than 8 octets in length
   hyphens which do not separate subtags
   zero-length subtags
   primary tags which are not purely alphabetic
  Curiously, all of those are permitted by the draft ABNF
  production grandfathered...
 
 The grandfathered production in the current draft is 
 
 grandfathered  = ALPHA *(alphanum / -)
 
 which does permit the sequences claimed by Bruce (except for
 not-purely-alphabetic primary sub-tags),

No exception.  alphanum is ALPHA / DIGIT.  In plain
English, grandfathered as defined in the draft is a letter
followed by any number of letters, digits, and/or hyphens, in
any order.  And that includes a123-xyz as I initially stated,
and clearly 1, 2, and 3 are digits.

 syntactically; but the set of 
 tags available for use is constrained by more than the ABNF syntax
 alone: the acceptable productions for each sub-tag must either be taken
 from one of the source standards or be registered.

So what? The ABNF is an expression of the grammar that
describes the set of all valid tags.  If the grammar permits
y-, a123-xyz, etc. (and it does) then a parser
claiming to parse language tags as defined by that ABNF
must be able to parse such tags.  That is, the ABNF-
specified grammar imposes requirements on parsers.  If
one doesn't intend to impose such requirements, the
ABNF specifying the grammar should be changed
accordingly.

 This is no different 
 from RFC 3066, so it is no more of a problem in this specification than
 it was in RFC 3066.

It is a very different grammar from RFC 3066, imposing
very different requirements on parsers.
 
 It might be that the wording in 2.2 could be tightened up to eliminate
 any possible question regarding the source for grandfathered
 productions.

It's not a matter of wording; the problem is with the ABNF.

 Alternately, there's no reason why the grandfathered production
 shouldn't be composed exactly to match what was used in RFC 3066:
 
 grandfathered = 1*8ALPHA *(- 1*8alphanum)

I believe I said as much (though one then needs to look
at reduce/reduce conflicts implied by the revised grammar):
 
  I see no reason for the ABNF to permit such content as is
  forbidden by RFC 3066; the actual ABNF for what RFC 3066
  permits is contained within 3066, and could have been directly
  incorporated rather than producing a grandfathered
  production which opens up several cans of worms.
 
 This vastly overstates the problem. There is no can of worms unless it
 exists in tags currently available under RFC 3066.

I referred to the additional requirements imposed on
parsers, as well as the unlimited tag length permitted.

  One defect related to tag length in RFC 3066 is not remedied
  by the draft; indeed the problem is greatly exacerbated...
 
  Unfortunately, a language- tag's length is unlimited by
  the ABNF in RFC 3066 (due to an unlimited number of subtags)
  and in the draft...
 
  In particular, tags other than private-use tags with more than
  two subtags require registration under RFC 3066 rules, and it
  is a trivial matter to determine the longest registered tag.
  The draft, however, encourages use of more subtags as well as
  removal of the subtag length upper bound; moreover, it permits
  infinite numbers of subtags without requiring registration of
  the resulting complete tag.
 
 Bruce states incorrectly that there is no upper bound on the length of
 sub-tags.

Look again at the draft definition of grandfathered -- now
show me where there's a limit in that production on subtag
length.

 His other concern, on the overall length of complete tags, is 
 valid, however: in terms of the ABNF syntax for both RFC 3066 and RFC
 3066bis, infinite-length productions are possible, but RFC 3066 would
 require registration of complete non-private-use tags while RFC 3066bis
 does not.

Yes, and a quick look at the registry reveals that the longest
tag is 11 octets (cel-gaulish).
 
 There are three open doors for infinite-length productions in the ABNF
 of the current draft:
 
 - unlimited extlang sub-tags
 - unlimited variant sub-tags
 - the number of possible extensions is limited to 25

The ABNF indicates no such limit.

 , but the length of 
 extensions is unlimited

You have missed several others:

1. privateuse length is unlimited (either tacked on
after lang etc., or directly as an alternative in
Language-Tag)

2. grandfathered, which as already discussed
permits unlimited length.

 
 We could impose some upper limits on these things; e.g.
 
 Language-Tag = ... *8(- extlang) ... *8(- variant) ... 1*25(-
 extension)

I think you mean *25(- extension), not 1*25...

 extension = singleton 1*8(- 2*8alphanum)

That leaves the extension portions' length at up to
25 * (1