RE: New Last Call: 'Tags for Identifying Languages' to BCP

Peter Constable Mon, 13 Dec 2004 16:08:39 -0800

> From: [EMAIL PROTECTED] [mailto:ietf-languages-
> [EMAIL PROTECTED] On Behalf Of Bruce Lilly



> > The "grandfathered" production in the current draft is
> >
> > grandfathered   = ALPHA *(alphanum / "-")
> >
> > which does permit the sequences claimed by Bruce (except for
> > not-purely-alphabetic primary sub-tags),
> 
> No exception.  "alphanum" is ALPHA / DIGIT.

My mistake; again, I had on my mind constaints beyond the ABNF.


> > syntactically; but the set of
> > tags available for use is constrained by more than the ABNF syntax
> > alone: the acceptable productions for each sub-tag must either be taken
> > from one of the source standards or be registered.
> 
> So what? The ABNF is an expression of the grammar that
> describes the set of all valid tags.

It is *part* of the expression of the grammar. Even in RFC 3066 this is the 
case: you know that t-abc is not valid under RFC 3066, but not because that is 
constrained by the ABNF of RFC 3066.

I will accept that the ABNF of draft should be changed to better reflect what 
the form of grandfathered productions can be, which, as I stated in my previous 
message, would be the equivalent of the ABNF of RFC 3066:

grandfathered = 1*8ALPHA *("-" 1*8alphanum)

I think that's an improvement, though technically I don't think it changes 
anything.


> If
> one doesn't intend to impose such requirements, the
> ABNF specifying the grammar should be changed
> accordingly.
> 
> > This is no different
> > from RFC 3066, so it is no more of a problem in this specification than
> > it was in RFC 3066.
> 
> It is a very different grammar from RFC 3066, imposing
> very different requirements on parsers.

Our disagreement amounts to a basic question of whether parsers should be 
written based on the ABNF alone, or based on the ABNF plus other constraints 
provided in the spec. Clearly, I think anyone writing a parser should consider 
other constraints as well.



> > > In particular, tags other than private-use tags with more than
> > > two subtags require registration under RFC 3066 rules, and it
> > > is a trivial matter to determine the longest registered tag.
> > > The draft, however, encourages use of more subtags as well as
> > > removal of the subtag length upper bound; moreover, it permits
> > > infinite numbers of subtags without requiring registration of
> > > the resulting complete tag.
> >
> > Bruce states incorrectly that there is no upper bound on the length of
> > sub-tags.
> 
> Look again at the draft definition of "grandfathered" -- now
> show me where there's a limit in that production on subtag
> length.

As mentioned, the limit is imposed by other tight constraints on 
'grandfathered'; you have already identified that the longest registered tag 
under RFC 3066 is 11 octets in length, therefore a 'grandfathered' tag can be 
at most 11 octets in length.



> > There are three open doors for infinite-length productions in the ABNF
> > of the current draft:
> >
> > - unlimited extlang sub-tags
> > - unlimited variant sub-tags
> > - the number of possible extensions is limited to 25
...
> > , but the length of
> > extensions is unlimited
> 
> You have missed several others:
> 
> 1. "privateuse" length is unlimited (either tacked on
>     after "lang" etc., or directly as an alternative in
>     "Language-Tag")

I disregarded this since it is identical to the case for RFC 3066, and you 
were, after all, charging that the draft creates problems that were worse than 
for RFC 3066.


> 2. "grandfathered", which as already discussed
>     permits unlimited length.

But as already stated is very tightly constrained, with a de-facto upper limit 
of 11 (subject to change if new tags are registered before the proposed spec is 
accepted).


> > We could impose some upper limits on these things...

> That leaves the extension portions' length at up to
> 25 * (1 + 1 + 8 * 9) = 1850 octets, not taking any other parts
> of a tag into account!   That's way too long (the RFC 2047
> limit for an encoded-word is 75 octets, including charset tag,
> some text, and some syntactic glue in addition to the language
> tag).

The problem already exists in RFC 3066. Even apart from private-use tags, 
tomorrow someone could request a registration for a tag that's 87 octets long, 
and there's nothing in RFC 3066 that would prohibit acceptance.


> > So, I think Bruce has identified a valid issue here. I personally would
> > not have characterized it as greatly exacerbating, though,
> 
> IMO, an increase from 11 octets worst-case, which is tolerable
> for constructing RFC 2047/2231 encoded-words, to >> 1850
> octets, which exceeds by a large margin what can be handled
> in a Content-Language or Accept-Language message header
> field, constitutes "greatly exacerbated".

Repeating my previous point, RFC 3066 doesn't stop a registered tag from being 
10^100 octets in length. Of course, all of us know that such a tag wouldn't be 
useful. At some point, we have to engage common sense, even for RFC 3066. The 
draft would allow a tag 

en-boont-boont-boont-boont-boont-boont-boont-boont-boont-boont-boont-boont-boont

(over 75 octets), but common sense tells us it doesn't make sense (and that 
anyone who uses such a thing deserves whatever they get). 

Now, we could try to revise the ABNF to constrain for such things, just as the 
ABNF of RFC 3066 could have been constrained further. It's not easy to express 
common-sense constraints in ABNF, however.

I suggest that wording be added to the draft giving a strong recommendatation 
to users that they not use tags the complete length of which exceeds 75 
characters.



> > > I am absolutely shocked that a draft dealing with language
> > > lacks an "Internationalization considerations" section as
> > > recommended by RFC 2277 (a.k.a. BCP 18).
> >
> > No more or less shocking than for RFC 3066, regarding which I'm not
> > aware of any complaints.
> 
> By deferring to the bilingual ISO lists for language and country
> tags, 3066 at least provided a minimal degree of internationalization.
> By explicitly limiting description fields to English and restricting
> the charset to US-ASCII, the draft proposal takes a giant leap
> backwards.

The US-ASCII limitation existed in RFC 3066, so is not new. 

On the more general point, I believe you are mistaking i18n concerns with 
localization concerns: you are looking for strings to be used in UI for 
different local markets. Apart from charset, RFC 1766, RFC 3066 or RFC 3066bis 
do not have *internationalization* concerns.


> > I don't quite understand what the critique is here: what is there to
> > internationalize about language tags?
> 
> There should probably be a reference (at least informative)
> pointing to BCP 18 and mentioning that the language tags
> defined provide a means of labeling the language of text,

Have you not read the abstract in the draft?

<quote>
   This document describes the structure, content, construction, and
   semantics of language tags for use in cases where it is desirable to
   indicate the language used in an information object.
</quote>


Or the introduction?
<quote>
   One means of indicating the language used is by labeling the
   information content with a language identifier...

   This document specifies an identifier mechanism...
</quote>

How much clearer does it need to be?



> The draft (if/when approved) should also indicate that
> it updates BCP 18, which refers to RFC 1766.

Is this right? This draft is not a replacement for RFC 2277, or an addendum to 
it. RFC 2277 also refers to RFC 1958, which was updated by RFC 3439, but surely 
RFC 3439 doesn't state that it updates BCP 18? (RFC 227 does have a section 
with significant overlap in topic, though, so perhaps this makes sense. I'm not 
well-enough versed in IETF document process to know.)


> Given the divergence noted above from RFC 3066's use
> of multilingual reference lists, the Internationalization
> considerations section should include a synopsis of the
> approach chosen (viz. to restrict description to English) and
> the rationale for that choice (see BCP 18 section 6).

Again, this is a localization issue, not an internationalization issue. I do 
not consider this necessary or even appropriate.



> > It's
> > true that ALPHA and DIGIT are not defined
> 
> Non-sequitur aside, those terms are defined in RFC 2234.

Of course I meant "not defined *within this document*".


> > >     implications (ISO 8601 date format parsing).
> >
> > As mentioned above, this really is a non-issue.
> 
> It's an issue (esp. in light of the finger pointing regarding
> accessibility to ISO 639/3166).

As has been pointed out, there is no such finger-pointing in the draft.


> Admittedly it can be
> resolved without much difficulty (but then that could
> have been done earlier, couldn't it?).

I think the authors and those of us who have been reviewing thought that the 
intent was quite clearly YYYY-MM-DD, so didn't see a concern. That's why last 
calls are announced to a much wider audience.


> > > 2. the clear contradiction between the claims about
> > >     ABNF compatibility with RFC 3066 and the factual
> > >     incompatibility of certain provisions in the grammar.
> >
> > The main concern was with the "grandfathered" production, but I've shown
> > that that is a non-issue.
> 
> Again, it is an issue that imposes requirements on language
> tag parsers.  What you've shown is that the ABNF is not
> consistent with what was desired to be expressed, and
> that makes it an issue that needs to be addressed.

Again, I believe the bigger issue is not getting the ABNF to express what was 
desired, but rather whether parsers are written to consider only the ABNF or 
the ABNF plus other specified constraints as well.


> > The maximal length issue exists just as much
> > in RFC 3066 due to private-use tags; it is a technical concern that
> > might worth reviewing in RFC 3066bis, however; but it is not
> > insurmountable, and not a new problem.
> 
> Private-use carries its own considerable baggage; aside from
> that, the draft proposal increases the length of non-private
> tags that affect both protocol design and implementations
> from a worst case maximum of 11 octets under RFC 3066...

Worst case at present; a month from now it could be unlimitedly larger. But 
I've accepted that it would be an improvement to add constraints on overall 
length.


Peter Constable
Microsoft Corporation

_______________________________________________
Ietf mailing list
[EMAIL PROTECTED]
https://www1.ietf.org/mailman/listinfo/ietf

RE: New Last Call: 'Tags for Identifying Languages' to BCP

Reply via email to