Hi - > From: "Tex Texin" <texte...@xencraft.com> > To: <l...@ietf.org>; <ietf@ietf.org> > Sent: Monday, March 02, 2009 1:05 AM > Subject: [Ltru] draft-ietf-ltru-4645bis-10.txt issue with preferred valuefor > YU > > With respect to the proposed update to the Language Subtag Registry > draft-ietf-ltru-4645bis-10: > > I would like to lodge an objection to the deletion of the Preferred-Value for > language subtag YU.
As ltru co-chair: it's exceedingly late for such an objection - the issue was discussed at length in the working group over a year ago. A recent revisiting of the question arrived at the same conclusion. > This change breaks the equivalence class relation between YU and CS. > It detrimentally changes the behavior of existing implementations. As a technical contributor: The main reason that CS does not make sense as a preferred value for YU is that there is *not* an "equivalence class relation" between them. There are pieces of what was YU that are not covered by CS. To treat them as an "equivalence class" ignores linguistic, geographic, and political reality. > The loss of the relationship between YU and CS makes documents that were > believed to be tagged equivalently, to no longer be equivalent. In my opinion, regarding them as equivalent is an error, since CS and YU don't encompass the same regions. > There is also no benefit to this change. I disagree. The change removes an error. > To be concrete, assume a user attempts to find documents for languages from > Yugoslavia. Language tags do *not* pretend to be able to answer this sort of query. Using a region subtag (e.g. 'CS') says that the data subtag uses a specific variety of the primary language, and that the party tagging the data believes that this distinction is useful. For example, I could tag this paragraph with 'en' or with 'en-US'. Is that extra distinction necessary or useful? In this case, no. Consequently, the "retrieving documents by region subtag" use case, although technically permitted by RFC 4647, is not realistic, and in many ways contrary to the basic "tag wisely" principle. > Using the then current registry data, a query tool noting the preferred > value relationship, matches either xx-YU and xx-CS. > > Another user searches for documents for Serbia. > > A query tool using the current registry data noting the preferred value > relationship, matches either xx-YU and xx-CS. > > The results are in some sense accurate and complete, given the history of the > subtag. No, they are not. (1) there is no requirement, much less a guarantee, that the data will bear a region subtag at all (2) there are many bits and pieces of YU not covered by CS - even if data always bore a region subtag, the YU->CS mapping would miss all the other territory that once belonged to YU. (3) blindly replacing all YU subtags with CS subtags would in fact falsify some data, since the language could well be of a variety covered by YU but not by CS. > After this change in the preferred value relationship, the query > tool does not know to search for both xx-YU and xx-CS, since the > registry does not indicate a relationship. Only one or the other > subtag is used for each query. However, the query results are now > incomplete since some documents for xx-YU have been tagged with > the one-time preferred tag of xx-CS. The relationship cannot be adequately automated with a simple one-way pointer like "preferred-value". The former YU also encompassed BA, HR, ME, MK, RS, and SI. > Comments in the registry are not a solution. Comments are a good > thing for recording rationale and tangential history. However, > implementers are not going to go thru and read the comments on any > or all tags in order to make a correct implementation. They are going > to implement based on the schema and operate with the data values. If someone (or something) is applying region subtags, they'd better have sufficient knowledge of the language varieties to do so meaningfully. This effectively requires *understanding* those comments and much more. The Language Subtag Registry does *not* attempt to record all the information needed to recognize language varieties. Rather, once someone (or something) has made a distinction, the LSR provides the bits needed to encode a tag for that language variety. In the particular case of the languages of the former YU, the region subtags now available (such as BA, HR, ME, MK, RS, and SI) are arguably far more useful, if someone needs to distinguish regional variations in their Croatian-language data, than just YU. (It's unclear to me whether YU would ever have been terribly useful, since it would allow the distinction of Croatian as spoken there from Croatian spoken somewhere (where?) else.) > The registry should stay as it is with respect to YU and retain > CS as the preferred value. > > As CS is now being used as a preferred value, deprecated or not, > there isn't a compelling motivation to remove the preferred value for YU. Please, let's look at the actual tagged language data. What corpora out there have employed YU (correctly) as a subtag? To what extent would replacing that subtag with 'CS' (rather than with BA, HR, ME, MK, RS, or SI) be correct, for Serbian, Croatian, or any of the other languages of that region? > Please eliminate this needless change that breaks applications > relying on the relationship between YU and CS. I would argue rather than an application that relies on an equivalence relation between YU and CS is already in some sense broken, in the same way as one assuming that Russia and the Soviet Union are somehow equivalent. > tex Randy _______________________________________________ Ietf mailing list Ietf@ietf.org https://www.ietf.org/mailman/listinfo/ietf