Re: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0
2011/9/1 Karl Williamson pub...@khwilliamson.com: Unicode 6.0 broke UTS #18, which since 1999 has suggested that BELL be the name used in regular expressions for U+0007. In 2003, this was strengthened to should be used. The breakage occurred by requiring that BELL instead be the name for a different code point. By breaking UTS #18, all implementations of it, including Perl's, were broken, causing real harm to real code and real people. For this reason, Perl has not completely adopted 6.0. Further, UTS #18 encourages implementations to do exactly what Perl did: The ISO names for the control characters may be unfamiliar, ... so it is recommended that they be supplemented with other aliases. For example, for U+0009 the implementation could accept the official name CHARACTER TABULATION, and also the aliases HORIZONTAL TABULATION, HT, and TAB. Thanks then for explaining that. So now such aliases are needed to correct obvious errors. Well this is not a real correction, but a change for the new short name. The should that specified an alias will now be replaced by a must with the new alias. This instability should have been explained, as it was not explicit in the PRI. The genesis of this proposal was to prevent the Unicode Consortium from making this kind of mistake again. The language in UTS #18 mentioning the TAB variants also dates to 2003. I think this example makes it clear why more than one alias may be needed per code point. Of course, PRI #202 is not the only mechanism possible to achieve the needed goal of preventing another mishap like BELL. But the consensus in the discussion about it was that is was the easiest route to get there. Given that most of these discussions have occured offline (not publicly but in private reports to the UTC, or between UTC members working with the closed unicore discussion list, or privately between each others), I had no idea of these discussions. But now that I'm an UTC member, I hope I will hear these cases earlier... Does it justify so many new aliases at the same time ? I've not checked the history of all past versions of UAX, UTR, and UTN (or even in the text of chapters of the main UTS)... Are there other cases in those past versions, that this PRI should investigate and track back ?
RE: [OT] Reusing the same property (was: RE: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0)
On Wednesday 31 August 2011, Doug Ewell d...@ewellic.org wrote: Coming back full circle, this is where many of the PUA protests on this list come from -- some folks want to use the Unicode PUA to encode things that are not characters, not even glyphs or symbols, nor anything else remotely resembling the intended scope of the Unicode Standard. Well, some of my ideas for which I am using the Private Use Areas have been banned from being discussed in this mailing list. Yet that is mailing list policy, not the policy over for what a person may use the Private Use Areas. The two are not the same. There is a ban on discussing the ideas in this list, yet I am entirely free to use the Private Use Area for assigning meanings within the scope that those meanings are Private Use meanings, publishing those meanings within the scope that those meanings are Private Use meanings and making and publishing fonts and producing and publishing pdfs as I choose and to continue my research. The intended scope of the Unicode Standard is something that can change with time. Research is about progress. What is in the Unicode Standard should not be constrained to what was intended to be in the Unicode Standard many years ago when the Unicode Standard was started. Certainly, there are some things that cannot be changed, yet not changing those things is not the same as restricting what the Unicode Standard can encode in the future. There have been various technological developments since the Unicode Standard was started and the scope of the Unicode Standard has been enlarged to support those new technologies. For example, the encoding of the emoji and now the encoding-in-progress of the symbols of the Webdings font. In relation to the encoding-in-progress of the symbols of the Webdings font, something I have wondered about is whether the encoding is for the specific Webdings glyphs or whether the encoding is for any representation of the same general concept. As a particular example, please consider the character that is accessed by the letter P using the Webdings font. I have seen that glyph used in a gif attached to an email along with text suggesting helping the environment by avoiding printing the email unless it is considered essential to do so. In another post Doug wrote as follows. I don't know what this means. Private-use tags starting with x- cannot be reliably and algorithmically parsed into subtags (just like all language tags before RFC 4646), but there are no real limits to what language information can be conveyed in them, as you seem to imply; you can write x-navi-as-spoken-in hometree-on-pandora if you like. I am using x-y as a language tag for some of my research. For example, there could be a database table for x-y and en-gb-oed sentences and another database table for x-y and fr sentences. One could then seek matches in the x-y fields in each table so as to find a link from a en-gb-oed sentence to a fr sentence. The method is suitable not only for French, there could be a database table for x-y and any other language tag, as desired. William Overington 1 September 2011
Language subtags (was: RE: [OT] Reusing the same property)
Philippe Verdy verdy underscore p at wanadoo dot fr wrote: 2011/8/31 Doug Ewell d...@ewellic.org: Philippe Verdy wrote: the existing BCP 47 implementations, but that would limit the may-be future extension of ISO 639 to longer codes): ISO 639 could immediately say that it will never allocate any language code (of any length) starting by qa..qz. Not possible; 'qu' is already taken for Quechua. And not necessary: 'qaa' through 'qtz' are reserved. I said using the prefixes starting by qa..qt, This was a direct quote from your post of Wed, 31 Aug 2011 21:58:25 +0200 (http://www.unicode.org/mail-arch/unicode-ml/y2011-m08/0456.html). But I'll assume it was a typo for qa..qt; you did mention this shorter range in other posts. these prefixes are not supposed to be used alone, there must be additional letters. so this does not apply to qu alone (yes, assigned to the Quechua macrolanguage, or isn't it a language subfamily ?). From here on, I assume you are asking about BCP 47, not about any part of ISO 639. BCP 47 uses the IANA Language Subtag Registry, which uses ISO 639 as a primary source, but adds constraints. As one example, when an alpha-2 code element (from 639-1) exists for a given language, a BCP 47 subtag exists only for that alpha-2 code element and not for the corresponding alpha-3 code element. So for French, you can only use 'fr' for French and not 'fre' (from 639-2/B) or 'fra' (from 639-2/T and 639-3). BCP 47 language subtags do not have prefixes. An ISO-based language subtag is either 2 or 3 letters long. There is no correlation, explicit or implicit, between a 2-letter language subtag and any 3-letter subtags that begin with those same two letters. ISO 639-3 does classify Quechua as a macrolanguage, but that doesn't affect code allocation; macrolanguages are assigned code elements and subtags just like any other language. It is often useful to be able to specify, say, Quechua in a tag instead of one of the many specific varieties of Quechua, such as Chimborazo Highland Quichua or Yanahuanca Pasco Quechua; this in fact is why the concept of macrolanguage exists. But I admit that there's an additional caveat: BCP47 opens all codes with 5 to 8 characters to possible registration in the IANA registry. I have not checked if there were some registration of language tags starting by qa..qt in the IANA registry, but there's apaprently no policy defined to forbid such registration. BCP 47 language subtags of 2 and 3 letters correspond to code elements assigned in some part of ISO 639. ISO 639-1, as stated earlier, has assigned 'qu' to Quechua. This is reflected in the Registry. I don't have a copy of 639-1 and don't know if it reserves 'qa..qt' or any other range. The 639-2 Web site, which lists 639-1 allocations, doesn't mention any such reservation. ISO 639-2 and 639-3 have defined 'qaa' through 'qtz' as Reserved for local use, which is reflected in the Registry as Private use. BCP 47 explains the use of these subtags as an alternative to the x- mechanism. One advantage, as you pointed out elsewhere, is that the resulting tag can be parsed like a normal tag; the region 'ZW' in qaa-ZW explicitly means Zimbabwe. ISO 639-2 has assigned 'que' to Quechua and ISO 639-5 has assigned 'qwe' to Quechuan (family). ISO 639-3 has assigned more than 50 code elements in the non-private range beginning with 'q', many of which (but not all) are for varieties of Quechua. The Registry reflects all of these assignments except 'que' (because Quechua in BCP 47 is 'qu'). And your not necessary comment does not apply here too: it just assigns the 3-letter codes for local use, not the longer codes which are only reserved for the 4-letter codes, but not assigned for private use (and there's also no provision given in ISO 649 to protect an encoding space for 5-letter codes or longer, as they are now usable for IANA registration). 4-letter language subtags are reserved, and will remain reserved unless and until BCP 47 is updated (via a new RFC) to make use of them. I don't care to speculate on their future allocation or use. If and when language subtags of 5 to 8 letters are registered, there will be no restriction (as far as I can tell) on subtags beginning with 'q' or any other letter or sequence. -- Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell
Re: Language subtags (was: RE: [OT] Reusing the same property)
In fact, my post about qa..qz was a 1-letter error, I only wanted to speak about qa..qt (as I had stated previously and every where else in this thread), before one wanted to correct me. Well I know that this is now going out of topic, because someone else spoke about www (before that I had only spoken about the general need for any encoding open standard that wants to be universal, to assign a private-use area. To which there was a desire to have a larger space than just qa..qt union qaa..qtz, for easier (algorithmic) mapping of local-user codes (both in ISO 639 and BC 47). I had also wanted to show that the x- prefix in BC47 makes the language tag not parsable like generic structured tags (that are also extensible to support locale tags, using extensions such as the one using the u subtag defined by Unicode, mostly for the CLDR, e.g. to encode collation options, or other locale conventions). Using the BCP47 x- prefix does not permit those extensions, because x- BCP47 language tags have no structure. And that's why I spoke about two alternatives : - using another singleton letter q (yes in BCP 47 only), followed by one subtag, to create arbitrary local-use language tags, that would still remain be parsable and would support the u extension mechanism - using ranges of codes starting by qa..qt of arbitrary longer lengths (not limited to 2-3 letters as now), which means a change both in ISO 639 (for code allocation) and BCP 47 (to restrict 5 to 8-letter codes that are NOT freely usable for local-use, but still open for registration, so that the IANA registrey could accept a registration of 5-8 letters codes starting by qa..qt) I hope this summary correctly represent what I wanted to show, because once agin the intent has been misunderstood and some people on this list were assuming things that I did not intend to request. In fact I have not requested anything, just spoken about the existing possibilities, that would permit an application to use a cumfortable space for its local uses that can easily remap some unusable codes to a PUA space where it can create aliases that would be recognized automatically by this local application as such (i.e. an alias of the standard language code), easing the interoperability of this application with the rest of the world, even if it needs to use local-use codes. Philippe. 2011/9/1 Doug Ewell d...@ewellic.org: Philippe Verdy verdy underscore p at wanadoo dot fr wrote: 2011/8/31 Doug Ewell d...@ewellic.org: Philippe Verdy wrote: the existing BCP 47 implementations, but that would limit the may-be future extension of ISO 639 to longer codes): ISO 639 could immediately say that it will never allocate any language code (of any length) starting by qa..qz. Not possible; 'qu' is already taken for Quechua. And not necessary: 'qaa' through 'qtz' are reserved. I said using the prefixes starting by qa..qt, This was a direct quote from your post of Wed, 31 Aug 2011 21:58:25 +0200 (http://www.unicode.org/mail-arch/unicode-ml/y2011-m08/0456.html). But I'll assume it was a typo for qa..qt; you did mention this shorter range in other posts. these prefixes are not supposed to be used alone, there must be additional letters. so this does not apply to qu alone (yes, assigned to the Quechua macrolanguage, or isn't it a language subfamily ?). From here on, I assume you are asking about BCP 47, not about any part of ISO 639. BCP 47 uses the IANA Language Subtag Registry, which uses ISO 639 as a primary source, but adds constraints. As one example, when an alpha-2 code element (from 639-1) exists for a given language, a BCP 47 subtag exists only for that alpha-2 code element and not for the corresponding alpha-3 code element. So for French, you can only use 'fr' for French and not 'fre' (from 639-2/B) or 'fra' (from 639-2/T and 639-3). BCP 47 language subtags do not have prefixes. An ISO-based language subtag is either 2 or 3 letters long. There is no correlation, explicit or implicit, between a 2-letter language subtag and any 3-letter subtags that begin with those same two letters. ISO 639-3 does classify Quechua as a macrolanguage, but that doesn't affect code allocation; macrolanguages are assigned code elements and subtags just like any other language. It is often useful to be able to specify, say, Quechua in a tag instead of one of the many specific varieties of Quechua, such as Chimborazo Highland Quichua or Yanahuanca Pasco Quechua; this in fact is why the concept of macrolanguage exists. But I admit that there's an additional caveat: BCP47 opens all codes with 5 to 8 characters to possible registration in the IANA registry. I have not checked if there were some registration of language tags starting by qa..qt in the IANA registry, but there's apaprently no policy defined to forbid such registration. BCP 47 language subtags of 2 and 3 letters correspond to code elements assigned in some part of ISO 639. ISO 639-1,
RE: Language subtags (was: RE: [OT] Reusing the same property)
Philippe Verdy verdy underscore p at wanadoo dot fr wrote: Well I know that this is now going out of topic, because someone else me spoke about www (before that I had only spoken about the general need for any encoding open standard that wants to be universal, to assign a private-use area. To which there was a desire to have a larger space than just qa..qt union qaa..qtz, for easier (algorithmic) mapping of local-user codes (both in ISO 639 and BC 47). 'qa' through 'qt' is not reserved in BCP 47, and as far as I know it is not reserved in 639-1. I've since discovered that ISO 639-6 has assigned more than a hundred code elements in the range 'qaaa' through 'qtzz', and apparently no code elements marked private-use or user-defined, so it looks like the current repertoire of 520 reserved code elements across all parts of ISO 639 will remain as is. I had also wanted to show that the x- prefix in BC47 makes the language tag not parsable like generic structured tags (that are also extensible to support locale tags, using extensions such as the one using the u subtag defined by Unicode, mostly for the CLDR, e.g. to encode collation options, or other locale conventions). Using the BCP47 x- prefix does not permit those extensions, because x- BCP47 language tags have no structure. We know that. And that's why I spoke about two alternatives : - using another singleton letter q (yes in BCP 47 only), followed by one subtag, to create arbitrary local-use language tags, that would still remain be parsable and would support the u extension mechanism - using ranges of codes starting by qa..qt of arbitrary longer lengths (not limited to 2-3 letters as now), which means a change both in ISO 639 (for code allocation) and BCP 47 (to restrict 5 to 8-letter codes that are NOT freely usable for local-use, but still open for registration, so that the IANA registrey could accept a registration of 5-8 letters codes starting by qa..qt) I don't see the advantage of either of these mechanisms, compared to using the existing range 'qaa' through 'qtz'. Is there a need for more than 520 private-use language identifiers? I hope this summary correctly represent what I wanted to show, because once agin the intent has been misunderstood and some people on this list were assuming things that I did not intend to request. In fact I have not requested anything, just spoken about the existing possibilities, that would permit an application to use a cumfortable space for its local uses that can easily remap some unusable codes to a PUA space where it can create aliases that would be recognized automatically by this local application as such (i.e. an alias of the standard language code), easing the interoperability of this application with the rest of the world, even if it needs to use local-use codes. I don't think I said you were requesting anything. I clarified some details about BCP 47 and ISO 639 code allocation, and addressed some statements that others might misunderstand, such as that ISO 639 code elements could be regarded as prefixes of others. -- Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell
RE: Language subtags (was: RE: [OT] Reusing the same property)
Doug, you are right. In reference to your following statements, I checked my copies of 639-1 and 639-2, and I couldn't find any reference to the reserved codes ('qa..qt' or any other range) in 639-1, whereas 639-2 has the reserved codes as you stated. Regards, Erkki ISO 639-1, as stated earlier, has assigned 'qu' to Quechua. This is reflected in the Registry. I don't have a copy of 639-1 and don't know if it reserves 'qa..qt' or any other range. The 639-2 Web site, which lists 639-1 allocations, doesn't mention any such reservation. ISO 639-2 and 639-3 have defined 'qaa' through 'qtz' as Reserved for local use, which is reflected in the Registry as Private use.
Re: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0
On 8/31/2011 11:25 PM, Philippe Verdy wrote: 2011/9/1 Karl Williamsonpub...@khwilliamson.com: But now that I'm an UTC member, I hope I will hear these cases earlier... Congratulations! Does it justify so many new aliases at the same time ? No. I'm firmly with you, I support the requirement for 1 (ONE) alias for control codes because they don't have names, but are used in environments where the need a string identifier other than a code point. (Just like regular characters, but even more so). I also support the requirement for 1 (ONE) short identifier, for all those control AND format characters for which widespread usage of such an abbreviation is customary. (VS-257 does not qualify). Further, I support, on a case-by-case basis the addition of duplicate aliases for reasons of compatibility. I would expect these compatibility requirements to be documented for each character in sort of proposal document, not just a list of entries in a draft property file. Finally, I don't support using the name of any standard, iso or otherwise, as a label in the new status field. It sets the wrong precedent. I've not checked the history of all past versions of UAX, UTR, and UTN (or even in the text of chapters of the main UTS)... Are there other cases in those past versions, that this PRI should investigate and track back ? My preference would be to start this new scheme of with a minimum of absolutely 100% required aliases. Anything even remotely doubtful should be removed for further study. A./
Re: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0
So now we completely agree. Thanks. 2011/9/1 Asmus Freytag asm...@ix.netcom.com: On 8/31/2011 11:25 PM, Philippe Verdy wrote: 2011/9/1 Karl Williamsonpub...@khwilliamson.com: But now that I'm an UTC member, I hope I will hear these cases earlier... Congratulations! Does it justify so many new aliases at the same time ? No. I'm firmly with you, I support the requirement for 1 (ONE) alias for control codes because they don't have names, but are used in environments where the need a string identifier other than a code point. (Just like regular characters, but even more so). I also support the requirement for 1 (ONE) short identifier, for all those control AND format characters for which widespread usage of such an abbreviation is customary. (VS-257 does not qualify). Further, I support, on a case-by-case basis the addition of duplicate aliases for reasons of compatibility. I would expect these compatibility requirements to be documented for each character in sort of proposal document, not just a list of entries in a draft property file. Finally, I don't support using the name of any standard, iso or otherwise, as a label in the new status field. It sets the wrong precedent. I've not checked the history of all past versions of UAX, UTR, and UTN (or even in the text of chapters of the main UTS)... Are there other cases in those past versions, that this PRI should investigate and track back ? My preference would be to start this new scheme of with a minimum of absolutely 100% required aliases. Anything even remotely doubtful should be removed for further study. A./
PRI#198 (Unicode 6.1 beta)
2011/9/2 announceme...@unicode.org: Unicode Standard Annex #42, Unicode Character Database in XML, will be updated for Unicode 6.1. The proposed update is now available for general public review and comment. Review period for this issue closes October 24, 2011. Details are on the following web page: http://www.unicode.org/review/pri198/ Thanks a lot for the excellent idea if adding a summary of the available feedbacks received in this PRI (it would be helpful to use this in fact for all existing and future PRI). It will certainly help focusing each discussion by type, grouping people submitting similar comments so that they also find some form of agreement between each other. It wuold also reduce the administrative work, allowing people to reference the comments made by others instead of repeating things, or allowing them to compare the various feedbacks if they can't be accepted all together (when a choice has to be made). (When I just start reading the L2 archives, only for 2011 for now, it imme- diately reveals me the complex task of sorting all the feedbacks received, or the lack of feedback on some critical subjects, and not sorting them inflates a lot the number of items to schedule in a meeting agenda, so much that they may not be studied and discussed with the time each of the proposals submitted would merit, or some items will be discussed several times, possibly forgetting some items studied in past meetings, and where some decisions could be further delayed after some more investigation time). And it paves the way for defining the meeting agendas in UTC (sorting rapidly thingsthat should be corrected early if they are blocking other items to be scheduled later: it allows a development of the standard with more precise milestones that are easier to complete without creating problems for the next milestones). -- Philippe.
Fwd: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0
For your information, here is a copy of a recent post I made privately with an existing UTC member. === I just posted two new feedbacks (using the feedback form) related to the proposed alias names. These feedbacks are all related to recent decisions taken this year (in February) after a PRI... One concerns NBHY, present in UAX #14 since Unicode 5.0.1, and which contains a list of other candidate aliases, after a decision made in: - PRI #105 (Proposed Update to UAX #14: Line Breaking Properties), Another concerns to recent decisions taken, related to code point labels. See: - PRI #132 (Code Point Name/Label Options) that closes some options (notably of *not* defining a new character property for referencing characters, as the result of adopting option D...), and the related - PRI #129 (Code Point Labels: Suggested Wording Details) which suggested a naming scheme for short aliases. The consistancy of those decisions requires some investigation, at at least this means that new aliases must all be individually studied and clearly justified. If there's a justification, it would be more importantly coming from the released texts of Unicode Standard itself (UTS), or its Standard Annexes (UAX), or the ISO 10646 standard, before any other considerations (including UTR, UTN, ISO standards other than ISO 10646, or any other standards). -- Philippe.