Re: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0

2011-09-01 Thread Philippe Verdy
2011/9/1 Karl Williamson pub...@khwilliamson.com:
 Unicode 6.0 broke UTS #18, which since 1999 has suggested that BELL be the
 name used in regular expressions for U+0007.  In 2003, this was strengthened
 to should be used.  The breakage occurred by requiring that BELL instead
 be the name for a different code point.  By breaking UTS #18, all
 implementations of it, including Perl's, were broken, causing real harm to
 real code and real people.  For this reason, Perl has not completely adopted
 6.0.

 Further, UTS #18 encourages implementations to do exactly what Perl did:
 The ISO names for the control characters may be unfamiliar, ... so it is
 recommended that they be supplemented with other aliases. For example, for
 U+0009 the implementation could accept the official name CHARACTER
 TABULATION, and also the aliases HORIZONTAL TABULATION, HT, and TAB.

Thanks then for explaining that. So now such aliases are needed to
correct obvious errors. Well this is not a real correction, but a
change for the new short name. The should that specified an alias
will now be replaced by a must with the new alias. This instability
should have been explained, as it was not explicit in the PRI.

 The genesis of this proposal was to prevent the Unicode Consortium from
 making this kind of mistake again.  The language in UTS #18 mentioning the
 TAB variants also dates to 2003.  I think this example makes it clear why
 more than one alias may be needed per code point.

 Of course, PRI #202 is not the only mechanism possible to achieve the needed
 goal of preventing another mishap like BELL.  But the consensus in the
 discussion about it was that is was the easiest route to get there.

Given that most of these discussions have occured offline (not
publicly but in private reports to the UTC, or between UTC members
working with the closed unicore discussion list, or privately between
each others), I had no idea of these discussions. But now that I'm an
UTC member, I hope I will hear these cases earlier...

Does it justify so many new aliases at the same time ?

I've not checked the history of all past versions of UAX, UTR, and UTN
(or even in the text of chapters of the main UTS)...  Are there other
cases in those past versions, that this PRI should investigate and
track back ?




RE: [OT] Reusing the same property (was: RE: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0)

2011-09-01 Thread William_J_G Overington
On Wednesday 31 August 2011, Doug Ewell d...@ewellic.org wrote:
  
 Coming back full circle, this is where many of the PUA protests on this list 
 come from -- some folks want to use the Unicode PUA to encode things that are 
 not characters, not even glyphs or symbols, nor anything else remotely 
 resembling the intended scope of the Unicode Standard.
 
Well, some of my ideas for which I am using the Private Use Areas have been 
banned from being discussed in this mailing list.
 
Yet that is mailing list policy, not the policy over for what a person may use 
the Private Use Areas.
 
The two are not the same.
 
There is a ban on discussing the ideas in this list, yet I am entirely free to 
use the Private Use Area for assigning meanings within the scope that those 
meanings are Private Use meanings, publishing those meanings within the scope 
that those meanings are Private Use meanings and making and publishing fonts 
and producing and publishing pdfs as I choose and to continue my research.
 
The intended scope of the Unicode Standard is something that can change with 
time.
 
Research is about progress. What is in the Unicode Standard should not be 
constrained to what was intended to be in the Unicode Standard many years ago 
when the Unicode Standard was started. Certainly, there are some things that 
cannot be changed, yet not changing those things is not the same as restricting 
what the Unicode Standard can encode in the future. There have been various 
technological developments since the Unicode Standard was started and the scope 
of the Unicode Standard has been enlarged to support those new technologies. 
For example, the encoding of the emoji and now the encoding-in-progress of the 
symbols of the Webdings font.
 
In relation to the encoding-in-progress of the symbols of the Webdings font, 
something I have wondered about is whether the encoding is for the specific 
Webdings glyphs or whether the encoding is for any representation of the same 
general concept.
 
As a particular example, please consider the character that is accessed by the 
letter P using the Webdings font. I have seen that glyph used in a gif attached 
to an email along with text suggesting helping the environment by avoiding 
printing the email unless it is considered essential to do so.
  
In another post Doug wrote as follows.
 
 I don't know what this means.  Private-use tags starting with x- cannot be 
 reliably and algorithmically parsed into subtags (just like all language tags 
 before RFC 4646), but there are no real limits to what language information 
 can be conveyed in them, as you seem to imply; you can write 
 x-navi-as-spoken-in hometree-on-pandora if you like.
  
I am using x-y as a language tag for some of my research. For example, there 
could be a database table for x-y and en-gb-oed sentences and another database 
table for x-y and fr sentences. One could then seek matches in the x-y fields 
in each table so as to find a link from a en-gb-oed sentence to a fr sentence. 
The method is suitable not only for French, there could be a database table for 
x-y and any other language tag, as desired.
 
William Overington
 
1 September 2011
 







Language subtags (was: RE: [OT] Reusing the same property)

2011-09-01 Thread Doug Ewell
Philippe Verdy verdy underscore p at wanadoo dot fr wrote:

 2011/8/31 Doug Ewell d...@ewellic.org:
 Philippe Verdy wrote:
 the existing
 BCP 47 implementations, but that would limit the may-be future
 extension of ISO 639 to longer codes): ISO 639 could immediately say
 that it will never allocate any language code (of any length)
 starting by qa..qz.

 Not possible; 'qu' is already taken for Quechua.  And not necessary:
 'qaa' through 'qtz' are reserved.

 I said using the prefixes starting by qa..qt,

This was a direct quote from your post of Wed, 31 Aug 2011 21:58:25
+0200 (http://www.unicode.org/mail-arch/unicode-ml/y2011-m08/0456.html).
 But I'll assume it was a typo for qa..qt; you did mention this
shorter range in other posts.

 these prefixes are not
 supposed to be used alone, there must be additional letters. so this
 does not apply to qu alone (yes, assigned to the Quechua
 macrolanguage, or isn't it a language subfamily ?).

 From here on, I assume you are asking about BCP 47, not about any part
of ISO 639.  BCP 47 uses the IANA Language Subtag Registry, which uses
ISO 639 as a primary source, but adds constraints.  As one example, when
an alpha-2 code element (from 639-1) exists for a given language, a BCP
47 subtag exists only for that alpha-2 code element and not for the
corresponding alpha-3 code element.  So for French, you can only use
'fr' for French and not 'fre' (from 639-2/B) or 'fra' (from 639-2/T and
639-3).

BCP 47 language subtags do not have prefixes.  An ISO-based language
subtag is either 2 or 3 letters long.  There is no correlation, explicit
or implicit, between a 2-letter language subtag and any 3-letter subtags
that begin with those same two letters.

ISO 639-3 does classify Quechua as a macrolanguage, but that doesn't
affect code allocation; macrolanguages are assigned code elements and
subtags just like any other language.  It is often useful to be able to
specify, say, Quechua in a tag instead of one of the many specific
varieties of Quechua, such as Chimborazo Highland Quichua or Yanahuanca
Pasco Quechua; this in fact is why the concept of macrolanguage
exists.

 But I admit that there's an additional caveat: BCP47 opens all codes
 with 5 to 8 characters to possible registration in the IANA registry.
 I have not checked if there were some registration of language tags
 starting by qa..qt in the IANA registry, but there's apaprently no
 policy defined to forbid such registration.

BCP 47 language subtags of 2 and 3 letters correspond to code elements
assigned in some part of ISO 639.

ISO 639-1, as stated earlier, has assigned 'qu' to Quechua.  This is
reflected in the Registry.  I don't have a copy of 639-1 and don't know
if it reserves 'qa..qt' or any other range.  The 639-2 Web site, which
lists 639-1 allocations, doesn't mention any such reservation.

ISO 639-2 and 639-3 have defined 'qaa' through 'qtz' as Reserved for
local use, which is reflected in the Registry as Private use.  BCP 47
explains the use of these subtags as an alternative to the x-
mechanism.  One advantage, as you pointed out elsewhere, is that the
resulting tag can be parsed like a normal tag; the region 'ZW' in
qaa-ZW explicitly means Zimbabwe.

ISO 639-2 has assigned 'que' to Quechua and ISO 639-5 has assigned 'qwe'
to Quechuan (family).  ISO 639-3 has assigned more than 50 code
elements in the non-private range beginning with 'q', many of which (but
not all) are for varieties of Quechua.  The Registry reflects all of
these assignments except 'que' (because Quechua in BCP 47 is 'qu').

 And your not necessary comment does not apply here too: it just
 assigns the 3-letter codes for local use, not the longer codes which
 are only reserved for the 4-letter codes, but not assigned for private
 use (and there's also no provision given in ISO 649 to protect an
 encoding space for 5-letter codes or longer, as they are now usable
 for IANA registration).

4-letter language subtags are reserved, and will remain reserved unless
and until BCP 47 is updated (via a new RFC) to make use of them.  I
don't care to speculate on their future allocation or use.

If and when language subtags of 5 to 8 letters are registered, there
will be no restriction (as far as I can tell) on subtags beginning with
'q' or any other letter or sequence.

--
Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14
www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell ­






Re: Language subtags (was: RE: [OT] Reusing the same property)

2011-09-01 Thread Philippe Verdy
In fact, my post about qa..qz was a 1-letter error, I only wanted to
speak about qa..qt (as I had stated previously and every where else in
this thread), before one wanted to correct me.

Well I know that this is now going out of topic, because someone else
spoke about www (before that I had only spoken about the general
need for any encoding open standard that wants to be universal, to
assign a private-use area.

To which there was a desire to have a larger space than just qa..qt
union qaa..qtz, for easier (algorithmic) mapping of local-user codes
(both in ISO 639 and BC 47).

I had also wanted to show that the x- prefix in BC47 makes the
language tag not parsable like generic structured tags (that are also
extensible to support locale tags, using extensions such as the one
using the u subtag defined by Unicode, mostly for the CLDR, e.g. to
encode collation options, or other locale conventions). Using the
BCP47 x- prefix does not permit those extensions, because x- BCP47
language tags have no structure.

And that's why I spoke about two alternatives :
- using another singleton letter q (yes in BCP 47 only), followed by
one subtag, to create arbitrary local-use language tags, that would
still remain be parsable and would support the u extension mechanism
- using ranges of codes starting by qa..qt of arbitrary longer lengths
(not limited to 2-3 letters as now), which means a change both in ISO
639 (for code allocation) and BCP 47 (to restrict 5 to 8-letter codes
that are NOT freely usable for local-use, but still open for
registration, so that the IANA registrey could accept a registration
of 5-8 letters codes starting by qa..qt)

I hope this summary correctly represent what I wanted to show, because
once agin the intent has been misunderstood and some people on this
list were assuming things that I did not intend to request.

In fact I have not requested anything, just spoken about the existing
possibilities, that would permit an application to use a cumfortable
space for its local uses that can easily remap some unusable codes to
a PUA space where it can create aliases that would be recognized
automatically by this local application as such (i.e. an alias of the
standard language code), easing the interoperability of this
application with the rest of the world, even if it needs to use
local-use codes.

Philippe.

2011/9/1 Doug Ewell d...@ewellic.org:
 Philippe Verdy verdy underscore p at wanadoo dot fr wrote:

 2011/8/31 Doug Ewell d...@ewellic.org:
 Philippe Verdy wrote:
 the existing
 BCP 47 implementations, but that would limit the may-be future
 extension of ISO 639 to longer codes): ISO 639 could immediately say
 that it will never allocate any language code (of any length)
 starting by qa..qz.

 Not possible; 'qu' is already taken for Quechua.  And not necessary:
 'qaa' through 'qtz' are reserved.

 I said using the prefixes starting by qa..qt,

 This was a direct quote from your post of Wed, 31 Aug 2011 21:58:25
 +0200 (http://www.unicode.org/mail-arch/unicode-ml/y2011-m08/0456.html).
  But I'll assume it was a typo for qa..qt; you did mention this
 shorter range in other posts.

 these prefixes are not
 supposed to be used alone, there must be additional letters. so this
 does not apply to qu alone (yes, assigned to the Quechua
 macrolanguage, or isn't it a language subfamily ?).

  From here on, I assume you are asking about BCP 47, not about any part
 of ISO 639.  BCP 47 uses the IANA Language Subtag Registry, which uses
 ISO 639 as a primary source, but adds constraints.  As one example, when
 an alpha-2 code element (from 639-1) exists for a given language, a BCP
 47 subtag exists only for that alpha-2 code element and not for the
 corresponding alpha-3 code element.  So for French, you can only use
 'fr' for French and not 'fre' (from 639-2/B) or 'fra' (from 639-2/T and
 639-3).

 BCP 47 language subtags do not have prefixes.  An ISO-based language
 subtag is either 2 or 3 letters long.  There is no correlation, explicit
 or implicit, between a 2-letter language subtag and any 3-letter subtags
 that begin with those same two letters.

 ISO 639-3 does classify Quechua as a macrolanguage, but that doesn't
 affect code allocation; macrolanguages are assigned code elements and
 subtags just like any other language.  It is often useful to be able to
 specify, say, Quechua in a tag instead of one of the many specific
 varieties of Quechua, such as Chimborazo Highland Quichua or Yanahuanca
 Pasco Quechua; this in fact is why the concept of macrolanguage
 exists.

 But I admit that there's an additional caveat: BCP47 opens all codes
 with 5 to 8 characters to possible registration in the IANA registry.
 I have not checked if there were some registration of language tags
 starting by qa..qt in the IANA registry, but there's apaprently no
 policy defined to forbid such registration.

 BCP 47 language subtags of 2 and 3 letters correspond to code elements
 assigned in some part of ISO 639.

 ISO 639-1, 

RE: Language subtags (was: RE: [OT] Reusing the same property)

2011-09-01 Thread Doug Ewell
Philippe Verdy verdy underscore p at wanadoo dot fr wrote:

 Well I know that this is now going out of topic, because someone else

me

 spoke about www (before that I had only spoken about the general
 need for any encoding open standard that wants to be universal, to
 assign a private-use area.
 
 To which there was a desire to have a larger space than just qa..qt
 union qaa..qtz, for easier (algorithmic) mapping of local-user codes
 (both in ISO 639 and BC 47).

'qa' through 'qt' is not reserved in BCP 47, and as far as I know it is
not reserved in 639-1.

I've since discovered that ISO 639-6 has assigned more than a hundred
code elements in the range 'qaaa' through 'qtzz', and apparently no code
elements marked private-use or user-defined, so it looks like the
current repertoire of 520 reserved code elements across all parts of ISO
639 will remain as is.

 I had also wanted to show that the x- prefix in BC47 makes the
 language tag not parsable like generic structured tags (that are also
 extensible to support locale tags, using extensions such as the one
 using the u subtag defined by Unicode, mostly for the CLDR, e.g. to
 encode collation options, or other locale conventions). Using the
 BCP47 x- prefix does not permit those extensions, because x- BCP47
 language tags have no structure.

We know that.

 And that's why I spoke about two alternatives :
 - using another singleton letter q (yes in BCP 47 only), followed by
 one subtag, to create arbitrary local-use language tags, that would
 still remain be parsable and would support the u extension mechanism
 - using ranges of codes starting by qa..qt of arbitrary longer lengths
 (not limited to 2-3 letters as now), which means a change both in ISO
 639 (for code allocation) and BCP 47 (to restrict 5 to 8-letter codes
 that are NOT freely usable for local-use, but still open for
 registration, so that the IANA registrey could accept a registration
 of 5-8 letters codes starting by qa..qt)

I don't see the advantage of either of these mechanisms, compared to
using the existing range 'qaa' through 'qtz'.  Is there a need for more
than 520 private-use language identifiers?

 I hope this summary correctly represent what I wanted to show, because
 once agin the intent has been misunderstood and some people on this
 list were assuming things that I did not intend to request.
 
 In fact I have not requested anything, just spoken about the existing
 possibilities, that would permit an application to use a cumfortable
 space for its local uses that can easily remap some unusable codes to
 a PUA space where it can create aliases that would be recognized
 automatically by this local application as such (i.e. an alias of the
 standard language code), easing the interoperability of this
 application with the rest of the world, even if it needs to use
 local-use codes.

I don't think I said you were requesting anything.  I clarified some
details about BCP 47 and ISO 639 code allocation, and addressed some
statements that others might misunderstand, such as that ISO 639 code
elements could be regarded as prefixes of others.

--
Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14
www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell ­






RE: Language subtags (was: RE: [OT] Reusing the same property)

2011-09-01 Thread Erkki I Kolehmainen
Doug, you are right. In reference to your following statements, I checked my 
copies of 639-1 and 639-2, and I couldn't find any reference to the reserved 
codes ('qa..qt' or any other range) in 639-1, whereas 639-2 has the reserved 
codes as you stated.

Regards, Erkki
  
ISO 639-1, as stated earlier, has assigned 'qu' to Quechua.  This is
reflected in the Registry.  I don't have a copy of 639-1 and don't know
if it reserves 'qa..qt' or any other range.  The 639-2 Web site, which
lists 639-1 allocations, doesn't mention any such reservation.

ISO 639-2 and 639-3 have defined 'qaa' through 'qtz' as Reserved for
local use, which is reflected in the Registry as Private use.





Re: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0

2011-09-01 Thread Asmus Freytag

On 8/31/2011 11:25 PM, Philippe Verdy wrote:

2011/9/1 Karl Williamsonpub...@khwilliamson.com:
But now that I'm an UTC member, I hope I will hear these cases earlier... 


Congratulations!


Does it justify so many new aliases at the same time ?


No. I'm firmly with you, I support the requirement for 1 (ONE) alias for 
control codes because they don't have names, but are used in 
environments where the need a string identifier other than a code point. 
(Just like regular characters, but even more so).


I also support the requirement for 1 (ONE) short identifier, for all 
those control AND format characters for which widespread usage of such 
an abbreviation is customary. (VS-257 does not qualify).


Further, I support, on a case-by-case basis the addition of duplicate 
aliases for reasons of compatibility. I would expect these 
compatibility requirements to be documented for each character in sort 
of proposal document, not just a list of entries in a draft property file.


Finally, I don't support using the name of any standard, iso or 
otherwise, as a label in the new status field. It sets the wrong precedent.


I've not checked the history of all past versions of UAX, UTR, and UTN 
(or even in the text of chapters of the main UTS)... Are there other 
cases in those past versions, that this PRI should investigate and 
track back ? 


My preference would be to start this new scheme of with a minimum of 
absolutely 100% required aliases. Anything even remotely doubtful should 
be removed for further study.


A./




Re: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0

2011-09-01 Thread Philippe Verdy
So now we completely agree. Thanks.

2011/9/1 Asmus Freytag asm...@ix.netcom.com:
 On 8/31/2011 11:25 PM, Philippe Verdy wrote:

 2011/9/1 Karl Williamsonpub...@khwilliamson.com:
 But now that I'm an UTC member, I hope I will hear these cases earlier...

 Congratulations!

 Does it justify so many new aliases at the same time ?

 No. I'm firmly with you, I support the requirement for 1 (ONE) alias for
 control codes because they don't have names, but are used in environments
 where the need a string identifier other than a code point. (Just like
 regular characters, but even more so).

 I also support the requirement for 1 (ONE) short identifier, for all those
 control AND format characters for which widespread usage of such an
 abbreviation is customary. (VS-257 does not qualify).

 Further, I support, on a case-by-case basis the addition of duplicate
 aliases for reasons of compatibility. I would expect these compatibility
 requirements to be documented for each character in sort of proposal
 document, not just a list of entries in a draft property file.

 Finally, I don't support using the name of any standard, iso or otherwise,
 as a label in the new status field. It sets the wrong precedent.

 I've not checked the history of all past versions of UAX, UTR, and UTN (or
 even in the text of chapters of the main UTS)... Are there other cases in
 those past versions, that this PRI should investigate and track back ?

 My preference would be to start this new scheme of with a minimum of
 absolutely 100% required aliases. Anything even remotely doubtful should be
 removed for further study.

 A./





PRI#198 (Unicode 6.1 beta)

2011-09-01 Thread Philippe Verdy
2011/9/2  announceme...@unicode.org:
 Unicode Standard Annex #42, Unicode Character Database in XML, will be
 updated for Unicode 6.1. The proposed update is now available for general
 public review and comment.

 Review period for this issue closes October 24, 2011. Details are on the
 following web page: http://www.unicode.org/review/pri198/

Thanks a lot for the excellent idea if adding a summary of the
available feedbacks received in this PRI (it would be helpful to use
this in fact for all existing and future PRI).

It will certainly help focusing each discussion by type, grouping
people submitting similar comments so that they also find some form of
agreement between each other. It wuold also reduce the administrative
work, allowing people to reference the comments made by others instead
of repeating things, or allowing them to compare the various feedbacks
if they can't be accepted all together (when a choice has to be made).

(When I just start reading the L2 archives, only for 2011 for now, it imme-
diately reveals me the complex task of sorting all the feedbacks received,
or the lack of feedback on some critical subjects, and not sorting
them inflates a lot the number of items to schedule in a meeting
agenda, so much that they may not be studied and discussed with the
time each of the proposals submitted would merit, or some items will
be discussed several times, possibly forgetting some items studied in
past meetings, and where some decisions could be further delayed after
some more investigation time).

And it paves the way for defining the meeting agendas in UTC (sorting
rapidly thingsthat should be corrected early if they are blocking
other items to be scheduled later: it allows a development of the
standard with more precise milestones that are easier to complete
without creating problems for the next milestones).

-- Philippe.



Fwd: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0

2011-09-01 Thread Philippe Verdy
For your information, here is a copy of a recent post I made privately
with an existing UTC member.

===

I just posted two new feedbacks (using the feedback form) related to
the proposed alias names. These feedbacks are all related to recent
decisions taken this year (in February) after a PRI...

One concerns NBHY, present in UAX #14 since Unicode 5.0.1, and which
contains a list of other candidate aliases, after a decision made in:
 - PRI #105 (Proposed Update to UAX #14: Line Breaking Properties),
Another concerns to recent decisions taken, related to code point labels. See:
 - PRI #132 (Code Point Name/Label Options) that closes some options
(notably of *not* defining a new character property for referencing
characters, as the result of adopting option D...), and the related
 - PRI #129 (Code Point Labels: Suggested Wording Details) which
suggested a naming scheme for short aliases.

The consistancy of those decisions requires some investigation, at at
least this means that new aliases must all be individually studied and
clearly justified. If there's a justification, it would be more
importantly coming from the released texts of Unicode Standard itself
(UTS), or its Standard Annexes (UAX), or the ISO 10646 standard,
before any other considerations (including UTR, UTN, ISO standards
other than ISO 10646, or any other standards).

-- Philippe.