Re: searching for PUA characters

2011-08-26 Thread Robert Abel
Hi Lorna,

On 2011/08/25 22:17, Lorna Priest wrote:
 I suppose what I'd like is to be able to identify beginning and ending
 codepoints to search for, such as F130..F32F or something along that
 line. 

You could use jEdit to search within a directory for \p{Co}. This
would match ranges \uE000-\uF8FF only ― not all PUA characters there
are. However, it might be adequate for your job.

Regards,

Robert



PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0

2011-08-26 Thread Philippe Verdy
2011/8/26  announceme...@unicode.org:
 The Unicode Technical Committee has posted a new issue for public review and
 comment. Details are on the following web page:

    http://www.unicode.org/review/pri202/

 Review periods for the new items close on October 24, 2011.

 Please see the page for links to discussion and relevant documents. Briefly,
 the new issue is:

 PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0

Isn't there an intersection between NameAliases.txt proposed in
PRI202, and the informational table defined for UTR #25 at
http://www.unicode.org/Public/math/revision-12/MathClassEx-12.txt
which also lists other name aliases for other standards ?

Couldn't there be a way to merge those lists ?

It would have the advantage of suppressing those names from the
proposed table for UTR #25 (characters used in Mathematical
notations).

In the merged name aliases table, we could as well include :
- SGML/HTML/XML character entity names (and some standardized synonyms) ?
- Postscript names (from AGL), also used in the name table of
TrueType/OpenType fonts
- possibly even their Postscript numeric id's (the 256 first names
from the AGL list is not even stored in fonts, where they are bound
only by string id).
- other names from candidate standards ?

Do names defined in NameAliases.txt have to be globally unique across
all supported standards (each one being assigned a specific value for
the new type field added in NameAliases.txt ? For me it's just
enough that they are unambiguous within the context of the standard
where they are looked up to find their UCS codepoints. Not all these
names have to be supported simultaneously.

As well, the name aliases should support named character sequences for
these other standards.

-- Philippe.




Re: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0

2011-08-26 Thread Ken Whistler

On 8/26/2011 3:13 PM, Philippe Verdy wrote:

Isn't there an intersection between NameAliases.txt proposed in
PRI202, and the informational table defined for UTR #25 at
http://www.unicode.org/Public/math/revision-12/MathClassEx-12.txt
which also lists other name aliases for other standards ?


No.



Couldn't there be a way to merge those lists ?


No, there isn't. They have completely different statuses.
NameAliases.txt is a normative part of the versioned UCD
and is used as part of the definition of the normative namespace
for Unicode character names. MathClassEx.txt is not part of
the UCD, has no normative status for the Unicode Standard, and
is associated with a UTR whose versioning is not synchronized
with the Unicode Standard.



It would have the advantage of suppressing those names from the
proposed table for UTR #25 (characters used in Mathematical
notations).


Which would be a disadvantage, actually, because it would remove them from
the context where they are useful.



In the merged name aliases table, we could as well include :


we could as well include... are dangerous words here. Going encyclopedic
is *completely* at odds with the normative intention of NameAliases.txt.


- SGML/HTML/XML character entity names (and some standardized synonyms) ?
- Postscript names (from AGL), also used in the name table of
TrueType/OpenType fonts
- possibly even their Postscript numeric id's (the 256 first names
from the AGL list is not even stored in fonts, where they are bound
only by string id).
- other names from candidate standards ?


No to all of those.



Do names defined in NameAliases.txt have to be globally unique across
all supported standards (each one being assigned a specific value for
the new type field added in NameAliases.txt ?


They have to be globally unique within the Unicode namespace, which is the
whole point.


  For me it's just
enough that they are unambiguous within the context of the standard
where they are looked up to find their UCS codepoints. Not all these
names have to be supported simultaneously.


That is a misunderstanding of the current use of the file, as well as of the
proposed extension to the file.



As well, the name aliases should support named character sequences for
these other standards.


No they should not.

--Ken




Re: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0

2011-08-26 Thread Philippe Verdy
2011/8/27 Ken Whistler k...@sybase.com:
 On 8/26/2011 3:13 PM, Philippe Verdy wrote:

 Isn't there an intersection between NameAliases.txt proposed in
 PRI202, and the informational table defined for UTR #25 at
 http://www.unicode.org/Public/math/revision-12/MathClassEx-12.txt
 which also lists other name aliases for other standards ?

 No.


 Couldn't there be a way to merge those lists ?

 No, there isn't. They have completely different statuses.
 NameAliases.txt is a normative part of the versioned UCD
 and is used as part of the definition of the normative namespace
 for Unicode character names. MathClassEx.txt is not part of
 the UCD, has no normative status for the Unicode Standard, and
 is associated with a UTR whose versioning is not synchronized
 with the Unicode Standard.


 It would have the advantage of suppressing those names from the
 proposed table for UTR #25 (characters used in Mathematical
 notations).

 Which would be a disadvantage, actually, because it would remove them from
 the context where they are useful.


 In the merged name aliases table, we could as well include :

 we could as well include... are dangerous words here. Going encyclopedic
 is *completely* at odds with the normative intention of NameAliases.txt.

Your statement then contradicts what PRI 202 says:
the intent is to add various standard and de facto aliases for
control characters, which have no names defined for them in the
Unicode Standard, as well as various character abbreviations which are
in widespread use.

It explicitly links the Unicode standard with others, at least by
reference. If these aliases are to be ALL unique in the UCS namespace,
this means that it will permently link those standards to the UCS.

May be it will be good for other standards that are now stable (or
frozen and kept for historical reasons, this is the case of the
standard Postscript namespace, frozen now in the AGL and in the
PostScript's standardEncoding, for use in TrueType, OpenType, and
PDF).

Yes I admit that the Postscript namespace is a bit different: it is
glyph-based rather than character-based, which also means that several
UCS characters may map by default to the same glyph name. But one of
those characters is still considered as the main one (for example the
space glyph name is normally mapped from U+0020, and from U+00A0,
but the first one is usually used by default when performing the
reverse mapping, if there's no other disambiguating context).

A similar case occurs with the GSM standard encoding (that does not
make, for example, distinctions between LATIN CAPITAL LETTER A,
CYRILLIC CAPITAL LETTER A, and GREEK CAPITAL LETTER ALPHA), as well as
in many legacy encodings that were also glyph-based and defined with
something else than a chart of representative glyphs (found in the
/MAPPINGS subdirectory, a sister to the /UNIDATA directory used by
the UCD).

Then why do you think, in the PRI 202 that some standards would have
their character names becoming part of the UCS namespace ? They could
remain as well informative, and we could have another informative
datafile (in the MAPPINGS subdirectory) to reference those standards
only informatively, without introducing them in the UCD...

For example the proposed addition of ISO 6429 names don't have to be a
normative part of the UCD, they could remain informational as well,
defined outside of it. They are not (and should not be) needed to
conformingly implement the UCS and Unicode algorithms, unless the
Unicode standard really wants to permanently bind the ISO 6429
standard, possibly against the intent of the authors of this standard.
Was there such formal request from the ISO standard maintainers, and
an agreed policy ?

-- Philippe.



Re: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0

2011-08-26 Thread Ken Whistler

On 8/26/2011 5:01 PM, Philippe Verdy wrote:

we could as well include... are dangerous words here. Going encyclopedic
  is*completely*  at odds with the normative intention of NameAliases.txt.

Your statement then contradicts what PRI 202 says:
the intent is to add various standard and de facto aliases for
control characters, which have no names defined for them in the
Unicode Standard, as well as various character abbreviations which are
in widespread use.


No, it does not, because you have conveniently omitted the next paragraph of
the PRI, which explains the context of use:

Because NameAliases.txt is used as part of the input which enforces 
name uniqueness for the Unicode character namespace, adding aliases for 
control codes and commonly used abbreviations for characters would 
prevent accidental name collisions in the future for character name 
matches in implementations such as regular expressions.





It explicitly links the Unicode standard with others, at least by
reference.


No, it does not.


If these aliases are to be ALL unique in the UCS namespace,
this means that it will permently link those standards to the UCS.


No, it will not. Only ISO 6429, which is *already* de facto linked to 
the UCS

for aliases for C0 and C1 control codes.



May be it will be good for other standards that are now stable (or
frozen and kept for historical reasons, this is the case of the
standard Postscript namespace, frozen now in the AGL and in the
PostScript's standardEncoding, for use in TrueType, OpenType, and
PDF).


Well, conceivably it could be good for some other standard, but it would
certainly not be good for the Unicode Standard to pollute the unique
namespace with an encyclopedic listing of names of arbitrary entities.



Yes I admit that the Postscript namespace is a bit different: it is
glyph-based rather than character-based, which also means that several
UCS characters may map by default to the same glyph name.


And I think we can stop right there. The problems are manifest.


Then why do you think, in the PRI 202 that some standards would have
their character names becoming part of the UCS namespace ?


Because by *definition* adding an entry to NameAliases.txt adds it to
the Unicode namespace. That is how the file is designed.


They could
remain as well informative, and we could have another informative
datafile (in the MAPPINGS subdirectory) to reference those standards
only informatively, without introducing them in the UCD...


That is out of scope for this PRI, which is specifically about additions
to NameAliases.txt, to prevent the possibility of future name collisions
such as U+1F514 BELL with the ISO 6429 control function name BELL.



For example the proposed addition of ISO 6429 names don't have to be a
normative part of the UCD, they could remain informational as well,
defined outside of it.


No, they need to become a normative part of the Unicode namespace. That
is *precisely* the problem that the PRI is addressing.


They are not (and should not be) needed to
conformingly implement the UCS and Unicode algorithms, unless the
Unicode standard really wants to permanently bind the ISO 6429
standard, possibly against the intent of the authors of this standard.


It has *nothing* to do with the intent of the authors of ISO 6429. It 
has to do

with the implementation requirements of users of the Unicode Standard,
and in particular for regex. Perl and other regex users do not want a 
name match

in a Unicode regex expression to be ambiguous.


Was there such formal request from the ISO standard maintainers, and
an agreed policy ?


It has nothing to do with ISO standard maintainers.

And yes, there was a formal request to do something about this problem,
but it came from one of the maintainers of Perl.

--Ken




Re: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0

2011-08-26 Thread Benjamin M Scarborough
Are name aliases exempted from the normal character naming conventions? I ask 
because four of the entries have words that begin with numbers.

008E;SINGLE-SHIFT 2;control
008F;SINGLE-SHIFT 3;control
0091;PRIVATE USE 1;control
0092;PRIVATE USE 2;control

—Ben Scarborough




Re: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0

2011-08-26 Thread Asmus Freytag
I agree with Ken that Phillipe's suggestion of conflating the 
annotations for mathematical use with formal Unicode name aliases is a 
non-starter. The former exist to help mathematicians identify symbols in 
Unicode, when they know their name from entity lists. The latter are 
designed to allow programmers to support identifiers that match existing 
usage -- mainly for characters for which there currently is not any well 
defined ID, or for characters for which their abbreviated name is their 
de-facto name.


In a limited number of cases, that would lead to multiple aliases for 
the same character. The ideal is, as always, to have single identifiers 
per character, where possible. In a few exceptional cases, allowing 
alternate IDs via the NameAlias technique is of such overwhelming 
practical use to support an exception.


Aliases come from the same namespace as character names, and must be 
unique, so that they can be used to unambiguously identify a character. 
They are intended to be used in programmatic interfaces, for example 
regular expressions. Adding redundant identifiers comes at a cost: all 
implementations have to rev their name tables, and using recently added 
aliases might not be portable until all implementations have caught up. 
That's why proposals to add additional aliases to any *existing* 
character should have to pass a really high bar. (I find the rationale 
for this initial expansion well thought ought and defensible - leaving 
the control codes unnamed in 10646 has proven problematic to implementers).


There's no strict limit to *informative* aliases for characters, nor is 
there a uniqueness requirement. If there are important real world 
designations under which certain characters are known, they could be 
documented with informative aliases. These informative aliases are then 
available to user interface designers who wish to support a search for 
character by name feature. Unlike the case for program source code, 
such interfaces can handle multiple hits for the same name - by 
presenting a list, for example.


Utlimately, even in this case, some annotations are better presented in 
special purpose files than informative records in the nameslist. That 
was done for mathematics. If there are other fields where there were 
established conventions for naming symbols, perhaps someone could 
provide an analogous list - but it should have no bearing on the PRI 
under consideration.


A./



Re: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0

2011-08-26 Thread Philippe Verdy
2011/8/27 Asmus Freytag asm...@ix.netcom.com:
 I agree with Ken that Phillipe's suggestion of conflating the annotations
 for mathematical use with formal Unicode name aliases is a non-starter.

Yes but why then adding ISO 6429 alias names ? What makes ISO 6429 a
better choice than another ISO standard, that you want to reject as a
non-starter option in the normative UCS namespace ?

And why dropping some naming rules for some the proposed alias names,
if this namespace also has normative rules ? If you want consistency,
those aliases could as well be informative only, and not part of the
UCS namespace, avoiding some of its restrictions, i.e. not defined in
the UCD itself but in a separate database.

And you did not reply to the question about the stability of the
related standard using these aliases, compared to the stability
requirement for the UCS namespace: if there's no such stability, the
normative reference in the UCD will remain only informative for the
other standard, creating possible future conflicts.



Re: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0

2011-08-26 Thread Philippe Verdy
2011/8/27 Ken Whistler k...@sybase.com:
 Was there such formal request from the ISO standard maintainers, and
 an agreed policy ?

 It has nothing to do with ISO standard maintainers.

 And yes, there was a formal request to do something about this problem,
 but it came from one of the maintainers of Perl.

You just replied to another question. If the request came from
maintainers of Perl, they absolutely don't need the *normative*
reference to the ISO 6429 (or any other standard than the Unicode
standard itself).

All they want is just *completeness* of the namespace ; and possibly
non ambiguities of interpretation of these names (for example to allow
reference by a more correct name in regular expressions that would
need to match, for example, parts of these names to create coherent
subsets, which are for now incoherent due to past naming errors that
can't be corrected and for which the only solution is to add aliases).