Re: searching for PUA characters
Hi Lorna, On 2011/08/25 22:17, Lorna Priest wrote: I suppose what I'd like is to be able to identify beginning and ending codepoints to search for, such as F130..F32F or something along that line. You could use jEdit to search within a directory for \p{Co}. This would match ranges \uE000-\uF8FF only ― not all PUA characters there are. However, it might be adequate for your job. Regards, Robert
PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0
2011/8/26 announceme...@unicode.org: The Unicode Technical Committee has posted a new issue for public review and comment. Details are on the following web page: http://www.unicode.org/review/pri202/ Review periods for the new items close on October 24, 2011. Please see the page for links to discussion and relevant documents. Briefly, the new issue is: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0 Isn't there an intersection between NameAliases.txt proposed in PRI202, and the informational table defined for UTR #25 at http://www.unicode.org/Public/math/revision-12/MathClassEx-12.txt which also lists other name aliases for other standards ? Couldn't there be a way to merge those lists ? It would have the advantage of suppressing those names from the proposed table for UTR #25 (characters used in Mathematical notations). In the merged name aliases table, we could as well include : - SGML/HTML/XML character entity names (and some standardized synonyms) ? - Postscript names (from AGL), also used in the name table of TrueType/OpenType fonts - possibly even their Postscript numeric id's (the 256 first names from the AGL list is not even stored in fonts, where they are bound only by string id). - other names from candidate standards ? Do names defined in NameAliases.txt have to be globally unique across all supported standards (each one being assigned a specific value for the new type field added in NameAliases.txt ? For me it's just enough that they are unambiguous within the context of the standard where they are looked up to find their UCS codepoints. Not all these names have to be supported simultaneously. As well, the name aliases should support named character sequences for these other standards. -- Philippe.
Re: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0
On 8/26/2011 3:13 PM, Philippe Verdy wrote: Isn't there an intersection between NameAliases.txt proposed in PRI202, and the informational table defined for UTR #25 at http://www.unicode.org/Public/math/revision-12/MathClassEx-12.txt which also lists other name aliases for other standards ? No. Couldn't there be a way to merge those lists ? No, there isn't. They have completely different statuses. NameAliases.txt is a normative part of the versioned UCD and is used as part of the definition of the normative namespace for Unicode character names. MathClassEx.txt is not part of the UCD, has no normative status for the Unicode Standard, and is associated with a UTR whose versioning is not synchronized with the Unicode Standard. It would have the advantage of suppressing those names from the proposed table for UTR #25 (characters used in Mathematical notations). Which would be a disadvantage, actually, because it would remove them from the context where they are useful. In the merged name aliases table, we could as well include : we could as well include... are dangerous words here. Going encyclopedic is *completely* at odds with the normative intention of NameAliases.txt. - SGML/HTML/XML character entity names (and some standardized synonyms) ? - Postscript names (from AGL), also used in the name table of TrueType/OpenType fonts - possibly even their Postscript numeric id's (the 256 first names from the AGL list is not even stored in fonts, where they are bound only by string id). - other names from candidate standards ? No to all of those. Do names defined in NameAliases.txt have to be globally unique across all supported standards (each one being assigned a specific value for the new type field added in NameAliases.txt ? They have to be globally unique within the Unicode namespace, which is the whole point. For me it's just enough that they are unambiguous within the context of the standard where they are looked up to find their UCS codepoints. Not all these names have to be supported simultaneously. That is a misunderstanding of the current use of the file, as well as of the proposed extension to the file. As well, the name aliases should support named character sequences for these other standards. No they should not. --Ken
Re: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0
2011/8/27 Ken Whistler k...@sybase.com: On 8/26/2011 3:13 PM, Philippe Verdy wrote: Isn't there an intersection between NameAliases.txt proposed in PRI202, and the informational table defined for UTR #25 at http://www.unicode.org/Public/math/revision-12/MathClassEx-12.txt which also lists other name aliases for other standards ? No. Couldn't there be a way to merge those lists ? No, there isn't. They have completely different statuses. NameAliases.txt is a normative part of the versioned UCD and is used as part of the definition of the normative namespace for Unicode character names. MathClassEx.txt is not part of the UCD, has no normative status for the Unicode Standard, and is associated with a UTR whose versioning is not synchronized with the Unicode Standard. It would have the advantage of suppressing those names from the proposed table for UTR #25 (characters used in Mathematical notations). Which would be a disadvantage, actually, because it would remove them from the context where they are useful. In the merged name aliases table, we could as well include : we could as well include... are dangerous words here. Going encyclopedic is *completely* at odds with the normative intention of NameAliases.txt. Your statement then contradicts what PRI 202 says: the intent is to add various standard and de facto aliases for control characters, which have no names defined for them in the Unicode Standard, as well as various character abbreviations which are in widespread use. It explicitly links the Unicode standard with others, at least by reference. If these aliases are to be ALL unique in the UCS namespace, this means that it will permently link those standards to the UCS. May be it will be good for other standards that are now stable (or frozen and kept for historical reasons, this is the case of the standard Postscript namespace, frozen now in the AGL and in the PostScript's standardEncoding, for use in TrueType, OpenType, and PDF). Yes I admit that the Postscript namespace is a bit different: it is glyph-based rather than character-based, which also means that several UCS characters may map by default to the same glyph name. But one of those characters is still considered as the main one (for example the space glyph name is normally mapped from U+0020, and from U+00A0, but the first one is usually used by default when performing the reverse mapping, if there's no other disambiguating context). A similar case occurs with the GSM standard encoding (that does not make, for example, distinctions between LATIN CAPITAL LETTER A, CYRILLIC CAPITAL LETTER A, and GREEK CAPITAL LETTER ALPHA), as well as in many legacy encodings that were also glyph-based and defined with something else than a chart of representative glyphs (found in the /MAPPINGS subdirectory, a sister to the /UNIDATA directory used by the UCD). Then why do you think, in the PRI 202 that some standards would have their character names becoming part of the UCS namespace ? They could remain as well informative, and we could have another informative datafile (in the MAPPINGS subdirectory) to reference those standards only informatively, without introducing them in the UCD... For example the proposed addition of ISO 6429 names don't have to be a normative part of the UCD, they could remain informational as well, defined outside of it. They are not (and should not be) needed to conformingly implement the UCS and Unicode algorithms, unless the Unicode standard really wants to permanently bind the ISO 6429 standard, possibly against the intent of the authors of this standard. Was there such formal request from the ISO standard maintainers, and an agreed policy ? -- Philippe.
Re: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0
On 8/26/2011 5:01 PM, Philippe Verdy wrote: we could as well include... are dangerous words here. Going encyclopedic is*completely* at odds with the normative intention of NameAliases.txt. Your statement then contradicts what PRI 202 says: the intent is to add various standard and de facto aliases for control characters, which have no names defined for them in the Unicode Standard, as well as various character abbreviations which are in widespread use. No, it does not, because you have conveniently omitted the next paragraph of the PRI, which explains the context of use: Because NameAliases.txt is used as part of the input which enforces name uniqueness for the Unicode character namespace, adding aliases for control codes and commonly used abbreviations for characters would prevent accidental name collisions in the future for character name matches in implementations such as regular expressions. It explicitly links the Unicode standard with others, at least by reference. No, it does not. If these aliases are to be ALL unique in the UCS namespace, this means that it will permently link those standards to the UCS. No, it will not. Only ISO 6429, which is *already* de facto linked to the UCS for aliases for C0 and C1 control codes. May be it will be good for other standards that are now stable (or frozen and kept for historical reasons, this is the case of the standard Postscript namespace, frozen now in the AGL and in the PostScript's standardEncoding, for use in TrueType, OpenType, and PDF). Well, conceivably it could be good for some other standard, but it would certainly not be good for the Unicode Standard to pollute the unique namespace with an encyclopedic listing of names of arbitrary entities. Yes I admit that the Postscript namespace is a bit different: it is glyph-based rather than character-based, which also means that several UCS characters may map by default to the same glyph name. And I think we can stop right there. The problems are manifest. Then why do you think, in the PRI 202 that some standards would have their character names becoming part of the UCS namespace ? Because by *definition* adding an entry to NameAliases.txt adds it to the Unicode namespace. That is how the file is designed. They could remain as well informative, and we could have another informative datafile (in the MAPPINGS subdirectory) to reference those standards only informatively, without introducing them in the UCD... That is out of scope for this PRI, which is specifically about additions to NameAliases.txt, to prevent the possibility of future name collisions such as U+1F514 BELL with the ISO 6429 control function name BELL. For example the proposed addition of ISO 6429 names don't have to be a normative part of the UCD, they could remain informational as well, defined outside of it. No, they need to become a normative part of the Unicode namespace. That is *precisely* the problem that the PRI is addressing. They are not (and should not be) needed to conformingly implement the UCS and Unicode algorithms, unless the Unicode standard really wants to permanently bind the ISO 6429 standard, possibly against the intent of the authors of this standard. It has *nothing* to do with the intent of the authors of ISO 6429. It has to do with the implementation requirements of users of the Unicode Standard, and in particular for regex. Perl and other regex users do not want a name match in a Unicode regex expression to be ambiguous. Was there such formal request from the ISO standard maintainers, and an agreed policy ? It has nothing to do with ISO standard maintainers. And yes, there was a formal request to do something about this problem, but it came from one of the maintainers of Perl. --Ken
Re: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0
Are name aliases exempted from the normal character naming conventions? I ask because four of the entries have words that begin with numbers. 008E;SINGLE-SHIFT 2;control 008F;SINGLE-SHIFT 3;control 0091;PRIVATE USE 1;control 0092;PRIVATE USE 2;control —Ben Scarborough
Re: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0
I agree with Ken that Phillipe's suggestion of conflating the annotations for mathematical use with formal Unicode name aliases is a non-starter. The former exist to help mathematicians identify symbols in Unicode, when they know their name from entity lists. The latter are designed to allow programmers to support identifiers that match existing usage -- mainly for characters for which there currently is not any well defined ID, or for characters for which their abbreviated name is their de-facto name. In a limited number of cases, that would lead to multiple aliases for the same character. The ideal is, as always, to have single identifiers per character, where possible. In a few exceptional cases, allowing alternate IDs via the NameAlias technique is of such overwhelming practical use to support an exception. Aliases come from the same namespace as character names, and must be unique, so that they can be used to unambiguously identify a character. They are intended to be used in programmatic interfaces, for example regular expressions. Adding redundant identifiers comes at a cost: all implementations have to rev their name tables, and using recently added aliases might not be portable until all implementations have caught up. That's why proposals to add additional aliases to any *existing* character should have to pass a really high bar. (I find the rationale for this initial expansion well thought ought and defensible - leaving the control codes unnamed in 10646 has proven problematic to implementers). There's no strict limit to *informative* aliases for characters, nor is there a uniqueness requirement. If there are important real world designations under which certain characters are known, they could be documented with informative aliases. These informative aliases are then available to user interface designers who wish to support a search for character by name feature. Unlike the case for program source code, such interfaces can handle multiple hits for the same name - by presenting a list, for example. Utlimately, even in this case, some annotations are better presented in special purpose files than informative records in the nameslist. That was done for mathematics. If there are other fields where there were established conventions for naming symbols, perhaps someone could provide an analogous list - but it should have no bearing on the PRI under consideration. A./
Re: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0
2011/8/27 Asmus Freytag asm...@ix.netcom.com: I agree with Ken that Phillipe's suggestion of conflating the annotations for mathematical use with formal Unicode name aliases is a non-starter. Yes but why then adding ISO 6429 alias names ? What makes ISO 6429 a better choice than another ISO standard, that you want to reject as a non-starter option in the normative UCS namespace ? And why dropping some naming rules for some the proposed alias names, if this namespace also has normative rules ? If you want consistency, those aliases could as well be informative only, and not part of the UCS namespace, avoiding some of its restrictions, i.e. not defined in the UCD itself but in a separate database. And you did not reply to the question about the stability of the related standard using these aliases, compared to the stability requirement for the UCS namespace: if there's no such stability, the normative reference in the UCD will remain only informative for the other standard, creating possible future conflicts.
Re: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0
2011/8/27 Ken Whistler k...@sybase.com: Was there such formal request from the ISO standard maintainers, and an agreed policy ? It has nothing to do with ISO standard maintainers. And yes, there was a formal request to do something about this problem, but it came from one of the maintainers of Perl. You just replied to another question. If the request came from maintainers of Perl, they absolutely don't need the *normative* reference to the ISO 6429 (or any other standard than the Unicode standard itself). All they want is just *completeness* of the namespace ; and possibly non ambiguities of interpretation of these names (for example to allow reference by a more correct name in regular expressions that would need to match, for example, parts of these names to create coherent subsets, which are for now incoherent due to past naming errors that can't be corrected and for which the only solution is to add aliases).