Re: Canonical block names: spaces vs. underscores
Note that the Block property is an artifact of how the committee organizes the encoding of characters. It is not very useful for processing. For that, the Script property, Script_Extensions, and others are normally much better. markus
Re: Canonical block names: spaces vs. underscores
Mathias Bynens wrote: > Any chance the canonical names can be used in `Blocks.txt` as well, > for consistency? This would simplify scripts that parse the Unicode > database text files. I don't see the problem here. The loose-matching rule is well-defined and not complicated, either visually or algorithmically; and if Mathias has an implementation up on GitHub, he should be able to use it wherever it's needed. -- Doug Ewell | http://ewellic.org | Thornton, CO đşđ¸
Re: Canonical block names: spaces vs. underscores
2016-05-26 20:48 GMT+02:00 Mathias Bynens : > > > On 26 May 2016, at 20:07, Ken Whistler wrote: > > Perhaps the âNote:â in the commented header in `Blocks.txt` could be > extended to point out that the ~~canonical block names~~, nay, ++preferred > block aliases++ are listed in `PropertyValueAliases.txt`? That wouldâve > been enough to avoid the question that spawned this thread. > I'd say that the "preferred block aliases" should be stable and always in the first entry. And the last entry should be the preferred version for display and unabbreviated (but not necessarily stable, it may change over time, and applications are free to use better display names, including translations; this last entry should be the best suitable for US English in a *technical* glossary and preferably used in Unicode documentations and proposals, but may be different for British English, or for vernacular names, but for reference the 1st entry should not change) Note also that the 1st entry in property aliases is not necessarily the most abbreviated one: there may be other aliases in the middle of the list using shorter names, provided that they don't conflict with others; or special aliases used for specific lookups matching some pattern with a known prefixes/suffixes (e.g. Hangul syllable types) so that another specification specific for this usage could simply drop those implied prefixes/suffixes, using even shorter aliases internally than the listed aliases) The rules for lookling up aliases in PropertyAliases should be independant of the property type: - capitalization should be preserved (with lookups always case-sensive, even of the listed values for a property type are currently using only ASCII capital letters, or only ASCII lowercase letters): the capitalization form may need to be distinguished in some future of the standard (without having to use a broken orthography to distinguish them), and we should not be using a slow UCA collator to match entries. - only underscores/spaces should be considered equivalent, and there will NEVER be special entries using leading or trailing underscores, or pairs of underscores, or pairs of whitespaces (all aliases are assumed to be trimmable and compressible, like in XML or HTML by default): applications may then choose the "canonicalization" form they prefer (with underscores, or with spaces) - some "camelCased" bijective transform could suppress spaces/underscores, provided that the transform includes an "escaping" mechanism for case distinctions; but alternatively we could also list conforming "camelCased" aliases (from which lowercase-only aliases with ASCII hyphens could be infered for use in CSS selectors also with a bijective transform) - however some programming languages (e.g. BASIC) do not have any case distinction for identifiers (and there's no easy escaping mechanism without using separators like underscores, which should also not be used in leading or traling positions), or use lettercase (of the initial) for special meaning (e.g. in several IA languages to distinguish variables and atoms: the escaping mechanism may need to prepend a leading underscore or some common prefix).
Re: Canonical block names: spaces vs. underscores
> On 26 May 2016, at 20:07, Ken Whistler wrote: > > Well, let's take an example. The entry in Blocks.txt for the Arabic > Presentation Forms-A block is: > > FB50..FDFF; Arabic Presentation Forms-A > > The entry for that block in PropertyValueAliases.txt is: > > blk; Arabic_PF_A ; Arabic_Presentation_Forms_A ; > Arabic_Presentation_Forms-A > > So then which would it be? Should Blocks.txt be changed to the long preferred > alias: > > FB50..FDFF; Arabic_Presentation_Forms_A > > or to the abbreviated preferred alias: > > FB50..FDFF; Arabic_PF_A > > which would be more consistent with the XML attribute and with most regex > usage? This sounds like a strawman argument (?). The long preferred alias definitely seems more suitable for a âcanonicalâ name. > I suppose a proposal to the UTC to further modify the UCD handling of block > names > could change this situation. But I'm not convinced that we shouldn't just > leave > things as they stand -- for stability. And then live with the complications > required > for scripts or other parsing algorithms that actually need to deal with > Blocks.txt to > either parse out block ranges (its main function) or to get usable block names > (its subsidiary function). Perhaps the âNote:â in the commented header in `Blocks.txt` could be extended to point out that the ~~canonical block names~~, nay, ++preferred block aliases++ are listed in `PropertyValueAliases.txt`? That wouldâve been enough to avoid the question that spawned this thread.
Re: Canonical block names: spaces vs. underscores
2016-05-26 20:07 GMT+02:00 Ken Whistler : > Well, let's take an example. The entry in Blocks.txt for the Arabic > Presentation Forms-A block is: > > FB50..FDFF; Arabic Presentation Forms-A > > The entry for that block in PropertyValueAliases.txt is: > > blk; Arabic_PF_A ; Arabic_Presentation_Forms_A ; > Arabic_Presentation_Forms-A > > So then which would it be? Should Blocks.txt be changed to the long > preferred alias: > > FB50..FDFF; Arabic_Presentation_Forms_A > > or to the abbreviated preferred alias: > > FB50..FDFF; Arabic_PF_A > I think that this would break parsers that expect the alias used in Blocks.txt to be directly "readable" with spaces. My opinion is to keep Blocks.txt untouched (with spaces) as it's part of the core standard since too long (and in sync with the ISO standard) as being the *normative* block name. But we could add this normative value (with spaces) into PropertyValueAliases.txt (that ISO 10646 does not have or need in its standard): blk; Arabic_PF_A ; Arabic_Presentation_Forms_A ; Arabic_Presentation_Forms-A ; Arabic Presentation Forms-A The other solution would be to *add* the abbreviated prefered alias in Blocks.txt: FB50..FDFF; Arabic Presentation Forms-A ; Arabic_PF_A But this could break existing Block.txt parsers, when parsers should not bug if finding new aliases in PropertyValueAliases.txt Another solution would be to properly explain that to lookup values in PropertyValues.txt, you can search it by replacing spaces in block names by underscores, or make sure that underscores and spaces in the *middle* of values are considered equivalent (so that even if they are rendered visually, we can also display the listed aliases using spaces instead of underscores. However it must be clear that these aliases are case-sensitive by default ("Arabic_Presentation_Forms_A" is not the same as "Arabic_presentation_forms_A" but is the same as "Arabic Presentation_Forms A), unless the block names property is normatively said to be case-insensitive (in that case the followings are also aliases: "arabic_pf_a", "arabic pf a"). But adding case insensitivity has a cost, which is much higher than *only* allowing basic replacements of spaces and underscores (this will work, provided that there's no "special" aliases starting by underscores, or using pairs of underscores: I doubt ISO will use pairs of spaces in block names which are supposed to be trimmed with whitespaces in the middle compressed). Removing or replacing the space-separated words in block names in the UCD would break the compatibility and synchronization with the ISO standard which list them with spaces.
Re: Canonical block names: spaces vs. underscores
On 5/26/2016 10:05 AM, Mathias Bynens wrote: On 26 May 2016, at 17:47, Mark Davis âď¸ wrote: The canonical property and property value formats are in the *Alias* files. Thanks for confirming! Well, not quite... See below. Any chance the canonical names can be used in `Blocks.txt` as well, for consistency? This would simplify scripts that parse the Unicode database text files. There's always a chance, I guess. But if we did so, we'd end up having to just invent some other more-or-less ad hoc property: Block_Name_Usable_For_Display, with the values we already have in the Blocks.txt file. Or we would have to change the format to include the block short alias as an additional field in the file, which would have its own maintenance and consistency issues. Or we would be introducing a historical inconsistency in the UCD between versions, which would *complicate* certain other scripts that parse the UCD. On 26 May 2016, at 18:03, Ken Whistler wrote: [âŚ] "canonical block name" is not a defined term in the standard. I didnât mean to imply it was â itâs just an English word. I meant âcanonicalâ as in âwithout loose matching appliedâ. Ah, but "canonical" is a very freighted word in Unicode parlance. There are 58 instances of the word "canonical" in the current version of UAX #44, Unicode Character Database. Every one of them is a term of art, and none of them means what you mean there. ;-) What are actually in PropertyValueAliases.txt are "preferred aliases" (one "abbreviated", and one "long"), plus a few "other aliases" for various compatibility reasons. UAX #42 follows suit. The block property is represented by the blk attribute, and the enumerated values of the blk attribute: http://www.unicode.org/reports/tr42/#w1aac13c13c19b1 use the *abbreviated *"preferred aliases" from PropertyValueAliases.txt. For enumerated properties, and especially for catalog properties such as Block and Script, the value of the property may be multi-word, and the best form to use in one context might not be exactly (as in binary string equality exact) the same as in another. That makes sense, but shouldnât it be consistent throughout the Unicode database text files? Well, let's take an example. The entry in Blocks.txt for the Arabic Presentation Forms-A block is: FB50..FDFF; Arabic Presentation Forms-A The entry for that block in PropertyValueAliases.txt is: blk; Arabic_PF_A ; Arabic_Presentation_Forms_A ; Arabic_Presentation_Forms-A So then which would it be? Should Blocks.txt be changed to the long preferred alias: FB50..FDFF; Arabic_Presentation_Forms_A or to the abbreviated preferred alias: FB50..FDFF; Arabic_PF_A which would be more consistent with the XML attribute and with most regex usage? If the latter, you would end up with systematically less identifiable labels in Blocks.txt, which would make it a bit more obscure for other uses, and which would also then create ambiguities about what might be the "best" or "preferred" label for blocks for an API returning a block name -- which certainly wouldn't be the abbreviated "preferred alias". I suppose a proposal to the UTC to further modify the UCD handling of block names could change this situation. But I'm not convinced that we shouldn't just leave things as they stand -- for stability. And then live with the complications required for scripts or other parsing algorithms that actually need to deal with Blocks.txt to either parse out block ranges (its main function) or to get usable block names (its subsidiary function). --Ken
Re: Canonical block names: spaces vs. underscores
> On 26 May 2016, at 17:47, Mark Davis âď¸ wrote: > > The canonical property and property value formats are in the *Alias* files. Thanks for confirming! Any chance the canonical names can be used in `Blocks.txt` as well, for consistency? This would simplify scripts that parse the Unicode database text files. > On 26 May 2016, at 18:03, Ken Whistler wrote: > > [âŚ] "canonical block name" is not a defined term in the standard. I didnât mean to imply it was â itâs just an English word. I meant âcanonicalâ as in âwithout loose matching appliedâ. > See the matching rules in UAX #44: > > http://www.unicode.org/reports/tr44/#Matching_Rules > > and in particular, the matching rule for symbolic values, which applies in > this case: > > http://www.unicode.org/reports/tr44/#UAX44-LM3 I know about loose matching, having recently implemented it (https://github.com/mathiasbynens/unicode-loose-match). > For enumerated properties, and especially for catalog properties such as > Block and Script, > the value of the property may be multi-word, and the best form to use in one > context might > not be exactly (as in binary string equality exact) the same as in another. That makes sense, but shouldnât it be consistent throughout the Unicode database text files?
Re: Canonical block names: spaces vs. underscores
On 5/26/2016 1:17 AM, Mathias Bynens wrote: `Blocks.txt` (http://unicode.org/Public/UNIDATA/Blocks.txt) lists blocks such as `Cyrillic Supplement`. However, `PropertyValueAliases.txt` (http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt) refers to this block as `Cyrillic_Supplement`, with an underscore instead of a space. Which is it? If proper canonical block names Well, first of all, "canonical block name" is not a defined term in the standard. Unlike normalization of Unicode strings, there is no "normalization" of property values that defines a particular form as *the* canonical form to which other strings normalize. use spaces instead of underscores, why doesnât `PropertyValueAliases.txt` reflect that? If proper canonical block names use underscores instead of spaces, why doesnât `Blocks.txt` reflect that? See the matching rules in UAX #44: http://www.unicode.org/reports/tr44/#Matching_Rules and in particular, the matching rule for symbolic values, which applies in this case: http://www.unicode.org/reports/tr44/#UAX44-LM3 For enumerated properties, and especially for catalog properties such as Block and Script, the value of the property may be multi-word, and the best form to use in one context might not be exactly (as in binary string equality exact) the same as in another. For Blocks.txt, all block names are given with spaces and with the casing conventions that would be most consistent with returning values for a block name in an API. The property values used in PropertyValueAliases.txt, on the other hand, are systematically turned into forms that are more identifier friendly, as the typical context of use for those values is in regex expressions and the like. There are invariant rules in place that guarantee that any new property values for properties subject to the Loose Matching Rule #3 noted above are always unique in their namespace, given the application of that matching rule. --Ken
Re: Canonical block names: spaces vs. underscores
Mathias Bynens wrote: > `Blocks.txt` (http://unicode.org/Public/UNIDATA/Blocks.txt) lists > blocks such as `Cyrillic Supplement`. > > However, `PropertyValueAliases.txt` > (http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt) refers to > this block as `Cyrillic_Supplement`, with an underscore instead of a > space. > > Which is it? It's both: http://www.unicode.org/reports/tr44/#Matching_Symbolic -- Doug Ewell | http://ewellic.org | Thornton, CO đşđ¸
Re: Canonical block names: spaces vs. underscores
The canonical property and property value formats are in the *Alias* files. {phone} On May 26, 2016 06:57, "Mathias Bynens" wrote: > > > On 26 May 2016, at 10:17, Mathias Bynens wrote: > > > > `Blocks.txt` (http://unicode.org/Public/UNIDATA/Blocks.txt) lists > blocks such as `Cyrillic Supplement`. > > > > However, `PropertyValueAliases.txt` ( > http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt) refers to > this block as `Cyrillic_Supplement`, with an underscore instead of a space. > > > > Which is it? > > > > If proper canonical block names use spaces instead of underscores, why > doesnât `PropertyValueAliases.txt` reflect that? > > If proper canonical block names use underscores instead of spaces, why > doesnât `Blocks.txt` reflect that? > > > > Another example: `Blocks.txt` has `Superscripts and Subscripts`, whereas > `PropertyValueAliases.txt` has `Superscripts_And_Subscripts`. Note that in > addition to the underscores, the case of the `A` changed as well. Which is > the canonical name? > > The same goes for other blocks with âandâ in the name, e.g. `Miscellaneous > Symbols and Pictographs`, `Supplemental Symbols and Pictographs`, etc. >
RE: Emoji for subdivision flags
Peter Constable replied to Karl Williamson: >>> Now that UTR #52 has been suspended, are any *specific* alternative >>> plans for representing subdivision flags being bandied about? >> >> What I'd like to know is how does one find out about such decisions >> in a timely manner? > > Watch for UTC minutes to be posted? Apparently the key is to look at this list [1], which is up to date, and not this one [2], which isn't. The relevant minutes are at [3]. Search for "Issue 321" and in particular look through the review comments at [4] to find out what happened to the original scope and intent of PDUTS #52. [1] http://www.unicode.org/L2/meetings/utc-meetings.html [2] http://www.unicode.org/consortium/utc-minutes.html [3] http://www.unicode.org/L2/L2016/16121.htm [4] http://www.unicode.org/review/pri321/feedback.html -- Doug Ewell | http://ewellic.org | Thornton, CO đşđ¸
Re: Canonical block names: spaces vs. underscores
> On 26 May 2016, at 10:17, Mathias Bynens wrote: > > `Blocks.txt` (http://unicode.org/Public/UNIDATA/Blocks.txt) lists blocks such > as `Cyrillic Supplement`. > > However, `PropertyValueAliases.txt` > (http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt) refers to this > block as `Cyrillic_Supplement`, with an underscore instead of a space. > > Which is it? > > If proper canonical block names use spaces instead of underscores, why > doesnât `PropertyValueAliases.txt` reflect that? > If proper canonical block names use underscores instead of spaces, why > doesnât `Blocks.txt` reflect that? > Another example: `Blocks.txt` has `Superscripts and Subscripts`, whereas `PropertyValueAliases.txt` has `Superscripts_And_Subscripts`. Note that in addition to the underscores, the case of the `A` changed as well. Which is the canonical name? The same goes for other blocks with âandâ in the name, e.g. `Miscellaneous Symbols and Pictographs`, `Supplemental Symbols and Pictographs`, etc.
Canonical block names: spaces vs. underscores
`Blocks.txt` (http://unicode.org/Public/UNIDATA/Blocks.txt) lists blocks such as `Cyrillic Supplement`. However, `PropertyValueAliases.txt` (http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt) refers to this block as `Cyrillic_Supplement`, with an underscore instead of a space. Which is it? If proper canonical block names use spaces instead of underscores, why doesnât `PropertyValueAliases.txt` reflect that? If proper canonical block names use underscores instead of spaces, why doesnât `Blocks.txt` reflect that?