Re: Canonical block names: spaces vs. underscores

2016-05-26 Thread Markus Scherer
Note that the Block property is an artifact of how the committee organizes
the encoding of characters. It is not very useful for processing. For that,
the Script property, Script_Extensions, and others are normally much better.

markus


Re: Canonical block names: spaces vs. underscores

2016-05-26 Thread Doug Ewell
Mathias Bynens wrote:

> Any chance the canonical names can be used in `Blocks.txt` as well,
> for consistency? This would simplify scripts that parse the Unicode
> database text files. 

I don't see the problem here. The loose-matching rule is well-defined
and not complicated, either visually or algorithmically; and if Mathias
has an implementation up on GitHub, he should be able to use it wherever
it's needed.

--
Doug Ewell | http://ewellic.org | Thornton, CO 🇺🇸




Re: Canonical block names: spaces vs. underscores

2016-05-26 Thread Philippe Verdy
2016-05-26 20:48 GMT+02:00 Mathias Bynens :

>
> > On 26 May 2016, at 20:07, Ken Whistler  wrote:
>
> Perhaps the “Note:” in the commented header in `Blocks.txt` could be
> extended to point out that the ~~canonical block names~~, nay, ++preferred
> block aliases++ are listed in `PropertyValueAliases.txt`? That would’ve
> been enough to avoid the question that spawned this thread.
>

I'd say that the "preferred block aliases" should be stable and always in
the first entry.

And the last entry should be the preferred version for display and
unabbreviated (but not necessarily stable, it may change over time, and
applications are free to use better display names, including translations;
this last entry should be the best suitable for US English in a *technical*
glossary and preferably used in Unicode documentations and proposals, but
may be different for British English, or for vernacular names, but for
reference the 1st entry should not change)

Note also that the 1st entry in property aliases is not necessarily the
most abbreviated one: there may be other aliases in the middle of the list
using shorter names, provided that they don't conflict with others; or
special aliases used for specific lookups matching some pattern with a
known prefixes/suffixes (e.g. Hangul syllable types) so that another
specification specific for this usage could simply drop those implied
prefixes/suffixes, using even shorter aliases internally than the listed
aliases)

The rules for lookling up aliases in PropertyAliases should be independant
of the property type:
- capitalization should be preserved (with lookups always case-sensive,
even of the listed values for a property type are currently using only
ASCII capital letters, or only ASCII lowercase letters): the capitalization
form may need to be distinguished in some future of the standard (without
having to use a broken orthography to distinguish them), and we should not
be using a slow UCA collator to match entries.
- only underscores/spaces should be considered equivalent, and there will
NEVER be special entries using leading or trailing underscores, or pairs of
underscores, or pairs of whitespaces (all aliases are assumed to be
trimmable and compressible, like in XML or HTML by default): applications
may then choose the "canonicalization" form they prefer (with underscores,
or with spaces)
- some "camelCased" bijective transform could suppress spaces/underscores,
provided that the transform includes an "escaping" mechanism for case
distinctions; but alternatively we could also list conforming "camelCased"
aliases (from which lowercase-only aliases with ASCII hyphens could be
infered for use in CSS selectors also with a bijective transform)
- however some programming languages (e.g. BASIC) do not have any case
distinction for identifiers (and there's no easy escaping mechanism without
using separators like underscores, which should also not be used in leading
or traling positions), or use lettercase (of the initial) for special
meaning (e.g. in several IA languages to distinguish variables and atoms:
the escaping mechanism may need to prepend a leading underscore or some
common prefix).


Re: Canonical block names: spaces vs. underscores

2016-05-26 Thread Mathias Bynens

> On 26 May 2016, at 20:07, Ken Whistler  wrote:
> 
> Well, let's take an example. The entry in Blocks.txt for the Arabic 
> Presentation Forms-A block is:
> 
> FB50..FDFF; Arabic Presentation Forms-A
> 
> The entry for that block in PropertyValueAliases.txt is:
> 
> blk; Arabic_PF_A  ; Arabic_Presentation_Forms_A  ; 
> Arabic_Presentation_Forms-A
> 
> So then which would it be? Should Blocks.txt be changed to the long preferred 
> alias:
> 
> FB50..FDFF; Arabic_Presentation_Forms_A
> 
> or to the abbreviated preferred alias:
> 
> FB50..FDFF; Arabic_PF_A
> 
> which would be more consistent with the XML attribute and with most regex 
> usage?

This sounds like a strawman argument (?). The long preferred alias definitely 
seems more suitable for a ‘canonical’ name.

> I suppose a proposal to the UTC to further modify the UCD handling of block 
> names
> could change this situation. But I'm not convinced that we shouldn't just 
> leave
> things as they stand -- for stability. And then live with the complications 
> required
> for scripts or other parsing algorithms that actually need to deal with 
> Blocks.txt to
> either parse out block ranges (its main function) or to get usable block names
> (its subsidiary function).

Perhaps the “Note:” in the commented header in `Blocks.txt` could be extended 
to point out that the ~~canonical block names~~, nay, ++preferred block 
aliases++ are listed in `PropertyValueAliases.txt`? That would’ve been enough 
to avoid the question that spawned this thread.


Re: Canonical block names: spaces vs. underscores

2016-05-26 Thread Philippe Verdy
2016-05-26 20:07 GMT+02:00 Ken Whistler :

> Well, let's take an example. The entry in Blocks.txt for the Arabic
> Presentation Forms-A block is:
>
> FB50..FDFF; Arabic Presentation Forms-A
>
> The entry for that block in PropertyValueAliases.txt is:
>
> blk; Arabic_PF_A  ; Arabic_Presentation_Forms_A  ;
> Arabic_Presentation_Forms-A
>
> So then which would it be? Should Blocks.txt be changed to the long
> preferred alias:
>
> FB50..FDFF; Arabic_Presentation_Forms_A
>
> or to the abbreviated preferred alias:
>
> FB50..FDFF; Arabic_PF_A
>

I think that this would break parsers that expect the alias used in
Blocks.txt to be directly "readable" with spaces. My opinion is to keep
Blocks.txt untouched (with spaces) as it's part of the core standard since
too long (and in sync with the ISO standard) as being the *normative* block
name.

But we could add this normative value (with spaces) into
PropertyValueAliases.txt (that ISO 10646 does not have or need in its
standard):

blk; Arabic_PF_A  ; Arabic_Presentation_Forms_A  ;
Arabic_Presentation_Forms-A ; Arabic Presentation Forms-A

The other solution would be to *add* the abbreviated prefered alias in
Blocks.txt:

FB50..FDFF; Arabic Presentation Forms-A ; Arabic_PF_A

But this could break existing Block.txt parsers, when parsers should not
bug if finding new aliases in PropertyValueAliases.txt

Another solution would be to properly explain that to lookup values in
PropertyValues.txt, you can search it by replacing spaces in block names by
underscores, or make sure that underscores and spaces in the *middle* of
values are considered equivalent (so that even if they are rendered
visually, we can also display the listed aliases using spaces instead of
underscores.

However it must be clear that these aliases are case-sensitive by default
("Arabic_Presentation_Forms_A" is not the same as
"Arabic_presentation_forms_A" but is the same as "Arabic Presentation_Forms
A), unless the block names property is normatively said to be
case-insensitive (in that case the followings are also aliases:
"arabic_pf_a", "arabic pf a"). But adding case insensitivity has a cost,
which is much higher than *only* allowing basic replacements of spaces and
underscores (this will work, provided that there's no "special" aliases
starting by underscores, or using pairs of underscores: I doubt ISO will
use pairs of spaces in block names which are supposed to be trimmed with
whitespaces in the middle compressed).

Removing or replacing the space-separated words in block names in the UCD
would break the compatibility and synchronization with the ISO standard
which list them with spaces.


Re: Canonical block names: spaces vs. underscores

2016-05-26 Thread Ken Whistler


On 5/26/2016 10:05 AM, Mathias Bynens wrote:

On 26 May 2016, at 17:47, Mark Davis ☕️  wrote:

The canonical property and property value formats are in the *Alias* files.

Thanks for confirming!


Well, not quite... See below.



Any chance the canonical names can be used in `Blocks.txt` as well, for 
consistency? This would simplify scripts that parse the Unicode database text 
files.


There's always a chance, I guess. But if we did so, we'd end up having 
to just invent some
other more-or-less ad hoc property: Block_Name_Usable_For_Display, with 
the values
we already have in the Blocks.txt file. Or we would have to change the 
format to include
the block short alias as an additional field in the file, which would 
have its own maintenance
and consistency issues. Or we would be introducing a historical 
inconsistency in the UCD
between versions, which would *complicate* certain other scripts that 
parse the UCD.





On 26 May 2016, at 18:03, Ken Whistler  wrote:

[…] "canonical block name" is not a defined term in the standard.

I didn’t mean to imply it was — it’s just an English word. I meant “canonical” 
as in “without loose matching applied”.


Ah, but "canonical" is a very freighted word in Unicode parlance. There 
are 58 instances
of the word "canonical" in the current version of UAX #44, Unicode 
Character Database.
Every one of them is a term of art, and none of them means what you mean 
there. ;-)


What are actually in PropertyValueAliases.txt are "preferred aliases" 
(one "abbreviated",
and one "long"), plus a few "other aliases" for various compatibility 
reasons.


UAX #42 follows suit. The block property is represented by the blk 
attribute, and the

enumerated values of the blk attribute:

http://www.unicode.org/reports/tr42/#w1aac13c13c19b1

use the *abbreviated *"preferred aliases" from PropertyValueAliases.txt.




For enumerated properties, and especially for catalog properties such as Block 
and Script,
the value of the property may be multi-word, and the best form to use in one 
context might
not be exactly (as in binary string equality exact) the same as in another.

That makes sense, but shouldn’t it be consistent throughout the Unicode 
database text files?


Well, let's take an example. The entry in Blocks.txt for the Arabic 
Presentation Forms-A block is:


FB50..FDFF; Arabic Presentation Forms-A

The entry for that block in PropertyValueAliases.txt is:

blk; Arabic_PF_A  ; Arabic_Presentation_Forms_A  
; Arabic_Presentation_Forms-A


So then which would it be? Should Blocks.txt be changed to the long 
preferred alias:


FB50..FDFF; Arabic_Presentation_Forms_A

or to the abbreviated preferred alias:

FB50..FDFF; Arabic_PF_A

which would be more consistent with the XML attribute and with most 
regex usage?
If the latter, you would end up with systematically less identifiable 
labels in Blocks.txt,
which would make it a bit more obscure for other uses, and which would 
also then
create ambiguities about what might be the "best" or "preferred" label 
for blocks for
an API returning a block name -- which certainly wouldn't be the 
abbreviated "preferred alias".


I suppose a proposal to the UTC to further modify the UCD handling of 
block names
could change this situation. But I'm not convinced that we shouldn't 
just leave
things as they stand -- for stability. And then live with the 
complications required
for scripts or other parsing algorithms that actually need to deal with 
Blocks.txt to
either parse out block ranges (its main function) or to get usable block 
names

(its subsidiary function).

--Ken








Re: Canonical block names: spaces vs. underscores

2016-05-26 Thread Mathias Bynens

> On 26 May 2016, at 17:47, Mark Davis ☕️  wrote:
> 
> The canonical property and property value formats are in the *Alias* files.

Thanks for confirming!

Any chance the canonical names can be used in `Blocks.txt` as well, for 
consistency? This would simplify scripts that parse the Unicode database text 
files.

> On 26 May 2016, at 18:03, Ken Whistler  wrote:
> 
> […] "canonical block name" is not a defined term in the standard.

I didn’t mean to imply it was — it’s just an English word. I meant “canonical” 
as in “without loose matching applied”.

> See the matching rules in UAX #44:
> 
> http://www.unicode.org/reports/tr44/#Matching_Rules
> 
> and in particular, the matching rule for symbolic values, which applies in 
> this case:
> 
> http://www.unicode.org/reports/tr44/#UAX44-LM3

I know about loose matching, having recently implemented it 
(https://github.com/mathiasbynens/unicode-loose-match).

> For enumerated properties, and especially for catalog properties such as 
> Block and Script,
> the value of the property may be multi-word, and the best form to use in one 
> context might
> not be exactly (as in binary string equality exact) the same as in another.

That makes sense, but shouldn’t it be consistent throughout the Unicode 
database text files?


Re: Canonical block names: spaces vs. underscores

2016-05-26 Thread Ken Whistler



On 5/26/2016 1:17 AM, Mathias Bynens wrote:

`Blocks.txt` (http://unicode.org/Public/UNIDATA/Blocks.txt) lists blocks such 
as `Cyrillic Supplement`.

However, `PropertyValueAliases.txt` 
(http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt) refers to this 
block as `Cyrillic_Supplement`, with an underscore instead of a space.

Which is it?

If proper canonical block names


Well, first of all, "canonical block name" is not a defined term in the 
standard. Unlike
normalization of Unicode strings, there is no "normalization" of 
property values that
defines a particular form as *the* canonical form to which other strings 
normalize.



  use spaces instead of underscores, why doesn’t `PropertyValueAliases.txt` 
reflect that?
If proper canonical block names use underscores instead of spaces, why doesn’t 
`Blocks.txt` reflect that?





See the matching rules in UAX #44:

http://www.unicode.org/reports/tr44/#Matching_Rules

and in particular, the matching rule for symbolic values, which applies 
in this case:


http://www.unicode.org/reports/tr44/#UAX44-LM3

For enumerated properties, and especially for catalog properties such as 
Block and Script,
the value of the property may be multi-word, and the best form to use in 
one context might

not be exactly (as in binary string equality exact) the same as in another.

For Blocks.txt, all block names are given with spaces and with the 
casing conventions that
would be most consistent with returning values for a block name in an 
API. The
property values used in PropertyValueAliases.txt, on the other hand, are 
systematically
turned into forms that are more identifier friendly, as the typical 
context of use for those

values is in regex expressions and the like.

There are invariant rules in place that guarantee that any new property 
values for properties
subject to the Loose Matching Rule #3 noted above are always unique in 
their namespace,

given the application of that matching rule.

--Ken





Re: Canonical block names: spaces vs. underscores

2016-05-26 Thread Doug Ewell
Mathias Bynens wrote:

> `Blocks.txt` (http://unicode.org/Public/UNIDATA/Blocks.txt) lists
> blocks such as `Cyrillic Supplement`.
>
> However, `PropertyValueAliases.txt`
> (http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt) refers to
> this block as `Cyrillic_Supplement`, with an underscore instead of a
> space.
>
> Which is it?

It's both:

http://www.unicode.org/reports/tr44/#Matching_Symbolic

--
Doug Ewell | http://ewellic.org | Thornton, CO 🇺🇸




Re: Canonical block names: spaces vs. underscores

2016-05-26 Thread Mark Davis ☕️
The canonical property and property value formats are in the *Alias* files.

{phone}
On May 26, 2016 06:57, "Mathias Bynens"  wrote:

>
> > On 26 May 2016, at 10:17, Mathias Bynens  wrote:
> >
> > `Blocks.txt` (http://unicode.org/Public/UNIDATA/Blocks.txt) lists
> blocks such as `Cyrillic Supplement`.
> >
> > However, `PropertyValueAliases.txt` (
> http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt) refers to
> this block as `Cyrillic_Supplement`, with an underscore instead of a space.
> >
> > Which is it?
> >
> > If proper canonical block names use spaces instead of underscores, why
> doesn’t `PropertyValueAliases.txt` reflect that?
> > If proper canonical block names use underscores instead of spaces, why
> doesn’t `Blocks.txt` reflect that?
> >
>
> Another example: `Blocks.txt` has `Superscripts and Subscripts`, whereas
> `PropertyValueAliases.txt` has `Superscripts_And_Subscripts`. Note that in
> addition to the underscores, the case of the `A` changed as well. Which is
> the canonical name?
>
> The same goes for other blocks with “and” in the name, e.g. `Miscellaneous
> Symbols and Pictographs`, `Supplemental Symbols and Pictographs`, etc.
>


RE: Emoji for subdivision flags

2016-05-26 Thread Doug Ewell
Peter Constable replied to Karl Williamson:

>>> Now that UTR #52 has been suspended, are any *specific* alternative
>>> plans for representing subdivision flags being bandied about?
>>
>> What I'd like to know is how does one find out about such decisions
>> in a timely manner?
>
> Watch for UTC minutes to be posted?

Apparently the key is to look at this list [1], which is up to date, and
not this one [2], which isn't.

The relevant minutes are at [3]. Search for "Issue 321" and in
particular look through the review comments at [4] to find out what
happened to the original scope and intent of PDUTS #52.

[1] http://www.unicode.org/L2/meetings/utc-meetings.html
[2] http://www.unicode.org/consortium/utc-minutes.html
[3] http://www.unicode.org/L2/L2016/16121.htm
[4] http://www.unicode.org/review/pri321/feedback.html

--
Doug Ewell | http://ewellic.org | Thornton, CO 🇺🇸




Re: Canonical block names: spaces vs. underscores

2016-05-26 Thread Mathias Bynens

> On 26 May 2016, at 10:17, Mathias Bynens  wrote:
> 
> `Blocks.txt` (http://unicode.org/Public/UNIDATA/Blocks.txt) lists blocks such 
> as `Cyrillic Supplement`.
> 
> However, `PropertyValueAliases.txt` 
> (http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt) refers to this 
> block as `Cyrillic_Supplement`, with an underscore instead of a space.
> 
> Which is it?
> 
> If proper canonical block names use spaces instead of underscores, why 
> doesn’t `PropertyValueAliases.txt` reflect that? 
> If proper canonical block names use underscores instead of spaces, why 
> doesn’t `Blocks.txt` reflect that?
> 

Another example: `Blocks.txt` has `Superscripts and Subscripts`, whereas 
`PropertyValueAliases.txt` has `Superscripts_And_Subscripts`. Note that in 
addition to the underscores, the case of the `A` changed as well. Which is the 
canonical name?

The same goes for other blocks with “and” in the name, e.g. `Miscellaneous 
Symbols and Pictographs`, `Supplemental Symbols and Pictographs`, etc.


Canonical block names: spaces vs. underscores

2016-05-26 Thread Mathias Bynens
`Blocks.txt` (http://unicode.org/Public/UNIDATA/Blocks.txt) lists blocks such 
as `Cyrillic Supplement`.

However, `PropertyValueAliases.txt` 
(http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt) refers to this 
block as `Cyrillic_Supplement`, with an underscore instead of a space.

Which is it?

If proper canonical block names use spaces instead of underscores, why doesn’t 
`PropertyValueAliases.txt` reflect that? 
If proper canonical block names use underscores instead of spaces, why doesn’t 
`Blocks.txt` reflect that?