Re: UAX44: loose matching of symbolic values and the `is` prefix

2016-06-06 Thread Mathias Bynens

> On 7 Jun 2016, at 00:39, Nova Patch  wrote:
> 
> […] Based on my past research for Unicode Regular Expression Engines at 
> IUC38, I suspect that there might not be any regex engine that actually 
> supports syntax like Script=IsGreek as described in UAX44-LM3! If anybody 
> knows otherwise, I’d love to hear about it.

This seems like a cut-and-dried case of reality not matching the specification, 
which is not helpful in any way. The sensible thing to do is to update the 
specification accordingly, as proposed.


Re: 72 New Emoji Characters

2016-06-06 Thread Oren Watson
I see this in the list of new emoji:
   GOAL NET
• marksmanship, sport shooting, hunting
 This is incorrect, a goal net would be for football or hockey, not
marksmanship.

On Mon, Jun 6, 2016 at 3:19 PM,  wrote:

> [image: [Emoji Image]]The 72 new emoji characters for Unicode 9.0 are now
> final, and listed in Emoji Recently Added
> . They include 7
> faces, 7 people, 7 hand gestures, 14 plants/animals, 18 food emoji, 12
> sports emoji, and a few others. The corresponding documentation in *UTR
> #51 Unicode Emoji, Version 3.0 *
> has also been updated, with additional guidelines for implementers and the
> new versions of the emoji data files. These should appear on smart phones
> and other devices that support emoji once vendors have a chance to update
> them.
>
> Four of the new emoji are added to complete gender pairs. Work has already
> begun on the Version 4.0 of Unicode Emoji, with a focus on further
> enhancing gender representation, and targeted to appear in the near future.
>
> The new emoji characters will soon be available for adoption
> , helping support 
> projects
> to improve language support
> .
>
> http://blog.unicode.org/2016/06/72-new-emoji-characters.html
>
>


Re: UAX44: loose matching of symbolic values and the `is` prefix

2016-06-06 Thread Nova Patch
Den mandag 6. juni 2016 skrev Doug Ewell følgende:
>
> Mathias Bynens wrote:
>
> > The `is` prefix doesn’t provide any functionality that would otherwise
> > be unavailable. It doesn’t add any value, yet causes incompatibility,
> > author confusion, and it increases implementation complexity.
>
> I don't see any evidence that it adds no value. Support for existing
> implementations is value.

Markus has now confirmed that ICU doesn’t support this syntax and I can
confirm that even Perl, which probably supports the most different ways to
write the same regex, doesn’t support any form of the `is` prefix for
property values when the property name is provided.

$ perl -Mutf8 -E 'say "π" =~ /\p{Script=Greek}/'
1
$ perl -Mutf8 -E 'say "π" =~ /\p{Script=IsGreek}/'
Can't find Unicode property definition "Script=IsGreek" at -e line 1.
$ perl -Mutf8 -E 'say "π" =~ /\p{Script=Is_Greek}/'
Can't find Unicode property definition "Script=Is_Greek" at -e line 1.

Although Perl does optionally support the `is` prefix for property names
and standalone property values:

$ perl -Mutf8 -E 'say "π" =~ /\p{IsScript=Greek}/'
1
$ perl -Mutf8 -E 'say "π" =~ /\p{IsGreek}/'
1

However, this syntax is notoriously inconstant among different regex
engines. Perl’s specific rules are documented in *perluniprops* (
http://perldoc.perl.org/perluniprops.html) as \p{Is_*} (case- and
underscore-insensitive) being a synonym for \p{*} which explains the above
functionality. Based on my past research for *Unicode Regular Expression
Engines* at IUC38, I suspect that there might not be any regex engine that
actually supports syntax like Script=IsGreek as described in UAX44-LM3! If
anybody knows otherwise, I’d love to hear about it.

Nova


Re: UAX44: loose matching of symbolic values and the `is` prefix

2016-06-06 Thread Asmus Freytag (c)

  
  
On 6/6/2016 9:09 AM, Markus Scherer
  wrote:


  
Interesting discussion!


ICU does not support "is" nor "in"
  prefixes. I wasn't even aware that UAX #44 loose matching
  prescribes "is". ICU just implements what
  Property[Value]Aliases.txt say:



  # Loose matching should be applied to all property names and property values, with
# the exception of String Property values. With loose matching of property names and
# values, the case distinctions, whitespace, hyphens, and '_' are ignored.

  
  

The prefixes seem gratuitous and
  confusing. For example, if I read UAX44-LM3 right, it would
  allow [:isscript=isgreek:].


We do support just [:Greek:] for
  scripts and [:L:] for general categories.


I would rather not add support for the
  prefixes in ICU.


markus
  

There is a difference in guaranteeing that
"is" is not the leading part of a property value alias and in
supporting a match. I agree that requiring (or suggesting) such
a thing is questionable. (Esp. in light of what ICU does).
However, making sure that those that follow
that conventions can continue to do so with future aliases *is*
reasonable.
A./

  



Re: UAX44: loose matching of symbolic values and the `is` prefix

2016-06-06 Thread Markus Scherer
Interesting discussion!

ICU does not support "is" nor "in" prefixes. I wasn't even aware that UAX
#44 loose matching prescribes "is". ICU just implements what
Property[Value]Aliases.txt say:

# Loose matching should be applied to all property names and property
values, with
# the exception of String Property values. With loose matching of
property names and
# values, the case distinctions, whitespace, hyphens, and '_' are ignored.


The prefixes seem gratuitous and confusing. For example, if I
read UAX44-LM3 right, it would allow [:isscript=isgreek:].

We do support just [:Greek:] for scripts and [:L:] for general categories.

I would rather not add support for the prefixes in ICU.

markus


Re: UAX44: loose matching of symbolic values and the `is` prefix

2016-06-06 Thread Mathias Bynens
> 
>> The `is` prefix doesn’t provide any functionality that would otherwise
>> be unavailable. It doesn’t add any value, yet causes incompatibility,
>> author confusion, and it increases implementation complexity.
> 
> I don't see any evidence that it adds no value. Support for existing
> implementations is value.

It adds no value because it doesn’t enable any new functionality.
I agree support for existing implementations would have some value, but given 
that existing implementations disagree on the properties for which they support 
`is` that is not going to happen anyway. It’s impossible to be compatible with 
all those different implementations at the same time.


Re: UAX44: loose matching of symbolic values and the `is` prefix

2016-06-06 Thread Mathias Bynens

> On 6 Jun 2016, at 18:04, Ken Whistler  wrote:
> 
> UAX #44 doesn't *require* any regex engine to include this "is prefix" 
> handling.

Are you referring to the fact that the first paragraph on  
http://unicode.org/reports/tr44/#Matching_Rules uses “strongly recommended” and 
“should” instead of “required” and “must”?

> What UAX #44 does is recommend that all property and property value aliases 
> be correctly recognized, and then specifies a clear statement (in UAX44-LM3) 
> of the loose matching rule for recognizing the various forms of those aliases 
> that could be considered equivalent. I don't think messing with that rule 
> statement (which has been in place since 2010) would be helpful.

Why not? What I had in mind was adding a small sentence like:

> For compatibility reasons, implementations may optionally support any initial 
> prefix string "is".

This wouldn’t be a breaking change in any way, and it would enable new 
implementations that aim to follow UAX44 to do so without having to support 
`is`, and it would solve the problem everywhere the matching rules get applied 
rather than just for regular expressions.

> I think the target of concern here is wrong. 

Not sure I agree. It seems to me the `is` prefix is problematic (for the same 
reasons) wherever it’s used, whether that’s in regular expressions or not.

> The target instead should be in UTS #18, which happily, has a proposed update 
> available for comment right now:
> 
> http://www.unicode.org/review/pri325/
> 
> The relevant point is:
> 
> http://www.unicode.org/reports/tr18/tr18-18.html#RL1.2
> 
> That is the conformance part that requires that conformant Unicode regex 
> implementations "must follow the Matching rules from [UAX44]".

Thanks for the pointer! I will submit my feedback there as well. It seems more 
awkward / difficult to add an exception there rather than just slightly 
tweaking the UAX44-LM3 text as suggested above, though.


Re: UAX44: loose matching of symbolic values and the `is` prefix

2016-06-06 Thread Doug Ewell
Mathias Bynens wrote:

> Looking at implementations in the wild, Steven Levithan found
> (https://github.com/mathiasbynens/es-unicode-regexp-proposal/issues/2#issuecomment-143288062)
> that some regex flavors use `Is` for scripts, some for blocks, some
> for scripts and blocks, some for neither. Since some script and block
> names collide, this causes problems, especially when porting regexes
> across flavors. 

Are script names and block names expected to share a common namespace?
If they don't, then there is no collision.

LM3 says to ignore initial (and non-final) "is" for all property aliases
and property value aliases, not just Script and Block values. There will
be a lot of "collisions" if you take all of those into consideration.

> The `is` prefix doesn’t provide any functionality that would otherwise
> be unavailable. It doesn’t add any value, yet causes incompatibility,
> author confusion, and it increases implementation complexity.

I don't see any evidence that it adds no value. Support for existing
implementations is value.

--
Doug Ewell | http://ewellic.org | Thornton, CO 🇺🇸




Re: UAX44: loose matching of symbolic values and the `is` prefix

2016-06-06 Thread Ken Whistler


On 6/6/2016 12:58 AM, Mathias Bynens wrote:

Backwards compatibility seems to be the only good reason to continue supporting 
the `is` prefix*for existing implementations*, such as the one in Perl. But why 
is it still a requirement for new engines to support it as part of UAX44-LM3?

I’d like to propose changing UAX44-LM3 to make supporting the `is` prefix 
optional for new implementations.



I think the target of concern here is wrong. UAX #44 doesn't *require* 
any regex engine to include this "is prefix" handling. What UAX #44 does 
is recommend that all property and property value aliases be correctly 
recognized, and then specifies a clear statement (in UAX44-LM3) of the 
loose matching rule for recognizing the various forms of those aliases 
that could be considered equivalent. I don't think messing with that 
rule statement (which has been in place since 2010) would be helpful.


The target instead should be in UTS #18, which happily, has a proposed 
update available for comment right now:


http://www.unicode.org/review/pri325/

The relevant point is:

http://www.unicode.org/reports/tr18/tr18-18.html#RL1.2

That is the conformance part that requires that conformant Unicode regex 
implementations "must follow the Matching rules from [UAX44]".


If you are seeking indulgences for new engine implementations, that 
seems like the correct point to be adding clarifications and exceptions. 
Note that the following text in that section already includes wording 
about exceptions and compatibility issues. There is also a following 
section specifically about regex for the Script and Script Extensions 
properties that seems like it would be the appropriate place to talk 
about the Greek/IsGreek issue as pertains to regex support.


I would suggest you make specific suggestions about the text of UTS #18 
as part of the ongoing public review for the proposed update of that 
specification.


--Ken



Re: UAX44: loose matching of symbolic values and the `is` prefix

2016-06-06 Thread srivas sinnathurai
Thanks Ashley.

> 
> On 06 June 2016 at 08:58 Mathias Bynens  wrote:
> 
> 
> http://unicode.org/reports/tr44/#UAX44-LM3 mentions the `is` prefix:
> 
> > For loose matching of symbolic values, an initial prefix string "is" is
> > ignored. […] Ignoring any initial "is" on a symbolic value during loose
> > matching is likely to produce the best results in application areas such
> > as regex. Removal of an initial "is" string for a loose matching
> > comparison only needs to be done once for a symbolic value, and need not
> > be tested recursively. There are no property aliases or property value
> > aliases of the form "isisisisistooconvoluted" defined just to test
> > implementation edge cases.
> 
> UAX44 provides the reason for the existence of this “feature”:
> 
> > The reason for this is that APIs returning property values are often
> > named using the convention of prefixing "is" (or "Is" or "Is_", and so
> > forth) to a property value.
> 
> That seems like a rather weak argument. Specifically applying this to
> UTS18 (Unicode regular expressions):
> 
> > "Script=Greek" is equivalent to "Script=isGreek" or "Script=Is_Greek"
> 
> If there is already a way to match all symbols in the Greek script (not
> counting the use of aliases and other loose matching requirements), i.e.
> `Script=Greek` — what good does it do to add support for yet another one?
> 
> Looking at implementations in the wild, Steven Levithan found
> (https://github.com/mathiasbynens/es-unicode-regexp-proposal/issues/2#issuecomment-143288062)
> that some regex flavors use `Is` for scripts, some for blocks, some for
> scripts and blocks, some for neither. Since some script and block names
> collide, this causes problems, especially when porting regexes across flavors.
> 
> The `is` prefix doesn’t provide any functionality that would otherwise be
> unavailable. It doesn’t add any value, yet causes incompatibility, author
> confusion, and it increases implementation complexity. UAX 44 includes two
> entire paragraphs pointing out that last part:
> 
> > Removal of an initial "is" string for a loose matching comparison only
> > needs to be done once for a symbolic value, and need not be tested
> > recursively. There are no property aliases or property value aliases of
> > the form "isisisisistooconvoluted" defined just to test implementation
> > edge cases.
> >
> > Existing and future property aliases and property value aliases are
> > guaranteed to be unique within their relevant namespaces, even if an
> > initial prefix string "is" is ignored. The existing cases of note for
> > aliases that do start with "is" are: dt=Iso
> > (Decomposition_Type=Isolated) and lb=IS. The Decomposition_Type value
> > alias does not cause any problem, because there is no contrasting value
> > alias dt=o (Decomposition_Type=olated). For lb=IS, note that the "IS" is
> > the entire property value alias, and is not a prefix. There is no null
> > value for the Line_Break property for it to contrast with, but
> > implementations of loose matching should be careful of this edge case,
> > so that "lb=IS" is not misinterpreted as matching a null value.
> 
> 
> Backwards compatibility seems to be the only good reason to continue
> supporting the `is` prefix *for existing implementations*, such as the one in
> Perl. But why is it still a requirement for new engines to support it as part
> of UAX44-LM3?
> 
> I’d like to propose changing UAX44-LM3 to make supporting the `is` prefix
> optional for new implementations.
> 
> 

>

UAX44: loose matching of symbolic values and the `is` prefix

2016-06-06 Thread Mathias Bynens
http://unicode.org/reports/tr44/#UAX44-LM3 mentions the `is` prefix:

> For loose matching of symbolic values, an initial prefix string "is" is 
> ignored. […] Ignoring any initial "is" on a symbolic value during loose 
> matching is likely to produce the best results in application areas such as 
> regex. Removal of an initial "is" string for a loose matching comparison only 
> needs to be done once for a symbolic value, and need not be tested 
> recursively. There are no property aliases or property value aliases of the 
> form "isisisisistooconvoluted" defined just to test implementation edge cases.

UAX44 provides the reason for the existence of this “feature”:

> The reason for this is that APIs returning property values are often named 
> using the convention of prefixing "is" (or "Is" or "Is_", and so forth) to a 
> property value.

That seems like a rather weak argument. Specifically applying this to UTS18 
(Unicode regular expressions):

> "Script=Greek" is equivalent to "Script=isGreek" or "Script=Is_Greek"

If there is already a way to match all symbols in the Greek script (not 
counting the use of aliases and other loose matching requirements), i.e. 
`Script=Greek` — what good does it do to add support for yet another one?

Looking at implementations in the wild, Steven Levithan found 
(https://github.com/mathiasbynens/es-unicode-regexp-proposal/issues/2#issuecomment-143288062)
 that some regex flavors use `Is` for scripts, some for blocks, some for 
scripts and blocks, some for neither. Since some script and block names 
collide, this causes problems, especially when porting regexes across flavors.

The `is` prefix doesn’t provide any functionality that would otherwise be 
unavailable. It doesn’t add any value, yet causes incompatibility, author 
confusion, and it increases implementation complexity. UAX 44 includes two 
entire paragraphs pointing out that last part:

> Removal of an initial "is" string for a loose matching comparison only needs 
> to be done once for a symbolic value, and need not be tested recursively. 
> There are no property aliases or property value aliases of the form 
> "isisisisistooconvoluted" defined just to test implementation edge cases.
> 
> Existing and future property aliases and property value aliases are 
> guaranteed to be unique within their relevant namespaces, even if an initial 
> prefix string "is" is ignored. The existing cases of note for aliases that do 
> start with "is" are: dt=Iso (Decomposition_Type=Isolated) and lb=IS. The 
> Decomposition_Type value alias does not cause any problem, because there is 
> no contrasting value alias dt=o (Decomposition_Type=olated). For lb=IS, note 
> that the "IS" is the entire property value alias, and is not a prefix. There 
> is no null value for the Line_Break property for it to contrast with, but 
> implementations of loose matching should be careful of this edge case, so 
> that "lb=IS" is not misinterpreted as matching a null value.


Backwards compatibility seems to be the only good reason to continue supporting 
the `is` prefix *for existing implementations*, such as the one in Perl. But 
why is it still a requirement for new engines to support it as part of 
UAX44-LM3?

I’d like to propose changing UAX44-LM3 to make supporting the `is` prefix 
optional for new implementations.