Re: Possible bug in formal grammar for extended grapheme cluster

2017-12-18 Thread Mark Davis ☕️ via Unicode
If you look back at http://www.unicode.org/reports/tr29/tr29-27.html#GB8a
(2015), the rule was simply not to break sequences of RI characters.

We changed that in http://www.unicode.org/reports/tr29/tr29-29.html#GB12
(2016) to only group pairs. Unfortunately, the (informative) table
http://www.unicode.org/reports/tr29/tr29-31.html#Table_Combining_Char_Sequences_and_Grapheme_Clusters
was not updated after 2015 to keep pace with the changes in rules. So that
is still to do



Mark 

On Mon, Dec 18, 2017 at 10:59 AM, Andre Schappo via Unicode <
unicode@unicode.org> wrote:

> Ah! That explains why
>
> pcre2grep -u '^\X{1}$'
>
> matches with
>
> 🇬🇧
> 🇩🇪🇫🇷
> 🇨🇳🇮🇹🇲🇾
> 🇪🇸🇦🇺🇷🇺🇳🇱🇯🇵
>
> ...etc...
>
> André Schappo
>
> On 17 Dec 2017, at 17:17, Mark Davis ☕️ via Unicode 
> wrote:
>
> Thanks for the feedback. You're correct about this; that is a holdover
> from an earlier version of the document when there was a more basic
> treatment of RI sequences.
>
> There is already an action to modify these. There is a placeholder review
> note about that just above
>
> http://www.unicode.org/reports/tr29/proposed.html#Table_Combining_Char_
> Sequences_and_Grapheme_Clusters
>
> (scroll up just a bit).
>
> Mark
>
> Mark 
>
> On Sun, Dec 17, 2017 at 4:16 PM, David P. Kendal via Unicode <
> unicode@unicode.org> wrote:
>
>> Hi,
>>
>> It’s possible I’m missing something, but the formal grammar/regular
>> expression given for extended grapheme clusters appears to have a bug
>> in it.
>> > ences_and_Grapheme_Clusters>
>>
>> The bug is here:
>>
>> RI-Sequence := Regional_Indicator+
>>
>> If the formal grammar is intended to exactly match the rules given the
>> the “Grapheme Cluster Boundary Rules” section below it as-is, then
>> this should be
>>
>> RI-Sequence := Regional_Indicator Regional_Indicator
>>
>> since as given it would cause any number of RI characters to coalesce
>> into a single grapheme cluster, instead of pairs of characters. That
>> is, the text U+1F1EC U+1F1E7 U+1F1EA U+1F1FA would represent one
>> grapheme cluster instead of the correct two.
>>
>> --
>> dpk (David P. Kendal) · Nassauische Str. 36, 10717
>> 
>> DE · http://dpk.io/
>>we do these things not because they are easy,  +49 159 03847809
>>   but because we thought they were going to be easy
>>   — ‘The Programmers’ Credo’, Maciej Cegłowski
>>
>>
>>
>
> 🌏 🌍 🌎
> André Schappo
> https://schappo.blogspot.co.uk
> https://twitter.com/andreschappo
> https://weibo.com/andreschappo
> https://groups.google.com/forum/#!forum/computer-science-curriculum-
> internationalization
>
>
>
>
>
>


Re: Possible bug in formal grammar for extended grapheme cluster

2017-12-18 Thread Andre Schappo via Unicode
Ah! That explains why

pcre2grep -u '^\X{1}$'

matches with

🇬🇧
🇩🇪🇫🇷
🇨🇳🇮🇹🇲🇾
🇪🇸🇦🇺🇷🇺🇳🇱🇯🇵

...etc...

André Schappo

On 17 Dec 2017, at 17:17, Mark Davis ☕️ via Unicode 
mailto:unicode@unicode.org>> wrote:

Thanks for the feedback. You're correct about this; that is a holdover from an 
earlier version of the document when there was a more basic treatment of RI 
sequences.

There is already an action to modify these. There is a placeholder review note 
about that just above

http://www.unicode.org/reports/tr29/proposed.html#Table_Combining_Char_Sequences_and_Grapheme_Clusters

(scroll up just a bit).

Mark

Mark

On Sun, Dec 17, 2017 at 4:16 PM, David P. Kendal via Unicode 
mailto:unicode@unicode.org>> wrote:
Hi,

It’s possible I’m missing something, but the formal grammar/regular
expression given for extended grapheme clusters appears to have a bug
in it.


The bug is here:

RI-Sequence := Regional_Indicator+

If the formal grammar is intended to exactly match the rules given the
the “Grapheme Cluster Boundary Rules” section below it as-is, then
this should be

RI-Sequence := Regional_Indicator Regional_Indicator

since as given it would cause any number of RI characters to coalesce
into a single grapheme cluster, instead of pairs of characters. That
is, the text U+1F1EC U+1F1E7 U+1F1EA U+1F1FA would represent one
grapheme cluster instead of the correct two.

--
dpk (David P. Kendal) · Nassauische Str. 36, 10717 DE · http://dpk.io/
   we do these things not because they are easy,  +49 159 
03847809
  but because we thought they were going to be easy
  — ‘The Programmers’ Credo’, Maciej Cegłowski




🌏 🌍 🌎
André Schappo
https://schappo.blogspot.co.uk
https://twitter.com/andreschappo
https://weibo.com/andreschappo
https://groups.google.com/forum/#!forum/computer-science-curriculum-internationalization







Re: Possible bug in formal grammar for extended grapheme cluster

2017-12-17 Thread Mark Davis ☕️ via Unicode
Thanks for the feedback. You're correct about this; that is a holdover from
an earlier version of the document when there was a more basic treatment of
RI sequences.

There is already an action to modify these. There is a placeholder review
note about that just above

http://www.unicode.org/reports/tr29/proposed.html#Table_Combining_Char_Sequences_and_Grapheme_Clusters

(scroll up just a bit).

Mark

Mark 

On Sun, Dec 17, 2017 at 4:16 PM, David P. Kendal via Unicode <
unicode@unicode.org> wrote:

> Hi,
>
> It’s possible I’m missing something, but the formal grammar/regular
> expression given for extended grapheme clusters appears to have a bug
> in it.
>  Sequences_and_Grapheme_Clusters>
>
> The bug is here:
>
> RI-Sequence := Regional_Indicator+
>
> If the formal grammar is intended to exactly match the rules given the
> the “Grapheme Cluster Boundary Rules” section below it as-is, then
> this should be
>
> RI-Sequence := Regional_Indicator Regional_Indicator
>
> since as given it would cause any number of RI characters to coalesce
> into a single grapheme cluster, instead of pairs of characters. That
> is, the text U+1F1EC U+1F1E7 U+1F1EA U+1F1FA would represent one
> grapheme cluster instead of the correct two.
>
> --
> dpk (David P. Kendal) · Nassauische Str. 36, 10717 DE · http://dpk.io/
>we do these things not because they are easy,  +49 159 03847809
>   but because we thought they were going to be easy
>   — ‘The Programmers’ Credo’, Maciej Cegłowski
>
>
>