Quick update: Manish pointed out that I'd misstated one of the rules, should be:
skin-sequence = $E_Base $Extend* $E_Modifier ; With that change, the test passes. (Thanks Manish!) Mark On Wed, Jan 3, 2018 at 10:16 AM, Mark Davis ☕️ <[email protected]> wrote: > I had a UTC action to adjust http://www.unicode.org/ > reports/tr29/proposed.html#Table_Combining_Char_Sequences_and_Grapheme_ > Clusters to update the regex, and other necessary changes surrounding > text. > > Here is what I've come up with for an EBNF formulation. The $x are the GCB > properties. > > cluster = crlf | $Control | precore* core postcore* ; > > > crlf = $CR $LF ; > > > precore = $Prepend ; > > > postcore = (?: virama-sequence | [$Extend $ZWJ $Virama $SpacingMark] ); > > > core = (?: hangul-syllable | ri-sequence | xpicto-sequence | virama-sequence > | [^$Control $CR $LF] ); > > > hangul-syllable = $L* (?:$V+ | $LV $V* | $LVT) $T* | $L+ | $T+ ; > > > ri-sequence = $RI $RI ; > > > > skin-sequence = $E_Base $E_Modifier ; > > > xpicto-sequence = (?: skin-sequence | \p{Extended_Pictographic} ) (?: > $Extend* $ZWJ (?: skin-sequence | \p{Extended_Pictographic} ))* ; > > > virama-sequence = [$Virama $ZWJ] $LinkingConsonant ; > > > I have tools to turn that into a (lovely) regex: > > \p{gcb=cr}\p{gcb=lf}|\p{gcb=control}|\p{gcb=Prepend}*(?:\ > p{gcb=l}*(?:\p{gcb=v}+|\p{gcb=lv}\p{gcb=v}*|\p{gcb=lvt})\p{ > gcb=t}*|\p{gcb=l}+|\p{gcb=t}+|\p{gcb=ri}\p{gcb=ri}|(?:\p{ > gcb=e_base}\p{gcb=E_Modifier}|\p{Extended_Pictographic})(?:\ > p{gcb=Extend}*\p{gcb=zwj}(?:\p{gcb=e_base}\p{gcb=E_Modifier}|\p{Extended_ > Pictographic}))*|[\p{gcb=Virama}\p{gcb=zwj}]\p{gcb= > LinkingConsonant}|[^\p{gcb=control}\p{gcb=cr}\p{gcb=lf}]) > (?:[\p{gcb=Virama}\p{gcb=zwj}]\p{gcb=LinkingConsonant}|[\p{ > gcb=Extend}\p{gcb=zwj}\p{gcb=Virama}\p{gcb=SpacingMark}])* > > (It is a bit shorter if some more property names/values are abbreviated.) > > I then tested against the current test file: GraphemeBreakTest.txt. There > is one outlying failure with that test file: > > 813) ☝̈🏻 > > hex: 261D 0308 1F3FB > > test: [0, 4] > > ebnf: [0, 2, 4] > > I believe that is a problem with the test rather than the BNF, but I need > to track it down in any event. > > A regex is much easier for many applications to use than the current rule > syntax, so I'm going to see if the other segmentations could be > reformulated as ebnfs (ideally corresponding to regular grammars, or in > the worst case, for PEGs). > > Feedback is welcome. > > > Mark >

