On 11/26/2015 4:29 AM, Philippe Verdy wrote:
2015-11-26 12:38 GMT+01:00 Asmus Freytag (t) <asmus-...@ix.netcom.com <mailto:asmus-...@ix.netcom.com>>:

    On 11/26/2015 3:08 AM, Philippe Verdy wrote:
    The related definition for extended grapheme clusters says:

    ( CRLF
    | *Prepend* *( RI-sequence | Hangul-Syllable | !Control )
             ( Grapheme_Extend | *SpacingMark* )*
    | . )

    However I do not understand why it may include only one
    Hangul-Syllable when applying prepended concatenation marks. And
    if the definition excludes whitespaces, nothing prevents it to
    extend to arbitrary sequences of
    letters/digits/symbols/punctuations (this could span very long
    sequences of sinograms, or other letters from scripts that do not
    use whitespaces as word separators. Even in the Latin script it
    would extend to the punctuation signs that may follow any word,
    or to an entire mathematical formula such as "1+2*3" but not "sin
    x"...

    White space is clearly NOT part a grapheme cluster, so I don't see
    what your issue is?


No, whitespace is a grapheme cluster by its own, matching (.)

The issue is the overlong extended grapheme cluster after any Prepend occurs because ( Grapheme_Extend | *SpacingMark* )* But ( RI-sequence | Hangul-Syllable | !Control ) is bounded (if we ignore the rare RI-sequences which are still are stil short) and will not match the sequences of digits or letters intended by the prepended concatenation marks, but only one.

Prepend in front of an RI-Sequence is really a "defective" cluster in terms of the Arabic number sign's definition. So, one thing the Grapheme cluster specification should be clear about is that it does not describe the breaks in formatting runs needed to implement these characters.

Also, for editing (a common use of grapheme clusters) running these together with any following characters is not very useful in my opinion. So, perhaps much of the "Prepend" is a bug after all?

    BTW, if after careful analysis you think there is a mistake, you
    should probably raise a bug on this.


For now the proposal only speaks about listing the prepended characters enumeration with a new defined property , it still does not address what are the sequences of graphemes over which they apply. As these quequences are specific to each prepended character, I don't see how the new property will help if we need to specialize each one of these characters: we still need custom algorithm (possibly tailored by locale) for breaking clusters using them.

correct - I wouldn't call that an "algorithm" -- it's the formatting behavior for that code point (some of them are similar, as I said, I see three patterns: following digit, digit run and word run.

With the definition given above, the extended grapheme clusters will break after each letter/digit/punctuation and
 <ARABIC NUMBER SIGN, ARABIC DIGIT ONE, ARABIC DIGIT TWO>
will still break into
<ARABIC NUMBER SIGN, ARABIC DIGIT ONE> separated from <ARABIC DIGIT TWO>
The proposed new property does not change this : how can we really extend the sequence of digits so that the number sign will span all of them? Use CGJ or explicit sequence delimiters ?

correct, gives an incorrect specification - we need an actual specification for the format runs.

A./

Reply via email to