Re: New Character Property for Prepended Concatenation Marks

Asmus Freytag (t) Thu, 26 Nov 2015 05:02:47 -0800

On 11/26/2015 4:29 AM, Philippe Verdy wrote:

2015-11-26 12:38 GMT+01:00 Asmus Freytag (t) <asmus-...@ix.netcom.com<mailto:asmus-...@ix.netcom.com>>:
    On 11/26/2015 3:08 AM, Philippe Verdy wrote:
    The related definition for extended grapheme clusters says:

    ( CRLF
    | *Prepend* *( RI-sequence | Hangul-Syllable | !Control )
             ( Grapheme_Extend | *SpacingMark* )*
    | . )

    However I do not understand why it may include only one
    Hangul-Syllable when applying prepended concatenation marks. And
    if the definition excludes whitespaces, nothing prevents it to
    extend to arbitrary sequences of
    letters/digits/symbols/punctuations (this could span very long
    sequences of sinograms, or other letters from scripts that do not
    use whitespaces as word separators. Even in the Latin script it
    would extend to the punctuation signs that may follow any word,
    or to an entire mathematical formula such as "1+2*3" but not "sin
    x"...
    White space is clearly NOT part a grapheme cluster, so I don't see
    what your issue is?


No, whitespace is a grapheme cluster by its own, matching (.)
The issue is the overlong extended grapheme cluster after any Prependoccurs because ( Grapheme_Extend | *SpacingMark* )*But ( RI-sequence | Hangul-Syllable | !Control ) is bounded (if weignore the rare RI-sequences which are still are stil short) and willnot match the sequences of digits or letters intended by the prependedconcatenation marks, but only one.

Prepend in front of an RI-Sequence is really a "defective" cluster interms of the Arabic number sign's definition. So, one thing the Graphemecluster specification should be clear about is that it does not describethe breaks in formatting runs needed to implement these characters.

Also, for editing (a common use of grapheme clusters) running thesetogether with any following characters is not very useful in my opinion.So, perhaps much of the "Prepend" is a bug after all?

    BTW, if after careful analysis you think there is a mistake, you
    should probably raise a bug on this.
For now the proposal only speaks about listing the prependedcharacters enumeration with a new defined property , it still does notaddress what are the sequences of graphemes over which they apply. Asthese quequences are specific to each prepended character, I don't seehow the new property will help if we need to specialize each one ofthese characters: we still need custom algorithm (possibly tailored bylocale) for breaking clusters using them.

correct - I wouldn't call that an "algorithm" -- it's the formattingbehavior for that code point (some of them are similar, as I said, I seethree patterns: following digit, digit run and word run.

With the definition given above, the extended grapheme clusters willbreak after each letter/digit/punctuation and
 <ARABIC NUMBER SIGN, ARABIC DIGIT ONE, ARABIC DIGIT TWO>
will still break into
<ARABIC NUMBER SIGN, ARABIC DIGIT ONE> separated from <ARABIC DIGIT TWO>
The proposed new property does not change this : how can we reallyextend the sequence of digits so that the number sign will span all ofthem? Use CGJ or explicit sequence delimiters ?

correct, gives an incorrect specification - we need an actualspecification for the format runs.

A./

Re: New Character Property for Prepended Concatenation Marks

Reply via email to