On 11/26/2015 4:29 AM, Philippe Verdy wrote:
2015-11-26 12:38 GMT+01:00 Asmus Freytag (t) <asmus-...@ix.netcom.com
<mailto:asmus-...@ix.netcom.com>>:
On 11/26/2015 3:08 AM, Philippe Verdy wrote:
The related definition for extended grapheme clusters says:
( CRLF
| *Prepend* *( RI-sequence | Hangul-Syllable | !Control )
( Grapheme_Extend | *SpacingMark* )*
| . )
However I do not understand why it may include only one
Hangul-Syllable when applying prepended concatenation marks. And
if the definition excludes whitespaces, nothing prevents it to
extend to arbitrary sequences of
letters/digits/symbols/punctuations (this could span very long
sequences of sinograms, or other letters from scripts that do not
use whitespaces as word separators. Even in the Latin script it
would extend to the punctuation signs that may follow any word,
or to an entire mathematical formula such as "1+2*3" but not "sin
x"...
White space is clearly NOT part a grapheme cluster, so I don't see
what your issue is?
No, whitespace is a grapheme cluster by its own, matching (.)
The issue is the overlong extended grapheme cluster after any Prepend
occurs because ( Grapheme_Extend | *SpacingMark* )*
But ( RI-sequence | Hangul-Syllable | !Control ) is bounded (if we
ignore the rare RI-sequences which are still are stil short) and will
not match the sequences of digits or letters intended by the prepended
concatenation marks, but only one.
Prepend in front of an RI-Sequence is really a "defective" cluster in
terms of the Arabic number sign's definition. So, one thing the Grapheme
cluster specification should be clear about is that it does not describe
the breaks in formatting runs needed to implement these characters.
Also, for editing (a common use of grapheme clusters) running these
together with any following characters is not very useful in my opinion.
So, perhaps much of the "Prepend" is a bug after all?
BTW, if after careful analysis you think there is a mistake, you
should probably raise a bug on this.
For now the proposal only speaks about listing the prepended
characters enumeration with a new defined property , it still does not
address what are the sequences of graphemes over which they apply. As
these quequences are specific to each prepended character, I don't see
how the new property will help if we need to specialize each one of
these characters: we still need custom algorithm (possibly tailored by
locale) for breaking clusters using them.
correct - I wouldn't call that an "algorithm" -- it's the formatting
behavior for that code point (some of them are similar, as I said, I see
three patterns: following digit, digit run and word run.
With the definition given above, the extended grapheme clusters will
break after each letter/digit/punctuation and
<ARABIC NUMBER SIGN, ARABIC DIGIT ONE, ARABIC DIGIT TWO>
will still break into
<ARABIC NUMBER SIGN, ARABIC DIGIT ONE> separated from <ARABIC DIGIT TWO>
The proposed new property does not change this : how can we really
extend the sequence of digits so that the number sign will span all of
them? Use CGJ or explicit sequence delimiters ?
correct, gives an incorrect specification - we need an actual
specification for the format runs.
A./