On 2019-01-27 11:38 PM, Richard Wordingham via Unicode wrote:
On Sun, 27 Jan 2019 19:57:37 +0000
James Kass via Unicode <unicode@unicode.org> wrote:

On 2019-01-27 7:09 PM, James Tauber via Unicode wrote:
In my original post, I asked if a language-specific tailoring of
the text segmentation algorithm was the solution but no one here
has agreed so far.
If there are likely to be many languages requiring exceptions to the
segmentation algorithm wrt U+2019, then perhaps it would be better to
establish conventions using ZWJ/ZWNJ and adjust the algorithm
accordingly so that it would be cross-languages.  (Rather than
requiring additional and open ended language-specific tailorings.) (I
inserted several combinations of ZWJ/ZWNJ into James Tauber's
example, but couldn't improve the segmentation in LibreOffice,
although it was possible to make it worse.)
If you look at TR29, you will see that ZWJ should only affect word
boundaries for emoji.  ZWNJ shall have no effect.  What you want is a
control that joins words, but we don't have that.

Richard.


(https://unicode.org/reports/tr29/)

It’s been said that the text segmentation rules seem over-complicated and are probably non-trivial to implement properly.  I tried your suggestion of WORD JOINER U+2060 after tau ( γένοιτ⁠’ ἄν ), but it only added yet another word break in LibreOffice.

The problem may stem from the fact that WORD JOINER is supposed to be treated as though it were a zero-width no-break space.  IOW it is a *space*, and as a space it indicates a word break.  That doesn’t seem right.

Instead of treating WORD JOINER as a SPACE, why not treat it as a WORD JOINER?  It could save a lot of problems wrt undesirable string segmentation in addition to possibly minimizing future language-specific tailoring and easing the burden on implementers.

Reply via email to