On 26/11/2004 23:24, Doug Ewell wrote:

...

Most "break opportunities" are between words, a concept often indicated
by an ordinary space (U+0020).  So you wouldn't generally have to
precede *every* combination of NBSP+combining mark with ZWSP "to ensure
a break opportunity," only those combinations preceded by a character
other than U+0020 that might inhibit the break.  For example, if you
wanted to ensure a break opportunity following U+2014 EM DASH, you would
probably use the ZWSP, but you don't have to use it everywhere.


As I understand it (and I asked for confirmation of this but have not received it), according to the current version of UAX #14 there is no break opportunity between SPACE and NBSP, because rule LB11b precedes rule LB12, although there is a note "Many existing implementations reverse the order of precedence between rules LB11b and LB12." There is a proposed update to UAX #14 which has the effect of reversing these rules (except for WJ). But until this change has been accepted and fully implemented, surely I need to use the ZWSP. Indeed, to be safe I will always need the ZWSP as I can never be sure that the update has been implemented.



I also wonder whether the RLM is needed for a construction that is expected to occur amid a sea of Hebrew. U+00A0 is of type CS, which is weak directional, meaning its directionality is dictated by that of surrounding characters. If the surrounding characters are Hebrew (RTL), the RLM seems redundant (though of course not "forbidden").


The point here is that individual Hebrew words and short phrases are often embedded within LTR text, which may be some kind of markup. I don't want to see Hebrew words being garbled because markup has been added, or because they have been quoted in an otherwise LTR document. So again the safest thing is to use the RLM in every case, and to keep it with the rest of the word e.g. when copying and pasting.

In fact this apparently leads to a small problem with text boundaries. If I understand it correctly from UAX #29, in the combination <SPACE, RLM, X>, where X is any character which might form part of a word (including NBSP), the word boundary will be between RLM (as with any other format character) and X, not between SPACE and RLM. Is that correct? Or are both word boundaries? If so, this seems undesirable. In such a situation, RLM affects what follows, not what precedes, and so the word etc boundary should be only before RLM. Is this perhaps a change which should be made to UAX #29? My proposal would be to add rules for certain format characters (RLM, LRM, LRO, RLO, LRE, RLE, perhaps others?) which prevent a word break after these characters and before any ALetter or Numeric. But for PDF the rule should perhaps prevent a word break before it.

Perhaps this discussion should be moved to the bidi list?

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/





Reply via email to