On Thu, Jul 02, 2015, Richard Wordingham  wrote:

> On Thu, 2 Jul 2015 10:37:17 +0200 (CEST)
> Marcel Schneider  wrote:
> 
> > (because it is
> > sufficient to simply type the words one after each other without
> > anything between, to get them as *one* word)
> 
> This only applies where it is traditional to separate words, a habit
> the Romans got out of and the Irish revived.

IMHO the case is a bit different in handwritten or engraved text vs word 
processing.

> Unicode Word Boundary Rule WB4 (in UAX #29 'Unicode Text
> Segmentation') decrees that U+2060 and U+FEFF be ignored in
> word-boundary determination except that newline breaks before them and
> that inserting them between between and creates an extra word
> boundary.

When we look up the set of existing format characters (Cf), the ZWSP, ZWNBSP 
and WJ fall out of the group in that they are used to detect word boundaries in 
cases like whole word search and spell checking. (They indicate word 
boundaries.) This is why, in reality, they are remapped to another category, a 
practice expressedly allowed by UAX #29. So in fact, the WB4 rule scarcely ever 
(say, *never*) applies to them. This can be discovered by oneself following the 
hints given at the very beginning of the UAX #29 content.

I believe that UAXes as well as the whole Standard are not here to decree, as 
Richard calls it, but to promote knowledge and to share a number of useful 
rules, given in accordance with practice and real needs. Perhaps some sentences 
are likely to be rewritten for clarification in order to stick even more with 
reality.

Perhaps, too, we should reconsider what we are talking about when using the 
expression “word boundary”. This is a bit ambiguous because UIs are designed to 
meet different needs, and because in English, the apostrophe is often a part of 
the sequences it is between. If I'm right, U+2019 or U+02BC in _month’s_ is 
expected to indicate a word boundary, and a search for the whole word _month_ 
will succeed, while _won’t_ in in the UAX #29 example is *one* word, and 
searching for a supposed _won_ word makes no sense (and will fail). However, 
both are selected as a whole by Shift+Ctrl+LEFT/RIGHT ARROW. 



[For the archive: Please refer to the last month’s thread _A new take on the 
English apostrophe in Unicode_. About the difference between quick cursor move 
and double-click select vs "whole word" search, please refer to my previous 
e-mails.] 

Definitely, word boundaries are found with a whole word search (see UAX #29, 
again).


Marcel

Reply via email to