On Mon, Jun 30, 2015, Richard Wordingham  wrote:

> On Sat, 27 Jun 2015 17:48:41 +0200 (CEST)
> Marcel Schneider  wrote:
> 
> > On Fri, Jun 26, Richard Wordingham wrote:
> > > On Fri, 26 Jun 2015 12:48:39 +0200 (CEST) Marcel Schneider wrote:
> 
> >>> Still in French, the letter apostrophe, when used as current
> >>> apostrophe, prevents the following word from being identified as a
> >>> word because of the missing word boundary and, subsequently,
> >>> prevents the autoexpand from working. This can be fixed by adding
> >>> a word joiner after the apostrophe, thanks to an autocorrect entry
> >>> that replaces U+02BC inserted by default in typographic mode, with
> >>> U+02BC U+2060.
> 
> >> No, this doesn't work. While the primary purpose of U+2060 is to
> >> prevent line breaks, it is also used to overrule word boundary
> >> detectors in scriptio continua. (It works quite well for
> >> spell-checking Thai in LibreOffice). It's name implies to me that it
> >> is intended to prevent a word boundary being deduced, through the
> >> strong correlation between word boundaries and line break opportunities.
> >> There doesn't seem to be a code for 'zero-width word boundary at
> >> which lines should not normally be broken'.
> 
> > Well, I extrapolated from U+FEFF, which works fine for me, even in
> > this particular context.
> 
> Does the tool misinterpret U+FEFF between Thai characters as a word
> boundary? Incidentally, which tool are you talking of?

I tested on Microsoft Word 2010 Starter running on Windows 7 Starter, on a 
netbook. This software being based on the full versions, the interpretation of 
U+FEFF must be the standard behavior. I tested in Latin script. You may wish to 
redo the tests, so please open a new document, input two words, replace the 
blank with whatever character the word boundaries behavior is to be checked of, 
and search for one of the two words with the 'whole word' option enabled. If 
the result is none, the test character indicates the absence of word 
boundaries; if there is a result, the test character indicates the presence of 
word boundaries.

> >> No, this doesn't work.

Right. The letter apostrophe cannot trigger the autocorrect for itself. I must 
keep U+0027 in the forefront, and get it replaced with U+02BC U+FEFF to keep 
the autocorrect/autoexpand working for what follows. Or even better, with 
U+FEFF U+02BC U+FEFF to clarify word boundaries. When there is no autoexpand, 
weʼll input the apostrophe as U+0027 and the single quotes as U+2018, U+2019, 
then replace all U+0027 with U+02BC. In the Windows Notepad that works, because 
the close-quote is presumably not in the equivalence class for the straight 
apostrophe, so it replaces the U+0027s with U+02BC and lets the U+2019s alone.


Given the instability of U+FEFF but also of U+00A0, as I wrote to Peter 
Constable a few moments ago, it seems as if we were unfortunately reaching the 
limits of text encoding. The purpose of the encoding design was, if Iʼm well 
informed, to get readible text files, and to allow users to mark them up for 
local printing or PDF conversion. Other usages must have been let out of scope, 
because today, you cannot exchange and process plain text files as one may 
wish. As soon as you must use plain text as a raw material for publishing, as 
you must convert British English quotation marks to US English quotation marks, 
as you must do searches including single quotes, as you must input text 
(especially with leading apostrophes) on keyboards with legacy drivers, and 
perhaps a few things more, there seems to be no other solution than to use 
workarounds, hand-process, look up and correct or convert the instances one by 
one.

The nice thing about this is that you become a craftsman again, that you get in 
touch with text, and you may feel like a linotypist or a lead typesetter who 
takes care of every detail. As a result, the professions of corrector, 
typesetter, typographer shall not disappear (as it was feared), and good 
craftmanship will stay thriving.

Another side effect is that the need of hand-processing text files lowers the 
appeal of copying other peoplesʼ work. Itʼs even harder when copying text from 
a PDF file. Sometimes you get whole paragraphs in ready-to-use plain text (let 
aside the NBSPs), and sometimes (e.g. from TUS) itʼs all in small pieces and 
you need to delete a lot of undue line breaks, as well as to text-transform the 
character identifiers because their uppercasing was just small caps formatting. 
Finally you may prefer to provide links to the content, but unfortunately there 
seems to be no way to copy bookmarks—so that you need to browse the contents 
and be likely to learn much more by the way.

If all this was the goal, letʼs say it loud. Then this was a good idea. Very 
good.

Regards,
Marcel Schneider

Reply via email to