On 11/27/2015 5:42 PM, Martin J. Dürst
wrote:
On
2015/11/28 04:55, Plug Gulp wrote:
The Unicode standard 8.0 states in chapter
23, section titled "Cursive
Connection and Ligatures"(printed page #814, PDF page #850)
that:
"The zero width joiner and non-joiner characters are designed
for use
in plain text; they should not be used where higher-level
ligation and
cursive control is available. (See Uni-code Technical Report
#20,
“Unicode in XML and Other Markup Languages,” for more
information.) "
I went through TR#20 and did not find any mention that ZWJ and
ZWNJ
are not suitable for use with markup languages. On the contrary,
ZWJ
and ZWNJ are listed in TR#20 under section 4 titled "Format
Characters
Suitable for Use with Markup".
So are ZWJ and ZWNJ characters suitable for use with markup
languages
such as HTML and XML?
They are indeed suitable for use with markup languages. They are
so suitable that they are already provided as entities in RFC
2070, which is now historic, and from there on through HTML 4.0
and onwards. Please see
http://tools.ietf.org/html/rfc2070#section-4.2.
I'm not sure why Unicode 8.0 has the text it has; at the least,
this should be toned down somewhat to say "they may be replaced by
higher-level ligation and cursive control mechanisms if
available".
Thanks for finding this!
The main reason for this is that these characters apply at a
single point; creating markup such as <zwj/> and
<zwnj/> would not give any advantages over
‍/‌.
Markup is at its best when it can be applied to nested spans of
text. It is not inconcievable that something like
<do_not_ligate_inside>...
</do_not_ligate_inside> could occasionally be useful, but I
have difficulties immagining a use case of the top of my head.
I'll file a bug report with the content of this email.
Good.
The whole thing started with the mistaken idea that ligation was
independent of orthography.
It is not, though the examples are going away (as modern usage is
being adapted to make it "easier" on software).
When you typeset Fraktur (which, according to Unicode, is merely a
different style of rendering Latin text), the rules require that
when you ligate "st" it must not be across word boundaries inside
compound nouns.
However, there are many examples of compound words that have the
same letters, but different internal division. So, you need to know
where the intended division is and then mark it with zwnj.
That was definitely not appreciated from the outset, because in
English ligation does not intersect with orthography.
That's my take on the root cause of all this strong language on the
use of zwj / zwnj.
A,/
|