Re: Yerushala(y)im - or Biblical Hebrew

Kenneth Whistler Fri, 25 Jul 2003 18:19:50 -0700

Ted continued:

> If I recall correctly, the suggestion for using CGJ for yerushala(y)im was
> to encode it as: <...lamed, patah, cgj, hiriq, final mem>. Also, I seem to
> recall that this gave some people heartburn because CGJ was not intended to
> join two combining characters. What if this case were encoded as: <...lamed,
> patah, cgj, zwnbs, hiriq, final mem>? (Please forgive me if this is what had
> been proposed all along.)
> 
> As I understand it from reading the description of CGJ (and ignoring for the
> moment that zwnbs has no visible glyph and is general category Cf), this is
> exactly what CGJ was designed for: treat the two base characters on either
> side of the CGJ as a single grapheme for the purpose of placing combining
> characters. This approach uses zero width no-break space to represent the
> "missing letter" interpretation of the two vowels pointed out by Jony
> Rosenne. Normalization wouldn't destroy the ordering of the vowels, and
> Hebrew-aware software could be written to do all this more-or-less
> transparently and automatically.


Hmm. Some further clarifications are in order, since the documentation
for both of these characters has not quite caught up to the UTC
decisions regarding them. A lot of work went into the Unicode 4.0
documentation on these, and the Unicode 4.0 chapters will be posted
online very soon -- at which point it would be helpful if everyone
concerned about this issue takes the time to read the latest on
these characters in particular.

First, about ZWNBS (U+FEFF). Because of the confusing overlap of
functionality of U+FEFF as the BOM (byte order mark) in the
Unicode encoding schemes and as what its name, ZERO WIDTH NO-BREAK
SPACE implies, the UTC (as of Unicode 3.2) standardized a separate
character, U+2060 WORD JOINER. That character is described
in UAX #14, Line Breaking Properties:
http://www.unicode.org/reports/tr14/
U+2060 is "the preferred choice for an invisible character to keep
other characters together that would otherwise be split across
the line at a direct break." U+FEFF retains that semantic, for
backwards compatibility, but its preferred use is as the byte
order mark only.

So whether or not a line break format control character is
relevant to the Biblical Hebrew vowel problem (and I don't think
it is, actually), one should be talking about use of U+2060 WORD
JOINER (WJ), rather than U+FEFF ZWNBS in any such new context.

Second, there is U+034F COMBINING GRAPHEME JOINER (CGJ) itself.
The impetus for encoding the CGJ at all was to have a
plain text means of distinguishing, for example, an "ie"
sequence that weights as two units for collation and an "ie"
sequence that weights as a single unit for collation.

During the debate about such an addition, the entity was called
various things, but the moniker "GRAPHEME JOINER" caught on
in the committee and stuck. There was also debate about
an equal and opposite "GRAPHEME NON-JOINER", on the principle
that inserting a GNJ between, e.g., a "ch" weighted as a unit,
so as to force it to be treated as two units would be the more
normal requirement in collation. However, the committee did
not develop consensus that that was a required *character*,
in part because insertion of *any* delimiting character in that
context could be taken as having that effect or be tailored
in collation to weight as desired to distinguish it from
the digraphic unit, for example.

The "COMBINING" became part of the CGJ's name when it
became clear that the character should be given the
General Category Mn, making it a combining mark, rather
than General Category Cf to make it a format control.

During this debate, high hopes were also placed on the
COMBINING GRAPHEME JOINER as being the magic bullet for all kinds
of things: it could "glue together" a pair of accents so
that they would render side-by-side instead of using the
default accent placement rules. It could also "glue together"
sequences of characters into a "grapheme cluster", so that
the grapheme cluster would become the target of an
enclosing combining mark -- that would resolve the problem
of how to get an enclosing circle to circle an arbitrary
number, rather than just a single digit, for example.

In the end, however, the inconsistent and troubling 
implications of this attempt at getting the Unicode
Standard further involved in the monkey business of trying
to be a glyph description language, rather than a character
encoding, caused many second thoughts. And the UTC formally
backed away from all those silver bullet aspects of CGJ.
In Unicode 4.0, CGJ has been stripped of all interpretation
except as an invisible mark which can be used to tailor
collation (and searching), so as to distinguish digraphic units
from sequences of the same characters.

If you look at UAX #29, Text Boundaries, now, and in particular,
Section 3, Grapheme Cluster Boundaries, you will see that
CGJ has nothing to do with the definition of such boundaries.
While it has the Grapheme_Link property (as do all the
Indic viramas), Grapheme_Link is no longer even mentioned
in UAX #29, and Grapheme_Link is nowhere else used, not even
in a derived property.

So the shorthand interpretation of CGJ currently is "invisible
target for collation tailoring of neighboring characters into
a digraphic unit." Even calling it by its formal name,
COMBINING GRAPHEME JOINER, immediately conjures up the wrong
connotations, so it is better to just use the CGJ acronym and
not spell it out. Or think of CGJ as standing for "Collation kluGJe",
if you wish. ;-)

Now when you say:

> If I recall correctly, the suggestion for using CGJ for yerushala(y)im was
> to encode it as: <...lamed, patah, cgj, hiriq, final mem>. Also, I seem to
> recall that this gave some people heartburn because CGJ was not intended to
> join two combining characters.

If people are getting "heartburn" because CGJ is not intended
to join two combining characters, the problem they are having
is the result of a misunderstanding of the intent here.

It is *true* that the CGJ is no longer intended to "join two
combining characters", although people tried for awhile to
see if it would work to "glue together two combining characters"
for different rendering.

But the point of the CGJ proposal with respect to Biblical Hebrew
is *not* to somehow sneak back around to interpreting the CGJ
as gluing two combining characters together. Instead, it
turns out that the CGJ, whose interpretation has been whittled
down to being almost nothing, has the appropriate set of
character *properties* to serve to block canonical reordering
of a combining character sequence. The important things are
that it is a) invisible, b) a combining mark, and c) has
combining class zero. To serve the purpose of blocking
the canonical ordering, it doesn't have to *do* anything but
just sit there with its properties as defined. It doesn't
"join" anything, and it doesn't have anything to do with
the "grapheme" status of the resulting sequence.

The only other Unicode characters with those properties are
the variation selectors, but those characters *do* have
cooccurrence constraints that prevent them from following
a combining mark (at least in a legally interpretable
way). That leaves the CGJ as the *only* Unicode character
which has the desired properties and which has no constraints
against occurrence in the middle of a combining character
sequence.

Another way of thinking of this is that in addition to CGJ
being the "Collation kluGJe", it can be interpreted as
the "Canonical Gradient Jigger", if we simply acknowledge
the fact that, given its current properties, if it occurs
in the relevant sequences of combining marks, it already
has the effect of jiggering the canonical gradients to
produce just the distinctions desired. ;-)

> Of course, zwnbs is not a base character. If using zwnbs is a problem
> (because it has no visible glyph and/or because it has category Cf), then
> perhaps what is needed is another character (perhaps a new one) that has no
> width or visible glyph but can be treated as a base character (category Lo).
> That may be needed anyway, since some of the boundary definitions have
> special rules for zwnbs.

There is no need for an invisible base character here. That
*would* be going further than is necessary to solve the
problem, and would create arguments about the actual content
of the text -- are we encoding an inherent consonant here or
not? Why go there, when the problem is simply to represent
the text as shown and then let commentators and phonologists
argue about whether the yod is "really" there or not.

> Ted
> 
> P.S. It's two p's but only one d.  :)

Sorry. Anticipatory doubling, I guess...

--Ken

Re: Yerushala(y)im - or Biblical Hebrew

Reply via email to