Thomas Widmann continued:

> [EMAIL PROTECTED] writes:
> 
> > > >Yes, I think you're right that an annotation is best -- but only
> > > >if EMPTY SET is indeed the right character.  I'm increasingly of
> > > >the opinion that a different character might be needed.
> > > 
> > > I would disagree.
> > 
> > As would I.
> 
> Oh dear, if you both disagree with me, my chances of getting through
> with this look slim indeed... :-)

O.k., I've finally read the thread, and it's time for another
linguist to chime in.

I absolutely concur with Peter, Michael, and Lukas that U+2205 EMPTY SET
is the correct and intended character to deal with this semantic
of null morphemes and other linguistic "zeroes" in technical
linguistic representation.

In response to an earlier comment in the thread, I also agree that
the annotation in the names list for U+2205 should be updated
(for a future version -- it's too late for Unicode 4.0) to
indicate this explicitly, so that we won't have to revisit this
issue a few years down the road.

> 
> But I'm wondering why.
> 
> I think we all agree on the following:
> 
> - Ø [LATIN CAPITAL LETTER O WITH STROKE] and ø [LATIN SMALL LETTER O
>   WITH STROKE] are both ruled out as their semantics is totally wrong.
> 
> - 0 [DIGIT ZERO] is also ruled out because it looks wrong in most
>   fonts (and one might argue that the semantics isn't exactly right,
>   either).
> 
> - ∅ [EMPTY SET] is the best choice if a single character has to be
>   chosen from the current Unicode repertoire.

All correct.

> 
> - But while ∅ [EMPTY SET] is normally just as wide as it is tall (it's
>   really just a circle with a stroke), the null symbol as used in
>   linguistics frequently looks more like 0 [DIGIT ZERO] with an added
>   stroke.  (But many variations exist, including ∅ [EMPTY SET], ø
>   [LATIN SMALL LETTER O WITH STROKE] and other symbols, most of which
>   can be explained by typesetters and word-processing programs that
>   didn't know what they're doing.)

Yes. And Pullum's discussion of this explictly calls out the
problem with confusable glyphs and notes that it has been
a persistent typesetting problem in linguistics:

  "Mentioning the null sign here allows us to stress that it is
   distinct from all four of the following visually rather similar
   characters: Phi [ U+0278 ], Barred O [ U+0275 ], Slashed O
   [ U+00F8 ], and Theta [ U+03D1, but showing the straight bar
   glyph variant ]. Typesetting errors in connection with these
   symbols are unfortunately fairly common."
   
Pullum's representative glyph for the "null sign" is, as
Thomas notes, of narrow aspect, and is essentially the
slashed zero glyph, often seen in typesetting linguistic
works. The alternative glyph, cited in Dinnsen, 1974, is
the Symbol font glyph (0xC6), with the large round circle
and a solidus overlay at a 45 degree angle. Which of these
shows up in linguistic typography, as in many instances of
typesetting linguistic material, is often a matter of what
the compositor had available.

Speaking in linguistic terms, what we have here is two
graphemes, with an etic overlap in their actual glyphs used
for display:

U+0030 DIGIT ZERO
    common glyphs: zero without slash, zero with slash, zero with dot
       (where the addition of slashes or dots are ad hoc
        devices to minimize confusion with the letter O, usually)
        
U+2205 EMPTY SET
    common glyphs: circle with 45 degree slash (PostScript symbol font),
                   zero with slash

So if you just approach the problem graphically, you get an
overlap, and there are glyphs which cannot be distinguished.
But the *range* of acceptable glyphs for the two *characters*
is distinct. A "zero with dot" glyph would never be appropriate
for U+2205, for example.

As Peter pointed out, linguists have also grown used to the
narrow glyph for their linguistic zero as the result of
many years of typewriter and/or daisywheel printer practice
of typesetting this symbol as <0, BACKSPACE, />, when nothing
better was available.
                  
> 
> - Furthermore, semantically an empty set is not really the same thing
>   as a null symbol.  (They both represent 'nothing', but so does 0
>   [DIGIT ZERO] and possibly other Unicode characters as well.)

True enough. Linguists (and logicians and mathematicians) are
very adept at discovering and representing many different kinds
of 'nothing'. For linguists, in particular, the 'nothing's of
interest are usually significant positions in structural
patterns whose surface manifestation is no sound (or no
written form). The significance is in the systematic contrast
with a 'something'.

However, for *character encoding* it is inappropriate to start
trying to establish a distinct encoded character for each
possible semantic distinction that could be associated with
a concept of zero or nothing. The appropriate approach is to
examine the written forms and typographical conventions for
same or different distinctions. And the net result, I believe,
is to conclude that there are two *characters*, with somewhat
confusing overlap in their glyphic representation. (See above.)

The fact that the EMPTY SET symbol gets used in many different
ways in different disciplines, including linguistics, no
more requires the encoding of additional characters than does the
fact that U+0023 NUMBER SIGN (#) is also used conventionally
in linguistics as a symbol having nothing to do with numbers --
it indicates boundaries in phonology and morphology, instead.

In the case of the "linguistic zero", the discussion is
further muddled by the terminology per se. Phonologists
and morphologists often talk about "zeroes" in their
analyses -- fully aware that these "zeroes" have nothing
to do with numeric values. And when their work then gets
typeset with "slashed-zero" glyphs -- possibly even by their
explicit preference -- the situation can get even more
confused. But this would not be helped, for linguists or
anyone else, by trying to introduce yet another character
for NULL SYMBOL, whose only glyph would be the "slashed-zero"
glyph. That would just make the visual overlap problem worse
without helping at all in preserving the text distinctions
required.

> If you agree with all of the above, I'm wondering what the argument is
> against a new Unicode character, called NULL or NULL SYMBOL.

Just provided.

> Surely
> if it looks different from any existing character and has a
> well-defining meaning also not covered, there must be a good case for
> adding it...?

Nope.

--Ken (as his linguist avatar)


Reply via email to