Thomas Widmann continued: > [EMAIL PROTECTED] writes: > > > > >Yes, I think you're right that an annotation is best -- but only > > > >if EMPTY SET is indeed the right character. I'm increasingly of > > > >the opinion that a different character might be needed. > > > > > > I would disagree. > > > > As would I. > > Oh dear, if you both disagree with me, my chances of getting through > with this look slim indeed... :-)
O.k., I've finally read the thread, and it's time for another linguist to chime in. I absolutely concur with Peter, Michael, and Lukas that U+2205 EMPTY SET is the correct and intended character to deal with this semantic of null morphemes and other linguistic "zeroes" in technical linguistic representation. In response to an earlier comment in the thread, I also agree that the annotation in the names list for U+2205 should be updated (for a future version -- it's too late for Unicode 4.0) to indicate this explicitly, so that we won't have to revisit this issue a few years down the road. > > But I'm wondering why. > > I think we all agree on the following: > > - Ø [LATIN CAPITAL LETTER O WITH STROKE] and ø [LATIN SMALL LETTER O > WITH STROKE] are both ruled out as their semantics is totally wrong. > > - 0 [DIGIT ZERO] is also ruled out because it looks wrong in most > fonts (and one might argue that the semantics isn't exactly right, > either). > > - ∅ [EMPTY SET] is the best choice if a single character has to be > chosen from the current Unicode repertoire. All correct. > > - But while ∅ [EMPTY SET] is normally just as wide as it is tall (it's > really just a circle with a stroke), the null symbol as used in > linguistics frequently looks more like 0 [DIGIT ZERO] with an added > stroke. (But many variations exist, including ∅ [EMPTY SET], ø > [LATIN SMALL LETTER O WITH STROKE] and other symbols, most of which > can be explained by typesetters and word-processing programs that > didn't know what they're doing.) Yes. And Pullum's discussion of this explictly calls out the problem with confusable glyphs and notes that it has been a persistent typesetting problem in linguistics: "Mentioning the null sign here allows us to stress that it is distinct from all four of the following visually rather similar characters: Phi [ U+0278 ], Barred O [ U+0275 ], Slashed O [ U+00F8 ], and Theta [ U+03D1, but showing the straight bar glyph variant ]. Typesetting errors in connection with these symbols are unfortunately fairly common." Pullum's representative glyph for the "null sign" is, as Thomas notes, of narrow aspect, and is essentially the slashed zero glyph, often seen in typesetting linguistic works. The alternative glyph, cited in Dinnsen, 1974, is the Symbol font glyph (0xC6), with the large round circle and a solidus overlay at a 45 degree angle. Which of these shows up in linguistic typography, as in many instances of typesetting linguistic material, is often a matter of what the compositor had available. Speaking in linguistic terms, what we have here is two graphemes, with an etic overlap in their actual glyphs used for display: U+0030 DIGIT ZERO common glyphs: zero without slash, zero with slash, zero with dot (where the addition of slashes or dots are ad hoc devices to minimize confusion with the letter O, usually) U+2205 EMPTY SET common glyphs: circle with 45 degree slash (PostScript symbol font), zero with slash So if you just approach the problem graphically, you get an overlap, and there are glyphs which cannot be distinguished. But the *range* of acceptable glyphs for the two *characters* is distinct. A "zero with dot" glyph would never be appropriate for U+2205, for example. As Peter pointed out, linguists have also grown used to the narrow glyph for their linguistic zero as the result of many years of typewriter and/or daisywheel printer practice of typesetting this symbol as <0, BACKSPACE, />, when nothing better was available. > > - Furthermore, semantically an empty set is not really the same thing > as a null symbol. (They both represent 'nothing', but so does 0 > [DIGIT ZERO] and possibly other Unicode characters as well.) True enough. Linguists (and logicians and mathematicians) are very adept at discovering and representing many different kinds of 'nothing'. For linguists, in particular, the 'nothing's of interest are usually significant positions in structural patterns whose surface manifestation is no sound (or no written form). The significance is in the systematic contrast with a 'something'. However, for *character encoding* it is inappropriate to start trying to establish a distinct encoded character for each possible semantic distinction that could be associated with a concept of zero or nothing. The appropriate approach is to examine the written forms and typographical conventions for same or different distinctions. And the net result, I believe, is to conclude that there are two *characters*, with somewhat confusing overlap in their glyphic representation. (See above.) The fact that the EMPTY SET symbol gets used in many different ways in different disciplines, including linguistics, no more requires the encoding of additional characters than does the fact that U+0023 NUMBER SIGN (#) is also used conventionally in linguistics as a symbol having nothing to do with numbers -- it indicates boundaries in phonology and morphology, instead. In the case of the "linguistic zero", the discussion is further muddled by the terminology per se. Phonologists and morphologists often talk about "zeroes" in their analyses -- fully aware that these "zeroes" have nothing to do with numeric values. And when their work then gets typeset with "slashed-zero" glyphs -- possibly even by their explicit preference -- the situation can get even more confused. But this would not be helped, for linguists or anyone else, by trying to introduce yet another character for NULL SYMBOL, whose only glyph would be the "slashed-zero" glyph. That would just make the visual overlap problem worse without helping at all in preserving the text distinctions required. > If you agree with all of the above, I'm wondering what the argument is > against a new Unicode character, called NULL or NULL SYMBOL. Just provided. > Surely > if it looks different from any existing character and has a > well-defining meaning also not covered, there must be a good case for > adding it...? Nope. --Ken (as his linguist avatar)