Re: New Charakter Proposal
William, Note the smiley. Ken's suggestion was a tongue in the hollow-skulls cheek. Yes, a 2 character sequence is less likely to occur, but is still a possibility, so your proposal doesn't actually fix the problem. The usual workaround is for a convention that uses characters with special semantics (ie metacharacters) to have an escape mechanism to indicate when a metacharacter is not to be treated as such. So perhaps 2 skull and crossbones in a row will be used to nullify the special meaning of one such character and together represent a single printable character. Of course, the consortium could assign another character for the special purpose, but there are so many special purposes that would then require character assignments, it would be come difficult for an application to take them all into account. It is better to let higher level protocols take over, where such abilities are needed or desired. As for the influence of posting a suggestion for character usage, I think you have made your point now, perhaps we don't need to keep restating it. Others have suggested this list is not a good place to post and suggest conventions for individual or non-standard use, since this is a list for standardization and subscribes to a standardization process. The Charman list was created for the alternative process. However, that suggestion doesn't seem to have had any influence... ;-) tex William Overington wrote: > > Kenneth Whistler wrote the following. > > >I think Marku's suggestion is correct. If you want to do > >something like this internally to a process, use a noncharacter > >code point for it. If you want to have visible display of this > >kind of error handling for conversion, then simply declare a > >convention for the use of an already existing character. > >My suggestion would be: U+2620. ;-) Then get people to share > >your convention. > > I find this suggestion curious, particularly coming as it does from an > officer of the Unicode Corporation. > > The U2600.pdf file has U+2620 under Warning signs and has = poison in its > description. > > Suppose for example that the source document encoded in UTF-8 is a document > about chemicals found around the house and that the U+2620 character is used > to indicate those which are poisonous. If U+2620 is also used to include in > visible form an indication of an error found during decoding, then finding a > U+2620 character in the decoded document would lead to an ambiguous > situation. > > One solution would be for the Unicode Consortium to encode an otherwise > unused character especially for the purpose. > > If, however, the way forward is for an individual to declare a convention, > then I suggest that a sequence of at least two characters, the first being a > base character and the one or more others being combining items be used so > as to produce an otherwise highly unlikely sequence of characters. > > For example, the character U+0304 COMBINING MACRON could be a good choice, > as it could be used to indicate a Boolean "not" condition with a character > which is otherwise unlikely to carry an accent. > > As to which character to use for the base character, I am undecided, however > it should, in my opinion, not be U+2620 as that is a warning sign meaning > poison and could lead to confusion if looking at a document. > > The advantage of a two character sequence is that a special piece of > software may be used to parse all incoming documents. Only occurrences of > the otherwise highly unlikely sequence will be regarded as indicating a > conversion problem with the encoding. If either of the two characters used > for the sequence is encountered other than with the rest of the sequence, > then it will not indicate the special effect. > > In my comet circumflex system I use a three character detection sequence. > This means that in order to enter the markup universe then all three > characters of the sequence need to be present in sequence. Thus, a piece of > software can scan all incoming text messages, even those which are not > designed to fit in with the comet circumflex system, and not indicate a > comet circumflex message if, say, a U+2604 COMET character arrives as part > of a message. > > Using a two or three character sequence which is otherwise highly unlikely > to occur is, in my opinion, a good way to indicate the presence of a special > feature as it allows one to monitor all text files for the special feature > without causing undesired responses on text files which have been prepared > without any regard to the special feature. > > I feel that the influence of posting a suggestion in this mailing list is > often greatly underestimated. If you do post a suggested two or three > character sequence for the purpose that you seek, perhaps, if you wish, > after further discussion in this group, my feeling is that that sequence may > well become well known and accepted for the purpose very quickly, simply > because wher
Re: New Charakter Proposal
Kenneth Whistler wrote the following. >I think Marku's suggestion is correct. If you want to do >something like this internally to a process, use a noncharacter >code point for it. If you want to have visible display of this >kind of error handling for conversion, then simply declare a >convention for the use of an already existing character. >My suggestion would be: U+2620. ;-) Then get people to share >your convention. I find this suggestion curious, particularly coming as it does from an officer of the Unicode Corporation. The U2600.pdf file has U+2620 under Warning signs and has = poison in its description. Suppose for example that the source document encoded in UTF-8 is a document about chemicals found around the house and that the U+2620 character is used to indicate those which are poisonous. If U+2620 is also used to include in visible form an indication of an error found during decoding, then finding a U+2620 character in the decoded document would lead to an ambiguous situation. One solution would be for the Unicode Consortium to encode an otherwise unused character especially for the purpose. If, however, the way forward is for an individual to declare a convention, then I suggest that a sequence of at least two characters, the first being a base character and the one or more others being combining items be used so as to produce an otherwise highly unlikely sequence of characters. For example, the character U+0304 COMBINING MACRON could be a good choice, as it could be used to indicate a Boolean "not" condition with a character which is otherwise unlikely to carry an accent. As to which character to use for the base character, I am undecided, however it should, in my opinion, not be U+2620 as that is a warning sign meaning poison and could lead to confusion if looking at a document. The advantage of a two character sequence is that a special piece of software may be used to parse all incoming documents. Only occurrences of the otherwise highly unlikely sequence will be regarded as indicating a conversion problem with the encoding. If either of the two characters used for the sequence is encountered other than with the rest of the sequence, then it will not indicate the special effect. In my comet circumflex system I use a three character detection sequence. This means that in order to enter the markup universe then all three characters of the sequence need to be present in sequence. Thus, a piece of software can scan all incoming text messages, even those which are not designed to fit in with the comet circumflex system, and not indicate a comet circumflex message if, say, a U+2604 COMET character arrives as part of a message. Using a two or three character sequence which is otherwise highly unlikely to occur is, in my opinion, a good way to indicate the presence of a special feature as it allows one to monitor all text files for the special feature without causing undesired responses on text files which have been prepared without any regard to the special feature. I feel that the influence of posting a suggestion in this mailing list is often greatly underestimated. If you do post a suggested two or three character sequence for the purpose that you seek, perhaps, if you wish, after further discussion in this group, my feeling is that that sequence may well become well known and accepted for the purpose very quickly, simply because where there is a need for such a sequence then, in the absence of any good reason not to do so, people will often happily use the suggested format. William Overington 1 November 2002
Re: Character identities
In Unicode code point U+308 is applied to COMBINING DIAERESIS. There are a number of precomposed forms with diaeresis. Let's take one of these, ü: The diaeresis may mean separate pronunication of the u, indicating it is not merged with preceding of following letter but is pronounced distinctly, as in the classical Greek name Peirithoüs or Spanish antigüedad. Similarly in Catalan. It is identified with the Greek dialytika of the same meaning, which is indeed the ultimate known origin of the symbol. The diaeresis indicates umlaut modification of u, as in German über, a use also found in Finnish, Turkish, Pinyin Chinese Romanization and in many other languages. In Magyar indicates a sound like French eu. In IPA it indicates u with a centralized pronunciation. There are may be other phonic interpretations. Of these uses, only for the second (and possibly the third), might combining superscript e be used instead of the diaeresis. The second certainly represents the most common use of ü tody, but not the only only one. Unicode encodes the character COMBINING DIAERESIS, not a generic UMLAUT MARKER which might take various forms. It provides itself no way of distinguishing between uses of diaeresis. All the above uses might occur in German text, or Swedish text, or Finnish text or any text which might introduce personal names or geographical names or particular words or phrases from various languages outside the main language of the text. The same applies for ä and ö. Indeed individual words with vowels and umlaut marker, whether represented as a COMBINING DIAERESIS or COMBINING LATIN SMALL LETTER E or following e may appear in text in any language because of use of technical vocabulary, eg. Senhnsücht, or in personal or place names. Now any use of diaeresis meaning umlaut in any language might, it seems to me, be reasonably replaced by superscript e meaning umlaut. But it is incorrect to replace diaeresis used for any other purpose by superscript e. In stright, plain Unicode, if you want to use diaeresis for umlaut, use diaeresis. If you want to use combining superscript e to indicate umlaut, use COMBINING LATIN SMALL LETTER E. Leave any other occurrences of umlaut alone. This is the only possiblitiy at the plain text level, and the most robust way of chosing between diaeresis and superscript e at any level. Given a higher protocol, we can do more. We might, as suggested, have a font which uses superscript e instead of diaeresis, at least for the combination characters with the base characters a, o, or u and in place of the diaeresis symbol itself. If we have another generally identical font with a true diaeresis instead, we can switch between fonts as necessary depending on whether diaeresis is used for umlaut or not, or whether in particular cases we wish to use one or the other symbol for umlaut. Switching between such alternate fonts as long been a standby when fancy typography is required. Yet I don't see there is any advantage to switching betwen between fonts and switching between the Unicode character COMBINING DIAERESIS and COMBINING LATIN SMALL LETTER E. And it makes us dependent on a particular set of fonts. That is probably not good. :-( A better solution might be an intelligent font that recognizes some kinds of tagging and which allows us to turn on different glyphs for diaeresis according to the tagging, one of these glyphs being a superscript e. So we tag words and phrases. And, magically, if that particular font works properly, we see diaeresis where we want diaeresis and superscript e where we want superscript e. But it is not evident that tagging for this purpose is any easier than entering the different Unicode characters from the beginning. And we are again dependent on the intelligence of a particular font. Of course, we might expect there will be soon be many such intelligent fonts. It is less likely that they will all work exactly the same, and understand exactly the same tags in the same way. And we are restricted to such intelligent fonts as understand a particular system of tagging rather than using almost any font. :-( We might propose introducing a tag or indicator of some kind at some level to indicate a diaeresis has umlaut function, but such a tag or indicator would probably only be used when a user wanted to use a superscript e, in which case it is not clear that using it would have any advantage over actually entering COMBINING LATIN SMALL LETTER E. :-( We might go to a still higher level of protocol, to a routine or plugin in an application or a new style feature added to HTML or XML which allows diaeresis replacement. Just as Microsoft Word and some other programs now allow capitalization and small capitalization as an effect, though the underlying text is still actually in upper and lower case, so we might show a diaeresis as a superscript e, though in fact at the plain text level the text has a diaeresis. P
Re: Character identities
(After sending this unadvertedly to Dominikus only, here's for the list also...) On 2002.10.30, 16:26, Dominikus Scherkl <[EMAIL PROTECTED]> wrote: > A font representing my mothers handwriting (german only :-) would > render "u" as "u with breve above" to distinguish it from the > representation of "n". I don't know how my mother would write a text > containing an "u with breve above", FWIW, I've seen the handwriting of an elder German esperantist, and he does exactly that: he puts breves above each and every "u", both on those which have it and on those which don't -- slightly confusing... On the brink of off-topic-ness, something of that sort is made in handwritten cyrillic (at least in Russian tradition): the "triple wave" of a lower case "t" is distinguished from the "triple wave" of a lower case "shch" (*) by means of a stroke above the former and a stroke below the latter. (*) Not that I'm an enthusiast of this transliteration... -- . António MARTINS-Tuválkin, | ()| <[EMAIL PROTECTED]> || R. Laureano de Oliveira, 64 r/c esq. | PT-1885-050 MOSCAVIDE (LRS) Não me invejo de quem tem | +351 917 511 549 carros, parelhas e montes | http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe | http://pagina.de/bandeiras/ a água em todas as fontes |
Re: [OT] Göthe (was: Re: RE: Character identities)
At 08:32 31.10.2002 -0800, Doug Ewell wrote: Adam Twardoch wrote: >> Should an English language font render ö as oe, so that Göthe >> appears automatically in the more normal English form Goethe? > > If you refer to Johann Wolfgang von Goethe, his name is *not* spelled > with an "ö" anyway. Somebody thinks so: http://www.transkription.de/gb_seiten/beispiele/goethe.htm Both forms are permissible and used, even though Goethe is today by far the more frequent version -- remember that there was no standardized German orthography before the late 19th century and that the idea that a person's name has exactly one spelling is a fairly young idea in Europe. Taking such facts into account for matching purposes is a good idea, but changing the version for rendering is not. Best regards, Marc * Marc Wilhelm Küster Saphor GmbH Fronländer 22 D-72072 Tübingen Tel.: (+49) / (0)7472 / 949 100 Fax: (+49) / (0)7472 / 949 114
Re: Tiberian Hebrew font situation
[EMAIL PROTECTED] scripsit: > I've been told by a respected and experienced Hebrew > font maker that it is IMPOSSIBLE to get all the Tiberian > Hebrew marks on 1 font under the Unicode system. I am no font designer, but certainly having dots appear in different sizes/places depending on the base character is very straightforward in all modern Unicode-loving font technologies. > I attempted to re-use other Semitic dots first, only > going into Euro, left-to-right blocks where it was > unavoidable. Fair enough, though combining marks have no inherent script or directionality. -- One art / There is John Cowan <[EMAIL PROTECTED]> No less / No more http://www.reutershealth.com All things / To do http://www.ccil.org/~cowan With sparks / Galore -- Douglas Hofstadter
[OT] Göthe (was: Re: RE: Character identities)
Adam Twardoch wrote: >> Should an English language font render ö as oe, so that Göthe >> appears automatically in the more normal English form Goethe? > > If you refer to Johann Wolfgang von Goethe, his name is *not* spelled > with an "ö" anyway. Somebody thinks so: http://www.transkription.de/gb_seiten/beispiele/goethe.htm -Doug Ewell Fullerton, California
RE: Character identities
Let me take a few comparable examples; 1. Some (I think font makers) a few years ago argued that the Lithuanian i-dot-circumflex was just a glyph variant (Lithuanian specific) of i-circumflex, and a few other similar characters. Still, the Unicode standard now does not regard those as glyph variants (anymore, if it ever did), and embodies that the Lithuanian i-dot-circumflex is a different character in its casing rules (see SpecialCasing.txt). There are special rules for inserting (when lowercasing) or removing (when uppercasing) dot-aboves on i-s and I-s for Lithuanian. I can only conclude that it would be wrong even for a Lithuanian specific font to display an i-circumflex character as an i-dot-circumflex glyph, even though an i-circumflex glyph is never used for Lithuanian. 2. The Khmer script got allocated a "KHMER SIGN BEYYAL". It stands (stood...) for "any abbreviation of the Khmer correspondence to etc."; there are at least four different abbreviations, much like "etc", "etc.", "&c", "et c.", ... It would be up to the font maker to decide exactly which abbreviation, and would vary by font. However, it is now targeted for deprecation for precisely that reason: it is *not* the font (maker) that should decide which abbreviation convention to use in a document, it is the *"author"* of the document who should decide. Just as for the Latin script, the author decides how to abbreviate "et cetera". The way of abbreviating should stay the same *regardless of font*. Note that the font may be chosen at a much later time, and not for wanting to change abbreviation convention. That convention one may want to have the same throughout a document also when using several different fonts in it, not having to carefully consider abbreviation conventions when choosing fonts. 3. Marco would even allow (by default; I cannot get away from that caveat since some (not all) font technologies do what they do) displaying the ROMAN NUMERAL ONE THOUSAND C D (U+2180) as an M, and it would be up to the font designer. While the glyphs are informative, this glyphic substitution definitely goes too far. If the author chose to use U+2180, a glyph having at least some similarity to the sample glyph should be shown, unless and until someone makes a (permanent or transient) explicit character change. 4. Some people write è instead of é (I claim they cannot spell...). So is it up to a font designer to display é as è if the font is made for a context where many people does not make a distinction? Can a correctly spelled name (say) be turned into an apparent misspelling by just choosing such a font? And that would be a Unicode font? 5. I can't leave the ö vs. ø; these are just different ways of writing "the same" letter; and it is not the case that ø is used instead of ö for any 7-bit reasons. It is conventional to use ø for ö in Norway and Denmark for any Swedish name (or word) containing it. The same goes for ä vs. æ. Why shouldn't this one be up to the font makers too? If the font is made purely for Norwegian, why not display ö as ø, as is the convention? This is *exactly* the same situation as with ä vs. a^e. I say, let the *"author"* decide in all these cases, and let that decision stand, *regardless of font changes*. [There is an implicit qualification there, but I'm tired of writing it.] > Kent Karlsson wrote: > > > I insist that you can talk about character-to-character > > > mappings only when > > > the so-called "backing store" is affected in some way. > > > > No, why? It is perfectly permissible to do the equivalent > > of "print(to_upper(mystring))" without changing the backing > > store ("mystring" in the pseudocode); to_upper here would > > return a NEW string without changing the argument. > > And that, conceptually, is a character-to-glyph mapping. Now I have lost you. How can it be that? The "print" part, yes. But not the to_upper part; that is a character-to-character mapping, inserted between the "backing store" and "mapping characters to glyphs". It is still an (apparent) character-to-character mapping even if it is not stored in the "backing store". > In my mind, you are so much into the OpenType architecture, > and so much used > to the concept that glyphization is what a font "does", that > you can't view the big picture. Now I have lost you again. Some fonts (in some font technologies) do more that "pure" glyphization. This is why I have been putting in caveats, since many people seem to think that all fonts *only* do glyphisation, which is not the case. But to be general I was referring to such mappings regardless of if that is built into some font (using character code points or, as in OT/AAT, using glyph indices) or (better) were external to the font. I was trying to use general formulations, but I cannot avoid having
RE: New Charakter Proposal
Hello. Markus Scherer wrote: > Chances are nearly 100% that overlong UTF-8 was a > spoofing attempt, or the result of something other than a > UTF-8 encoder. Correct. This is exactly my topic. Wouldn't it be nice to have a standardized way to indicate that an attack to the message has occured without hiding the contained information from the user? The way we do this yet, is popping up some alert box, but this does not remain in the text. And using any unassigned or forbidden codepoint (as you suggested) would keep it's meaning only for the application which converted the text (in our case a small tool decoding encrypted messages - which will never see the text again). And leaving any other mark in the text is at least non-standard, so most unicode-tools can't use it (which is our goal). But ok, it is not that important. Would only be nice. Best regards. -- Dominikus Scherkl [EMAIL PROTECTED]