Re: New Charakter Proposal

2002-10-31 Thread Tex Texin
William,

Note the smiley. Ken's suggestion was a tongue in the hollow-skulls
cheek.

Yes, a 2 character sequence is less likely to occur, but is still a
possibility, so your proposal doesn't actually fix the problem. The
usual workaround is for a convention that uses characters with special
semantics (ie metacharacters) to have an escape mechanism to indicate
when a metacharacter is not to be treated as such. So perhaps 2 skull
and crossbones in a row will be used to nullify the special meaning of
one such character and together represent a single printable character.

Of course, the consortium could assign another character for the special
purpose, but there are so many special purposes that would then require
character assignments, it would be come difficult for an application to
take them all into account. It is better to let higher level protocols
take over, where such abilities are needed or desired.

As for the influence of posting a suggestion for character usage, I
think you have made your point now, perhaps we don't need to keep
restating it. Others have suggested this list is not a good place to
post and suggest conventions for individual or non-standard use, since
this is a list for standardization and subscribes to a standardization
process. The Charman list was created for the alternative process.
However, that suggestion doesn't seem to have had any influence...

;-)

tex


William Overington wrote:
> 
> Kenneth Whistler wrote the following.
> 
> >I think Marku's suggestion is correct. If you want to do
> >something like this internally to a process, use a noncharacter
> >code point for it. If you want to have visible display of this
> >kind of error handling for conversion, then simply declare a
> >convention for the use of an already existing character.
> >My suggestion would be: U+2620. ;-) Then get people to share
> >your convention.
> 
> I find this suggestion curious, particularly coming as it does from an
> officer of the Unicode Corporation.
> 
> The U2600.pdf file has U+2620 under Warning signs and has = poison in its
> description.
> 
> Suppose for example that the source document encoded in UTF-8 is a document
> about chemicals found around the house and that the U+2620 character is used
> to indicate those which are poisonous.  If U+2620 is also used to include in
> visible form an indication of an error found during decoding, then finding a
> U+2620 character in the decoded document would lead to an ambiguous
> situation.
> 
> One solution would be for the Unicode Consortium to encode an otherwise
> unused character especially for the purpose.
> 
> If, however, the way forward is for an individual to declare a convention,
> then I suggest that a sequence of at least two characters, the first being a
> base character and the one or more others being combining items be used so
> as to produce an otherwise highly unlikely sequence of characters.
> 
> For example, the character U+0304 COMBINING MACRON could be a good choice,
> as it could be used to indicate a Boolean "not" condition with a character
> which is otherwise unlikely to carry an accent.
> 
> As to which character to use for the base character, I am undecided, however
> it should, in my opinion, not be U+2620 as that is a warning sign meaning
> poison and could lead to confusion if looking at a document.
> 
> The advantage of a two character sequence is that a special piece of
> software may be used to parse all incoming documents.  Only occurrences of
> the otherwise highly unlikely sequence will be regarded as indicating a
> conversion problem with the encoding.  If either of the two characters used
> for the sequence is encountered other than with the rest of the sequence,
> then it will not indicate the special effect.
> 
> In my comet circumflex system I use a three character detection sequence.
> This means that in order to enter the markup universe then all three
> characters of the sequence need to be present in sequence.  Thus, a piece of
> software can scan all incoming text messages, even those which are not
> designed to fit in with the comet circumflex system, and not indicate a
> comet circumflex message if, say, a U+2604 COMET character arrives as part
> of a message.
> 
> Using a two or three character sequence which is otherwise highly unlikely
> to occur is, in my opinion, a good way to indicate the presence of a special
> feature as it allows one to monitor all text files for the special feature
> without causing undesired responses on text files which have been prepared
> without any regard to the special feature.
> 
> I feel that the influence of posting a suggestion in this mailing list is
> often greatly underestimated.  If you do post a suggested two or three
> character sequence for the purpose that you seek, perhaps, if you wish,
> after further discussion in this group, my feeling is that that sequence may
> well become well known and accepted for the purpose very quickly, simply
> because wher

Re: New Charakter Proposal

2002-10-31 Thread William Overington
Kenneth Whistler wrote the following.

>I think Marku's suggestion is correct. If you want to do
>something like this internally to a process, use a noncharacter
>code point for it. If you want to have visible display of this
>kind of error handling for conversion, then simply declare a
>convention for the use of an already existing character.
>My suggestion would be: U+2620. ;-) Then get people to share
>your convention.

I find this suggestion curious, particularly coming as it does from an
officer of the Unicode Corporation.

The U2600.pdf file has U+2620 under Warning signs and has = poison in its
description.

Suppose for example that the source document encoded in UTF-8 is a document
about chemicals found around the house and that the U+2620 character is used
to indicate those which are poisonous.  If U+2620 is also used to include in
visible form an indication of an error found during decoding, then finding a
U+2620 character in the decoded document would lead to an ambiguous
situation.

One solution would be for the Unicode Consortium to encode an otherwise
unused character especially for the purpose.

If, however, the way forward is for an individual to declare a convention,
then I suggest that a sequence of at least two characters, the first being a
base character and the one or more others being combining items be used so
as to produce an otherwise highly unlikely sequence of characters.

For example, the character U+0304 COMBINING MACRON could be a good choice,
as it could be used to indicate a Boolean "not" condition with a character
which is otherwise unlikely to carry an accent.

As to which character to use for the base character, I am undecided, however
it should, in my opinion, not be U+2620 as that is a warning sign meaning
poison and could lead to confusion if looking at a document.

The advantage of a two character sequence is that a special piece of
software may be used to parse all incoming documents.  Only occurrences of
the otherwise highly unlikely sequence will be regarded as indicating a
conversion problem with the encoding.  If either of the two characters used
for the sequence is encountered other than with the rest of the sequence,
then it will not indicate the special effect.

In my comet circumflex system I use a three character detection sequence.
This means that in order to enter the markup universe then all three
characters of the sequence need to be present in sequence.  Thus, a piece of
software can scan all incoming text messages, even those which are not
designed to fit in with the comet circumflex system, and not indicate a
comet circumflex message if, say, a U+2604 COMET character arrives as part
of a message.

Using a two or three character sequence which is otherwise highly unlikely
to occur is, in my opinion, a good way to indicate the presence of a special
feature as it allows one to monitor all text files for the special feature
without causing undesired responses on text files which have been prepared
without any regard to the special feature.

I feel that the influence of posting a suggestion in this mailing list is
often greatly underestimated.  If you do post a suggested two or three
character sequence for the purpose that you seek, perhaps, if you wish,
after further discussion in this group, my feeling is that that sequence may
well become well known and accepted for the purpose very quickly, simply
because where there is a need for such a sequence then, in the absence of
any good reason not to do so, people will often happily use the suggested
format.

William Overington

1 November 2002



















Re: Character identities

2002-10-31 Thread Jim Allan




In Unicode code point U+308 is applied to COMBINING DIAERESIS. 
There are a number of precomposed forms with diaeresis.

Let's take one of these, ü:


  The diaeresis may mean separate pronunication of the u, indicating it is not merged with preceding
of following letter but is pronounced distinctly, as in the classical Greek
name Peirithoüs or Spanish antigüedad. Similarly in Catalan. It
is identified with the Greek dialytika
of the same meaning, which is indeed the ultimate known origin of the symbol.

  
  The diaeresis indicates umlaut modification of u, as in German über, a use also found in Finnish, Turkish,
Pinyin Chinese Romanization and in many other languages.

  
  In Magyar indicates a sound like French eu.

  
  In IPA it indicates u with
a centralized pronunciation. 

There are may be other phonic interpretations.

Of these uses, only for the second (and possibly the third),  might combining
superscript e be used instead of
the diaeresis. The second certainly represents the most common use of ü tody, but not the only only one.

Unicode encodes the character COMBINING DIAERESIS, not a generic UMLAUT MARKER
which might take various forms. It provides itself no way of distinguishing
between uses of diaeresis.

All the above uses might occur in German text, or Swedish text, or Finnish
text or any text which might introduce personal names or geographical names
or particular words or phrases from various languages outside the main language
of the text. The same applies for ä
and ö.

Indeed individual words with vowels and umlaut marker, whether represented
as a COMBINING DIAERESIS or COMBINING LATIN SMALL LETTER
E  or following e may appear
in text in any language because
of use of technical vocabulary, eg. Senhnsücht,
or in personal or place names. 

Now any use of diaeresis meaning umlaut in any language might, it seems to
me, be reasonably replaced by superscript e meaning umlaut. But it is incorrect
to replace diaeresis used for any other purpose by superscript e.

In stright, plain Unicode, if you want to use diaeresis for umlaut, use diaeresis. 
If you want to use combining superscript e to indicate umlaut,  use COMBINING
LATIN SMALL LETTER E.  Leave
any other occurrences of umlaut alone.  This is the only possiblitiy at the plain text level,
and the most  robust way of chosing between diaeresis and superscript e at any level.

Given a higher protocol, we can do more.  We might, as suggested, have a
font which uses superscript e  instead
of diaeresis, at least for the combination characters with the base characters
a, o, or u and in place of  the diaeresis symbol
itself. If we have another generally identical
font with a true diaeresis instead, we can switch between fonts as necessary
depending on whether diaeresis is used for umlaut or not, or whether in particular
cases we wish to use one or the other symbol for umlaut. 

Switching between such alternate fonts as long been a standby when fancy
typography is required.

Yet I don't see there is any advantage to switching betwen between fonts
and switching between the Unicode character COMBINING DIAERESIS
and COMBINING LATIN SMALL LETTER E.  And it makes us dependent on a particular
set of fonts. That is probably not good.  :-(

A better solution might be an intelligent font that recognizes some kinds
of tagging and which allows us to turn on different glyphs for diaeresis according
to the tagging, one of these glyphs being a superscript e. So we tag words and phrases. And,
magically, if that particular font works properly, we see diaeresis where
we want diaeresis and superscript e where
we want superscript e.

But it is not evident that tagging for this purpose is any easier than
entering the different Unicode characters from the beginning. And we are
again dependent on the intelligence of a particular font. Of course, we might
expect there will be soon be many such intelligent fonts. It is less likely
that they will all work exactly the same, and understand exactly the same
tags in the same way. And we are restricted to such intelligent fonts as
understand a particular system of tagging rather than using almost any font.
 :-(

We might propose introducing a tag or indicator of some kind at some level
to indicate a diaeresis has umlaut function, but such a tag or indicator would
probably only be used when a user wanted to use a superscript e, in
which case it is not clear that using it would have any advantage over actually
entering COMBINING LATIN SMALL LETTER E. :-(

We might go to a still higher level of protocol, to a routine or plugin in
an application or a new style feature added to HTML or XML which allows diaeresis
replacement. Just as Microsoft Word and some other programs now allow capitalization
and small capitalization as an effect, though the underlying text is still
actually in upper and lower case, so we might show a diaeresis as a superscript
e, though in fact at the plain text
level the text has a diaeresis. P

Re: Character identities

2002-10-31 Thread Anto'nio Martins-Tuva'lkin
(After sending this unadvertedly to Dominikus only, here's
for the list also...) On 2002.10.30, 16:26, Dominikus Scherkl
<[EMAIL PROTECTED]> wrote:

> A font representing my mothers handwriting (german only :-) would
> render "u" as "u with breve above" to distinguish it from the
> representation of "n". I don't know how my mother would write a text
> containing an "u with breve above",

FWIW, I've seen the handwriting of an elder German esperantist, and he
does exactly that: he puts breves above each and every "u", both on
those which have it and on those which don't -- slightly confusing...

On the brink of off-topic-ness, something of that sort is made in
handwritten cyrillic (at least in Russian tradition): the "triple wave"
of a lower case "t" is distinguished from the "triple wave" of a lower
case "shch" (*) by means of a stroke above the former and a stroke below
the latter.

(*) Not that I'm an enthusiast of this transliteration...

--   .
António MARTINS-Tuválkin,   |  ()|
<[EMAIL PROTECTED]>   ||
R. Laureano de Oliveira, 64 r/c esq. |
PT-1885-050 MOSCAVIDE (LRS)  Não me invejo de quem tem   |
+351 917 511 549 carros, parelhas e montes   |
http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe   |
http://pagina.de/bandeiras/  a água em todas as fontes   |





Re: [OT] Göthe (was: Re: RE: Character identities)

2002-10-31 Thread Marc Wilhelm Küster
At 08:32 31.10.2002 -0800, Doug Ewell wrote:

Adam Twardoch  wrote:

>> Should an English language font render ö as oe,  so that Göthe
>> appears automatically in the more normal English form Goethe?
>
> If you refer to Johann Wolfgang von Goethe, his name is *not* spelled
> with an "ö" anyway.

Somebody thinks so:

http://www.transkription.de/gb_seiten/beispiele/goethe.htm


Both forms are permissible and used, even though Goethe is today by far the 
more frequent version -- remember that there was no standardized German 
orthography before the late 19th century and that the idea that a person's 
name has exactly one spelling is a fairly young idea in Europe.

Taking such facts into account for matching purposes is a good idea, but
changing the version for rendering is not.

Best regards,

Marc



*
Marc Wilhelm Küster
Saphor GmbH

Fronländer 22
D-72072 Tübingen

Tel.: (+49) / (0)7472 / 949 100
Fax: (+49) / (0)7472 / 949 114




Re: Tiberian Hebrew font situation

2002-10-31 Thread John Cowan
[EMAIL PROTECTED] scripsit:

> I've been told by a respected and experienced Hebrew
> font maker that it is IMPOSSIBLE to get all the Tiberian
> Hebrew marks on 1 font under the Unicode system.  

I am no font designer, but certainly having dots appear in different
sizes/places depending on the base character is very straightforward
in all modern Unicode-loving font technologies.

> I attempted to re-use other Semitic dots first, only
> going into Euro, left-to-right blocks where it was
> unavoidable.

Fair enough, though combining marks have no inherent script or directionality.

-- 
One art / There is  John Cowan <[EMAIL PROTECTED]>
No less / No more   http://www.reutershealth.com
All things / To do  http://www.ccil.org/~cowan
With sparks / Galore -- Douglas Hofstadter




[OT] Göthe (was: Re: RE: Character identities)

2002-10-31 Thread Doug Ewell
Adam Twardoch  wrote:

>> Should an English language font render ö as oe,  so that Göthe
>> appears automatically in the more normal English form Goethe?
>
> If you refer to Johann Wolfgang von Goethe, his name is *not* spelled
> with an "ö" anyway.

Somebody thinks so:

http://www.transkription.de/gb_seiten/beispiele/goethe.htm

-Doug Ewell
 Fullerton, California





RE: Character identities

2002-10-31 Thread Kent Karlsson

Let me take a few comparable examples;

1. Some (I think font makers) a few years ago argued
   that the Lithuanian i-dot-circumflex was just a
   glyph variant (Lithuanian specific) of i-circumflex,
   and a few other similar characters.

   Still, the Unicode standard now does not regard those as
   glyph variants (anymore, if it ever did), and embodies
   that the Lithuanian i-dot-circumflex is a different
   character in its casing rules (see SpecialCasing.txt).
   There are special rules for inserting (when lowercasing)
   or removing (when uppercasing) dot-aboves on i-s and I-s
   for Lithuanian.  I can only conclude that it would be
   wrong even for a Lithuanian specific font to display an
   i-circumflex character as an i-dot-circumflex glyph,
   even though an i-circumflex glyph is never used for
   Lithuanian.

2. The Khmer script got allocated a "KHMER SIGN BEYYAL".
   It stands (stood...) for "any abbreviation of the
   Khmer correspondence to etc."; there are at least four
   different abbreviations, much like "etc", "etc.", "&c",
   "et c.", ... It would be up to the font maker to decide
   exactly which abbreviation, and would vary by font.

   However, it is now targeted for deprecation for precisely
   that reason: it is *not* the font (maker) that should
   decide which abbreviation convention to use in a document,
   it is the *"author"* of the document who should decide.
   Just as for the Latin script, the author decides how to
   abbreviate "et cetera". The way of abbreviating should stay
   the same *regardless of font*. Note that the font may be
   chosen at a much later time, and not for wanting to
   change abbreviation convention. That convention one
   may want to have the same throughout a document also
   when using several different fonts in it, not having to
   carefully consider abbreviation conventions when choosing
   fonts.

3. Marco would even allow (by default; I cannot get away
   from that caveat since some (not all) font technologies
   do what they do) displaying the ROMAN NUMERAL ONE THOUSAND
   C D (U+2180) as an M, and it would be up to the font
   designer. While the glyphs are informative, this glyphic
   substitution definitely goes too far.  If the author
   chose to use U+2180, a glyph having at least some
   similarity to the sample glyph should be shown, unless
   and until someone makes a (permanent or transient)
   explicit character change.

4. Some people write è instead of é (I claim they cannot
   spell...).  So is it up to a font designer to display
   é as è if the font is made for a context where many
   people does not make a distinction?  Can a correctly
   spelled name (say) be turned into an apparent misspelling
   by just choosing such a font?  And that would be a Unicode
   font?

5. I can't leave the ö vs. ø; these are just different
   ways of writing "the same" letter; and it is not
   the case that ø is used instead of ö for any 
   7-bit reasons. It is conventional to use ø for ö
   in Norway and Denmark for any Swedish name (or
   word) containing it.  The same goes for ä vs. æ.
   Why shouldn't this one be up to the font makers too?
   If the font is made purely for Norwegian, why not
   display ö as ø, as is the convention?  This is
   *exactly* the same situation as with ä vs. a^e.

I say, let the *"author"* decide in all these cases, and
let that decision stand, *regardless of font changes*.
[There is an implicit qualification there, but I'm
tired of writing it.]


> Kent Karlsson wrote:
> > > I insist that you can talk about character-to-character 
> > > mappings only when
> > > the so-called "backing store" is affected in some way.
> > 
> > No, why?  It is perfectly permissible to do the equivalent
> > of "print(to_upper(mystring))" without changing the backing
> > store ("mystring" in the pseudocode); to_upper here would
> > return a NEW string without changing the argument.
> 
> And that, conceptually, is a character-to-glyph mapping.

Now I have lost you.  How can it be that?  The "print"
part, yes. But not the to_upper part; that is a
character-to-character mapping, inserted between the
"backing store" and "mapping characters to glyphs".
It is still an (apparent) character-to-character
mapping even if it is not stored in the "backing store".

> In my mind, you are so much into the OpenType architecture, 
> and so much used
> to the concept that glyphization is what a font "does", that 
> you can't view the big picture.

Now I have lost you again.  Some fonts (in some font
technologies) do more that "pure" glyphization. This
is why I have been putting in caveats, since many people
seem to think that all fonts *only* do glyphisation,
which is not the case.

But to be general I was referring to such mappings regardless
of if that is built into some font (using character code points
or, as in OT/AAT, using glyph indices) or (better) were external
to the font.

I was trying to use general formulations, but I cannot
avoid having

RE: New Charakter Proposal

2002-10-31 Thread Dominikus Scherkl
Hello.

Markus Scherer wrote:
> Chances are nearly 100% that overlong UTF-8 was a 
> spoofing attempt, or the result of something other than a 
> UTF-8 encoder.
Correct. This is exactly my topic.
Wouldn't it be nice to have a standardized way to indicate
that an attack to the message has occured without hiding
the contained information from the user?

The way we do this yet, is popping up some alert box, but
this does not remain in the text.
And using any unassigned or forbidden codepoint (as you
suggested) would keep it's meaning only for the application
which converted the text (in our case a small tool decoding
encrypted messages - which will never see the text again).
And leaving any other mark in the text is at least
non-standard, so most unicode-tools can't use it (which is
our goal).

But ok, it is not that important. Would only be nice.

Best regards.
-- 
Dominikus Scherkl
[EMAIL PROTECTED]