Re: *Why* are precomposed characters required for "backward compatibility"?

2002-07-13 Thread David Hopwood

-BEGIN PGP SIGNED MESSAGE-

David Hopwood wrote:
>   "French uses the accents acute ( ́), grave ( ̀), circumflex ( ̑),
>diaeresis ( ̈), and cedilla ( ̧)."

Oops, that's not a circumflex. I'll blame the tiny font size and lack of
character name feedback in Character Map.

(Feature request for Unibook: double-clicking on a character appends
it to the clipboard.)

- -- 
David Hopwood <[EMAIL PROTECTED]>

Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5  0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. If I revoke a
public key but refuse to specify why, it is because the private key has been
seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip


-BEGIN PGP SIGNATURE-
Version: 2.6.3i
Charset: noconv

iQEVAwUBPS/UlzkCAxeYt5gVAQHGeAf8DsgclZrRHXqCwOyXKTBTOMqVB+BrinOW
FhGvagSHBKSZxNOc7r3pAMTR/wyVVBFXqfCYtrPbkEzN+QSNHrvVl0QMu5CPoZA9
1s4DOLGRjRpfwmt0MgEaNFF3lY/pzhnKeDnBzV95sr6wAHL9sVcrQU+aplrQB7uB
i78TH+a505E1a16ldiCPZAQf2DTGbJ2lM8IPYBKHEc6u5JW8IlQhwfVaddig5HD3
CXfRgCL57FfbMIBRd8gIei/vrpNB2cdFpdFOfHXuFF+bVhA1lr3xaAEpmpLhUFFH
320d5VF3TOTJs6QrmYU/ESHh6r1W0QbcIzHmGAIJcnSFWqUliQZGQw==
=/iaB
-END PGP SIGNATURE-






Re: *Why* are precomposed characters required for "backward compatibility"?

2002-07-13 Thread David Hopwood

-BEGIN PGP SIGNED MESSAGE-

Dan Oscarsson wrote:
> >From: David Hopwood <[EMAIL PROTECTED]>
> 
> >For all of these characters, use as a spacing diacritic is actually much
> >less common than any of the other uses listed above. Even when they are used
> >to represent accents, it is usually as a fallback representation of a combining
> >accent, not as a true spacing accent.
> >
> >So, there would have been no practical problem with disunifying spacing
> >circumflex, grave, and tilde from the above US-ASCII characters, so that the
> >preferred representation of all spacing diacritics would have been the
> >combining diacritic applied to U+0020.
> 
> Apart from the problems Kenneth Whistler mentioned.

I'm not sure which post you're referring to. Possibly message-id
<[EMAIL PROTECTED]>? That post argued that characters
in the US-ASCII range should not have decompositions (which I entirely agree
with), but it did not give any arguments against disunifying spacing circumflex,
grave and tilde from U+005E (caret), U+0060 (backtick), and U+007E (middle tilde).

> You would get the same problems with the ISO 8859-1 spacing accents but
> there are less people using them than with those in ASCII.

None of the spacing accents are commonly used, at least in Latin scripts.
Perhaps we have different ideas of what a spacing accent is? To me, the
following sentence contains spacing accents:

  "French uses the accents acute ( ́), grave ( ̀), circumflex ( ̑),
   diaeresis ( ̈), and cedilla ( ̧)."

I.e. almost the only use of spacing accents is to talk about accents.
That's a perfectly reasonable thing to do, and it should be supported, but
something that is used so infrequently is not going to cause many complaints
regardless of how the encoding is handled.

A caveat to this, as I said above, is that spacing accent characters are
sometimes used as a *fallback* when a charset has no representation for a
particular composite abstract character. But that is never necessary in
Unicode; it should only occur in Unicode text as a result of conversion
from some other charset.

> One problem is that some characters can be used as an accent and as
> a normal base character,

No, there are no Unicode characters can be used both as a combining accent
and as a base character. The ISO-Latin standards were ambiguous about this,
but Unicode is not.

> and some characters that Unicode defines a decomposition of, is not a
> composed character in some countries.

Huh? I don't understand what you're trying to say here at all. What
do countries have to do with this?

> So in some contexts is is wrong to decompose some characters that
> could be ok to decompose in others.

I assume you're talking only about compatibility decompositions (by
definition, canonically equivalent strings are always semantically
equivalent in all contexts). If so then I agree; IMHO it really only makes
sense to use the NFKC/D decompositions to convert Unicode text containing
compatibility characters to marked-up text, not as a normalisation form as
such.

> That is one reason I prefer NFC as it do not decompose characters.

NFC does decompose some characters (those in the "composition exclusions"
list).

- -- 
David Hopwood <[EMAIL PROTECTED]>

Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5  0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. If I revoke a
public key but refuse to specify why, it is because the private key has been
seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip


-BEGIN PGP SIGNATURE-
Version: 2.6.3i
Charset: noconv

iQEVAwUBPS/OQzkCAxeYt5gVAQGHqAgAhcVbLT+Qebk8l8zVt94oHb9q2c+0Ddpf
QyftnzaWxERSDkac1N3IFSJTYs+MFmMjxwaGzavN1+U1mzKSJiDTNyOOc5RUb0of
4ctxFzjAYB+cizW0w6Kl8G3GT/iAk0EkpdDCmOozt85i99M/n4NdeEQyE/PlYuzg
XtC59f+uTCjXlxf19ko4Oel512b+lFQG4yBAgzK74KGhLJ9E6rZ4S5HQfZJVDawP
QLRmZ2s+PhGoy5aPekkPzzSdFy5tdaMFA5rMz6gzy7o0g8SiAtQgWS83FJ40NTzi
4Ec4sHd7tpgtkmb+0mFtDqqhkr7akcEjYB/VhuQZdBs/31kJV+3n/w==
=A96M
-END PGP SIGNATURE-




Re: *Why* are precomposed characters required for "backward compatibility"?

2002-07-12 Thread Dan Oscarsson


>From: David Hopwood <[EMAIL PROTECTED]>

>For all of these characters, use as a spacing diacritic is actually much
>less common than any of the other uses listed above. Even when they are used
>to represent accents, it is usually as a fallback representation of a combining
>accent, not as a true spacing accent.
>
>So, there would have been no practical problem with disunifying spacing
>circumflex, grave, and tilde from the above US-ASCII characters, so that the
>preferred representation of all spacing diacritics would have been the
>combining diacritic applied to U+0020.

Apart from the problems Kenneth Whistler mentioned.
You would get the same problems with the ISO 8859-1 spacing accents but
there are less people using them than with those in ASCII.
One problem is that some characters can be used as an accent and as
a normal base character, and some characters that Unicode defines
a decomposition of, is not a composed character in some countries.
So in some contexts is is wrong to decompose some characters that
could be ok to decompose in others.
That is one reason I prefer NFC as it do not decompose characters.

>

>> For a lot of text handling precomposed characters are much easier to
>> handle, especially when the combining character comes after instead of
>> before the base character.
>
>I thought you said approximately the opposite in relation to T.61 above :-)
>
Sorry, got the last part wrong in my haste. I meant it is easier when
the combining character comes before the base character.

   Dan




Re: *Why* are precomposed characters required for"backward compatibility"?

2002-07-11 Thread David Starner

At 07:14 PM 7/11/02 -0700, Doug Ewell wrote:
>Programming languages, notably C and its offspring, have appropriated
>these characters for their own purposes.  You can't really blame "users"
>for that.

I'm not sure you can blame anyone for that. If you're going to waste
keys on my keyboard for ^, ~ and `, and waste 3 out of 94 graphical
characters in ASCII for them, I'm going to find a use. Next time you
design a universal character set, take more care. (-;

>Except, of course, for any additional user confusion that might have
>arisen from encoding three more lookalike "spoof buddies."

The spoofing problem already exists; adding a few more, with valid
reasons, really isn't going to change anything. Note that ~ and arguably
^ don't spoof SPACING COMBINING TILDE and SPACING COMBINING CIRCUMFLEX.
(Why did Latin-1 and Unicode persist in encoding more of these, BTW?
/Character backspace spacing mark/ was illegal in both of them, and I've
never seen them used for what was overtly their purpose.)





Re: *Why* are precomposed characters required for "backward compatibility"?

2002-07-11 Thread Kenneth Whistler

Dan Oscarsson said:

> NFD should not be an extension of ASCII. There are several spacing
> accents in ASCII
> that should be decomposed just like the spacing accents in ISO 8859-1
> are decomposed.
> All or none spacing accents should be decomposed.

In addition to the usage clarifications made by John Cowan and
David Hopwood, I should point out a little history here.

As of Unicode 2.0, some compatibility decompositions were still
provided for U+005E CIRCUMFLEX ACCENT, U+005F LOW LINE, and
U+0060 GRAVE ACCENT, along the lines suggested by Dan. However,
when normalization forms were being established and standardized
in the Unicode 3.0 time frame, it became obvious that these
particular compatibility decompositions would lead to trouble.

Any Unicode normalization form that would not leave ASCII values
unchanged would have been DOA (dead on arrival), because of its
potential impact on widely used syntax characters in countless
formal syntaxes. The equating of U+005F LOW LINE with a combining
low line applied to a SPACE was particularly problematical, since
LOW LINE is so widely accepted as an element of identifiers.

Because of these complications, the 3 compatibility decompositions
were withdrawn by the UTC (unanimously, if I recall correctly),
*before* the normalization forms were finally standardized.

Consistency in treatment would be nice, but consistency in
treatment of the multiply ambiguous ASCII characters of this ilk
is impossible at this point. And it would have been very, very, very,
vry bad for normalization to have allowed these three, in particular,
to have decompositions.

--Ken




Re: *Why* are precomposed characters required for "backward compatibility"?

2002-07-11 Thread Doug Ewell

David Hopwood  wrote:

> OTOH, there can be more than one way to represent composites that
> include two or more diacritics in different combining classes (e.g.
> ). Technically, that would mean that
> strict byte-for-byte round-tripping of X -> NFD -> X would not be
> guaranteed in every case (unless X also requires that all data is
> normalised). This doesn't apply to T.61, but it does apply to other
> standards such as TIS620 (ISO-Latin-11 / Thai), which have combining
> marks in more than one class.

As you mentioned, this does not apply to T.61 or ISO 6937, because they
do not permit multiple diacritics to be applied to a single base
character.

> Users have basically ignored (if they are even aware of) any
> admonitions from standards institutions to treat U+005E, U+0060 or
> U+007E as spacing accents, and continued to use them for the purposes
> listed below:

Programming languages, notably C and its offspring, have appropriated
these characters for their own purposes.  You can't really blame "users"
for that.

> So, there would have been no practical problem with disunifying
> spacing circumflex, grave, and tilde from the above US-ASCII
> characters, so that the preferred representation of all spacing
> diacritics would have been the combining diacritic applied to U+0020.

Except, of course, for any additional user confusion that might have
arisen from encoding three more lookalike "spoof buddies."  Unicode is
already taking a lot of heat on the IDN list for not unifying all
"lookalike" pairs.

-Doug Ewell
 Fullerton, California





Re: *Why* are precomposed characters required for "backward compatibility"?

2002-07-11 Thread David Hopwood

-BEGIN PGP SIGNED MESSAGE-

Dan Oscarsson wrote:
> From David Hopwood <[EMAIL PROTECTED]>
> 
> >The only difficulty would have been if a pre-existing standard had supported
> >both precomposed and decomposed encodings of the same combining mark. I don't
> >think there are any such standards (other than Unicode as it is now), are
> >there?
> 
> Yes. T.61 is still in use.

See  for a mapping table for ISO 6937,
which is apparently a superset of T.61 (according to
).

> It uses combining accents.

As John Cowan pointed out, neither T.61, nor the other suggested counterexample
of ISO 5426-1980 or ANSEL, have more than one way to represent a composite with a
single diacritic.

OTOH, there can be more than one way to represent composites that include two or
more diacritics in different combining classes (e.g. ). Technically, that would mean that strict byte-for-byte round-
tripping of X -> NFD -> X would not be guaranteed in every case (unless X also
requires that all data is normalised). This doesn't apply to T.61, but it does
apply to other standards such as TIS620 (ISO-Latin-11 / Thai), which have
combining marks in more than one class.

I very much doubt, though, that this would be considered a significant practical
problem; there would still be no semantic information lost by round-tripping.
Also some of the languages that it might otherwise affect aren't typically
encoded using legacy charsets with combining characters (e.g. the most common
legacy encoding for Vietnamese is VISCII, which has only precomposed characters).

> One place where it is used is in X.500.

Incidentally, AFAICT most recent implementations of X.509 etc. don't bother to
convert correctly to and from T.61 :-(

> It also have the nice way where the combining accent
> comes before the base character making it easier to parse.

That's debateable. I don't think it matters one way or the other as long as only
one ordering is used.

> >(Obviously, an NFD-only Unicode would not have been an extension of ISO-8859-1.
> >That wouldn't have been much of a loss; it would still have been an extension
> >of US-ASCII.)
> 
> NFD should not be an extension of ASCII. There are several spacing
> accents in ASCII that should be decomposed just like the spacing accents in
> ISO 8859-1 are decomposed.
> All or none spacing accents should be decomposed.

Users have basically ignored (if they are even aware of) any admonitions from
standards institutions to treat U+005E, U+0060 or U+007E as spacing accents,
and continued to use them for the purposes listed below:

Character   Common uses

U+005E CIRCUMFLEX ACCENTto indicate superscripts; 'to the power of';
'exclusive-or' in some programming languages

U+0060 GRAVE ACCENT opening single quote; 'backtick' in some programming
& shell languages

U+007E TILDEas prefix: 'not' in programming languages; home
  directories in Unix; symbol for 'approximately'
as suffix: backup filenames in Unix
(preferred glyph is middle tilde, which is not the
 same as a spacing tilde accent anyway)

For all of these characters, use as a spacing diacritic is actually much
less common than any of the other uses listed above. Even when they are used
to represent accents, it is usually as a fallback representation of a combining
accent, not as a true spacing accent.

So, there would have been no practical problem with disunifying spacing
circumflex, grave, and tilde from the above US-ASCII characters, so that the
preferred representation of all spacing diacritics would have been the
combining diacritic applied to U+0020.

> I could ask why are not precomposed characters preferred to be used, if
> they exist?

They are, in HTML, XML, etc.

> For a lot of text handling precomposed characters are much easier to
> handle, especially when the combining character comes after instead of
> before the base character.

I thought you said approximately the opposite in relation to T.61 above :-)

- -- 
David Hopwood <[EMAIL PROTECTED]>

Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5  0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. If I revoke a
public key but refuse to specify why, it is because the private key has been
seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip


-BEGIN PGP SIGNATURE-
Version: 2.6.3i
Charset: noconv

iQEVAwUBPS4vADkCAxeYt5gVAQEyrgf/YQQebmjtCEx9pazUFuxATH5ABvrazbIh
JLhoqh9/1MimkP1asBN08L3VUAz/pDvVFj/TKbGqqPEgrKTkzfbaGIPcACLzsPzV
oUJMx7aBerQZXzHQLRFqVlQ9Q37IRTEm/c9+KXbOwNVEBGCshUSymvyrSQ2mT0bM
s6I8bVwtrkdL4kffAGxaqlZdCG24VJSTfUOj

Re: *Why* are precomposed characters required for "backward compatibility"?

2002-07-11 Thread Dan Oscarsson


>From: David Hopwood <[EMAIL PROTECTED]>


>The only difficulty would have been if a pre-existing standard had supported
>both precomposed and decomposed encodings of the same combining mark. I don't
>think there are any such standards (other than Unicode as it is now), are
>there?

Yes. T.61 is still in use. It uses combining accents. One place where it
is
used is in X.500. It also have the nice way where the combining accent
comes before the base character making it easier to parse.


>
>(Obviously, an NFD-only Unicode would not have been an extension of ISO-8859-1.
>That wouldn't have been much of a loss; it would still have been an extension
>of US-ASCII.)

NFD should not be an extension of ASCII. There are several spacing
accents in ASCII
that should be decomposed just like the spacing accents in ISO 8859-1
are decomposed.
All or none spacing accents should be decomposed.


I could ask why are not precomposed characters preferred to be used, if
they exist?
For a lot of text handling precomposed characters are much easier to
handle, especially
when the combining character comes after instead of before the base
character.

   Dan




Re: *Why* are precomposed characters required for "backward compatibility"?

2002-07-09 Thread Kenneth Whistler

David Hopwood wrote:

> Marco Cimarosti wrote:

> > BTW, they always sold me that precomposed accented letters exist in Unicode
> > only because of backward compatibility with existing standards.
> 
> I don't get that argument. It is not difficult to round-trip convert between
> NFD and a non-Unicode standard that uses precomposed characters. Round-trip
> convertability of strings does not imply round-trip convertability of
> individual characters, and I don't see why the latter would be necessary.

Because while it is conceptually not difficult to roundtrip convert between
legacy accented Latin characters and Unicode NFD combining character sequences,
in practice many Unicode implementations would never have gotten off the
ground if they had had to start with combining character sequences for
all Latin letters, including, in particular, the 8859 repertoires. And
the character mapping tables are considerably more complex, in practice, if
they must map 1-n, n-1, rather than 1-1. Right now, a Latin-1 to Unicode
mapping table is trivial, but if Latin-1 had not been covered with a set
of precomposed characters, the mapping would *not* have been trivial, and that
would have been a significant barrier to early Unicode adoption. And people
would *still* be complaining -- vigorously -- about the performance hit
and maintenance complexity of interoperating with 8859 and common PC
code pages.

> The only difficulty would have been if a pre-existing standard had supported
> both precomposed and decomposed encodings of the same combining mark. I don't
^^
/character
> think there are any such standards (other than Unicode as it is now), are
> there?

Not to my knowledge.

> 
> (Obviously, an NFD-only Unicode would not have been an extension of ISO-8859-1.
> That wouldn't have been much of a loss; it would still have been an extension
> of US-ASCII.)
> 
> > If this compatibility issue didn't exist, Unicode would be like NFD.
> 
> And would have been much simpler and better for it, IMHO.

It would have been better, in some respects, to treat Latin like the
complex script it is, and to end up with the same kind of clean,
by-the-principles encoding that Unicode has for Devanagari, essentially
free of equivalences and normalization difficulties. But it took years
for major platforms to get up to speed on complex script rendering,
including the relatively simple but elusive prospect of dynamic
application of diacritics to Latin letters (and/or mapping of
combining character sequences to preformed complex glyphs).

And despite the vigorous advocacy by some factions of early Unicoders
to have a consistent, decomposed Latin representation in Unicode, there
were some rather hard-headed decisions made early on (1989) that that approach 
would cripple what was then an experimental encoding. The inclusion of large
numbers of precomposed Latin letters as encoded characters was the
price for the participation of IBM, Microsoft, and the Unix vendors,
and was also the price for the possibility of alignment of Unicode with
an ISO international standard. Without paying those prices, Unicode
would not exist today, in my opinion.

--Ken

> 
> - -- 
> David Hopwood <[EMAIL PROTECTED]>




*Why* are precomposed characters required for "backward compatibility"?

2002-07-09 Thread David Hopwood

-BEGIN PGP SIGNED MESSAGE-

Marco Cimarosti wrote:
> Theodore H. Smith wrote:
> > [...] If I didn't know what a composite was, I'd guess it was the same
> > thing as a combining sequence.
> >
> > However, the two are meant to be different, so it can't be the same.
> 
> They are meant to have exactly the same meaning, appearance and behavior.
> The difference is only inside the computer's memory, and should be invisible
> to users.
> 
> The purpose of the normalization algorithm above is to get rid of this
> useless difference:
> 
> - Normalization Form D (NFD) turns any precomposed accented letter into a
> letter + accent sequence.
> 
> - Normalization Form C (NFC) turns any letter + accent sequence into a
> precomposed accented letter, if one exists.
> 
> BTW, they always sold me that precomposed accented letters exist in Unicode
> only because of backward compatibility with existing standards.

I don't get that argument. It is not difficult to round-trip convert between
NFD and a non-Unicode standard that uses precomposed characters. Round-trip
convertability of strings does not imply round-trip convertability of
individual characters, and I don't see why the latter would be necessary.

The only difficulty would have been if a pre-existing standard had supported
both precomposed and decomposed encodings of the same combining mark. I don't
think there are any such standards (other than Unicode as it is now), are
there?

(Obviously, an NFD-only Unicode would not have been an extension of ISO-8859-1.
That wouldn't have been much of a loss; it would still have been an extension
of US-ASCII.)

> If this compatibility issue didn't exist, Unicode would be like NFD.

And would have been much simpler and better for it, IMHO.

- -- 
David Hopwood <[EMAIL PROTECTED]>

Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5  0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. If I revoke a
public key but refuse to specify why, it is because the private key has been
seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip


-BEGIN PGP SIGNATURE-
Version: 2.6.3i
Charset: noconv

iQEVAwUBPSuImDkCAxeYt5gVAQGIjwf/dKRYcEVD5ZC5A12jmZtXrgUaS+FMcHmy
3EYhqN1Csr8aNP1JJZyz48VCd3WM9aV+vu3fieU/ADGu084pTQ97sG0ABXZeWagX
WVWpGNZH8N6JQ7YHYoW1MBkx8S1t2Fg7J36ZN71KqeKsqrWUoLFosb3QGOJpSV09
1MygGi5UPn6vW8OVX1lAmUcs+ETYwVNd9aPqxmwkpwyO48PwgjdGEuIYcvXSDAac
+g4CGPmc+mSIxrtw3yjkXIHkL8pzx1QE88BV2BB6VLiSaLvadm82Be4kGuQqcC4s
Tpr1uhGvHG+hqKLxyyXzefEZyvYi182hcFXbS+7vqhEtWnPDRayAFQ==
=yELZ
-END PGP SIGNATURE-