date:20031217

Re: [OT] CJK - CJC (Re: Corea?)

2003-12-17 Thread Doug Ewell

Anto'nio Martins-Tuva'lkin antonio at tuvalkin dot web dot pt wrote:

 Every language, whose speaking community ever conteacted others, does
 it.  , f.i., is the Chuvash name for neighbouring 
 , which is probably still known in English as Gorky, a clumsy
 transcription of the 1934-1991 name .

No, it's Nizhniy Novgorod to me.

I don't think I'll respond to the rest of Anto'nio's charming and
respectful post.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

Re: Case mapping of dotless lowercase letters

2003-12-17 Thread Doug Ewell

Philippe Verdy verdy underscore p at wanadoo dot fr wrote:

 Well Outlook 2000 is unable to represent any e with ogonek and trema
 of your example. So, despite they are canonically equivalent, they are
 rendered differently:

Everything rendered perfectly over here, on Windows 95 and Outlook
Express 5 (and Uniscribe).  You might try switching to Lucida Sans
Unicode, if you have it.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

RE: Case mapping of dotless lowercase letters

2003-12-17 Thread Peter Jacobi

[EMAIL PROTECTED] wrote:

[...]
 Note that ß (sharp s) casefolds to ss, and Å¿ (long s) casefolds to s. So
 straße, straÅ¿se, and strasse also both map to the same (strasse)
 subname.
[...]

According to my Duden, sharp-s doesn't uppercases to SS, when it is in 
a name. So 'Großmann' and 'Grossmann' should get distinct Domains, 
where available.

BTW, the whole thread on IDN domain names which can be mistaken, seems
rather pointless. It is an old problem, explored by registering misspellings
or with and without a hyphen. If there is a possibility of confusion, then
there
is a possibility of a lawsuit and the older rights and larger legal
department
will win. AFAIK mircosoft.com was killed this way (whereas rnicrosoft.com is
being tolerated, strange).

Regards,
Peter Jacobi

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net

Stability of scientific names, was Stability of WG2

2003-12-17 Thread Curtis Clark

on 2003-12-16 15:27 Peter Kirk wrote:

I'm no expert on this... 
I am. :-)

but I thought that species could be transferred 
from genus to genus as knowledge advances. 
As John pointed out, the epithet stays the same.

And presumably obvious 
spelling mistakes are corrected (contrast FHTORA in U+1D0C5), or are 
you saying that if the first publication had Brontosuarus as a typo 
this error would remain for ever?
There are errors and then there are errors. Some are correctable, some 
are not, and botanists and zoologists have different rules about this. 
An example that's not entirely OT: There was a Russian physician with 
the last name  - a cyrillicization of his German family name 
Escholtz. His name was commonly written then and today in German form as 
Johann Friedrich Eschscholtz, the schsch reduplication being a 
reflection of the Cyrillic spelling. He Latinized (language, not 
alphabet) his name (a common occurrence among naturalists) to Eschscholzius.

He was physician to the Kotzebue expedition from Russia to (among other 
places) California; the ship's naturalist was Adelbert von Chamisso 
(author of _Peter Schlemiel_). Chamisso and Eschscholtz were fast 
friends (and some accounts imply that they were lovers). Chamisso named 
several new species of organisms for his friend, including the 
California poppy.

In the original description of the California poppy, he named it 
_Eschscholzia californica_, making the genus name the feminine form of 
Eschscholtz's Latinized name (this is a common occurrence). In the 
caption of the illustration of the plant, however, it was spelled 
_Eschholzia_. But for over a century afterwards, most botanists and 
horticulturists spelled the genus _Eschscholtzia_, assuming that both 
spellings in the original description were typographic errors.

But the rules of nomenclature are very specific about which types of 
errors can be corrected, and, since there is no obvious correct 
spelling of Escholtz, *the spelling that accompanied the original 
description must stand*, and the plant is correctly _Eschscholzia 
californica_.

--
Curtis Clark  http://www.csupomona.edu/~jcclark/
Mockingbird Font Works  http://www.mockfont.com/

RE: [OT] CJK - CJC (Re: Corea?)

2003-12-17 Thread Marco Cimarosti

Doug Ewell wrote:
 I'll go farther than that.  It's always bothered me that speakers of
 European languages, including English but especially French, have seen
 fit to rename the cities and internal subdivisions of other countries.

Rightly said!

There is reason to rename Colonia to Kln, Augusta to Augsburg,
Eboraco to York, Provincia to Provence, and so on.

_ Marco

Re: Case mapping of dotless lowercase letters

2003-12-17 Thread Peter Kirk

On 16/12/2003 17:21, Kenneth Whistler wrote:

Correcting myself:

 

Note that none of the 3 sets of equivalence classes violates
*canonical* equivalence, because none of the 8 sequences involved
is canonically equivalent to any other. In other words, no matter
which of the 3 approaches you take to case folding, in no instance
are you claiming that canonically equivalent sequences are to be
interpreted differently.
   

Actually, dotted I *is* canonically equivalent to I, dot above
(I overlooked that when compiling the summary.)
 

This implies (since there are no decomposition exclusions) that NFD, 
used on Turkic text, violates the very sensible rule DO NOT USE 
COMBINING DOTS WITH I's, and leads to all sorts of potential confusion 
e.g. that both simple and full case folding and lowercasing applied to 
NFD Turkic text generate the nonsensical i, dot above. This could be a 
serious problem - although one that may not be worth fixing.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Case mapping of dotless lowercase letters

2003-12-17 Thread Peter Kirk

On 16/12/2003 19:28, John Cowan wrote:

Philippe Verdy scripsit:

 

If we just remove any 0307 from the Turkic texts, there is absolutely no
problem with Turkic CaseFolding, provided that we also define
Turkic-specific uppercase mappings as done above, and don't use the default
locale-neutral uppercase mappings of the UCD.
   

There's no reason to expect that there will be any 0307 whatever in
Turkish/Azeri texts: it's not a diacritic those languages use, AFAIK.
 

Not normally. But it does appear in Turkic text normalised to NFD as the 
dotted I's are decomposed.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Case mapping of dotless lowercase letters

2003-12-17 Thread jon

 There's no reason to expect that there will be any 0307 whatever in
 Turkish/Azeri texts: it's not a diacritic those languages use, AFAIK.

There's no reason to expect that there won't be, particularly if they quote a 
piece in a language which does use U+0307.

--
Jon Hanna   | Toys and books
http://www.hackcraft.net/ | for hospitals:
| http://santa.boards.ie

RE: Case mapping of dotless lowercase letters

2003-12-17 Thread Philippe Verdy

Doug Ewell
 Philippe Verdy verdy underscore p at wanadoo dot fr wrote:
 
  Well Outlook 2000 is unable to represent any e with ogonek and trema
  of your example. So, despite they are canonically equivalent, they are
  rendered differently:
 
 Everything rendered perfectly over here, on Windows 95 and Outlook
 Express 5 (and Uniscribe).  You might try switching to Lucida Sans
 Unicode, if you have it.

I have Lucida Sans Unicode with Office. But there's a difference between
Outlook (2000) and Windows XP's Outlook Express 6 here, despite they are
supposed to share the same UniScribe engine (or may be there's a parallel
version of Uniscribe used only in Office 2000 (updated with Office Update
separately from Windows), and not updated along Outlook Express (within
Windows Update)...


__
 ella for Spam Control  has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com
attachment: winmail.dat

Re: Stability of WG2

2003-12-17 Thread Peter Kirk

On 16/12/2003 19:58, John Cowan wrote:

Peter Kirk scripsit:

 

I'm no expert on this... but I thought that species could be transferred 
from genus to genus as knowledge advances. 
   

True enough, but the specific epithet remains the same, and the old names
are still available (as the jargon has it) though no longer valid
(what I was calling preferred in my previous post).  Linnaeus himself,
working with two different descriptions of chimps, split them into
Homo troglodytes and Simia satyrus (which latter also included bonobos
and orangutans); when the mistake was cleared up, the specific epithet
troglodytes, being the older, was retained for chimps, whereas bonobos
got satyrus, both now in the new genus Pan; orangs were moved to Pongo
and given the new epithet pygmaeus.  (There's now a move underfoot to
move all of these, plus gorillas, into Homo; I don't give it much chance,
though I think it's a cool idea.)
Nobody would call chimps Homo troglodytes, or orangs Simia satyrus,
today, but those names can't ever be assigned to other species in future.
(If chimps were folded into Homo, they would be H. troglodytes again.)
 

And that is more or less what I would like to see with Unicode character 
names. Old names can remain valid as deprecated synonyms (or perhaps 
non-deprecated synonyms e.g. if Corean becomes officially preferred 
but Korean is still in widespread use), and not reusable for other 
characters, but should be gradually replaceable by new, correct or 
updated names.



--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Case mapping of dotless lowercase letters

2003-12-17 Thread Peter Kirk

On 16/12/2003 14:59, Kent Karlsson wrote:

...

Peter Kirk wrote:
 

If the Swedish registry allows all the letters used in Swedish and Sami, 
and far eastern registries allow Chinese characters, the Turkish and 
Azerbaijani registries should allow, and be allowed to allow, all the 
letters of the alphabets of their national languages.
   

Note that  (sharp s) casefolds to ss, and  (long s) casefolds to s. So
strae, strase, and strasse also both map to the same (strasse)
subname.
 

The difference here is that Germans recognise ss and sharp s as variant 
spellings in the same words, whereas in Turkish i and dotless i are 
quite different letters, just as in Swedish, Turkish and German o and o 
umlaut are quite different letters. I know Germans tolerate o umlaut  
written as oe, but I don't think Turks do. But surely the whole point of 
getting away from ASCII-only domain names is to respect national and 
language-specific alphabets. What is needed for Germany and Sweden 
should not be denied to Turkey.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

RE: [OT] CJK - CJC (Re: Corea?)

2003-12-17 Thread jon

Quoting Marco Cimarosti [EMAIL PROTECTED]:

 Doug Ewell wrote:
  I'll go farther than that.  It's always bothered me that speakers of
  European languages, including English but especially French, have seen
  fit to rename the cities and internal subdivisions of other countries.
 
 Rightly said!
 
 There is reason to rename Colonia to Köln, Augusta to Augsburg,
 Eboraco to York, Provincia to Provence, and so on.
 

I doubt Christians mean offence when they refer to Jesus through any of the 
countless transcriptions, spellings and pronunciations used in various 
languages. I think this is analogous to assuming that anyone dreaming of 
packing it all in and buying a villa in Provence similarly means no offence 
when expressing that desire in English (Zapan though would appear to be a 
different matter).

--
Jon Hanna   | Toys and books
http://www.hackcraft.net/ | for hospitals:
| http://santa.boards.ie

RE: Case mapping of dotless lowercase letters

2003-12-17 Thread Philippe Verdy

Peter Kirk wrote:
 This implies (since there are no decomposition exclusions) that NFD, 
 used on Turkic text, violates the very sensible rule DO NOT USE 
 COMBINING DOTS WITH I's, and leads to all sorts of potential confusion 
 e.g. that both simple and full case folding and lowercasing applied to 
 NFD Turkic text generate the nonsensical i, dot above. This could be a 
 serious problem - although one that may not be worth fixing.

Yes NFD is an issue, but not a critical one, because the decomposition is
canonical, and not excluded from recomposition.

However you're wrong here: only Full CaseFolding generates i, dot-above
from dotted-I, not the default lowercase mapping in the UCD which is just
left unchanged, or the locale-specific tr/az lowercase mapping which
maps it to (soft-dotted-)i.

Typical Turkish and Azeri texts will not use dot-above, except in the NFD
form I, dot-above for dotted-I, which is just needed because of the Full
CaseFolding mapping to make it respect canonical equivalence.

I do hope that dotless-j and dotted-J will avoid these confusions, but not
trying to decompose dotted-J in the NFD form, and not generating j,
dot-above in Full CaseFolding of dotted-J, but just (soft-dotted-)j. Or
will it add more confusion there, if j is treated diffrently than i?


__
 ella for Spam Control  has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com
attachment: winmail.dat

RE: [OT] CJK - CJC (Re: Corea?)

2003-12-17 Thread Philippe Verdy

Marco Cimarosti wrote:
 Doug Ewell wrote:
  I'll go farther than that.  It's always bothered me that speakers of
  European languages, including English but especially French, have seen
  fit to rename the cities and internal subdivisions of other countries.
 
 Rightly said!
 
 There is reason to rename Colonia to Kln, Augusta to Augsburg,
 Eboraco to York, Provincia to Provence, and so on.

Or even Aix-la-Chapelle to Aachen because that's its _current_ German
name (the French name was official in the history, and is still used in
French).

Cities sometimes change name, some of theme being famous like the _current_
Saint-Ptersbourg (French name revived in Russia with just a
transliteration, the Latin transcription being also widely used by Russians)
which has also been Lningrad or Ptrograd or Stalingrad (in the Latin
transliteration of the official and changing Russian script name, this Latin
transliteration changing a bit among various languages which used them), and
even Saint-Ptersbourg officially for some time in the tsar's Russia.


__
 ella for Spam Control  has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com
attachment: winmail.dat

Re: Stability of scientific names, was Stability of WG2

2003-12-17 Thread Alexander Savenkov

Hello,

2003-12-17T11:06:32Z Curtis Clark [EMAIL PROTECTED] wrote:

 on 2003-12-16 15:27 Peter Kirk wrote:

 I'm no expert on this... 

 I am. :-)

 but I thought that species could be transferred 
 from genus to genus as knowledge advances. 

 As John pointed out, the epithet stays the same.

 And presumably obvious 
 spelling mistakes are corrected (contrast FHTORA in U+1D0C5), or are
 you saying that if the first publication had Brontosuarus as a typo
 this error would remain for ever?

 There are errors and then there are errors. Some are correctable, some
 are not, and botanists and zoologists have different rules about this.
 An example that's not entirely OT: There was a Russian physician with
 the last name  - a cyrillicization of his German family name

He was  actually. You forgot the soft sign.

(I'm not sure everyone will see the name - the editor replaced the
encoding with windows-1251, and there's no UTF-8 support).

Regards,
-- 
  Alexander Savenkovhttp://www.xmlhack.ru/
  [EMAIL PROTECTED] http://www.xmlhack.ru/authors/croll/

RE: [OT] CJK - CJC (Re: Corea?)

2003-12-17 Thread jarkko.hietaniemi

Or even Aix-la-Chapelle to Aachen because that's its _current_ German name (the 
French name was official in the history, and is still used in French).

You better tell the Bundespost about this :-) AFAIK (not being a German) 
Aachen is very much the current German name.
(go to http://www.deutschepost.de/ and search for PLZ Suchen)

RE: Case mapping of dotless lowercase letters

2003-12-17 Thread Kent Karlsson

 The difference here is that Germans recognise ss and sharp s 
 as variant spellings in the same words, 

Not altogether, taking into account spelling rules.
They are *ordered* the same, but that is another matter.

 whereas in Turkish i and dotless i are 
 quite different letters, just as in Swedish, Turkish and 
 German o and o 
 umlaut are quite different letters. I know Germans tolerate o umlaut  
 written as oe,

No, again an ordering rule, not a spelling rule. It has been used as
fallback too, like ss for . But it is not correct spelling.
(I will not go into the German spelling reform, since I'm not well
familiar with it.)

 but I don't think Turks do. But surely the whole point of 
 getting away from ASCII-only domain names is to respect national and 
 language-specific alphabets. What is needed for Germany and Sweden 
 should not be denied to Turkey.

There was never an intent do deny Turkey anything. The thing was that
the uppercase of i is I (usually) and the uppercase of  is also I, so i, I,
and  used to be folded together (to i) in the drafts for IDN. Apparently
that was deemed to harsh and was modified. (I think I complained at
some point, but it wasn't modified then, but apparently much later.)
Still for IDNs there is no language dependence in the case folding, as
there are for the case *mappings*. So I is turned into i (not ) also
for Turkish for IDNs. On the other hand, domain names are most often
written in lowercase anyway.

/kent k

RE: Case mapping of dotless lowercase letters

2003-12-17 Thread Arcane Jill






Far be it from me to stir things up even further, but...

QUESTION - Is the rendering of {U+0065} {U+0302} (that's i,
combining circumflex above) locale-dependent?

I may have got this totally wrong, but it occurs to me that in
non-Turkic fonts, U+0065 is "soft-dotted". That is, the dot disappears
in the presence of any COMBININGABOVE modifier. But in Turkic,
U+0065 is "hard-dotted", so the dot must not be removed if a circumflex
is added. I freely admit I don't know whether Turkic uses circumflex or
not, but the question will work just as well with any
COMBININGABOVE modifier.

If this is so, how can a character be considered "soft-dotted" in one
locale and "hard-dotted" in another?

Would it not make more sense to have not two, but three
different kinds of lowercase i: non-dotted i, soft-dotted
i and hard-dotted i?. (And similarly for uppercase). Of
course, then you might as well invent COMBINING SOFT DOT ABOVE so we
can use it elsewhere.

It gets better. (You're gonna hate me). If we then make the set {
soft-dotted-i, soft-dotted-I, non-dotted-i, non-dotted-I } a casefold
equivalence class which lowercases to soft-dotted-i (except in
the Turkic locale, where it lowercases to non-dotted-i), and uppercases
to non-dotted-I in all locales; and if we similarly make {
hard-dotted-i, hard-dotted-I } a separate casefold equivalence class
lowercasing to hard-dotted-i and uppercasing to
hard-dotted-I (in all locales), then all of the problems
outlined by Philippe would go away. And we could do the same with j too.

Of course - it would have one nasty side-effect. The Turks would then
have to use hard-dotted-i instead of soft-dotted-i, but
since the characters (in this new scheme) now have completely different
meanings, that's fair enough. Hey ho.

Just musing
Jill

RE: [OT] CJK - CJC (Re: Corea?)

2003-12-17 Thread Michael Everson

At 11:30 + 2003-12-17, [EMAIL PROTECTED] wrote:

I doubt Christians mean offence when they refer to Jesus through any of the
countless transcriptions, spellings and pronunciations used in various
languages.
It's odd that in English Judas and Jude are distinguished; in the 
original they are not.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

RE: [OT] CJK - CJC (Re: Corea?)

2003-12-17 Thread Michael Everson

At 11:04 +0100 2003-12-17, Marco Cimarosti wrote:

There is reason to rename Colonia to Köln, Augusta to Augsburg,
Eboraco to York, Provincia to Provence, and so on.
Nicely said. Subtle irony tends to go over some 
people's heads on this list though.

Eboraco is called Eabhrac in Irish. :-)
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

Re[2]: [OT] CJK - CJC (Re: Corea?)

2003-12-17 Thread Alexander Savenkov

Hello,

2003-12-17T14:36:37Z Philippe Verdy [EMAIL PROTECTED] wrote:

 Marco Cimarosti wrote:
 Doug Ewell wrote:
  I'll go farther than that.  It's always bothered me that speakers of
  European languages, including English but especially French, have seen
  fit to rename the cities and internal subdivisions of other countries.
 
 Rightly said!
 
 There is reason to rename Colonia to Koln, Augusta to Augsburg,
 Eboraco to York, Provincia to Provence, and so on.

 Or even Aix-la-Chapelle to Aachen because that's its _current_ German
 name (the French name was official in the history, and is still used in
 French).

 Cities sometimes change name, some of theme being famous like the _current_
 Saint-Petersbourg (French name revived in Russia with just a

It's Saint-Petersburg (or St. Petersburg) if you write in English.
The name has German roots, not French ones.

 transliteration, the Latin transcription being also widely used by Russians)

Why would Russians use the Latin transcription for a Russian name?

 which has also been Leningrad or Petrograd or Stalingrad

Stalingrad was the previous name for Volgograd, not St. Petersburg.
The initial name was Tsaritsyn.

Petrograd on the other hand *was* the name of St. Petersburg in
1914-1924. Leningrad was the name of it in 1924-1991.

 (in the Latin
 transliteration of the official and changing Russian script name, this Latin
 transliteration changing a bit among various languages which used them), and
 even Saint-Petersbourg officially for some time in the tsar's Russia.

I wonder what you meant by the some time part. St. Petersburg was
founded in 1703, and therefore stayed St. Petersburg for more than 200
years, that is it was St. Petersburg *most* of the time.

You mixed everything up, Phillippe.

Regards,
-- 
  Alexander Savenkovhttp://www.xmlhack.ru/
  [EMAIL PROTECTED] http://www.xmlhack.ru/authors/croll/

RE: Case mapping of dotless lowercase letters

2003-12-17 Thread Kent Karlsson


[resending; better set the encoding to UTF-8...]


Peter Kirk wrote:
...
 used on Turkic text, violates the very sensible rule DO NOT USE 
 COMBINING DOTS WITH I's, and leads to all sorts of potential 
 confusion 
 e.g. that both simple and full case folding and lowercasing 
 applied to 
 NFD Turkic text generate the nonsensical i, dot above. This 
 could be a 
 serious problem - although one that may not be worth fixing.

i, dot above is not non-sensical. It is used in Lithuanian for
such things as i, dot above, tilde above, as well as other
additonal accents above an i or a j that keeps its dot.

/kent k


Lithuanian alphabet (not listing all the uppercase
accented letters)

 Aa (,{}{}), Bb, Cc (CHch), , Dd, 
 Ee (,   {} {}  {} {}), Ff, Gg, Hh, 
 Ii ({i} {i} {i}  {}{} {}{}, Yy, , 
),
 Jj ({J}{j}), Kk, Ll ({l}), Mm ({m}), Nn (), 
 Oo (, , ), Pp, [Qq], Rr (r), Ss, , Tt, 
 Uu ({} {}  {}), Vv, [Ww], [Xx], Zz,

RE: Case mapping of dotless lowercase letters

2003-12-17 Thread Arcane Jill








 Would it not make more sense to have not two, but three
different kinds of lowercase i: non-dotted i, soft-dotted
i and hard-dotted i?. (And similarly for uppercase). Of
course, then you might as well invent COMBINING SOFT DOT ABOVE so we
can use it elsewhere.


I should have mentioned that in this hypothetical scheme, the following
would be canonically equivalent:

soft-dotted-i = non-dotted-i
combining-soft-dot-above
soft-dotted-I = non-dotted-I
combining-soft-dot-above
hard-dotted-i = non-dotted-i combining-dot-above
hard-dotted-I = non-dotted-I combining-dot-above

Sorry for the omission in previous email
Jill

RE: Case mapping of dotless lowercase letters

2003-12-17 Thread Kent Karlsson


Peter Kirk wrote:
...
 used on Turkic text, violates the very sensible rule DO NOT USE 
 COMBINING DOTS WITH I's, and leads to all sorts of potential 
 confusion 
 e.g. that both simple and full case folding and lowercasing 
 applied to 
 NFD Turkic text generate the nonsensical i, dot above. This 
 could be a 
 serious problem - although one that may not be worth fixing.

i, dot above is not non-sensical. It is used in Lithuanian for
such things as i, dot above, tilde above, as well as other
additonal accents above an i or a j that keeps its dot.

/kent k


Lithuanian alphabet (not listing all the uppercase
accented letters)

 Aa (Àà, Áá Ãã Aa {A´}{a´}), Bb, Cc (CHch), Cc, Dd, 
 Ee (Ee, Ee  è é ? e {e´} {e~} e {e´} {e~}), Ff, Gg, Hh, 
 Ii (Ì{i?`} Í{i?´} I{i?~} Ii {I´}{i?´} {I~}{i?~}, Yy, Ýý, ??),
 Jj ({J~}{j?~}), Kk, Ll ({l~}), Mm ({m~}), Nn (Ññ), 
 Oo (ò, ó, õ), Pp, [Qq], Rr (r~), Ss, , Tt, 
 Uu (ù ú u Uu {u´} {u~} Uu {u´}), Vv, [Ww], [Xx], Zz,

RE: Case mapping of dotless lowercase letters

2003-12-17 Thread Kent Karlsson


Philippe Verdy wrote:
 I do hope that dotless-j and dotted-J ...

Dotless j. That's in the works.

A precomposed dotted uppercase J? No, I think I can predict
that there will be no such encoded character.  If you want a
dotted uppercase J, use J, combining-dot-above.

/kent k

Arabic Presentation Forms-A

2003-12-17 Thread Philippe Verdy

I was validating some internal processing of strings, and I found these
intrigating decompositions for Arabic Presentation forms-A. I was surprised
to see that they are compatibility decomposed in (isolated) rows from bottom
to top, in a distinct reading order from normal Arabic reading order for
rows , but of coruse with the same right-to-left reading order:

#code;cc;nfd;nfkdFolded; # CHAR?; NFD?; NFKDFOLDED?;
# RIAL SIGN
fdfc;;;isolated 0631 06cc 0627 0644; # ??; ?; ?;

The Arial Unicode MS font does not have a glyph for the Rial currency sign
so I won't comment lots about it, even if it's a special ligature of its
component letters:
- where the medial form of U+06CC ARABIC LETTER FARSI YEH (?) is shown on
charts only as two dots (and not with its Arabic letter alef maksura base
form, as the comment in Arabic chart suggests for Arabic letter yeh), which
is
- located on below-left of the medial form of U+0627 (?) ,
- and where the initial form of U+0631 (?)  kerns below its next two
characters (sometimes with an aditional kashida below its next three
characters). However the general layout is still one row, so the
decomposition seems very quite reasonable; it's just regrettable that it's
not found in Arial Unicode MS (unless this Rial sign is traditional and no
more in actual use today).

I'm not sure that the compatibility decomposition gives the accurate form
for rendering the traditional glyph coded for the currency symbol...

--

Now I have this one:

#code;name;cc;
#   nfd;nfkdFolded;
#   #CHAR?; NFD?; NFKDFOLDED?;
FDFA;ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM;0;
FDFA;isolated 0635 0644 0649 0020 0627 0644 0644 0647 0020 0639
0644 064a 0647 0020 0648 0633 0644 0645;
# ??; ??; ???   ?;

#code;name;cc;
#   nfd;nfkdFolded;
#   #CHAR?; NFD?; NFKDFOLDED?;
FDFB;ARABIC LIGATURE JALLAJALALOUHOU;0;
FDFB;isolated 062c 0644 0020 062c 0644 0627 0644 0647;
# ??; ??; ?? ??;

I note that the Unicode charts show them with their complex and highly
ligated form, that correspond to the Arabic tradition in Quran. This is
apparently not implemented in Microsoft fonts which just render only the
first two on only 2 bottom-to-top rows.

The compatibility decomposition creates 4 space-separated words WORD1,
WORD2, WORD3, WORD4 that would be rendered normally either in one row as:
WORD4 WORD3 WORD2 WORD1
i.e.
???   ?
or on multiple narrow rows as:
WORD1   or  WORD2 WORD1
WORD2   WORD4 WORD3
WORD3
WORD4
i.e.
??? or  ??? 
 ?

?
using the top-to-bottom normal layout of plain-text rows in Arabic.

I can understand that it's difficult to make them fit more ideally like this
(with kashidas noted by underscores) :
WORD2
___WORD1
W___ORD3
W___ORD4
i.e. actually this order:

???

?

to better match the actual glyph in charts which also uses kashidas, given
the height constraints in fonts, and the difficulty to create the
traditional complex kerning between rows, but the current presentation of
the alternate glyph chosen in Arial Unicode MS does not seems intuitive.
Isn't there some requirement in Unicode to not change the common layout
which is part of the character identity and structural for the script? Such
interpretation problem does not occur in  the presentation of U+FDFB (which
also has two rows in the representative glyph of Arabic Presentation Forms-A
charts). Is there an error here?

---

Now with this one:

#code;name;cc;
#   nfd;nfkdFolded;
#   #CHAR?; NFD?; NFKDFOLDED?;
FDFB;ARABIC LIGATURE JALLAJALALOUHOU;0;
FDFB;isolated 062c 0644 0020 062c 0644 0627 0644 0647;
# ??; ??; ?? ??;

The decomposition into WORD1 WORD2 follows the same principles but is less
complex, and it uses this layout:
WORD2 WORD1
or:
WORD1
WORD2
The second layout is used in Arial Unicode MS to render the ligature.

---

Now I don't know why the last very complex but marvelous ligature U+FDFD in
Unicode does not have a compatiblity decomposition. In fact I can't decipher
clearly to what Arabic letters the ligature corresponds (this is not
documented in Unicode, except through its English name, which is probably
too far from the Arabic name to allow this identification)

More generally, my question is related to the allowed modification of
layouts for ligature glyphs in fonts: are they allowed, and how could they
be acceptably be represented when the plain-text character is not
compatibility-decomposed but rendered with a single glyph...


__
 ella for Spam Control  has removed

Re: [OT] CJK - CJC (Re: Corea?)

2003-12-17 Thread jcowan

Alexander Savenkov scripsit:

 You mixed everything up, Phillippe.

As we say in America, General Grant [1822-1885] Still Dead.

-- 
Do what you will,   John Cowan
   this Life's a Fiction[EMAIL PROTECTED]
And is made up of   http://www.reutershealth.com
   Contradiction.  --William Blake  http://www.ccil.org/~cowan

June Ashton 1999 thesis U Sydney

2003-12-17 Thread Elaine Keown

Elaine Keown
in Austin

Hi,

I wanted to bring the following dissertation--listed
at the bottom--to the attention of the e-discussion
groups.  I'm going to try to have some American
research library or University Microfilms make it
available here in the U.S.

Apparently Dr. Ashton, an Aussie scholar, compared
Greek, Coptic, etc. scribal marks with each other--I
believe she decided everything was Egyptian,
ultimately.  

The dissertation is relevant for encoding Dead Sea
scrolls in Hebrew - Aramaic - Greek etc, TLG, Coptic,
and (probably) Egyptian demotic and hieratic.  

I think Egyptian demotic or hieratic should be done
soon.--Elaine

U SYDNEY DISSERTATION:
The persistence, diffusion and interchangeability of
scribal habits in the ancient Near East before the
codex / by June Ashton.  
Publisher 1999. 

__
Do you Yahoo!?
New Yahoo! Photos - easier uploading and sharing.
http://photos.yahoo.com/

Re: [OT] CJK - CJC (Re: Corea?)

2003-12-17 Thread jcowan

Michael Everson scripsit:

 It's odd that in English Judas and Jude are distinguished; in the 
 original they are not.

Or for that matter that Jesus and Joshua are distinguished, but here we
can lay the blame on Greek vs. Hebrew.

-- 
Well, I'm back.  --SamJohn Cowan [EMAIL PROTECTED]

RE: [OT] CJK - CJC (Re: Corea?)

2003-12-17 Thread Marco Cimarosti

Michael Everson wrote:
 At 11:04 +0100 2003-12-17, Marco Cimarosti wrote:
 
 There is reason to rename Colonia to Köln, Augusta to 
 Augsburg,
 Eboraco to York, Provincia to Provence, and so on.
 
 Nicely said. Subtle irony tends to go over some 
 people's heads on this list though.

Especially if one forgets an essential no. :-(
It should have been There is NO reason to rename...

 Eboraco is called Eabhrac in Irish. :-)

So, that's who set the bad example in the first place! When the Angles came
they said: if Britanni can mangle place names, why shouldn't Ingevones? :-)

Ciao.
Marco

Re: Stability of WG2

2003-12-17 Thread Doug Ewell

Peter Kirk peterkirk at qaya dot org wrote:

 Nobody would call chimps Homo troglodytes, or orangs Simia satyrus,
 today, but those names can't ever be assigned to other species in
 future. (If chimps were folded into Homo, they would be H.
 troglodytes again.)

 And that is more or less what I would like to see with Unicode
 character names. Old names can remain valid as deprecated synonyms (or
 perhaps non-deprecated synonyms e.g. if Corean becomes officially
 preferred but Korean is still in widespread use), and not reusable
 for other characters, but should be gradually replaceable by new,
 correct or updated names.

I really think this is a deceased Equus caballus.

As a programmer, I can't personally imagine designing a program that
relies on the Unicode names to identify characters uniquely, instead of
relying on the code points.  Of course the names have to be unique, but
beyond that it certainly wouldn't bother me or any of the programs I've
written if some of the names were changed from one version to the next.

But apparently, for whatever reason, it IS very important to some
programmers and programs, and they have made it very clear for years and
years now that the names *must not change* in the interest of stability.
That is the policy of UTC and WG2, and it will not be changed simply
because anyone -- an individual or an entire committee -- determines
that name A' (or B) is more appropriate for a character than name A.
That goes for glaring mistakes like OI and HANGZHOU, and for typos like
FHTORA, and it would go for KOREAN as well.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

Re: Case mapping of dotless lowercase letters

2003-12-17 Thread Peter Kirk

On 17/12/2003 05:24, Kent Karlsson wrote:

...

There was never an intent do deny Turkey anything. The thing was that
the uppercase of i is I (usually) and the uppercase of  is also I, so i, I,
and  used to be folded together (to i) in the drafts for IDN. Apparently
that was deemed to harsh and was modified. (I think I complained at
some point, but it wasn't modified then, but apparently much later.)
Still for IDNs there is no language dependence in the case folding, as
there are for the case *mappings*. So I is turned into i (not ) also
for Turkish for IDNs. On the other hand, domain names are most often
written in lowercase anyway.
		/kent k

 

OK, that sounds reasonable now. I guess Turks and Azeris will just have 
to make sure they use lower case domain names, which makes more sense 
anyway.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

RE: Arabic Presentation Forms-A

2003-12-17 Thread Marco Cimarosti

Philippe Verdy wrote:
  #code;cc;nfd;nfkdFolded; # CHAR?; NFD?; NFKDFOLDED?;
  # RIAL SIGN
  fdfc;;;isolated 0631 06cc 0627 0644; # ??; ?; ?;
  
  The Arial Unicode MS font does not have a glyph for the 
 Rial currency sign so I won't comment lots about it, even if 
 it's a special ligature of its component letters:
  - where the medial form of U+06CC ARABIC LETTER FARSI YEH 
 (?) is shown on charts only as two dots (and not with its 
 Arabic letter alef maksura base form, as the comment in 
 Arabic chart suggests for Arabic letter yeh), which is

I am not sure I understand what you are asking, but it is quite normal that
the initial and medial form of letters Beh, Teh, Theh, Noon and Yeh loose
their tooth and are thus recognizable only by their dots. Similarly, Seen
and Sheen often loose their three teeth.

I find this particularly puzzling with the initial and medial forms of Seen,
which becomes a simple straight line in most calligraphic styles.

  - located on below-left of the medial form of U+0627 (?) ,

U+627 is Alif, so it has no medial form.

  - and where the initial form of U+0631 (?)  kerns below its 
 next two characters (sometimes with an aditional kashida 
 below its next three characters).

This too is quite normal: the tail of Reh, Zain and Waw often kerns below
the next letter. Compare it to Latin lowercase j, which has a similar
behavior.

_ Marco

Cuneiform Base Signs Plus Modifiers

2003-12-17 Thread Dean Snyder

[I am sending this email to both the Initiative for Cuneiform Encoding
email list, [EMAIL PROTECTED], and the general Unicode email list,
[EMAIL PROTECTED], in order to get comments from both the cuneiform and
Unicode communities.]

From the very first Initiative for Cuneiform Encoding conference at Johns
Hopkins University in November 2000, I, along with all others I am aware
of, have accepted unquestioningly the suggestion that we encode the
complex Sumero-Akkadian cuneiform signs as separate code points in Unicode.

For the non-cuneiformists on these lists, one way cuneiformists
categorize cuneiform signs is as simple, compound, and complex signs - a
simple sign being one not formed by combining two or more signs, a
compound sign being one formed by postfixing one or more signs to form a
grapheme cluster; and a complex sign being one formed by infixing one
sign inside another to form a new sign. At both ICE conferences we
decided to encode simple and complex signs but not compound signs.

Recently I have had second thoughts about encoding complex signs.

Modification of base, or simple, signs was a productive process for
making new signs in the earlier periods of cuneiform usage, and included
such modifications as adding or subtracting wedges, rotating signs,
infixing signs, etc. (For some examples of how the ancient scribes
modified base signs to form new complex signs see http://www.jhu.edu/
ice/basesigns/.)

Instead of encoding all 875 post-archaic, base and complex cuneiform
signs, we could instead encode the 280 base signs plus a dozen or so sign
modifiers. (I am not including in these approximate figures the 75 or so
numerical signs being proposed for encoding.) This would be somewhat
analogous to encoding a, e, the acute accent, and the grave accent
instead of encoding a with acute, a with grave, e with acute, etc.

Encoding base signs with modifiers would more closely mirror, in the
encoding, the way the script system itself actually worked and it would
more easily accommodate modern research in archaic cuneiform, a stage in
cuneiform script development we have all decided not to encode for now
due to the current provisional state of its scholarship. By providing in
the encoding the base signs along with their modifiers cuneiformists
working in archaic and other periods could generate newly discovered or
newly analyzed complex signs ad hoc, without having to go through the
time-consuming and expensive Unicode/ISO standardization process.
Compound and complex sign realization would then simply be a matter of
the coordination of input methods with fonts, something now doable by end
users with modern computer operating systems. (This, of course, assumes
that we are more likely to find new combinations and modifications of
existing base signs than to find new base signs themselves. At any rate,
when we do find new base signs we need to encode them anyway.)

To most cuneiformists, of course, the encoding underpinnings would all be
hidden by input methods and fonts. One would simply type the expected
SHUD3 and the input method would map it to 3 code points, KA INFIX and
SHU (mouth sign with hand sign infixed), and the font would render it as
one complex sign (meaning to pray).

And from a practical point of view encoding only the base signs and their
modifiers would be easy for us to do - we need only remove the complex
signs from our lists and add the 13 or 14 modifiers.


Respectfully,

Dean A. Snyder
Scholarly Technology Specialist
Library Digital Programs, Sheridan Libraries
Garrett Room, MSE Library, 3400 N. Charles St.
Johns Hopkins University
Baltimore, Maryland, USA 21218

office: 410 516-6850 fax: 410-516-6229
Manager, Digital Hammurabi Project: www.jhu.edu/digitalhammurabi

Re: Stability of WG2

2003-12-17 Thread Jim Allan

Doug Ewell wrote:

But apparently, for whatever reason, it IS very important to some
programmers and programs, and they have made it very clear for years and
years now that the names *must not change* in the interest of stability. 
On the other hand, there is nothing to prevent the Unicode consortium or 
any other body or any single person from creating a new *additional* 
corrected set of names if the Unicode consortium or any other body or 
any single person wishes to do so.

That would just be an alternative list of character names.

There would be nothing to prevent any particular application or language 
or individual person or standard using such an alternative list in 
preference to the older standard Unicode list of names, if indeed anyone 
is really using these names for much of anything.

The only real purpose I can see the names serve is that writing 
something like MODIFIER LETTER SMALL SCHWA is more easily understood by 
a reader who doesn't have TUS handy than is U+1D4A. At least the reader 
knows that some kind of schwa is being referenced (if the reader knows 
what a schwa is.)  And if they come across the same name in another 
article about phonetic characters in Unicode they can be reasonably sure 
the same character is being discussed.

Also if there is either a typo in the name or in the Unicode identifying 
code then one of these can serve as a check on the other.

But I rather not be surprised if that at some time in the future a 
second set of names with obvious errors corrected were to be created.

Jim Allan

Re: Case mapping of dotless lowercase letters

2003-12-17 Thread Peter Kirk

On 17/12/2003 05:30, Arcane Jill wrote:

Far be it from me to stir things up even further, but...

QUESTION - Is the rendering of {U+0065} {U+0302} (that's i, combining 
circumflex above) locale-dependent?

I may have got this totally wrong, but it occurs to me that in 
non-Turkic fonts, U+0065 is soft-dotted. That is, the dot disappears 
in the presence of any COMBININGABOVE modifier. But in Turkic, 
U+0065 is hard-dotted, so the dot must not be removed if a 
circumflex is added. I freely admit I don't know whether Turkic uses 
circumflex or not, but the question will work just as well with /any/ 
COMBININGABOVE modifier.

...
Turkish does in fact use circumflex above a, i and u, although rather 
rarely and often dropped today (but no other diacritics above except for 
umlaut as part of regular letters, no umlaut on i). i with circumflex is 
especially rare but is sometimes written on Arabic loan words like mill 
(/national/). Note carefully that this is pronounced as a variant of 
*dotted* i, and replaced by dotted i (not dotless i) when the circumflex 
is dropped, but it is written undotted in both upper and lower case. 
Note the following found from a Google search, which gives some upper 
and lower case equivalents.

TRK *MLL* KODLANDIRMA SSTEM. *...* . *Mill* Kodlandrma Sisteminin 
temelini ...

Conclusion: the right thing even for Turkish is to drop the dot on i 
before a circumflex. But by the same argument we would also want to drop 
the dot on dotless I. Oh dear, I have just made the whole issue even 
more complicated!

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

RE: Arabic Presentation Forms-A

2003-12-17 Thread Philippe Verdy

 Philippe Verdy wrote:
   #code;cc;nfd;nfkdFolded; # CHAR?; NFD?; NFKDFOLDED?;
   # RIAL SIGN
   fdfc;;;isolated 0631 06cc 0627 0644; # ??; ?; ?;

I should have disabled temporarily my email filter to send this one. All
UTF-8 codes were replaced by ISO-8859-1 characters, substituing '?' instead
of Arabic characters...
I hope that the codepoints that I gave explicitly will still make my message
readable...

Well in your message you comment on the form shown in the charts, and I
don't criticize them.

I was just wondering if their rendering in Arial Unicode MS is correct and
conforming to the required need to keep the interpretation, and in what
measure the beautiful ligatures found in Unicode charts are normative, as
there's a very large difference with what Arial Unicde MS does, with a
distinct character layout, and no ligature, no kerning kashidas, and in some
cases not even the contextual shaping of its embedded letters, so that the
Arial Unicode MS font render these ligatures as their NFKD decomposition
rendered in a single square.

This may be valid if this was just a ligature, but in that case, why aren'
those decomposition canonical like the ffi ligature ?


__
 ella for Spam Control  has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com
attachment: winmail.dat

RE: Case mapping of dotless lowercase letters

2003-12-17 Thread Philippe Verdy

Peter Kirk wrote:
 Conclusion: the right thing even for Turkish is to drop the dot on i 
 before a circumflex.

I agree. The letter is rare enough to not create an exception here for
the removal of dot on the soft-dotted i followed by circumflex (which
is needed much more often in other languages that use '' and '.

 But by the same argument we would also want to drop 
 the dot on dotless I.

I think you meant But by the same argument we would also want to drop 
the dot on DOTTED I. I would not recommand it, this would make things
even worse and more complicated.

If Turkish wants to remove the dot on pseudo-dotted I if followed by
a circumflex, the correct thing to do is then to use the ASCII dotless
I and add a circumflex or use its canonical equivalent
LATIN CAPITAL LETTER I WITH CIRCUMFLEX.

With the current specification, both of
LATIN CAPITAL LETTER I, COMBINING CIRCUMFLEX, and
LATIN CAPITAL LETTER I WITH CIRCUMFLEX
are canonical equivalents and must render the same, without the dot.

To display a dot, one can use one of the four canonical eqquivalents:
LATIN CAPITAL LETTER I WITH DOT ABOVE, COMBINING CIRCUMFLEX
LATIN CAPITAL LETTER I WITH CIRCUMFLEX, COMBINING DOT ABOVE
LATIN CAPITAL LETTER I, COMBINING DOT ABOVE, COMBINING CIRCUMFLEX
LATIN CAPITAL LETTER I, COMBINING CIRCUMFLEX, COMBINING DOT ABOVE
(one is the NFC form, another is the NFD form, two others are also
possible)


__
 ella for Spam Control  has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com
attachment: winmail.dat

Re: Case mapping of dotless lowercase letters

2003-12-17 Thread Chris Jacobs

 To display a dot, one can use one of the four canonical eqquivalents:
  LATIN CAPITAL LETTER I WITH DOT ABOVE, COMBINING CIRCUMFLEX
  LATIN CAPITAL LETTER I WITH CIRCUMFLEX, COMBINING DOT ABOVE
  LATIN CAPITAL LETTER I, COMBINING DOT ABOVE, COMBINING CIRCUMFLEX
  LATIN CAPITAL LETTER I, COMBINING CIRCUMFLEX, COMBINING DOT ABOVE
 (one is the NFC form, another is the NFD form, two others are also
 possible)

Those four are not all canonical equivalent since circumflex and dot above
are both combining class 230,
so they interact.

RE: Case mapping of dotless lowercase letters

2003-12-17 Thread Philippe Verdy

Chris Jacobs wrote:
  To display a dot, one can use one of the four canonical eqquivalents:
   LATIN CAPITAL LETTER I WITH DOT ABOVE, COMBINING CIRCUMFLEX
   LATIN CAPITAL LETTER I WITH CIRCUMFLEX, COMBINING DOT ABOVE
   LATIN CAPITAL LETTER I, COMBINING DOT ABOVE, COMBINING CIRCUMFLEX
   LATIN CAPITAL LETTER I, COMBINING CIRCUMFLEX, COMBINING DOT ABOVE
  (one is the NFC form, another is the NFD form, two others are also
  possible)
 
 Those four are not all canonical equivalent since circumflex and dot above
 are both combining class 230, so they interact.

You're right. Initially I wanted to verify their combining classes to see
which form was the NFC or NFD, but I did not need to remember these classes
values as they effectively combine at the same (above) class.

So depending on the letters to encode one can use any of:
NFC: LATIN CAPITAL LETTER I WITH DOT ABOVE, COMBINING CIRCUMFLEX
NFD: LATIN CAPITAL LETTER I, COMBINING DOT ABOVE, COMBINING
CIRCUMFLEX
to encode the circumflex above the dot (I think this is what Turkish would
use as the fot is considered part of the base letter),

or any of:
NFC: LATIN CAPITAL LETTER I WITH CIRCUMFLEX, COMBINING DOT ABOVE
NFD: LATIN CAPITAL LETTER I, COMBINING CIRCUMFLEX, COMBINING DOT
ABOVE
to encode the dot above the circumflex (but may be Turkish will not make a
difference here and will read it as a glyph variant)


__
 ella for Spam Control  has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com
attachment: winmail.dat

American English translation of character names (was Re: Stability of WG2)

2003-12-17 Thread Kenneth Whistler

Jim Allan noted:

 On the other hand, there is nothing to prevent the Unicode consortium or 
 any other body or any single person from creating a new *additional* 
 corrected set of names if the Unicode consortium or any other body or 
 any single person wishes to do so.
 
 That would just be an alternative list of character names.

 But I rather not be surprised if that at some time in the future a 
 second set of names with obvious errors corrected were to be created.

And, indeed, some of us have toyed around with the notion of
publishing an American English translation of the Unicode
names list, including such obvious improvements as:

U+002E FULL STOP  -- PERIOD (or DOT)

U+002F SOLIDUS-- SLASH

U+0040 COMMERCIAL AT  -- AT SIGN

U+005C REVERSE SOLIDUS -- BACKSLASH

U+005F LOW LINE  -- (SPACING) UNDERSCORE

U+00B6 PILCROW SIGN -- PARAGRAPH SIGN

U+0268 LATIN SMALL LETTER I WITH STROKE -- ... BARRED I

U+019B LATIN SMALL LETTER LAMBDA WITH STROKE -- ... BARRED LAMBDA

U+03BB GREEK SMALL LETTER LAMDA -- ... LAMBDA

U+21B0 UPWARDS ARROW WITH TIP LEFTWARDS -- UP ARROW WITH TIP POINTING LEFT

U+21BA ANTICLOCKWISE OPEN CIRCLE ARROW -- COUNTERCLOCKWISE ...

U+FE4E CENTRELINE LOW LINE -- CENTERLINE UNDERSCORE

and so on and so on, including all the obvious errors that
people are continuing to worry about. ;-)

--Ken

Re: Arabic Presentation Forms-A

2003-12-17 Thread Kenneth Whistler

Philippe asked:

 The Arial Unicode MS font does not have a glyph for the Rial currency sign
 so I won't comment lots about it, even if it's a special ligature of its
 component letters:

 it's just regrettable that it's
 not found in Arial Unicode MS (unless this Rial sign is traditional and no
 more in actual use today).

The Rial currency sign was recently added to the standard, so
many fonts still don't have it. It was added for compatibility
with an Iranian standard.

 I'm not sure that the compatibility decomposition gives the accurate form
 for rendering the traditional glyph coded for the currency symbol...

It isn't supposed to. Compatibility decompositions are approximations,
not necessarily the basis for building an Arabic ligation, especially
for special cases like this currency sign.


 FDFA;ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM;0;
   FDFA;isolated 0635 0644 0649 0020 0627 0644 0644 0647 0020 0639
 0644 064a 0647 0020 0648 0633 0644 0645;

 FDFB;ARABIC LIGATURE JALLAJALALOUHOU;0;
   FDFB;isolated 062c 0644 0020 062c 0644 0627 0644 0647;

 but the current presentation of
 the alternate glyph chosen in Arial Unicode MS does not seems intuitive.

That's an issue for Microsoft customers and testers of Microsoft
fonts to determine.

 Isn't there some requirement in Unicode to not change the common layout
 which is part of the character identity and structural for the script? Such
 interpretation problem does not occur in  the presentation of U+FDFB (which
 also has two rows in the representative glyph of Arabic Presentation Forms-A
 charts). Is there an error here?

Nope. Glyph shapes are not normative or prescriptive. As long as
the identity of the character is clear, there might be an aesthetic
faux pas, but not an error or a failure of conformance to the standard.

 More generally, my question is related to the allowed modification of
 layouts for ligature glyphs in fonts: are they allowed, 

Yes.

 and how could they
 be acceptably be represented when the plain-text character is not
 compatibility-decomposed but rendered with a single glyph...

By the code points in question, of course. For these word
ligatures, which are really used as complete symbols, one would
ordinarily not expect to enter the whole compatibility sequence
of characters, anyway. Normal rendering engines don't produce
these highly elaborated ligatures automatically from such
sequences.

 I was just wondering if their rendering in Arial Unicode MS is correct and
 conforming to the required need to keep the interpretation, 

As long as the identity of the character is correct, which it seems
to be, since you identified it, then one can say the font is
correct.

 and in what measure the beautiful ligatures found in Unicode 
 charts are normative, 

In no measure.

 as there's a very large difference with what Arial Unicde MS does

There are large differences between Arabic fonts for *all* of
the Arabic characters in the standard -- not just these word
ligature symbols.

--Ken

Re: Case mapping of dotless lowercase letters

2003-12-17 Thread Christopher John Fynn


 However, could there be an encoding for:
 LATIN CAPITAL LETTER DOTLESS J
 with a lowercase mapping to the new:
 LATIN SMALL LETTER DOTLESS J
 Of course the former would look exactly the same as the
 ASCII uppercase J, except that it would have a distinct
 case mapping. This would avoid, for j/J the nightmare
 of dotless-i/dotted-i/I...


It introduces another difficulty though - If there are languages using a LATIN
SMALL LETTER DOTLESS J and words written in those languages are sometimes
capitalised - then presumably there is already data where  LATIN CAPITAL
LETTER J has already been used as the upper case for LATIN SMALL LETTER
DOTLESS J introducing a separate

A purist might argue that if there are no places where a  using LATIN CAPITAL
LETTER DOTLESS J  instead of LATIN SMALL LETTER DOTLESS J makes a lexical
difference  then one is simply a glyph variant of the other. If that is so then
there is no need for two characters one form could be handled by higher level
mark-up and rendered using a different glyph.


I think Latin has too long been considered a simple script - if one takes
into account  the number of languages written in Latin script and all the
additions modifications used to do this, Latin is a complex script.  In view
of this before adding new Latin characters it might be a good idea to first
consider  the kind of solutions used for scripts that have always been
considered complex.

- Chris

Re: Case mapping of dotless lowercase letters

2003-12-17 Thread Christopher John Fynn


Philippe Verdy [EMAIL PROTECTED] wrote:

 Ohhh... I admit this is hypothetic for a possible use, but the candrabindu
 case is a precedent coming from romanization of non-Latin scripts: what if
 there's a combining x above used to interact over a diacritic and mark its
 suppression in corrected texts or in documents related to
 orthographic/grammatical rules, or simply because it is needed for correct
 romanization of some ancient script...

If special rendering rules are needed for romanisation of particular languages
there is a facility in OpenType and other smart-font formats to include
different rules for different languages written with the same script.

One could use this to provide e.g  different rendering behaviour for Turkish
than for other languages written in Latin and I suspect it could be used in
many cases of transliteration non-Latin scripts (presuming a particular
language was written in that script)

Orthographic rules can certainly be handled by features and lookups in smart
fonts.

Maybe this is the level on which many of these issues should be handled.  We
only need new characters where it  is necessary to make a distinction, or
resolve something that would otherwise be ambiguous, in plain text.

- Chris

Re: Cuneiform Base Signs Plus Modifiers

2003-12-17 Thread Christopher John Fynn


Dean Snyder [EMAIL PROTECTED] wrote:

 Recently I have had second thoughts about encoding complex signs.

 Modification of base, or simple, signs was a productive process for
 making new signs in the earlier periods of cuneiform usage, and included
 such modifications as adding or subtracting wedges, rotating signs,
 infixing signs, etc. (For some examples of how the ancient scribes
 modified base signs to form new complex signs see http://www.jhu.edu/
 ice/basesigns/.)

 Instead of encoding all 875 post-archaic, base and complex cuneiform
 signs, we could instead encode the 280 base signs plus a dozen or so sign
 modifiers. (I am not including in these approximate figures the 75 or so
 numerical signs being proposed for encoding.) This would be somewhat
 analogous to encoding a, e, the acute accent, and the grave accent
 instead of encoding a with acute, a with grave, e with acute, etc.

This fits in best with the Unicode charater encoding model and is definitely
the way to go, particularly if the script was productive.

If additional complex signs are found you will then be able to represent them
straight away and won't have submit a proposal to add an additional character,
wait for it to be accepted   get encoded, and then wait support for it to
appear in applications and fonts (a proccess which usually takes several years)

 Encoding base signs with modifiers would more closely mirror, in the
 encoding, the way the script system itself actually worked and it would
 more easily accommodate modern research in archaic cuneiform, a stage in
 cuneiform script development we have all decided not to encode for now
 due to the current provisional state of its scholarship. By providing in
 the encoding the base signs along with their modifiers cuneiformists
 working in archaic and other periods could generate newly discovered or
 newly analyzed complex signs ad hoc, without having to go through the
 time-consuming and expensive Unicode/ISO standardization process.
 Compound and complex sign realization would then simply be a matter of
 the coordination of input methods with fonts, something now doable by end
 users with modern computer operating systems. (This, of course, assumes
 that we are more likely to find new combinations and modifications of
 existing base signs than to find new base signs themselves. At any rate,
 when we do find new base signs we need to encode them anyway.)

I think it is always a good idea to closely mirror in encoding the way a script
system actually works - and break it down into primitives or base characters,
combining marks and  modifiers

It might be helpful to at how smart-font systems like OpenType and AAT/ATSUI
are already used for rendering complex scripts and to try and think of the
features and lookups a Cuneiform font using this sort of technology might use.

 To most cuneiformists, of course, the encoding underpinnings would all be
 hidden by input methods and fonts. One would simply type the expected
 SHUD3 and the input method would map it to 3 code points, KA INFIX and
 SHU (mouth sign with hand sign infixed), and the font would render it as
 one complex sign (meaning to pray).

This is perfectly feasible.

 And from a practical point of view encoding only the base signs and their
 modifiers would be easy for us to do - we need only remove the complex
 signs from our lists and add the 13 or 14 modifiers.

This seems to be the right approach.

- Chris Fynn

Re: Stability of WG2

2003-12-17 Thread Christopher John Fynn


 Jim Allan [EMAIL PROTECTED] wrote:

 On the other hand, there is nothing to prevent the Unicode consortium or
 any other body or any single person from creating a new *additional*
 corrected set of names if the Unicode consortium or any other body or
 any single person wishes to do so.

 That would just be an alternative list of character names.

Of course anybody can make and use their own name list for their own purposes -
getting a new alternative name list added to the standard is another issue.
There is plenty of disagreement about what the proper name  for many
characters should be - which is probably one of the reasons for the rule that
says once a name is assigned it cannot be changed.

If this rule wasn't there, Unicode and WG2 would get a constant stream of
proposals to correct the  name of  character U+  - and then have to spend
time on discussing such proposals and voting on them. I  think members of UTC 
WG2 have much more useful  things to do with their time.


- Chris

Re: Case mapping of dotless lowercase letters

2003-12-17 Thread John Cowan

Christopher John Fynn scripsit:

 It introduces another difficulty though - If there are languages using a
 LATIN SMALL LETTER DOTLESS J

There aren't.  Dotless j as a character (as opposed to a glyph used with
various accents above) is only used in non-IPA phonetic alphabets.

 I think Latin has too long been considered a simple script - if one takes
 into account  the number of languages written in Latin script and all the
 additions modifications used to do this, Latin is a complex script.  

Amen.

-- 
I suggest you call for help,John Cowan
or learn the difficult art of mud-breathing.[EMAIL PROTECTED]
--Great-Souled Sam  http://www.ccil.org/~cowan

47 matches

Mail list logo