Re: [OT] CJK - CJC (Re: Corea?)
Anto'nio Martins-Tuva'lkin antonio at tuvalkin dot web dot pt wrote: Every language, whose speaking community ever conteacted others, does it. , f.i., is the Chuvash name for neighbouring , which is probably still known in English as Gorky, a clumsy transcription of the 1934-1991 name . No, it's Nizhniy Novgorod to me. I don't think I'll respond to the rest of Anto'nio's charming and respectful post. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Re: Case mapping of dotless lowercase letters
Philippe Verdy verdy underscore p at wanadoo dot fr wrote: Well Outlook 2000 is unable to represent any e with ogonek and trema of your example. So, despite they are canonically equivalent, they are rendered differently: Everything rendered perfectly over here, on Windows 95 and Outlook Express 5 (and Uniscribe). You might try switching to Lucida Sans Unicode, if you have it. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
RE: Case mapping of dotless lowercase letters
[EMAIL PROTECTED] wrote: [...] Note that ß (sharp s) casefolds to ss, and Å¿ (long s) casefolds to s. So straße, straÅ¿se, and strasse also both map to the same (strasse) subname. [...] According to my Duden, sharp-s doesn't uppercases to SS, when it is in a name. So 'Großmann' and 'Grossmann' should get distinct Domains, where available. BTW, the whole thread on IDN domain names which can be mistaken, seems rather pointless. It is an old problem, explored by registering misspellings or with and without a hyphen. If there is a possibility of confusion, then there is a possibility of a lawsuit and the older rights and larger legal department will win. AFAIK mircosoft.com was killed this way (whereas rnicrosoft.com is being tolerated, strange). Regards, Peter Jacobi -- +++ GMX - die erste Adresse für Mail, Message, More +++ Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net
Stability of scientific names, was Stability of WG2
on 2003-12-16 15:27 Peter Kirk wrote: I'm no expert on this... I am. :-) but I thought that species could be transferred from genus to genus as knowledge advances. As John pointed out, the epithet stays the same. And presumably obvious spelling mistakes are corrected (contrast FHTORA in U+1D0C5), or are you saying that if the first publication had Brontosuarus as a typo this error would remain for ever? There are errors and then there are errors. Some are correctable, some are not, and botanists and zoologists have different rules about this. An example that's not entirely OT: There was a Russian physician with the last name - a cyrillicization of his German family name Escholtz. His name was commonly written then and today in German form as Johann Friedrich Eschscholtz, the schsch reduplication being a reflection of the Cyrillic spelling. He Latinized (language, not alphabet) his name (a common occurrence among naturalists) to Eschscholzius. He was physician to the Kotzebue expedition from Russia to (among other places) California; the ship's naturalist was Adelbert von Chamisso (author of _Peter Schlemiel_). Chamisso and Eschscholtz were fast friends (and some accounts imply that they were lovers). Chamisso named several new species of organisms for his friend, including the California poppy. In the original description of the California poppy, he named it _Eschscholzia californica_, making the genus name the feminine form of Eschscholtz's Latinized name (this is a common occurrence). In the caption of the illustration of the plant, however, it was spelled _Eschholzia_. But for over a century afterwards, most botanists and horticulturists spelled the genus _Eschscholtzia_, assuming that both spellings in the original description were typographic errors. But the rules of nomenclature are very specific about which types of errors can be corrected, and, since there is no obvious correct spelling of Escholtz, *the spelling that accompanied the original description must stand*, and the plant is correctly _Eschscholzia californica_. -- Curtis Clark http://www.csupomona.edu/~jcclark/ Mockingbird Font Works http://www.mockfont.com/
RE: [OT] CJK - CJC (Re: Corea?)
Doug Ewell wrote: I'll go farther than that. It's always bothered me that speakers of European languages, including English but especially French, have seen fit to rename the cities and internal subdivisions of other countries. Rightly said! There is reason to rename Colonia to Kln, Augusta to Augsburg, Eboraco to York, Provincia to Provence, and so on. _ Marco
Re: Case mapping of dotless lowercase letters
On 16/12/2003 17:21, Kenneth Whistler wrote: Correcting myself: Note that none of the 3 sets of equivalence classes violates *canonical* equivalence, because none of the 8 sequences involved is canonically equivalent to any other. In other words, no matter which of the 3 approaches you take to case folding, in no instance are you claiming that canonically equivalent sequences are to be interpreted differently. Actually, dotted I *is* canonically equivalent to I, dot above (I overlooked that when compiling the summary.) This implies (since there are no decomposition exclusions) that NFD, used on Turkic text, violates the very sensible rule DO NOT USE COMBINING DOTS WITH I's, and leads to all sorts of potential confusion e.g. that both simple and full case folding and lowercasing applied to NFD Turkic text generate the nonsensical i, dot above. This could be a serious problem - although one that may not be worth fixing. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Case mapping of dotless lowercase letters
On 16/12/2003 19:28, John Cowan wrote: Philippe Verdy scripsit: If we just remove any 0307 from the Turkic texts, there is absolutely no problem with Turkic CaseFolding, provided that we also define Turkic-specific uppercase mappings as done above, and don't use the default locale-neutral uppercase mappings of the UCD. There's no reason to expect that there will be any 0307 whatever in Turkish/Azeri texts: it's not a diacritic those languages use, AFAIK. Not normally. But it does appear in Turkic text normalised to NFD as the dotted I's are decomposed. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Case mapping of dotless lowercase letters
There's no reason to expect that there will be any 0307 whatever in Turkish/Azeri texts: it's not a diacritic those languages use, AFAIK. There's no reason to expect that there won't be, particularly if they quote a piece in a language which does use U+0307. -- Jon Hanna | Toys and books http://www.hackcraft.net/ | for hospitals: | http://santa.boards.ie
RE: Case mapping of dotless lowercase letters
Doug Ewell Philippe Verdy verdy underscore p at wanadoo dot fr wrote: Well Outlook 2000 is unable to represent any e with ogonek and trema of your example. So, despite they are canonically equivalent, they are rendered differently: Everything rendered perfectly over here, on Windows 95 and Outlook Express 5 (and Uniscribe). You might try switching to Lucida Sans Unicode, if you have it. I have Lucida Sans Unicode with Office. But there's a difference between Outlook (2000) and Windows XP's Outlook Express 6 here, despite they are supposed to share the same UniScribe engine (or may be there's a parallel version of Uniscribe used only in Office 2000 (updated with Office Update separately from Windows), and not updated along Outlook Express (within Windows Update)... __ ella for Spam Control has removed Spam messages and set aside Newsletters for me You can use it too - and it's FREE! http://www.ellaforspam.com attachment: winmail.dat
Re: Stability of WG2
On 16/12/2003 19:58, John Cowan wrote: Peter Kirk scripsit: I'm no expert on this... but I thought that species could be transferred from genus to genus as knowledge advances. True enough, but the specific epithet remains the same, and the old names are still available (as the jargon has it) though no longer valid (what I was calling preferred in my previous post). Linnaeus himself, working with two different descriptions of chimps, split them into Homo troglodytes and Simia satyrus (which latter also included bonobos and orangutans); when the mistake was cleared up, the specific epithet troglodytes, being the older, was retained for chimps, whereas bonobos got satyrus, both now in the new genus Pan; orangs were moved to Pongo and given the new epithet pygmaeus. (There's now a move underfoot to move all of these, plus gorillas, into Homo; I don't give it much chance, though I think it's a cool idea.) Nobody would call chimps Homo troglodytes, or orangs Simia satyrus, today, but those names can't ever be assigned to other species in future. (If chimps were folded into Homo, they would be H. troglodytes again.) And that is more or less what I would like to see with Unicode character names. Old names can remain valid as deprecated synonyms (or perhaps non-deprecated synonyms e.g. if Corean becomes officially preferred but Korean is still in widespread use), and not reusable for other characters, but should be gradually replaceable by new, correct or updated names. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Case mapping of dotless lowercase letters
On 16/12/2003 14:59, Kent Karlsson wrote: ... Peter Kirk wrote: If the Swedish registry allows all the letters used in Swedish and Sami, and far eastern registries allow Chinese characters, the Turkish and Azerbaijani registries should allow, and be allowed to allow, all the letters of the alphabets of their national languages. Note that (sharp s) casefolds to ss, and (long s) casefolds to s. So strae, strase, and strasse also both map to the same (strasse) subname. The difference here is that Germans recognise ss and sharp s as variant spellings in the same words, whereas in Turkish i and dotless i are quite different letters, just as in Swedish, Turkish and German o and o umlaut are quite different letters. I know Germans tolerate o umlaut written as oe, but I don't think Turks do. But surely the whole point of getting away from ASCII-only domain names is to respect national and language-specific alphabets. What is needed for Germany and Sweden should not be denied to Turkey. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
RE: [OT] CJK - CJC (Re: Corea?)
Quoting Marco Cimarosti [EMAIL PROTECTED]: Doug Ewell wrote: I'll go farther than that. It's always bothered me that speakers of European languages, including English but especially French, have seen fit to rename the cities and internal subdivisions of other countries. Rightly said! There is reason to rename Colonia to Köln, Augusta to Augsburg, Eboraco to York, Provincia to Provence, and so on. I doubt Christians mean offence when they refer to Jesus through any of the countless transcriptions, spellings and pronunciations used in various languages. I think this is analogous to assuming that anyone dreaming of packing it all in and buying a villa in Provence similarly means no offence when expressing that desire in English (Zapan though would appear to be a different matter). -- Jon Hanna | Toys and books http://www.hackcraft.net/ | for hospitals: | http://santa.boards.ie
RE: Case mapping of dotless lowercase letters
Peter Kirk wrote: This implies (since there are no decomposition exclusions) that NFD, used on Turkic text, violates the very sensible rule DO NOT USE COMBINING DOTS WITH I's, and leads to all sorts of potential confusion e.g. that both simple and full case folding and lowercasing applied to NFD Turkic text generate the nonsensical i, dot above. This could be a serious problem - although one that may not be worth fixing. Yes NFD is an issue, but not a critical one, because the decomposition is canonical, and not excluded from recomposition. However you're wrong here: only Full CaseFolding generates i, dot-above from dotted-I, not the default lowercase mapping in the UCD which is just left unchanged, or the locale-specific tr/az lowercase mapping which maps it to (soft-dotted-)i. Typical Turkish and Azeri texts will not use dot-above, except in the NFD form I, dot-above for dotted-I, which is just needed because of the Full CaseFolding mapping to make it respect canonical equivalence. I do hope that dotless-j and dotted-J will avoid these confusions, but not trying to decompose dotted-J in the NFD form, and not generating j, dot-above in Full CaseFolding of dotted-J, but just (soft-dotted-)j. Or will it add more confusion there, if j is treated diffrently than i? __ ella for Spam Control has removed Spam messages and set aside Newsletters for me You can use it too - and it's FREE! http://www.ellaforspam.com attachment: winmail.dat
RE: [OT] CJK - CJC (Re: Corea?)
Marco Cimarosti wrote: Doug Ewell wrote: I'll go farther than that. It's always bothered me that speakers of European languages, including English but especially French, have seen fit to rename the cities and internal subdivisions of other countries. Rightly said! There is reason to rename Colonia to Kln, Augusta to Augsburg, Eboraco to York, Provincia to Provence, and so on. Or even Aix-la-Chapelle to Aachen because that's its _current_ German name (the French name was official in the history, and is still used in French). Cities sometimes change name, some of theme being famous like the _current_ Saint-Ptersbourg (French name revived in Russia with just a transliteration, the Latin transcription being also widely used by Russians) which has also been Lningrad or Ptrograd or Stalingrad (in the Latin transliteration of the official and changing Russian script name, this Latin transliteration changing a bit among various languages which used them), and even Saint-Ptersbourg officially for some time in the tsar's Russia. __ ella for Spam Control has removed Spam messages and set aside Newsletters for me You can use it too - and it's FREE! http://www.ellaforspam.com attachment: winmail.dat
Re: Stability of scientific names, was Stability of WG2
Hello, 2003-12-17T11:06:32Z Curtis Clark [EMAIL PROTECTED] wrote: on 2003-12-16 15:27 Peter Kirk wrote: I'm no expert on this... I am. :-) but I thought that species could be transferred from genus to genus as knowledge advances. As John pointed out, the epithet stays the same. And presumably obvious spelling mistakes are corrected (contrast FHTORA in U+1D0C5), or are you saying that if the first publication had Brontosuarus as a typo this error would remain for ever? There are errors and then there are errors. Some are correctable, some are not, and botanists and zoologists have different rules about this. An example that's not entirely OT: There was a Russian physician with the last name - a cyrillicization of his German family name He was actually. You forgot the soft sign. (I'm not sure everyone will see the name - the editor replaced the encoding with windows-1251, and there's no UTF-8 support). Regards, -- Alexander Savenkovhttp://www.xmlhack.ru/ [EMAIL PROTECTED] http://www.xmlhack.ru/authors/croll/
RE: [OT] CJK - CJC (Re: Corea?)
Or even Aix-la-Chapelle to Aachen because that's its _current_ German name (the French name was official in the history, and is still used in French). You better tell the Bundespost about this :-) AFAIK (not being a German) Aachen is very much the current German name. (go to http://www.deutschepost.de/ and search for PLZ Suchen)
RE: Case mapping of dotless lowercase letters
The difference here is that Germans recognise ss and sharp s as variant spellings in the same words, Not altogether, taking into account spelling rules. They are *ordered* the same, but that is another matter. whereas in Turkish i and dotless i are quite different letters, just as in Swedish, Turkish and German o and o umlaut are quite different letters. I know Germans tolerate o umlaut written as oe, No, again an ordering rule, not a spelling rule. It has been used as fallback too, like ss for . But it is not correct spelling. (I will not go into the German spelling reform, since I'm not well familiar with it.) but I don't think Turks do. But surely the whole point of getting away from ASCII-only domain names is to respect national and language-specific alphabets. What is needed for Germany and Sweden should not be denied to Turkey. There was never an intent do deny Turkey anything. The thing was that the uppercase of i is I (usually) and the uppercase of is also I, so i, I, and used to be folded together (to i) in the drafts for IDN. Apparently that was deemed to harsh and was modified. (I think I complained at some point, but it wasn't modified then, but apparently much later.) Still for IDNs there is no language dependence in the case folding, as there are for the case *mappings*. So I is turned into i (not ) also for Turkish for IDNs. On the other hand, domain names are most often written in lowercase anyway. /kent k
RE: Case mapping of dotless lowercase letters
Far be it from me to stir things up even further, but... QUESTION - Is the rendering of {U+0065} {U+0302} (that's i, combining circumflex above) locale-dependent? I may have got this totally wrong, but it occurs to me that in non-Turkic fonts, U+0065 is "soft-dotted". That is, the dot disappears in the presence of any COMBININGABOVE modifier. But in Turkic, U+0065 is "hard-dotted", so the dot must not be removed if a circumflex is added. I freely admit I don't know whether Turkic uses circumflex or not, but the question will work just as well with any COMBININGABOVE modifier. If this is so, how can a character be considered "soft-dotted" in one locale and "hard-dotted" in another? Would it not make more sense to have not two, but three different kinds of lowercase i: non-dotted i, soft-dotted i and hard-dotted i?. (And similarly for uppercase). Of course, then you might as well invent COMBINING SOFT DOT ABOVE so we can use it elsewhere. It gets better. (You're gonna hate me). If we then make the set { soft-dotted-i, soft-dotted-I, non-dotted-i, non-dotted-I } a casefold equivalence class which lowercases to soft-dotted-i (except in the Turkic locale, where it lowercases to non-dotted-i), and uppercases to non-dotted-I in all locales; and if we similarly make { hard-dotted-i, hard-dotted-I } a separate casefold equivalence class lowercasing to hard-dotted-i and uppercasing to hard-dotted-I (in all locales), then all of the problems outlined by Philippe would go away. And we could do the same with j too. Of course - it would have one nasty side-effect. The Turks would then have to use hard-dotted-i instead of soft-dotted-i, but since the characters (in this new scheme) now have completely different meanings, that's fair enough. Hey ho. Just musing Jill
RE: [OT] CJK - CJC (Re: Corea?)
At 11:30 + 2003-12-17, [EMAIL PROTECTED] wrote: I doubt Christians mean offence when they refer to Jesus through any of the countless transcriptions, spellings and pronunciations used in various languages. It's odd that in English Judas and Jude are distinguished; in the original they are not. -- Michael Everson * * Everson Typography * * http://www.evertype.com
RE: [OT] CJK - CJC (Re: Corea?)
At 11:04 +0100 2003-12-17, Marco Cimarosti wrote: There is reason to rename Colonia to Köln, Augusta to Augsburg, Eboraco to York, Provincia to Provence, and so on. Nicely said. Subtle irony tends to go over some people's heads on this list though. Eboraco is called Eabhrac in Irish. :-) -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re[2]: [OT] CJK - CJC (Re: Corea?)
Hello, 2003-12-17T14:36:37Z Philippe Verdy [EMAIL PROTECTED] wrote: Marco Cimarosti wrote: Doug Ewell wrote: I'll go farther than that. It's always bothered me that speakers of European languages, including English but especially French, have seen fit to rename the cities and internal subdivisions of other countries. Rightly said! There is reason to rename Colonia to Koln, Augusta to Augsburg, Eboraco to York, Provincia to Provence, and so on. Or even Aix-la-Chapelle to Aachen because that's its _current_ German name (the French name was official in the history, and is still used in French). Cities sometimes change name, some of theme being famous like the _current_ Saint-Petersbourg (French name revived in Russia with just a It's Saint-Petersburg (or St. Petersburg) if you write in English. The name has German roots, not French ones. transliteration, the Latin transcription being also widely used by Russians) Why would Russians use the Latin transcription for a Russian name? which has also been Leningrad or Petrograd or Stalingrad Stalingrad was the previous name for Volgograd, not St. Petersburg. The initial name was Tsaritsyn. Petrograd on the other hand *was* the name of St. Petersburg in 1914-1924. Leningrad was the name of it in 1924-1991. (in the Latin transliteration of the official and changing Russian script name, this Latin transliteration changing a bit among various languages which used them), and even Saint-Petersbourg officially for some time in the tsar's Russia. I wonder what you meant by the some time part. St. Petersburg was founded in 1703, and therefore stayed St. Petersburg for more than 200 years, that is it was St. Petersburg *most* of the time. You mixed everything up, Phillippe. Regards, -- Alexander Savenkovhttp://www.xmlhack.ru/ [EMAIL PROTECTED] http://www.xmlhack.ru/authors/croll/
RE: Case mapping of dotless lowercase letters
[resending; better set the encoding to UTF-8...] Peter Kirk wrote: ... used on Turkic text, violates the very sensible rule DO NOT USE COMBINING DOTS WITH I's, and leads to all sorts of potential confusion e.g. that both simple and full case folding and lowercasing applied to NFD Turkic text generate the nonsensical i, dot above. This could be a serious problem - although one that may not be worth fixing. i, dot above is not non-sensical. It is used in Lithuanian for such things as i, dot above, tilde above, as well as other additonal accents above an i or a j that keeps its dot. /kent k Lithuanian alphabet (not listing all the uppercase accented letters) Aa (,{}{}), Bb, Cc (CHch), , Dd, Ee (, {} {} {} {}), Ff, Gg, Hh, Ii ({i} {i} {i} {}{} {}{}, Yy, , ), Jj ({J}{j}), Kk, Ll ({l}), Mm ({m}), Nn (), Oo (, , ), Pp, [Qq], Rr (r), Ss, , Tt, Uu ({} {} {}), Vv, [Ww], [Xx], Zz,
RE: Case mapping of dotless lowercase letters
Would it not make more sense to have not two, but three different kinds of lowercase i: non-dotted i, soft-dotted i and hard-dotted i?. (And similarly for uppercase). Of course, then you might as well invent COMBINING SOFT DOT ABOVE so we can use it elsewhere. I should have mentioned that in this hypothetical scheme, the following would be canonically equivalent: soft-dotted-i = non-dotted-i combining-soft-dot-above soft-dotted-I = non-dotted-I combining-soft-dot-above hard-dotted-i = non-dotted-i combining-dot-above hard-dotted-I = non-dotted-I combining-dot-above Sorry for the omission in previous email Jill
RE: Case mapping of dotless lowercase letters
Peter Kirk wrote: ... used on Turkic text, violates the very sensible rule DO NOT USE COMBINING DOTS WITH I's, and leads to all sorts of potential confusion e.g. that both simple and full case folding and lowercasing applied to NFD Turkic text generate the nonsensical i, dot above. This could be a serious problem - although one that may not be worth fixing. i, dot above is not non-sensical. It is used in Lithuanian for such things as i, dot above, tilde above, as well as other additonal accents above an i or a j that keeps its dot. /kent k Lithuanian alphabet (not listing all the uppercase accented letters) Aa (Àà, Áá Ãã Aa {A´}{a´}), Bb, Cc (CHch), Cc, Dd, Ee (Ee, Ee è é ? e {e´} {e~} e {e´} {e~}), Ff, Gg, Hh, Ii (Ì{i?`} Í{i?´} I{i?~} Ii {I´}{i?´} {I~}{i?~}, Yy, Ýý, ??), Jj ({J~}{j?~}), Kk, Ll ({l~}), Mm ({m~}), Nn (Ññ), Oo (ò, ó, õ), Pp, [Qq], Rr (r~), Ss, , Tt, Uu (ù ú u Uu {u´} {u~} Uu {u´}), Vv, [Ww], [Xx], Zz,
RE: Case mapping of dotless lowercase letters
Philippe Verdy wrote: I do hope that dotless-j and dotted-J ... Dotless j. That's in the works. A precomposed dotted uppercase J? No, I think I can predict that there will be no such encoded character. If you want a dotted uppercase J, use J, combining-dot-above. /kent k
Arabic Presentation Forms-A
I was validating some internal processing of strings, and I found these intrigating decompositions for Arabic Presentation forms-A. I was surprised to see that they are compatibility decomposed in (isolated) rows from bottom to top, in a distinct reading order from normal Arabic reading order for rows , but of coruse with the same right-to-left reading order: #code;cc;nfd;nfkdFolded; # CHAR?; NFD?; NFKDFOLDED?; # RIAL SIGN fdfc;;;isolated 0631 06cc 0627 0644; # ??; ?; ?; The Arial Unicode MS font does not have a glyph for the Rial currency sign so I won't comment lots about it, even if it's a special ligature of its component letters: - where the medial form of U+06CC ARABIC LETTER FARSI YEH (?) is shown on charts only as two dots (and not with its Arabic letter alef maksura base form, as the comment in Arabic chart suggests for Arabic letter yeh), which is - located on below-left of the medial form of U+0627 (?) , - and where the initial form of U+0631 (?) kerns below its next two characters (sometimes with an aditional kashida below its next three characters). However the general layout is still one row, so the decomposition seems very quite reasonable; it's just regrettable that it's not found in Arial Unicode MS (unless this Rial sign is traditional and no more in actual use today). I'm not sure that the compatibility decomposition gives the accurate form for rendering the traditional glyph coded for the currency symbol... -- Now I have this one: #code;name;cc; # nfd;nfkdFolded; # #CHAR?; NFD?; NFKDFOLDED?; FDFA;ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM;0; FDFA;isolated 0635 0644 0649 0020 0627 0644 0644 0647 0020 0639 0644 064a 0647 0020 0648 0633 0644 0645; # ??; ??; ??? ?; #code;name;cc; # nfd;nfkdFolded; # #CHAR?; NFD?; NFKDFOLDED?; FDFB;ARABIC LIGATURE JALLAJALALOUHOU;0; FDFB;isolated 062c 0644 0020 062c 0644 0627 0644 0647; # ??; ??; ?? ??; I note that the Unicode charts show them with their complex and highly ligated form, that correspond to the Arabic tradition in Quran. This is apparently not implemented in Microsoft fonts which just render only the first two on only 2 bottom-to-top rows. The compatibility decomposition creates 4 space-separated words WORD1, WORD2, WORD3, WORD4 that would be rendered normally either in one row as: WORD4 WORD3 WORD2 WORD1 i.e. ??? ? or on multiple narrow rows as: WORD1 or WORD2 WORD1 WORD2 WORD4 WORD3 WORD3 WORD4 i.e. ??? or ??? ? ? using the top-to-bottom normal layout of plain-text rows in Arabic. I can understand that it's difficult to make them fit more ideally like this (with kashidas noted by underscores) : WORD2 ___WORD1 W___ORD3 W___ORD4 i.e. actually this order: ??? ? to better match the actual glyph in charts which also uses kashidas, given the height constraints in fonts, and the difficulty to create the traditional complex kerning between rows, but the current presentation of the alternate glyph chosen in Arial Unicode MS does not seems intuitive. Isn't there some requirement in Unicode to not change the common layout which is part of the character identity and structural for the script? Such interpretation problem does not occur in the presentation of U+FDFB (which also has two rows in the representative glyph of Arabic Presentation Forms-A charts). Is there an error here? --- Now with this one: #code;name;cc; # nfd;nfkdFolded; # #CHAR?; NFD?; NFKDFOLDED?; FDFB;ARABIC LIGATURE JALLAJALALOUHOU;0; FDFB;isolated 062c 0644 0020 062c 0644 0627 0644 0647; # ??; ??; ?? ??; The decomposition into WORD1 WORD2 follows the same principles but is less complex, and it uses this layout: WORD2 WORD1 or: WORD1 WORD2 The second layout is used in Arial Unicode MS to render the ligature. --- Now I don't know why the last very complex but marvelous ligature U+FDFD in Unicode does not have a compatiblity decomposition. In fact I can't decipher clearly to what Arabic letters the ligature corresponds (this is not documented in Unicode, except through its English name, which is probably too far from the Arabic name to allow this identification) More generally, my question is related to the allowed modification of layouts for ligature glyphs in fonts: are they allowed, and how could they be acceptably be represented when the plain-text character is not compatibility-decomposed but rendered with a single glyph... __ ella for Spam Control has removed
Re: [OT] CJK - CJC (Re: Corea?)
Alexander Savenkov scripsit: You mixed everything up, Phillippe. As we say in America, General Grant [1822-1885] Still Dead. -- Do what you will, John Cowan this Life's a Fiction[EMAIL PROTECTED] And is made up of http://www.reutershealth.com Contradiction. --William Blake http://www.ccil.org/~cowan
June Ashton 1999 thesis U Sydney
Elaine Keown in Austin Hi, I wanted to bring the following dissertation--listed at the bottom--to the attention of the e-discussion groups. I'm going to try to have some American research library or University Microfilms make it available here in the U.S. Apparently Dr. Ashton, an Aussie scholar, compared Greek, Coptic, etc. scribal marks with each other--I believe she decided everything was Egyptian, ultimately. The dissertation is relevant for encoding Dead Sea scrolls in Hebrew - Aramaic - Greek etc, TLG, Coptic, and (probably) Egyptian demotic and hieratic. I think Egyptian demotic or hieratic should be done soon.--Elaine U SYDNEY DISSERTATION: The persistence, diffusion and interchangeability of scribal habits in the ancient Near East before the codex / by June Ashton. Publisher 1999. __ Do you Yahoo!? New Yahoo! Photos - easier uploading and sharing. http://photos.yahoo.com/
Re: [OT] CJK - CJC (Re: Corea?)
Michael Everson scripsit: It's odd that in English Judas and Jude are distinguished; in the original they are not. Or for that matter that Jesus and Joshua are distinguished, but here we can lay the blame on Greek vs. Hebrew. -- Well, I'm back. --SamJohn Cowan [EMAIL PROTECTED]
RE: [OT] CJK - CJC (Re: Corea?)
Michael Everson wrote: At 11:04 +0100 2003-12-17, Marco Cimarosti wrote: There is reason to rename Colonia to Köln, Augusta to Augsburg, Eboraco to York, Provincia to Provence, and so on. Nicely said. Subtle irony tends to go over some people's heads on this list though. Especially if one forgets an essential no. :-( It should have been There is NO reason to rename... Eboraco is called Eabhrac in Irish. :-) So, that's who set the bad example in the first place! When the Angles came they said: if Britanni can mangle place names, why shouldn't Ingevones? :-) Ciao. Marco
Re: Stability of WG2
Peter Kirk peterkirk at qaya dot org wrote: Nobody would call chimps Homo troglodytes, or orangs Simia satyrus, today, but those names can't ever be assigned to other species in future. (If chimps were folded into Homo, they would be H. troglodytes again.) And that is more or less what I would like to see with Unicode character names. Old names can remain valid as deprecated synonyms (or perhaps non-deprecated synonyms e.g. if Corean becomes officially preferred but Korean is still in widespread use), and not reusable for other characters, but should be gradually replaceable by new, correct or updated names. I really think this is a deceased Equus caballus. As a programmer, I can't personally imagine designing a program that relies on the Unicode names to identify characters uniquely, instead of relying on the code points. Of course the names have to be unique, but beyond that it certainly wouldn't bother me or any of the programs I've written if some of the names were changed from one version to the next. But apparently, for whatever reason, it IS very important to some programmers and programs, and they have made it very clear for years and years now that the names *must not change* in the interest of stability. That is the policy of UTC and WG2, and it will not be changed simply because anyone -- an individual or an entire committee -- determines that name A' (or B) is more appropriate for a character than name A. That goes for glaring mistakes like OI and HANGZHOU, and for typos like FHTORA, and it would go for KOREAN as well. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Re: Case mapping of dotless lowercase letters
On 17/12/2003 05:24, Kent Karlsson wrote: ... There was never an intent do deny Turkey anything. The thing was that the uppercase of i is I (usually) and the uppercase of is also I, so i, I, and used to be folded together (to i) in the drafts for IDN. Apparently that was deemed to harsh and was modified. (I think I complained at some point, but it wasn't modified then, but apparently much later.) Still for IDNs there is no language dependence in the case folding, as there are for the case *mappings*. So I is turned into i (not ) also for Turkish for IDNs. On the other hand, domain names are most often written in lowercase anyway. /kent k OK, that sounds reasonable now. I guess Turks and Azeris will just have to make sure they use lower case domain names, which makes more sense anyway. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
RE: Arabic Presentation Forms-A
Philippe Verdy wrote: #code;cc;nfd;nfkdFolded; # CHAR?; NFD?; NFKDFOLDED?; # RIAL SIGN fdfc;;;isolated 0631 06cc 0627 0644; # ??; ?; ?; The Arial Unicode MS font does not have a glyph for the Rial currency sign so I won't comment lots about it, even if it's a special ligature of its component letters: - where the medial form of U+06CC ARABIC LETTER FARSI YEH (?) is shown on charts only as two dots (and not with its Arabic letter alef maksura base form, as the comment in Arabic chart suggests for Arabic letter yeh), which is I am not sure I understand what you are asking, but it is quite normal that the initial and medial form of letters Beh, Teh, Theh, Noon and Yeh loose their tooth and are thus recognizable only by their dots. Similarly, Seen and Sheen often loose their three teeth. I find this particularly puzzling with the initial and medial forms of Seen, which becomes a simple straight line in most calligraphic styles. - located on below-left of the medial form of U+0627 (?) , U+627 is Alif, so it has no medial form. - and where the initial form of U+0631 (?) kerns below its next two characters (sometimes with an aditional kashida below its next three characters). This too is quite normal: the tail of Reh, Zain and Waw often kerns below the next letter. Compare it to Latin lowercase j, which has a similar behavior. _ Marco
Cuneiform Base Signs Plus Modifiers
[I am sending this email to both the Initiative for Cuneiform Encoding email list, [EMAIL PROTECTED], and the general Unicode email list, [EMAIL PROTECTED], in order to get comments from both the cuneiform and Unicode communities.] From the very first Initiative for Cuneiform Encoding conference at Johns Hopkins University in November 2000, I, along with all others I am aware of, have accepted unquestioningly the suggestion that we encode the complex Sumero-Akkadian cuneiform signs as separate code points in Unicode. For the non-cuneiformists on these lists, one way cuneiformists categorize cuneiform signs is as simple, compound, and complex signs - a simple sign being one not formed by combining two or more signs, a compound sign being one formed by postfixing one or more signs to form a grapheme cluster; and a complex sign being one formed by infixing one sign inside another to form a new sign. At both ICE conferences we decided to encode simple and complex signs but not compound signs. Recently I have had second thoughts about encoding complex signs. Modification of base, or simple, signs was a productive process for making new signs in the earlier periods of cuneiform usage, and included such modifications as adding or subtracting wedges, rotating signs, infixing signs, etc. (For some examples of how the ancient scribes modified base signs to form new complex signs see http://www.jhu.edu/ ice/basesigns/.) Instead of encoding all 875 post-archaic, base and complex cuneiform signs, we could instead encode the 280 base signs plus a dozen or so sign modifiers. (I am not including in these approximate figures the 75 or so numerical signs being proposed for encoding.) This would be somewhat analogous to encoding a, e, the acute accent, and the grave accent instead of encoding a with acute, a with grave, e with acute, etc. Encoding base signs with modifiers would more closely mirror, in the encoding, the way the script system itself actually worked and it would more easily accommodate modern research in archaic cuneiform, a stage in cuneiform script development we have all decided not to encode for now due to the current provisional state of its scholarship. By providing in the encoding the base signs along with their modifiers cuneiformists working in archaic and other periods could generate newly discovered or newly analyzed complex signs ad hoc, without having to go through the time-consuming and expensive Unicode/ISO standardization process. Compound and complex sign realization would then simply be a matter of the coordination of input methods with fonts, something now doable by end users with modern computer operating systems. (This, of course, assumes that we are more likely to find new combinations and modifications of existing base signs than to find new base signs themselves. At any rate, when we do find new base signs we need to encode them anyway.) To most cuneiformists, of course, the encoding underpinnings would all be hidden by input methods and fonts. One would simply type the expected SHUD3 and the input method would map it to 3 code points, KA INFIX and SHU (mouth sign with hand sign infixed), and the font would render it as one complex sign (meaning to pray). And from a practical point of view encoding only the base signs and their modifiers would be easy for us to do - we need only remove the complex signs from our lists and add the 13 or 14 modifiers. Respectfully, Dean A. Snyder Scholarly Technology Specialist Library Digital Programs, Sheridan Libraries Garrett Room, MSE Library, 3400 N. Charles St. Johns Hopkins University Baltimore, Maryland, USA 21218 office: 410 516-6850 fax: 410-516-6229 Manager, Digital Hammurabi Project: www.jhu.edu/digitalhammurabi
Re: Stability of WG2
Doug Ewell wrote: But apparently, for whatever reason, it IS very important to some programmers and programs, and they have made it very clear for years and years now that the names *must not change* in the interest of stability. On the other hand, there is nothing to prevent the Unicode consortium or any other body or any single person from creating a new *additional* corrected set of names if the Unicode consortium or any other body or any single person wishes to do so. That would just be an alternative list of character names. There would be nothing to prevent any particular application or language or individual person or standard using such an alternative list in preference to the older standard Unicode list of names, if indeed anyone is really using these names for much of anything. The only real purpose I can see the names serve is that writing something like MODIFIER LETTER SMALL SCHWA is more easily understood by a reader who doesn't have TUS handy than is U+1D4A. At least the reader knows that some kind of schwa is being referenced (if the reader knows what a schwa is.) And if they come across the same name in another article about phonetic characters in Unicode they can be reasonably sure the same character is being discussed. Also if there is either a typo in the name or in the Unicode identifying code then one of these can serve as a check on the other. But I rather not be surprised if that at some time in the future a second set of names with obvious errors corrected were to be created. Jim Allan
Re: Case mapping of dotless lowercase letters
On 17/12/2003 05:30, Arcane Jill wrote: Far be it from me to stir things up even further, but... QUESTION - Is the rendering of {U+0065} {U+0302} (that's i, combining circumflex above) locale-dependent? I may have got this totally wrong, but it occurs to me that in non-Turkic fonts, U+0065 is soft-dotted. That is, the dot disappears in the presence of any COMBININGABOVE modifier. But in Turkic, U+0065 is hard-dotted, so the dot must not be removed if a circumflex is added. I freely admit I don't know whether Turkic uses circumflex or not, but the question will work just as well with /any/ COMBININGABOVE modifier. ... Turkish does in fact use circumflex above a, i and u, although rather rarely and often dropped today (but no other diacritics above except for umlaut as part of regular letters, no umlaut on i). i with circumflex is especially rare but is sometimes written on Arabic loan words like mill (/national/). Note carefully that this is pronounced as a variant of *dotted* i, and replaced by dotted i (not dotless i) when the circumflex is dropped, but it is written undotted in both upper and lower case. Note the following found from a Google search, which gives some upper and lower case equivalents. TRK *MLL* KODLANDIRMA SSTEM. *...* . *Mill* Kodlandrma Sisteminin temelini ... Conclusion: the right thing even for Turkish is to drop the dot on i before a circumflex. But by the same argument we would also want to drop the dot on dotless I. Oh dear, I have just made the whole issue even more complicated! -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
RE: Arabic Presentation Forms-A
Philippe Verdy wrote: #code;cc;nfd;nfkdFolded; # CHAR?; NFD?; NFKDFOLDED?; # RIAL SIGN fdfc;;;isolated 0631 06cc 0627 0644; # ??; ?; ?; I should have disabled temporarily my email filter to send this one. All UTF-8 codes were replaced by ISO-8859-1 characters, substituing '?' instead of Arabic characters... I hope that the codepoints that I gave explicitly will still make my message readable... Well in your message you comment on the form shown in the charts, and I don't criticize them. I was just wondering if their rendering in Arial Unicode MS is correct and conforming to the required need to keep the interpretation, and in what measure the beautiful ligatures found in Unicode charts are normative, as there's a very large difference with what Arial Unicde MS does, with a distinct character layout, and no ligature, no kerning kashidas, and in some cases not even the contextual shaping of its embedded letters, so that the Arial Unicode MS font render these ligatures as their NFKD decomposition rendered in a single square. This may be valid if this was just a ligature, but in that case, why aren' those decomposition canonical like the ffi ligature ? __ ella for Spam Control has removed Spam messages and set aside Newsletters for me You can use it too - and it's FREE! http://www.ellaforspam.com attachment: winmail.dat
RE: Case mapping of dotless lowercase letters
Peter Kirk wrote: Conclusion: the right thing even for Turkish is to drop the dot on i before a circumflex. I agree. The letter is rare enough to not create an exception here for the removal of dot on the soft-dotted i followed by circumflex (which is needed much more often in other languages that use '' and '. But by the same argument we would also want to drop the dot on dotless I. I think you meant But by the same argument we would also want to drop the dot on DOTTED I. I would not recommand it, this would make things even worse and more complicated. If Turkish wants to remove the dot on pseudo-dotted I if followed by a circumflex, the correct thing to do is then to use the ASCII dotless I and add a circumflex or use its canonical equivalent LATIN CAPITAL LETTER I WITH CIRCUMFLEX. With the current specification, both of LATIN CAPITAL LETTER I, COMBINING CIRCUMFLEX, and LATIN CAPITAL LETTER I WITH CIRCUMFLEX are canonical equivalents and must render the same, without the dot. To display a dot, one can use one of the four canonical eqquivalents: LATIN CAPITAL LETTER I WITH DOT ABOVE, COMBINING CIRCUMFLEX LATIN CAPITAL LETTER I WITH CIRCUMFLEX, COMBINING DOT ABOVE LATIN CAPITAL LETTER I, COMBINING DOT ABOVE, COMBINING CIRCUMFLEX LATIN CAPITAL LETTER I, COMBINING CIRCUMFLEX, COMBINING DOT ABOVE (one is the NFC form, another is the NFD form, two others are also possible) __ ella for Spam Control has removed Spam messages and set aside Newsletters for me You can use it too - and it's FREE! http://www.ellaforspam.com attachment: winmail.dat
Re: Case mapping of dotless lowercase letters
To display a dot, one can use one of the four canonical eqquivalents: LATIN CAPITAL LETTER I WITH DOT ABOVE, COMBINING CIRCUMFLEX LATIN CAPITAL LETTER I WITH CIRCUMFLEX, COMBINING DOT ABOVE LATIN CAPITAL LETTER I, COMBINING DOT ABOVE, COMBINING CIRCUMFLEX LATIN CAPITAL LETTER I, COMBINING CIRCUMFLEX, COMBINING DOT ABOVE (one is the NFC form, another is the NFD form, two others are also possible) Those four are not all canonical equivalent since circumflex and dot above are both combining class 230, so they interact.
RE: Case mapping of dotless lowercase letters
Chris Jacobs wrote: To display a dot, one can use one of the four canonical eqquivalents: LATIN CAPITAL LETTER I WITH DOT ABOVE, COMBINING CIRCUMFLEX LATIN CAPITAL LETTER I WITH CIRCUMFLEX, COMBINING DOT ABOVE LATIN CAPITAL LETTER I, COMBINING DOT ABOVE, COMBINING CIRCUMFLEX LATIN CAPITAL LETTER I, COMBINING CIRCUMFLEX, COMBINING DOT ABOVE (one is the NFC form, another is the NFD form, two others are also possible) Those four are not all canonical equivalent since circumflex and dot above are both combining class 230, so they interact. You're right. Initially I wanted to verify their combining classes to see which form was the NFC or NFD, but I did not need to remember these classes values as they effectively combine at the same (above) class. So depending on the letters to encode one can use any of: NFC: LATIN CAPITAL LETTER I WITH DOT ABOVE, COMBINING CIRCUMFLEX NFD: LATIN CAPITAL LETTER I, COMBINING DOT ABOVE, COMBINING CIRCUMFLEX to encode the circumflex above the dot (I think this is what Turkish would use as the fot is considered part of the base letter), or any of: NFC: LATIN CAPITAL LETTER I WITH CIRCUMFLEX, COMBINING DOT ABOVE NFD: LATIN CAPITAL LETTER I, COMBINING CIRCUMFLEX, COMBINING DOT ABOVE to encode the dot above the circumflex (but may be Turkish will not make a difference here and will read it as a glyph variant) __ ella for Spam Control has removed Spam messages and set aside Newsletters for me You can use it too - and it's FREE! http://www.ellaforspam.com attachment: winmail.dat
American English translation of character names (was Re: Stability of WG2)
Jim Allan noted: On the other hand, there is nothing to prevent the Unicode consortium or any other body or any single person from creating a new *additional* corrected set of names if the Unicode consortium or any other body or any single person wishes to do so. That would just be an alternative list of character names. But I rather not be surprised if that at some time in the future a second set of names with obvious errors corrected were to be created. And, indeed, some of us have toyed around with the notion of publishing an American English translation of the Unicode names list, including such obvious improvements as: U+002E FULL STOP -- PERIOD (or DOT) U+002F SOLIDUS-- SLASH U+0040 COMMERCIAL AT -- AT SIGN U+005C REVERSE SOLIDUS -- BACKSLASH U+005F LOW LINE -- (SPACING) UNDERSCORE U+00B6 PILCROW SIGN -- PARAGRAPH SIGN U+0268 LATIN SMALL LETTER I WITH STROKE -- ... BARRED I U+019B LATIN SMALL LETTER LAMBDA WITH STROKE -- ... BARRED LAMBDA U+03BB GREEK SMALL LETTER LAMDA -- ... LAMBDA U+21B0 UPWARDS ARROW WITH TIP LEFTWARDS -- UP ARROW WITH TIP POINTING LEFT U+21BA ANTICLOCKWISE OPEN CIRCLE ARROW -- COUNTERCLOCKWISE ... U+FE4E CENTRELINE LOW LINE -- CENTERLINE UNDERSCORE and so on and so on, including all the obvious errors that people are continuing to worry about. ;-) --Ken
Re: Arabic Presentation Forms-A
Philippe asked: The Arial Unicode MS font does not have a glyph for the Rial currency sign so I won't comment lots about it, even if it's a special ligature of its component letters: it's just regrettable that it's not found in Arial Unicode MS (unless this Rial sign is traditional and no more in actual use today). The Rial currency sign was recently added to the standard, so many fonts still don't have it. It was added for compatibility with an Iranian standard. I'm not sure that the compatibility decomposition gives the accurate form for rendering the traditional glyph coded for the currency symbol... It isn't supposed to. Compatibility decompositions are approximations, not necessarily the basis for building an Arabic ligation, especially for special cases like this currency sign. FDFA;ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM;0; FDFA;isolated 0635 0644 0649 0020 0627 0644 0644 0647 0020 0639 0644 064a 0647 0020 0648 0633 0644 0645; FDFB;ARABIC LIGATURE JALLAJALALOUHOU;0; FDFB;isolated 062c 0644 0020 062c 0644 0627 0644 0647; but the current presentation of the alternate glyph chosen in Arial Unicode MS does not seems intuitive. That's an issue for Microsoft customers and testers of Microsoft fonts to determine. Isn't there some requirement in Unicode to not change the common layout which is part of the character identity and structural for the script? Such interpretation problem does not occur in the presentation of U+FDFB (which also has two rows in the representative glyph of Arabic Presentation Forms-A charts). Is there an error here? Nope. Glyph shapes are not normative or prescriptive. As long as the identity of the character is clear, there might be an aesthetic faux pas, but not an error or a failure of conformance to the standard. More generally, my question is related to the allowed modification of layouts for ligature glyphs in fonts: are they allowed, Yes. and how could they be acceptably be represented when the plain-text character is not compatibility-decomposed but rendered with a single glyph... By the code points in question, of course. For these word ligatures, which are really used as complete symbols, one would ordinarily not expect to enter the whole compatibility sequence of characters, anyway. Normal rendering engines don't produce these highly elaborated ligatures automatically from such sequences. I was just wondering if their rendering in Arial Unicode MS is correct and conforming to the required need to keep the interpretation, As long as the identity of the character is correct, which it seems to be, since you identified it, then one can say the font is correct. and in what measure the beautiful ligatures found in Unicode charts are normative, In no measure. as there's a very large difference with what Arial Unicde MS does There are large differences between Arabic fonts for *all* of the Arabic characters in the standard -- not just these word ligature symbols. --Ken
Re: Case mapping of dotless lowercase letters
However, could there be an encoding for: LATIN CAPITAL LETTER DOTLESS J with a lowercase mapping to the new: LATIN SMALL LETTER DOTLESS J Of course the former would look exactly the same as the ASCII uppercase J, except that it would have a distinct case mapping. This would avoid, for j/J the nightmare of dotless-i/dotted-i/I... It introduces another difficulty though - If there are languages using a LATIN SMALL LETTER DOTLESS J and words written in those languages are sometimes capitalised - then presumably there is already data where LATIN CAPITAL LETTER J has already been used as the upper case for LATIN SMALL LETTER DOTLESS J introducing a separate A purist might argue that if there are no places where a using LATIN CAPITAL LETTER DOTLESS J instead of LATIN SMALL LETTER DOTLESS J makes a lexical difference then one is simply a glyph variant of the other. If that is so then there is no need for two characters one form could be handled by higher level mark-up and rendered using a different glyph. I think Latin has too long been considered a simple script - if one takes into account the number of languages written in Latin script and all the additions modifications used to do this, Latin is a complex script. In view of this before adding new Latin characters it might be a good idea to first consider the kind of solutions used for scripts that have always been considered complex. - Chris
Re: Case mapping of dotless lowercase letters
Philippe Verdy [EMAIL PROTECTED] wrote: Ohhh... I admit this is hypothetic for a possible use, but the candrabindu case is a precedent coming from romanization of non-Latin scripts: what if there's a combining x above used to interact over a diacritic and mark its suppression in corrected texts or in documents related to orthographic/grammatical rules, or simply because it is needed for correct romanization of some ancient script... If special rendering rules are needed for romanisation of particular languages there is a facility in OpenType and other smart-font formats to include different rules for different languages written with the same script. One could use this to provide e.g different rendering behaviour for Turkish than for other languages written in Latin and I suspect it could be used in many cases of transliteration non-Latin scripts (presuming a particular language was written in that script) Orthographic rules can certainly be handled by features and lookups in smart fonts. Maybe this is the level on which many of these issues should be handled. We only need new characters where it is necessary to make a distinction, or resolve something that would otherwise be ambiguous, in plain text. - Chris
Re: Cuneiform Base Signs Plus Modifiers
Dean Snyder [EMAIL PROTECTED] wrote: Recently I have had second thoughts about encoding complex signs. Modification of base, or simple, signs was a productive process for making new signs in the earlier periods of cuneiform usage, and included such modifications as adding or subtracting wedges, rotating signs, infixing signs, etc. (For some examples of how the ancient scribes modified base signs to form new complex signs see http://www.jhu.edu/ ice/basesigns/.) Instead of encoding all 875 post-archaic, base and complex cuneiform signs, we could instead encode the 280 base signs plus a dozen or so sign modifiers. (I am not including in these approximate figures the 75 or so numerical signs being proposed for encoding.) This would be somewhat analogous to encoding a, e, the acute accent, and the grave accent instead of encoding a with acute, a with grave, e with acute, etc. This fits in best with the Unicode charater encoding model and is definitely the way to go, particularly if the script was productive. If additional complex signs are found you will then be able to represent them straight away and won't have submit a proposal to add an additional character, wait for it to be accepted get encoded, and then wait support for it to appear in applications and fonts (a proccess which usually takes several years) Encoding base signs with modifiers would more closely mirror, in the encoding, the way the script system itself actually worked and it would more easily accommodate modern research in archaic cuneiform, a stage in cuneiform script development we have all decided not to encode for now due to the current provisional state of its scholarship. By providing in the encoding the base signs along with their modifiers cuneiformists working in archaic and other periods could generate newly discovered or newly analyzed complex signs ad hoc, without having to go through the time-consuming and expensive Unicode/ISO standardization process. Compound and complex sign realization would then simply be a matter of the coordination of input methods with fonts, something now doable by end users with modern computer operating systems. (This, of course, assumes that we are more likely to find new combinations and modifications of existing base signs than to find new base signs themselves. At any rate, when we do find new base signs we need to encode them anyway.) I think it is always a good idea to closely mirror in encoding the way a script system actually works - and break it down into primitives or base characters, combining marks and modifiers It might be helpful to at how smart-font systems like OpenType and AAT/ATSUI are already used for rendering complex scripts and to try and think of the features and lookups a Cuneiform font using this sort of technology might use. To most cuneiformists, of course, the encoding underpinnings would all be hidden by input methods and fonts. One would simply type the expected SHUD3 and the input method would map it to 3 code points, KA INFIX and SHU (mouth sign with hand sign infixed), and the font would render it as one complex sign (meaning to pray). This is perfectly feasible. And from a practical point of view encoding only the base signs and their modifiers would be easy for us to do - we need only remove the complex signs from our lists and add the 13 or 14 modifiers. This seems to be the right approach. - Chris Fynn
Re: Stability of WG2
Jim Allan [EMAIL PROTECTED] wrote: On the other hand, there is nothing to prevent the Unicode consortium or any other body or any single person from creating a new *additional* corrected set of names if the Unicode consortium or any other body or any single person wishes to do so. That would just be an alternative list of character names. Of course anybody can make and use their own name list for their own purposes - getting a new alternative name list added to the standard is another issue. There is plenty of disagreement about what the proper name for many characters should be - which is probably one of the reasons for the rule that says once a name is assigned it cannot be changed. If this rule wasn't there, Unicode and WG2 would get a constant stream of proposals to correct the name of character U+ - and then have to spend time on discussing such proposals and voting on them. I think members of UTC WG2 have much more useful things to do with their time. - Chris
Re: Case mapping of dotless lowercase letters
Christopher John Fynn scripsit: It introduces another difficulty though - If there are languages using a LATIN SMALL LETTER DOTLESS J There aren't. Dotless j as a character (as opposed to a glyph used with various accents above) is only used in non-IPA phonetic alphabets. I think Latin has too long been considered a simple script - if one takes into account the number of languages written in Latin script and all the additions modifications used to do this, Latin is a complex script. Amen. -- I suggest you call for help,John Cowan or learn the difficult art of mud-breathing.[EMAIL PROTECTED] --Great-Souled Sam http://www.ccil.org/~cowan