Re: DUCET and supplementary foldings (was: Looking for transcription or transliteration standards latin- arabic)
From: Asmus Freytag [EMAIL PROTECTED] I have a certain sympathy for the idea of designing UCA so that the untailored *default* works for such kind of multilingual usage. However, the other use of the DUCET is to be the most convenient base for applying all tailorings. I have a certain sympathy for the position that claims that there are important, but perhaps specialized or not economically powerful classes of users that will not likely have access to a tailored UCA for their language or writing system. If that is really the case, i.e. appreciable numbers of smaller languages would be able to survive without tailoring, then the alternative to fixing the DUCET could be a separate publication of a common base tailoring for multilingual data access. (A base tailoring would be applied before further tailoring for a specific language). I appreciate much this analysis. The DUCET has effectively two supposed usages, whose purposes are opposed. If used as a base collation from which a language-specific collation can be built simply with few rules, it's true that the other common usage needed for multilanguage searches is not easy to build. May be we could think about designing a new standard collation tailoring table which could be used as an alternative to the DUCET, but targetting multilanguage searches. And so, such tailoring would include more folding than the DUCET, putting the differences at a higher weight level. And give it a name (MUCET? for Multilanguage Unicode Collation Elements Table?) that would be supported as well. The DUCET is now quite stable and there's no need to change it, as it is now well known and certainly used in many applications that depend on it (RDBMS engines notably). But a MUCET would be certainly useful, including for users that would no more need to search for multiple words in a multilanguage database or simply for the web. Nothing forbids, in addition, to sort the matching entries by relevance using the DUCET as a secondary collation order. After all a collation elements table works exactly like a custom decomposition table that creates additional strings whose encoding is not portable as it depends on weight values. Using custom decompositions is often much simpler than implementing a multilevel collation, using existing algorithms implemented for NFD and NFKD decompositions. In such a view, some extra decompositions are needed, using non-standard Unicode characters for some elements (for example when decomposing a AE letter into a ligature with an extra custom control with a higher collation level, to be used only for full collation order but that could be ignored for searches limited at level 1 or 2).
User Expectations for collation (was Re: Looking for transcription or transliteration standards latin-arabic)
These provide good examples. It would be interesting to see, of the people on the [EMAIL PROTECTED] list, how many non-Poles would expect to find the following orders: Ab b Ac Eb b Ec Ob b Oc Ce e Cy Ne e Ny Sa a Sy Za a Zy Za a Zy and either (a) or (b): a) La a Ly// interleaved b) La Ly a// non-interleaved Mark - Original Message - From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Saturday, July 10, 2004 01:02 Subject: Re: Looking for transcription or transliteration standards latin-arabic W licie z pi, 09-07-2004, godz. 19:34 -0700, Asmus Freytag napisa: o-slash, can be analyzed as o and slash, even though that's not done canonically in Unicode. Allowing users outside Scandinavia to perform fuzzy searches for words with this character is useful. In this view of folding, Language-specific fuzzy searches would be tailored (usually by being based on collation information, rather than on generic diacritic folding). In Polish letters with diacritics are sorted after the corresponding letters without. Omitting diacritics is an error, even though text without them is generally readable. They are removed when the given protocol requires or encourages ASCII (e.g. filenames to be used in URLs, login names, variable names in programming languages, ancient computer systems). There is no alternate spelling scheme like German AE/OE/UE/SS. Polish leters are never folded when sorting lexicographically. This applies to in the same way as to other eight letters. Foreign diacritics are always folded though, at least I don't remember seeing any other case. I think would be folded together with O in an encyclopaedia if this is a foreign O with some accent, unrelated to Polish which is a separate letter (can you suggest some non-Polish word starting with which could be found in an encyclopaedia?). But there are cases when I would prefer to fold Polish diacritics in searches. It's basically every case when you are not sure that all stored data is using diacritics, for example in generic WWW searching. There are still people who don't use diacritics in usenet and email, or in entries in guest books and other unprofessional web content. There are even sometimes people who insist that Polish letters *should not* be used in usenet and email because some computer systems can't handle them. Diacritics are rare on IRC (because the IRC protocol doesn't distinguish between CP-1250, ISO-8859-2 and UTF-8) and with instant messengers (because of laziness). This is why for searching archives of unknown data it's generally better to fold them. As far as I know, the default UCA folds these letters except , and standard Polish tailoring doesn't fold any Polish letter. While not folding them in searching is technically correct and nobody would be surprised that they are not folded, it's often more useful to fold them and people would be pleasantly surprised if they don't have to repeat the search with omitted diacritics. If one wants to find data containing a word, rather than collect statistics about usage of a word with and without diacritics, it's very rare than folding does some harm. Hmm, it's not that simple. When I'm searching for JZYK (existing word), I will be happy to find occurrences of JEZYK too (non-existing word, must have had diacritics stripped), but it makes no sense to return JEYK (another existing word). It's not just making the letters equivalent. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Looking for transcription or transliteration standards latin- arabic
At 01:02 AM 7/10/2004, Marcin 'Qrczak' Kowalczyk wrote: But there are cases when I would prefer to fold Polish diacritics in searches. It's basically every case when you are not sure that all stored data is using diacritics, Or when you are unsure how it is spelled, for example, looking up a personal or geographic name you are not familiar with. The discussion started around the case where searching is not localized (tailored) to the language, which, by definition means that users will not be familiar with the spelling of the items they are trying to retrieve. If one wants to find data containing a word, rather than collect statistics about usage of a word with and without diacritics, it's very rare than folding does some harm. Hmm, it's not that simple. When I'm searching for JĘZYK (existing word), I will be happy to find occurrences of JEZYK too (non-existing word, must have had diacritics stripped), but it makes no sense to return JEŻYK (another existing word). It's not just making the letters equivalent. There are other types of searches than 'google'. One example is searches for for station names on services such as http://www.bahn.de. Unlike air-travel sites, the number of destinations (all across Europe, by the way), is huge, as the site also includes commuter train services. They've changed their search algorithm a number of times over the years, but at one time, you could enter a destination without diacritics and it would attempt to match that to the list of known station names. In case of multiple hits it would give you a list to pick from. They also supported alternative non-native names (such as Cologne). I haven't used it in a while, so I don't know what they support today, but when I did, I found it very useful in looking up destinations. I have a certain sympathy for the idea of designing UCA so that the untailored *default* works for such kind of multilingual usage. However, the other use of the DUCET is to be the most convenient base for applying all tailorings. I have a certain sympathy for the position that claims that there are important, but perhaps specialized or not economically powerful classes of users that will not likely have access to a tailored UCA for their language or writing system. If that is really the case, i.e. appreciable numbers of smaller languages would be able to survive without tailoring, then the alternative to fixing the DUCET could be a separate publication of a common base tailoring for multilingual data access. (A base tailoring would be applied before further tailoring for a specific language). A./
Re: User Expectations for collation (was Re: Looking for transcription or transliteration standards latin-arabic)
I missed Mark's change in subject - so I replied to Marcin's message right now under the old subject line: - Original Message - From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Saturday, July 10, 2004 01:02 Subject: Re: Looking for transcription or transliteration standards latin-arabic W liście z pią, 09-07-2004, godz. 19:34 -0700, Asmus Freytag napisał: o-slash, can be analyzed as o and slash, even though that's not done canonically in Unicode. Allowing users outside Scandinavia to perform fuzzy searches for words with this character is useful. In this view of folding, Language-specific fuzzy searches would be tailored (usually by being based on collation information, rather than on generic diacritic folding). In Polish letters with diacritics ĄĆĘŁŃÓŚŹŻ are sorted after the corresponding letters without. Omitting diacritics is an error, even though text without them is generally readable. They are removed when the given protocol requires or encourages ASCII (e.g. filenames to be used in URLs, login names, variable names in programming languages, ancient computer systems). There is no alternate spelling scheme like German AE/OE/UE/SS. Polish leters are never folded when sorting lexicographically. This applies to Ł in the same way as to other eight letters. Foreign diacritics are always folded though, at least I don't remember seeing any other case. I think Ó would be folded together with O in an encyclopaedia if this is a foreign O with some accent, unrelated to Polish Ó which is a separate letter (can you suggest some non-Polish word starting with Ó which could be found in an encyclopaedia?). But there are cases when I would prefer to fold Polish diacritics in searches. It's basically every case when you are not sure that all stored data is using diacritics, for example in generic WWW searching. There are still people who don't use diacritics in usenet and email, or in entries in guest books and other unprofessional web content. There are even sometimes people who insist that Polish letters *should not* be used in usenet and email because some computer systems can't handle them. Diacritics are rare on IRC (because the IRC protocol doesn't distinguish between CP-1250, ISO-8859-2 and UTF-8) and with instant messengers (because of laziness). This is why for searching archives of unknown data it's generally better to fold them. As far as I know, the default UCA folds these letters except Ł, and standard Polish tailoring doesn't fold any Polish letter. While not folding them in searching is technically correct and nobody would be surprised that they are not folded, it's often more useful to fold them and people would be pleasantly surprised if they don't have to repeat the search with omitted diacritics. If one wants to find data containing a word, rather than collect statistics about usage of a word with and without diacritics, it's very rare than folding does some harm. Hmm, it's not that simple. When I'm searching for JĘZYK (existing word), I will be happy to find occurrences of JEZYK too (non-existing word, must have had diacritics stripped), but it makes no sense to return JEŻYK (another existing word). It's not just making the letters equivalent.
Re: Looking for transcription or transliteration standards latin- arabic
W licie z pi, 09-07-2004, godz. 19:34 -0700, Asmus Freytag napisa: o-slash, can be analyzed as o and slash, even though that's not done canonically in Unicode. Allowing users outside Scandinavia to perform fuzzy searches for words with this character is useful. In this view of folding, Language-specific fuzzy searches would be tailored (usually by being based on collation information, rather than on generic diacritic folding). In Polish letters with diacritics are sorted after the corresponding letters without. Omitting diacritics is an error, even though text without them is generally readable. They are removed when the given protocol requires or encourages ASCII (e.g. filenames to be used in URLs, login names, variable names in programming languages, ancient computer systems). There is no alternate spelling scheme like German AE/OE/UE/SS. Polish leters are never folded when sorting lexicographically. This applies to in the same way as to other eight letters. Foreign diacritics are always folded though, at least I don't remember seeing any other case. I think would be folded together with O in an encyclopaedia if this is a foreign O with some accent, unrelated to Polish which is a separate letter (can you suggest some non-Polish word starting with which could be found in an encyclopaedia?). But there are cases when I would prefer to fold Polish diacritics in searches. It's basically every case when you are not sure that all stored data is using diacritics, for example in generic WWW searching. There are still people who don't use diacritics in usenet and email, or in entries in guest books and other unprofessional web content. There are even sometimes people who insist that Polish letters *should not* be used in usenet and email because some computer systems can't handle them. Diacritics are rare on IRC (because the IRC protocol doesn't distinguish between CP-1250, ISO-8859-2 and UTF-8) and with instant messengers (because of laziness). This is why for searching archives of unknown data it's generally better to fold them. As far as I know, the default UCA folds these letters except , and standard Polish tailoring doesn't fold any Polish letter. While not folding them in searching is technically correct and nobody would be surprised that they are not folded, it's often more useful to fold them and people would be pleasantly surprised if they don't have to repeat the search with omitted diacritics. If one wants to find data containing a word, rather than collect statistics about usage of a word with and without diacritics, it's very rare than folding does some harm. Hmm, it's not that simple. When I'm searching for JZYK (existing word), I will be happy to find occurrences of JEZYK too (non-existing word, must have had diacritics stripped), but it makes no sense to return JEYK (another existing word). It's not just making the letters equivalent. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
RE: Looking for transcription or transliteration standards latin- arabic
transliteration is no longer needed or useful. Transliteration is a one-to-one mapping between scripts, and the reader needs to be familiar with both scripts and the transliteration rules to make sense of it. That's not true. Looking at Wright's Historical German Grammar, I see Goth. baírand, OHG. bërant=Skr. bháranti. It would be illegible to me, and probably many Germantists, if it were written in three scripts instead of one. Using foreign scripts is rarely of help to the casual reader, especially in the frequent cases where it's not important that understand the details of the transliteration scheme. -- ___ Sign-up for Ads Free at Mail.com http://promo.mail.com/adsfreejump.htm
Re: Looking for transcription or transliteration standards latin- arabic
Jony Rosenne wrote: Cologne is not a transliteration of Kln but the English name of the city, just as Munich, Rome, Moscow, The Hague, Longhorn, Venice, Jaffa and Jerusalem. Would that be the English name for Windows Ligorno?
RE: Looking for transcription or transliteration standards latin- arabic
Sorry, I meant Leghorn. Jony -Original Message- From: Simon Montagu [mailto:[EMAIL PROTECTED] Sent: Friday, July 09, 2004 9:19 AM To: Jony Rosenne Cc: [EMAIL PROTECTED] Subject: Re: Looking for transcription or transliteration standards latin- arabic Jony Rosenne wrote: Cologne is not a transliteration of Köln but the English name of the city, just as Munich, Rome, Moscow, The Hague, Longhorn, Venice, Jaffa and Jerusalem. Would that be the English name for Windows Ligorno?
Re: Looking for transcription or transliteration standards latin- arabic
Jony Rosenne scripsit: I doubt it makes much sense to the casual reader. Witness how nearly every radio and television pronounces New Delhi as New Del-hi. O pity the poor poor Zippity, For he can eat nothing but Greli, A plant that grows only In New Caledony, While the Zippity lives in New Delhi. --Shel Silverstein -- Take two turkeys, one goose, four John Cowan cabbages, but no duck, and mix them http://www.ccil.org/~cowan together. After one taste, you'll duck [EMAIL PROTECTED] soup the rest of your life.http://www.reutershealth.com --Groucho
RE: Looking for transcription or transliteration standards latin- arabic
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of D. Starner Sent: Friday, July 09, 2004 9:13 AM To: [EMAIL PROTECTED] Subject: RE: Looking for transcription or transliteration standards latin- arabic transliteration is no longer needed or useful. Transliteration is a one-to-one mapping between scripts, and the reader needs to be familiar with both scripts and the transliteration rules to make sense of it. That's not true. Looking at Wright's Historical German Grammar, I see Goth. baírand, OHG. bërant=Skr. bháranti. It would be illegible to me, and probably many Germantists, if it were written in three scripts instead of one. Using foreign scripts is rarely of help to the casual reader, especially in the frequent cases where it's not important that understand the details of the transliteration scheme. I doubt it makes much sense to the casual reader. Witness how nearly every radio and television pronounces New Delhi as New Del-hi. Jony -- ___ Sign-up for Ads Free at Mail.com http://promo.mail.com/adsfreejump.htm
Re: Looking for transcription or transliteration standards latin- arabic
On 09/07/2004 01:41, Michael (michka) Kaplan wrote: From: Michael Everson [EMAIL PROTECTED] I think it's stupid (in general) to argue for stripping a letter of diacritics. If a reader is ignorant of their meaning, that can be cured. But if they are meaningful, stripping them is just misspelling the words they belong to. Why would anyone want to do that? I think its inadvisable (in general) to call things stupid merely because one does not see the need. on the whole, that is a better time to ask the question than to make the judgment. There is actually a great deal of both European and American data in programs like Microsoft Exchange and Outlook, as well as in web search) that folding away diacritics as a part of giving full lists of possible matches is indeed preferred by users. Now they would (also) prefer the exact matches to have priority, but having additional matches without the diacritics is a common request, and one that has been built into many scenarios. It seems to me that you two Michaels are talking at cross purposes. Everson was apparently referring to the practice of stripping diacritics from foreign words as rendered typographically, e.g. in magazines and presumably online texts. And I tend to agree with him (from my European perspective) that this is unnecessary. On the other hand, if some people want to do it, they should not be prevented. But Kaplan is referring to something quite different, optionally ignoring diacritics in search operations. This is indeed desirable, so that a single search can match both Dvorak and Dvok for example, and so that the one doing the search does not need to remember exactly which diacritics are used in the name. And it is already covered by the Unicode collation algorithm and default table, in which diacritics are distinguished only at the second level and so folded by a top level only collation. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Looking for transcription or transliteration standards latin- arabic
At 17:43 -0700 2004-07-08, Mark Davis wrote: Why would anyone want to do that? I tend to be with you on this, that it does little harm to retain accents. However, most major periodic popular publications have this practice; for example The Economist keeps accents for French, German, Spanish, Italian words and names but discards others (as I recall). I wouldn't consider that good typography, that's all I'm saying. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Looking for transcription or transliteration standards latin- arabic
Pronunciation keys in dictionaries are a kind of transliteration. We still need those (well, I do, at least). Ted On Friday, July 09, 2004 1:08 AM, Jony Rosenne wrote: Now that we have moved from the world of typewriters, that imposed technical constraints on the writer, such as being able to use only the limited set of characters implemented, to the world of Unicode which removes this constraint, transliteration is no longer needed or useful. Ted Hopp, Ph.D. ZigZag, Inc. [EMAIL PROTECTED] +1-301-990-7453 newSLATE is your personal learning workspace ...on the web at http://www.newSLATE.com/
Re: Looking for transcription or transliteration standards latin- arabic
Of course, that's true about Kln. My point was that after all this time, the use of Dvorak or Tchaikovsky are *now* the English names for what originated in a different language. Mark - Original Message - From: Jony Rosenne [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, July 08, 2004 22:12 Subject: RE: Looking for transcription or transliteration standards latin- arabic -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Mark Davis Sent: Friday, July 09, 2004 3:43 AM To: [EMAIL PROTECTED]; Michael Everson Subject: Re: Looking for transcription or transliteration standards latin- arabic ... In one sense, the using Dvorak in English for Dvok is little different than using Cologne in English for Kln. Both are transcriptions into a form that has become more or less customary. Cologne is not a transliteration of Kln but the English name of the city, just as Munich, Rome, Moscow, The Hague, Longhorn, Venice, Jaffa and Jerusalem. Why a foreign city should have an English name is an interesting philosophical question, but not directly concerned with Unicode. This is however common in many languages. The transliteration of Kln would be Koln. Jony Mark
Re: Looking for transcription or transliteration standards latin- arabic
Whether it is a matter of typography or not depends on what the input text is. Setting the letters D v o k as Dvorak would indeed be bad typography. Setting the letters D v o r a k as Dvorak would be perfect fine typography. Mark - Original Message - From: Michael Everson [EMAIL PROTECTED] To: Mark Davis [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Friday, July 09, 2004 02:29 Subject: Re: Looking for transcription or transliteration standards latin- arabic At 17:43 -0700 2004-07-08, Mark Davis wrote: Why would anyone want to do that? I tend to be with you on this, that it does little harm to retain accents. However, most major periodic popular publications have this practice; for example The Economist keeps accents for French, German, Spanish, Italian words and names but discards others (as I recall). I wouldn't consider that good typography, that's all I'm saying. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Looking for transcription or transliteration standards latin- arabic
At 06:55 -0700 2004-07-09, Mark Davis wrote: Of course, that's true about Köln. My point was that after all this time, the use of Dvorak or Tchaikovsky are *now* the English names for what originated in a different language. I don't agree that Dvorak is the English name for the composer. But I don't agree that façade is correctly spelled in English without the ç either. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Looking for transcription or transliteration standards latin- arabic
Quoting Michael Everson [EMAIL PROTECTED]: At 06:55 -0700 2004-07-09, Mark Davis wrote: Of course, that's true about Köln. My point was that after all this time, the use of Dvorak or Tchaikovsky are *now* the English names for what originated in a different language. I don't agree that Dvorak is the English name for the composer. But I don't agree that façade is correctly spelled in English without the ç either. Yes, Dvorak is the name of the American branch of the family; after they changed the spelling of their name. It's not even pronounced the same. They have a famous typewriter keyboard inventor in their line, but no famous composers. -- Jon Hanna http://www.hackcraft.net/ Write a wise saying and your name will live forever - Anonymous
Re: Looking for transcription or transliteration standards latin- arabic
From: Peter Kirk [EMAIL PROTECTED] But Kaplan is referring to something quite different, optionally ignoring diacritics in search operations. This is indeed desirable, so that a single search can match both Dvorak and Dvok for example, and so that the one doing the search does not need to remember exactly which diacritics are used in the name. And it is already covered by the Unicode collation algorithm and default table, in which diacritics are distinguished only at the second level and so folded by a top level only collation. (a) If this were true and it were the only need, then case folding would also just be a UCA issue, yet case folding is in the document. (b) Not everyone uses the UCA who uses Unicode (most of the corporate members companies in Unicode -- including IBM -- had alternate collation methods that existed prior to the UCA and which to this day support more languages, in their databases and operating systems) (c) Since the operation (diacritic folding) is a valid one that implementations may want to do and the UCA is a UTS and thus not required for Unicode conformance, it is a sensible folding operation to define. Does diacritic folding destroy information provided by the distinctions that diacritcs provide? Of course it does. But then again, the same can be said of all foldings. This does not diminish their potential usefulness in specific tasks/operations. MichKa [MS] NLS Collation/Locale/Keyboard Development Globalization Infrastructure and Font Technologies Windows International Division
RE: Looking for transcription or transliteration standards latin- arabic
Title: RE: Looking for transcription or transliteration standards latin- arabic From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Michael Everson Sent: Friday, July 09, 2004 7:13 AM At 06:55 -0700 2004-07-09, Mark Davis wrote: Of course, that's true about Köln. My point was that after all this time, the use of Dvorak or Tchaikovsky are *now* the English names for what originated in a different language. I don't agree that Dvorak is the English name for the composer. The English name is, I think, a poor choice of words. Standard anglicization would be better. But I don't agree that façade is correctly spelled in English without the ç either. On this, we must resign ourselves to disagreement. /|/|ike
Re: Looking for transcription or transliteration standards latin- arabic
#CYRILLIC SMALL LETTER SHORT I WITH TAIL 049B; 043A; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER KA WITH DESCENDER 049D; 043A; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER KA WITH VERTICAL STROKE 049F; 043A; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER KA WITH STROKE 04C4; 043A; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER KA WITH HOOK 04C6; 043B; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER EL WITH TAIL 04CE; 043C; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER EM WITH TAIL 04A3; 043D; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER EN WITH DESCENDER 04C8; 043D; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER EN WITH HOOK 04CA; 043D; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER EN WITH TAIL 04E7; 043E; ; !uca #CYRILLIC SMALL LETTER O WITH DIAERESIS 04A7; 043F; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER PE WITH MIDDLE HOOK 048F; 0440; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER ER WITH TICK 04AB; 0441; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER ES WITH DESCENDER 04AD; 0442; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER TE WITH DESCENDER 04F1; 0443; ; !uca #CYRILLIC SMALL LETTER U WITH DIAERESIS 04F3; 0443; ; !uca #CYRILLIC SMALL LETTER U WITH DOUBLE ACUTE 04B9; 0447; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER CHE WITH VERTICAL STROKE 04F5; 0447; ; !uca #CYRILLIC SMALL LETTER CHE WITH DIAERESIS 04F9; 044B; ; !uca #CYRILLIC SMALL LETTER YERU WITH DIAERESIS 04ED; 044D; ; !uca #CYRILLIC SMALL LETTER E WITH DIAERESIS 047C; 0460; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER OMEGA WITH TITLO 047D; 0461; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER OMEGA WITH TITLO 0476; 0474; ; !uca #CYRILLIC CAPITAL LETTER IZHITSA WITH DOUBLE GRAVE ACCENT 0477; 0475; ; !uca #CYRILLIC SMALL LETTER IZHITSA WITH DOUBLE GRAVE ACCENT 04B0; 04AE; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER STRAIGHT U WITH STROKE 04B1; 04AF; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER STRAIGHT U WITH STROKE 04B6; 04BC; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER CHE WITH DESCENDER 04B7; 04BC; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER CHE WITH DESCENDER 04B8; 04BC; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER CHE WITH VERTICAL STROKE 04BE; 04BC; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER ABKHASIAN CHE WITH DESCENDER 04BF; 04BC; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER ABKHASIAN CHE WITH DESCENDER 04CB; 04BC; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER KHAKASSIAN CHE 04CC; 04BC; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER KHAKASSIAN CHE 04DA; 04D8; ; !uca #CYRILLIC CAPITAL LETTER SCHWA WITH DIAERESIS 04DB; 04D9; ; !uca #CYRILLIC SMALL LETTER SCHWA WITH DIAERESIS 04EA; 04E8; ; !uca #CYRILLIC CAPITAL LETTER BARRED O WITH DIAERESIS 04EB; 04E9; ; !uca #CYRILLIC SMALL LETTER BARRED O WITH DIAERESIS Mark - Original Message - From: Michael (michka) Kaplan [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Friday, July 09, 2004 07:40 Subject: Re: Looking for transcription or transliteration standards latin- arabic From: Peter Kirk [EMAIL PROTECTED] But Kaplan is referring to something quite different, optionally ignoring diacritics in search operations. This is indeed desirable, so that a single search can match both Dvorak and Dvok for example, and so that the one doing the search does not need to remember exactly which diacritics are used in the name. And it is already covered by the Unicode collation algorithm and default table, in which diacritics are distinguished only at the second level and so folded by a top level only collation. (a) If this were true and it were the only need, then case folding would also just be a UCA issue, yet case folding is in the document. (b) Not everyone uses the UCA who uses Unicode (most of the corporate members companies in Unicode -- including IBM -- had alternate collation methods that existed prior to the UCA and which to this day support more languages, in their databases and operating systems) (c) Since the operation (diacritic folding) is a valid one that implementations may want to do and the UCA is a UTS and thus not required for Unicode conformance, it is a sensible folding operation to define. Does diacritic folding destroy information provided by the distinctions that diacritcs provide? Of course it does. But then again, the same can be said of all foldings. This does not diminish their potential usefulness in specific tasks/operations. MichKa [MS] NLS Collation/Locale/Keyboard Development Globalization Infrastructure and Font Technologies Windows International Division
Re: Looking for transcription or transliteration standards latin- arabic
Michael Everson writes: I don't agree that Dvorak is the English name for the composer. But I don't agree that façade is correctly spelled in English without the ç either. The Society for Pure English (http://www.gutenberg.net/1/2/3/9/12390/12390-h/12390-h.htm) disagreed: We still borrow as freely as ever; but half the benefit of this borrowing is lost to us, owing to our modern and pedantic attempts to preserve the foreign sounds and shapes of imported words, which make their current use unnecessarily difficult. Owing to our false taste in this matter many words which have been long naturalized in the language are being now put back into their foreign forms, and our speech is being thus gradually impoverished. This process of de-assimilation generally begins with the restoration of foreign accents to such words as have them in French; thus role is now written rôle; debris, débris; detour, détour; depot, dépôt; and the old words long established in our language, levee, naivety, now appear as levée, and naïveté. -- ___ Sign-up for Ads Free at Mail.com http://promo.mail.com/adsfreejump.htm
Re: Looking for transcription or transliteration standards latin- arabic
On 2004.07.09, 17:06, Mark Davis [EMAIL PROTECTED] wrote: we do not decompose characters like U+00D8 LATIN CAPITAL LETTER O WITH STROKE. [I have felt from the beginning that it was a mistake to not be consistent in our decompositions Where can one join your party? ;-) -- but that is water under the bridge.] Hm, there is a Nature's Cycle of Water, you know? ;-) --. António MARTINS-Tuválkin | ()| [EMAIL PROTECTED]|| PT-1XXX-XXX LISBOA Não me invejo de quem tem| +351 934 821 700 carros, parelhas e montes| http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe| http://pagina.de/bandeiras/ a água em todas as fontes|
Re: Looking for transcription or transliteration standards latin- arabic
On 09/07/2004 17:06, Mark Davis wrote: I agree with Michael -- diacritic folding is a useful folding to add, independent of the UCA. Also, Peter's remark that: And it is already covered by the Unicode collation algorithm and default table... is incorrect. ... Well, I think this depends on whether the stroke in characters like U+00D8 and similar additional marks are considered to be diacritics. I am not sure that they are diacritics in the strict sense, and the current DUCET mappings don't treat them as such, but John Cowan's list does treat them as such. ... The UCA generally follows our decompositions in determining many primary weights, and we do not decompose characters like U+00D8 LATIN CAPITAL LETTER O WITH STROKE. [I have felt from the beginning that it was a mistake to not be consistent in our decompositions -- but that is water under the bridge.] If you look at John's suggested file for diacritic folding(http://www.ccil.org/~cowan/DiacriticFolding.txt), ... I have just reviewed this list and found it odd that Hebrew presentation forms are included but Arabic ones are not. But in fact surely not only the Hebrew presentation forms but also most of the precomposed characters are redundant in this list. For the basic folding algorithm (in http://www.unicode.org/reports/tr30/) is: a. Apply optional folding operations b. Apply canonical decomposition c. Repeat (*a*) and (*b*) until stable d. Apply composition if necessary Step (b) will decompose not only presentation forms but also all precomposed characters with canonical decompositions, and the combining marks will be deleted by the repeat of step (a). It is therefore necessary to list in the specification of the folding only all (?) combining marks, which are to be deleted, and all precomposed characters which do *not* have canonical decompositions. Letters like O with stroke are presumably in this latter list, along with many of the listed Cyrillic characters. But I would suggest some caution about listing for diacritic folding some of the Cyrillic characters below, especially those with descenders. I note that 0429 is not folded to 0428 etc, and this is correct because within the Cyrillic writing system these are entirely separate characters. But the difference between these two is in fact exactly the same descender which is removed in 0496 etc. I am also surprised to note that no folding is given for 0419/0439; although in some ways this is desirable because Russians do not consider this breve to be a diacritic (and after all we would not want the dot on i to be removed as a diacritic!), these characters have canonical decompositions to 0418/0438 and breve and the principle of canonical equivalence and the folding algorithm (which works on decomposed characters) more or less demand that the breve be deleted. Also 048A/048B should then fold to 0418/0438 rather than 0419/0439. ... 04D0; 0410; ; !uca #CYRILLIC CAPITAL LETTER A WITH BREVE 04D2; 0410; ; !uca #CYRILLIC CAPITAL LETTER A WITH DIAERESIS 0490; 0413; !nfd+remove_marks; #CYRILLIC CAPITAL LETTER GHE WITH UPTURN 0492; 0413; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER GHE WITH STROKE 0494; 0413; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER GHE WITH MIDDLE HOOK 04D6; 0415; ; !uca #CYRILLIC CAPITAL LETTER IE WITH BREVE 0496; 0416; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER ZHE WITH DESCENDER 04DC; 0416; ; !uca #CYRILLIC CAPITAL LETTER ZHE WITH DIAERESIS 0498; 0417; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER ZE WITH DESCENDER 04DE; 0417; ; !uca #CYRILLIC CAPITAL LETTER ZE WITH DIAERESIS 04E4; 0418; ; !uca #CYRILLIC CAPITAL LETTER I WITH DIAERESIS 048A; 0419; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER SHORT I WITH TAIL 049A; 041A; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER KA WITH DESCENDER 049C; 041A; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER KA WITH VERTICAL STROKE 049E; 041A; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER KA WITH STROKE 04C3; 041A; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER KA WITH HOOK 04C5; 041B; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER EL WITH TAIL 04CD; 041C; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER EM WITH TAIL 04A2; 041D; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER EN WITH DESCENDER 04C7; 041D; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER EN WITH HOOK 04C9; 041D; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER EN WITH TAIL 04E6; 041E; ; !uca #CYRILLIC CAPITAL LETTER O WITH DIAERESIS 04A6; 041F; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER PE WITH MIDDLE HOOK 048E; 0420; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER ER WITH TICK 04AA; 0421; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER ES WITH DESCENDER 04AC; 0422; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER TE WITH DESCENDER 04F0; 0423; ; !uca #CYRILLIC CAPITAL LETTER U WITH DIAERESIS 04F2; 0423; ; !uca #CYRILLIC CAPITAL LETTER U WITH DOUBLE ACUTE 04B2; 0425; !nfd+remove_marks; !uca #CYRILLIC CAPITAL
Re: Looking for transcription or transliteration standards latin- arabic
Peter Kirk scripsit: I have just reviewed this list and found it odd that Hebrew presentation forms are included but Arabic ones are not. The specification actually called only for Latin, Greek, and Cyrillic; I added Hebrew pour la lagniappe. If someone wants to add Arabic, I encourage them to do so. the Hebrew presentation forms but also most of the precomposed characters are redundant in this list. True; however, the current list indicates the scope of what actually happens, even if it is overlong. It is therefore necessary to list in the specification of the folding only all (?) combining marks, which are to be deleted, I believe that all Mn-class characters, and only they, are deleted by this. I note that 0429 is not folded to 0428 etc, and this is correct because within the Cyrillic writing system these are entirely separate characters. But the difference between these two is in fact exactly the same descender which is removed in 0496 etc. I don't think that matters. Long historical practice has made SHCHA a separate letter, just as G, J, U, and W are now separate Latin letters from C, I, V, and VV-ligature. I am also surprised to note that no folding is given for 0419/0439; although in some ways this is desirable because Russians do not consider this breve to be a diacritic (and after all we would not want the dot on i to be removed as a diacritic!), these characters have canonical decompositions to 0418/0438 and breve and the principle of canonical equivalence and the folding algorithm (which works on decomposed characters) more or less demand that the breve be deleted. Also 048A/048B should then fold to 0418/0438 rather than 0419/0439. I think I agree with this: i-breve does not have the same universal status as shch. -- John Cowan www.reutershealth.com www.ccil.org/~cowan [EMAIL PROTECTED] 'Tis the Linux rebellion / Let coders take their place, The Linux-nationale / Shall Microsoft outpace, We can write better programs / Our CPUs won't stall, So raise the penguin banner of / The Linux-nationale.
Re: Looking for transcription or transliteration standards latin- arabic
At 08:33 PM 7/9/2004, John Cowan wrote: I have just reviewed this list and found it odd that Hebrew presentation forms are included but Arabic ones are not. The specification actually called only for Latin, Greek, and Cyrillic; I added Hebrew pour la lagniappe. If someone wants to add Arabic, I encourage them to do so. the Hebrew presentation forms but also most of the precomposed characters are redundant in this list. True; however, the current list indicates the scope of what actually happens, even if it is overlong. I have taken the file from the server today and massaged it to be in a form suitable for inclusion in the next draft of TR#30, which will be issued in time for the UTC to review it in August. Once the review issue opens for this draft, please comment on the review form, so that the UTC has formal input to evaluate. My understanding of the folding would be that it would be more agressive in diacritic folding than some languages, so that it is useful in cross language searching. For example, it should allow English users to search for words with accented characters in them by supplying the equivalent word spelled in base letters only. 'i' has a dot, but doesn't have a base letter that's more 'basic' than itself, since dotless-i, while theoretically there, is more specialized and not universally accessible from input devices. o-slash, can be analyzed as o and slash, even though that's not done canonically in Unicode. Allowing users outside Scandinavia to perform fuzzy searches for words with this character is useful. In this view of folding, Language-specific fuzzy searches would be tailored (usually by being based on collation information, rather than on generic diacritic folding). A./
Re: FW: Looking for transcription or transliteration standards latin- arabic
John Cowan wrote: The Unicode people are probably going to standardize on calling it diacritic folding, by analogy to the term case folding. Añd whàt shåll wë câll thë ãddítiõn of dîacrìtícs bÿ spämmêrs, ïñ ân ättëmpt tò fóòl spåm fîltêrs? -- Curtis Clark http://www.csupomona.edu/~jcclark/ Web Coordinator, Cal Poly Pomona +1 909 979 6371 Professor, Biological Sciences +1 909 869 4062
Re: FW: Looking for transcription or transliteration standards latin- arabic
Curtis Clark jcclark dash lists at earthlink dot net wrote: John Cowan wrote: The Unicode people are probably going to standardize on calling it diacritic folding, by analogy to the term case folding. Ad wht shll w cll th ddtin of dacrtcs b spmmrs, n ttmpt t fl spm fltrs? . -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
RE: FW: Looking for transcription or transliteration standards latin- arabic
Sanan virkkoi, noin nimesi Curtis Clark: Añd whàt shåll wë câll thë ãddítiõn of dîacrìtícs bÿ spämmêrs, ïñ ân ättëmpt tò fóòl spåm fîltêrs? http://en.wikipedia.org/wiki/Heavy_metal_umlaut
Re: FW: Looking for transcription or transliteration standards latin- arabic
On 08/07/2004 06:44, Curtis Clark wrote: John Cowan wrote: The Unicode people are probably going to standardize on calling it diacritic folding, by analogy to the term case folding. Añd whàt shåll wë câll thë ãddítiõn of dîacrìtícs bÿ spämmêrs, ïñ ân ättëmpt tò fóòl spåm fîltêrs? An opportunity for spam filters to employ diacritic folding. National Geographic may not need this folding but spam filters could certainly use it. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: FW: Looking for transcription or transliteration standards latin- arabic
On 2004.07.08, 09:56, Peter Kirk [EMAIL PROTECTED] wrote: Añd whàt shåll wë câll thë ãddítiõn of dîacrìtícs bÿ spämmêrs, ïñ ân ättëmpt tò fóòl spåm fîltêrs? An opportunity for spam filters to employ diacritic folding. What about things like PEN|S en|argement or G00D L00KING |\/|EN? --. António MARTINS-Tuválkin | ()| [EMAIL PROTECTED]|| PT-1XXX-XXX LISBOA Não me invejo de quem tem| +351 934 821 700 carros, parelhas e montes| http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe| http://pagina.de/bandeiras/ a água em todas as fontes|
Re: FW: Looking for transcription or transliteration standards latin- arabic
This thread seems to have gone far enough off-topic. Please keep to the topic or take comments off list. Regards from your, -- Sarasvati Añd whàt shåll wë câll thë ãddítiõn of dîacrìtícs bÿ spämmêrs, ïñ ân ättëmpt tò fóòl spåm fîltêrs? What about things like PEN|S en|argement or G00D L00KING |\/|EN?
Re: Looking for transcription or transliteration standards latin-arabic
You will need a Unicode font with Central-European an IPA characters to read my examples. Mike Ayers wrote: Perhaps it is. But then it's partly due to the lazy tradition. Are you implying that, had printers throughout the centuries put the effort into faithfully reproducing every obscure symbol from every foreign language, that the modern American would accept words with arbitrary diacritics? I do not pretend to know, but accept is probably not the best word to use in this context, after all it's not about the spelling of English words. And not every tradition needs to be hundreds of years old. I don't think it's a problem with any given diacritical. Its rather an indistinct horror of diacriticals in general in speakers of a language without any diacriticals at all, like English. E.g. Hungarian uses three diacriticals and Hungarian speakers make no big deal of just ignoring the meaningless caron in Czech or the grave and the cedilla in Roumanian names. On the other hand, I must admit, that we also can be quite brutal to diacriticals in some newspapers or when it comes to a language like Vietnamese... In other words, you're pretty comfortable with your own diacritics. You make my point for me. Our own are the acute (to show vowel length), the diaeresis (to show timbre, like in German) and the doubleacute (=a stretched diaeresis actually, to show both timbre and length at the same time). The caron or the cedilla are just as foreign for us as e.g. the odd question marks above Vietnamese vowels, even if they may be less unusual. And the case of the newpapers I'm talking about may be just classic examples of lazy typography, at least the silly spelling mistakes and other inaccuracies they allow themselves point in that direction. In books by any serious publisher, it would definitely be completely unacceptable to write e.g. Haek's name (a famous Czech satyrist) as Hasek. Once we got into this debate, let me quote an example where distinguishing between diacritics as familiar and unfamiliar may lead to undesirable results. Imagine, someone writes an article about a person named Trcsik [trik] (we accidentally have an actress by that surname). Suppose the journalist thinks it reasonable to retain the familiar diaeresis, because it is found in German and many other well-known orthographies. But what should be the fate of the doubleacute (which is actually nothing but a special kind of diaeresis, as I mentioned above)? As an unfamiliar diacritic, it should be discarded if the principle is applied mechanically. This would result in the form Trocsik [troik]; however, as you may see from the phonetic transcription, this is not simply incomplete information in such a context, but explicit misinformation. The less cruel approach would be to replace the special diaeresis with the normal one and write Trcsik [trik]. This is undoubtedly the least unacceptable of the diacritic-folded variants mathematically possible, but it is neither a proper English transcription because of the diaereses and the unusual value of the consonant cluster cs, nor correct Hungarian because of denying the long vowel, so what is it after all? There may not be an easy way to solve sucht situations, so that everybody would be pleased, but at least thinking about them does no harm. Sorry for being so long, perhaps someone finds my data interesting. Regards, Miert fizetsz az internetert? Korlatlan, ingyenes internet hozzaferes a FreeStarttol. Probald ki most! http://www.freestart.hu
RE: Looking for transcription or transliteration standards latin- arabic
Title: RE: Looking for transcription or transliteration standards latin-arabic From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of busmanus Sent: Thursday, July 08, 2004 1:27 PM I do not pretend to know, but accept is probably not the best word to use in this context, after all it's not about the spelling of English words. And not every tradition needs to be hundreds of years old. Actually, I was sating the most extreme case. Hundreds of years meant, basically, the lifetime of any given reader, so that maximum familiarity could be achieved. I do not believe that even in such a case would the average reader become comfortable with foreign diacritics. Although I speak with regards to English, as it is the only language I know well enough, I believe the principle applies for all languages, as it is an issue of familiarity, which is rather general to humanity. it would definitely be completely unacceptable to write e.g. Haek's name (a famous Czech satyrist) as Hasek. When transcribing to English, however, removal of the caron (macron? Apologies, but I tend to forget the names of most accents) would be most acceptable (for American English, at least). Once we got into this debate, let me quote an example where distinguishing between diacritics as familiar and unfamiliar may lead to undesirable results. SNIP/ Interesting case, and one reason why diacritic stripping, although brutal, may be desireable - it doesn't pretend to be accurate. Accuracy can be very hard to achieve when transcribing, especially since diacritics can be used to indicate very different things in different languages. There may not be an easy way to solve sucht situations, so that everybody would be pleased, but at least thinking about them does no harm. Sorry for being so long, perhaps someone finds my data interesting. I do find it interesting. It gave me some insight into the European view of diacritics, which is very different from mine. For instance, it seems that diacritics have similar effects on vowels, and that those vowels have similar sounds both before and after modification, across most (all?) European languages - am I reading correctly here? Thanks, /|/|ike
Re: Looking for transcription or transliteration standards latin- arabic
RE: Looking for transcription or transliteration standards latin-arabicMike Ayers wrote: it would definitely be completely unacceptable to write e.g. Haek's name (a famous Czech satyrist) as Hasek. When transcribing to English, however, removal of the caron (macron? Apologies, but I tend to forget the names of most accents) would be most acceptable (for American English, at least). Caron, or more commonly hacek. A macron is a shortish overline. English-speaking classical music buffs quickly learn to associate the diacritic-free spelling Dvorak with the (approximate) pronunciation /'dvrk/. Whether Dvorak is an acceptable way to spell Dvok probably depends on who's doing the accepting. For the computer columnist and the keyboard layout inventor, whose names are apparently pronounced /'dvk/ anyway, it's fine. Once we got into this debate, let me quote an example where distinguishing between diacritics as familiar and unfamiliar may lead to undesirable results. SNIP/ Interesting case, and one reason why diacritic stripping, although brutal, may be desireable - it doesn't pretend to be accurate. Accuracy can be very hard to achieve when transcribing, especially since diacritics can be used to indicate very different things in different languages. Desirable because it doesn't pretend to be accurate. That's a useful philosophy at times, but I have to admit I'm surprised to see it expressed on the Unicode list. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
RE: Looking for transcription or transliteration standards latin- arabic
At 14:57 -0700 2004-07-08, Mike Ayers wrote: When transcribing to English, however, removal of the caron (macron? Apologies, but I tend to forget the names of most accents) would be most acceptable (for American English, at least). NOT in good typography, ever. It gave me some insight into the European view of diacritics, which is very different from mine. For instance, it seems that diacritics have similar effects on vowels, and that those vowels have similar sounds both before and after modification, across most (all?) European languages - am I reading correctly here? Not really. Diacritics may affect the quantity of a vowel, the quality of a vowel, or simply indicate something about a word's history. I think it's stupid (in general) to argue for stripping a letter of diacritics. If a reader is ignorant of their meaning, that can be cured. But if they are meaningful, stripping them is just misspelling the words they belong to. Why would anyone want to do that? -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Looking for transcription or transliteration standards latin- arabic
From: Michael Everson [EMAIL PROTECTED] I think it's stupid (in general) to argue for stripping a letter of diacritics. If a reader is ignorant of their meaning, that can be cured. But if they are meaningful, stripping them is just misspelling the words they belong to. Why would anyone want to do that? I think its inadvisable (in general) to call things stupid merely because one does not see the need. on the whole, that is a better time to ask the question than to make the judgment. There is actually a great deal of both European and American data in programs like Microsoft Exchange and Outlook, as well as in web search) that folding away diacritics as a part of giving full lists of possible matches is indeed preferred by users. Now they would (also) prefer the exact matches to have priority, but having additional matches without the diacritics is a common request, and one that has been built into many scenarios. Formalizing that operation in Unicode is only a bad thing (or a stupid thing, to use your words) if creating a standard that meets real world needs (as opposed to ideal typographic or linguistic preferences) is considered a bad (or stupid) thing. As far as I know, most of the members of the Unicode Consortium have those real world use cases as their first priority. MichKa [MS]
Re: Looking for transcription or transliteration standards latin- arabic
Why would anyone want to do that? I tend to be with you on this, that it does little harm to retain accents. However, most major periodic popular publications have this practice; for example The Economist keeps accents for French, German, Spanish, Italian words and names but discards others (as I recall). In one sense, the using Dvorak in English for Dvok is little different than using Cologne in English for Kln. Both are transcriptions into a form that has become more or less customary. Mark - Original Message - From: Michael Everson [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, July 08, 2004 15:13 Subject: RE: Looking for transcription or transliteration standards latin- arabic At 14:57 -0700 2004-07-08, Mike Ayers wrote: When transcribing to English, however, removal of the caron (macron? Apologies, but I tend to forget the names of most accents) would be most acceptable (for American English, at least). NOT in good typography, ever. It gave me some insight into the European view of diacritics, which is very different from mine. For instance, it seems that diacritics have similar effects on vowels, and that those vowels have similar sounds both before and after modification, across most (all?) European languages - am I reading correctly here? Not really. Diacritics may affect the quantity of a vowel, the quality of a vowel, or simply indicate something about a word's history. I think it's stupid (in general) to argue for stripping a letter of diacritics. If a reader is ignorant of their meaning, that can be cured. But if they are meaningful, stripping them is just misspelling the words they belong to. Why would anyone want to do that? -- Michael Everson * * Everson Typography * * http://www.evertype.com
RE: Looking for transcription or transliteration standards latin- arabic
Transcription is useful and necessary, transliteration less so. When transcribing from, for example, Czech , into English, we should not be mislead by the fact that in Unicode both use the Latin script. In fact, Czech uses the Czech script (= writing system, in this case), and English uses the English script. The Czech script includes letter-diacritic combinations that are not part of the English script or maybe have a different meaning. To the English or American reader who does not know Czech they are incomprehensible, so he relies on transcription. The purpose of transcription is to copy the word into the English script. If the reader, or all intended readers, are comfortable with the Czech script then transcription is not necessary. The situation is only slightly different from Russian to English transcription. It appears to be different because the Russian script looks different. Now that we have moved from the world of typewriters, that imposed technical constraints on the writer, such as being able to use only the limited set of characters implemented, to the world of Unicode which removes this constraint, transliteration is no longer needed or useful. Transliteration is a one-to-one mapping between scripts, and the reader needs to be familiar with both scripts and the transliteration rules to make sense of it. Jony -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Michael Everson Sent: Friday, July 09, 2004 1:13 AM To: [EMAIL PROTECTED] Subject: RE: Looking for transcription or transliteration standards latin- arabic At 14:57 -0700 2004-07-08, Mike Ayers wrote: When transcribing to English, however, removal of the caron (macron? Apologies, but I tend to forget the names of most accents) would be most acceptable (for American English, at least). NOT in good typography, ever. It gave me some insight into the European view of diacritics, which is very different from mine. For instance, it seems that diacritics have similar effects on vowels, and that those vowels have similar sounds both before and after modification, across most (all?) European languages - am I reading correctly here? Not really. Diacritics may affect the quantity of a vowel, the quality of a vowel, or simply indicate something about a word's history. I think it's stupid (in general) to argue for stripping a letter of diacritics. If a reader is ignorant of their meaning, that can be cured. But if they are meaningful, stripping them is just misspelling the words they belong to. Why would anyone want to do that? -- Michael Everson * * Everson Typography * * http://www.evertype.com
RE: Looking for transcription or transliteration standards latin- arabic
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Mark Davis Sent: Friday, July 09, 2004 3:43 AM To: [EMAIL PROTECTED]; Michael Everson Subject: Re: Looking for transcription or transliteration standards latin- arabic ... In one sense, the using Dvorak in English for Dvok is little different than using Cologne in English for Kln. Both are transcriptions into a form that has become more or less customary. Cologne is not a transliteration of Kln but the English name of the city, just as Munich, Rome, Moscow, The Hague, Longhorn, Venice, Jaffa and Jerusalem. Why a foreign city should have an English name is an interesting philosophical question, but not directly concerned with Unicode. This is however common in many languages. The transliteration of Kln would be Koln. Jony Mark
Re: Looking for transcription or transliteration standards latin- arabic
From: Mark Davis [EMAIL PROTECTED] In one sense, the using Dvorak in English for Dvok is little different than using Cologne in English for Kln. Both are transcriptions into a form that has become more or less customary. If at all, Kln is a German and Cologne is a French/English transcription of the Latin name Colonia. Adam
Re: Looking for transcription or transliteration standards latin- arabic
Peter Kirk writes This is more complicated than it looks. The Greek form Istimboli is impossible for the period as Greek had no [b] sound, for was pronounced [v] except that later and perhaps already at that period was pronounced [b] at least in foreign words. So is the Greek consonant cluster , or , or , or what? Also is the previous consonant cluster as transliterated, or corresponding to isthmus? And then what are the Greek vowels? I was only trying to grasp the sense of Gerd's throw-away remark (which I hope he will explain), but I appreciate the difficulties you raise, especially the point about the Greek beta as the phoneme /v/ . That particular difficulty at least doesn't apply to the Ottoman b, if we look for a Turkish -bul . Raymond Mercier http://ourworld.compuserve.com/homepages/RaymondM
Re: Looking for transcription or transliteration standards latin- arabic
On 07/07/2004 07:08, Raymond Mercier wrote: ... I was only trying to grasp the sense of Gerd's throw-away remark (which I hope he will explain), but I appreciate the difficulties you raise, especially the point about the Greek beta as the phoneme /v/ . That particular difficulty at least doesn't apply to the Ottoman b, if we look for a Turkish -bul . The last part is uncontroversial, I think. The uncertainty is over the first part of the word. Google gives only three hits for istimboli, one of which (http://linguistlist.org/issues/3/3-929.html) says: An interesting historical case is Istanbul, whose name comes from the Greek phrase eis ten poli (to the city -- first e is epsilon, and second e is eta). That phrase tended to be pronounced istimboli and with dissimilation istamboli. So when the Turks changed the name from Constantinople to Istanbul, they simply changed from a name with an obvious Greek derivation to one with a nonobvious Greek derivation. This is a possible derivation. If this is Gerd's source, he failed to make the point that istimboli was not a Greek name of the city but a colloquial pronunciation of a phrase. And the source of that may be the following old German text, from http://www.staff.ncl.ac.uk/jon.west/get/hc0144_3.htm: Constantinopel hayssen die Chrichen Istimboli und die Thrcken hayssends Stambol; And according to http://www.fotoist.8m.com/ad.htm (in Turkish) this information comes the from 14th-15th century German traveller Johan Schildtberger. But I have my suspicions about this information. The Greeks had no problem with initial consonant clusters but the Turks did, so it is much more likely that the Turks added the initial I to a Greek word starting with ST, just as Spanish and French add initial E before such clusters. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
[OT] Istanbul [was: Re: Looking for transcription or transliteration standards latin- arabic]
Constantinopel hayssen die Chrichen Istimboli und die Thrcken hayssends Stambol; The Greeks had no problem with initial consonant clusters but the Turks did, so it is much more likely that the Turks added the initial I to a Greek word starting with ST, just as Spanish and French add initial E before such clusters. Are you sure about the Turks and the initial consonant clusters? I always thought it depends on the actual cluster structure. Modern Turkish at least has loanwords such as brokoli, graten or the notorious spor where the problem is the word-*final* cluster, not the word *initial* one. While Turkic roots usually do not begin with consonant clusters, it appears to be OK in loans. The situation is a bit difficult because of the Persian and Arabic adstrata in Ottoman Turkish. Both Arabic and Persian definitely do not allow word-initial consonant clusters at all, which led to a lot of words with auxiliary vowels in Turkish. However, these words already had the auxiliary vowels when Philipp -- Was fr Japan ist der Tenno, ist fr Frankfurt Brezel-Benno. - Brezelverkufer in Frankfurt/Main
Re: [OT] Istanbul [was: Re: Looking for transcription or transliteration standards latin- arabic]
On 07/07/2004 11:22, Philipp Reichmuth wrote: ... Are you sure about the Turks and the initial consonant clusters? I always thought it depends on the actual cluster structure. Modern Turkish at least has loanwords such as brokoli, graten or the notorious spor where the problem is the word-*final* cluster, not the word *initial* one. While Turkic roots usually do not begin with consonant clusters, it appears to be OK in loans. There are certainly no word initial consonant clusters in native Turkic words. Looking at the specific ST cluster in my Turkish-English dictionary, there are a number of words listed, but they are all transparently loans from western languages and the kinds of words which were probably borrowed in the 20th century: stabilize, stadyum/stat, staj, stajyer, stand, standart, star, statik, stat, statko, sten, steno(grafi), step (steppe), stepne (spare tyre), stereo(foni(k)), stereotip, steril(ize/izasyon), sterlin, stetoskop, setyn (station wagon), stil, stilistik, stilo, stok(u/lamak), stop, stopaj, stor, strateji(k), stratosfer, stratus, streptokok, streptomisin, stres (medical), striptiz(ci), stdyo. But here are the corresponding words with word initial added vowels: stampa, stavroz/istavroz (from Greek stavros), istasyon, istatistik(i), istavrit (a fish), istep (steppe), istim, istimbot, istiridye (? oyster), istop, usturmaa (a nautical term probably from storm). These words seem to be rather older loans, some perhaps 19th century but stavroz/istavroz is surely much earlier, also istavrit if that is a loan from Greek stavrit- as seems likely. These are complete lists for ST but the same happens with other consonant clusters e.g. SP, SV. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Looking for transcription or transliteration standards latin- arabic
Peter Kirk a crit : On 07/07/2004 07:08, Raymond Mercier wrote: This is a possible derivation. If this is Gerd's source, he failed to make the point that istimboli was not a Greek name of the city but a colloquial pronunciation of a phrase. And the source of that may be the following old German text, from http://www.staff.ncl.ac.uk/jon.west/get/hc0144_3.htm: Constantinopel hayssen die Chrichen Istimboli und die Thrcken hayssends Stambol; And according to http://www.fotoist.8m.com/ad.htm (in Turkish) this information comes the from 14th-15th century German traveller Johan Schildtberger. But I have my suspicions about this information. The Greeks had no problem with initial consonant clusters but the Turks did, so it is much more likely that the Turks added the initial I to a Greek word starting with ST, just as Spanish and French add initial E before such clusters. French (for the last 5 centuries) no longer adds an initial E in front of ST (see : stop, start, sport (*), stage, stature, station, etc.), historically (in Old French) this was true (estable [stable], estamper [to stamp], estat [state, station], esterlin [sterling], estrange [stange, stranger]). Old French is before the fall of Constatinople and the end of the Hundred Year war (both in 1453 as all French-speaking schoolchildren learn). Spanish still does (or a least did recently) see recent loanwords : esqu (ski) or esprint (sprint). P. A. (*) English word derived from an Old French word desport / deport (entertainment), see deporte in Spanish and desporto/desporte in Portuguese (but esporte in Brazil). .
Re: Looking for transcription or transliteration standards latin- arabic
An interesting historical case is Istanbul, whose name comes from the Greek phrase eis ten poli (to the city -- first e is epsilon, and second e is eta). That phrase tended to be pronounced istimboli and with dissimilation istamboli. So when the Turks changed the name from Constantinople to Istanbul, they simply changed from a name with an obvious Greek derivation to one with a nonobvious Greek derivation. This explanation seems rather Byzantine to me. -- Curtis Clark http://www.csupomona.edu/~jcclark/ Web Coordinator, Cal Poly Pomona +1 909 979 6371 Professor, Biological Sciences +1 909 869 4062
RE: Looking for transcription or transliteration standards latin- arabic
Title: RE: Looking for transcription or transliteration standards latin- arabic From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Anto'nio Martins-Tuva'lkin Sent: Tuesday, July 06, 2004 9:04 PM On 2004.07.07, 00:49, Mike Ayers [EMAIL PROTECTED] wrote: Are you implying that, had printers throughout the centuries put the effort into faithfully reproducing every obscure symbol I spell my own name with some of those obscure symbols, thank you. Yep. Hope you don't mind my inability to pronounce it. However, grave (and acute) accents hardly rate as obscure, so I could pronounce through them and get passably close. Even here in the cultural boondocks we know that. Obscure indeed -- that's the last thing I'd expect in a list such as this! Is internationalization is serious issue, or just a toy to kill off idle time? Oh, calm down. We were originally talking about Vietnamese diacritics, many of which definitely qualify as obscure, the rest being obscure uses of more familiar diacritics. Just because you don't like the kind of internationalization I mentioned does not mean it shouldn't be discussed. from every foreign language, that the modern American would accept words with arbitrary diacritics? Foreign? American? I obviously misunderstood the whole purpose of these discussions, then. Bye bye -- will back as soon as I get my Green Card, seor! ;-) Are you just trying to kick up dirt here, or were you genuinely unaware that National Geographic is an American publication? I specified American, as opposed to English speaking in this case for that reason, also because the British are known to be more familiar with, and therefore tolerant of, various diacritics. I doubt, however, that this would have any bearing on Vietnamese, which, while it uses familiar looking diacritics, uses them in very unfamiliar (to Europeans in general, as best I understand it) ways. Now, in a last desperate hope to address the issue I raised: does the practice of stripping diacritics have a name? Thanks, /|/|ike
Re: Looking for transcription or transliteration standards latin- arabic
On 07/07/2004 17:04, Mike Ayers wrote: ... Are you just trying to kick up dirt here, or were you genuinely unaware that National Geographic is an American publication? I specified American, as opposed to English speaking in this case for that reason, also because the British are known to be more familiar with, and therefore tolerant of, various diacritics. I doubt, however, that this would have any bearing on Vietnamese, which, while it uses familiar looking diacritics, uses them in very unfamiliar (to Europeans in general, as best I understand it) ways. Indeed we British are more tolerant. Most of us have learned at least a little French and so vaguely know what e acute sounds like, perhaps also e grave, and that e with an accent is not silent, as in café. Other accents we tend to understand as marking stress and/or length, which works for Spanish and probably also António's Portuguese. So we do a lot better in guessing pronunciation than we would do if the diacritics were stripped off completely, even if we don't actually understand properly what they mean. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
FW: Looking for transcription or transliteration standards latin- arabic
Title: FW: Looking for transcription or transliteration standards latin- arabic John notified me that he intended to CC the list, so here it is: -Original Message- From: John Cowan [mailto:[EMAIL PROTECTED]] Sent: Wednesday, July 07, 2004 8:32 AM To: Mike Ayers Subject: Re: Looking for transcription or transliteration standards latin- arabic Mike Ayers scripsit: Now, in a last desperate hope to address the issue I raised: does the practice of stripping diacritics have a name? The Unicode people are probably going to standardize on calling it diacritic folding, by analogy to the term case folding. I have provided them with a table that does diacritic folding for the Latin, Greek, Cyrillic, and Hebrew scripts; it does not, however, remove combining diacritics (which is easy to do on your own). -- There are three kinds of people in the world: John Cowan those who can count, http://www.reutershealth.com and those who can't. [EMAIL PROTECTED]
Re: Looking for transcription or transliteration standards latin- arabic
On 03/07/2004 00:07, Patrick Andries wrote: Jony Rosenne a crit : -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of John H. Jenkins Peking for Bejng. :-) Or Constantinople for Istanbul. :-) Two very different political realities (before and after 1453). Cities change names without going through transliterattions, cf. Berlin (Ontario) becoming Kitchener in 1916. But Constantinople - Istanbul is not in fact this kind of name change, for Istanbul (that is, stanbul) is probably a corrupted and shortened version of Constantinople, with the initial I added to fit Turkish phonology (cf. the old western version Stamboul, still used in Russian, also Smyrna - Izmir). (I have also heard it said that Istanbul comes from Greek EIS TN POLIN to the city, but that seems less likely to me.) So the change is more like Beijing - Peking than Berlin - Kitchener. I guess another similar change would be Danzig - Gdansk, but I don't know where the initial G came from so possibly the Polish form is older than the German. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Looking for transcription or transliteration standards latin- arabic
W licie z wto, 06-07-2004, godz. 10:50 +0100, Peter Kirk napisa: I guess another similar change would be Danzig - Gdansk, but I don't know where the initial G came from so possibly the Polish form is older than the German. A name with initial Gd is older than with D: http://encyclopedia.thefreedictionary.com/Gdansk http://en.wikipedia.org/wiki/Gda%C5%84sk#Names but Wikipedia has now a hot dispute about how it should call the city: http://en.wikipedia.org/wiki/Talk:Gdansk/Naming_convention -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/
Re: Looking for transcription or transliteration standards latin- arabic
Peter Kirk a crit : On 03/07/2004 00:07, Patrick Andries wrote: o very different political realities (before and after 1453). Cities change names without going through transliterattions, cf. Berlin (Ontario) becoming Kitchener in 1916. But Constantinople - Istanbul is not in fact this kind of name change, for Istanbul (that is, stanbul) is probably a corrupted and shortened version of Constantinople, with the initial I added to fit Turkish phonology (cf. the old western version Stamboul, still used in Russian, also Smyrna - Izmir). (I have also heard it said that Istanbul comes from Greek EIS TN POLIN to the city, but that seems less likely to me.) Yes, I have heard this. So the change is more like Beijing - Peking than Berlin - Kitchener. Without a political change Constantinople would not have changed name in a matter of days (at least as far as the officials were concerned). In any case, it is not a transliteration problem (Beijing -- Pkin). P. A.
Re: Looking for transcription or transliteration standards latin- arabic
Patrick Andries a crit : So the change is more like Beijing - Peking than Berlin - Kitchener. Without a political change Constantinople would not have changed name in a matter of days (at least as far as the officials were concerned). In any case, it is not a transliteration problem (Beijing -- Pkin). [PA] I wrote this a bit too fast this morning (first message !). I believe the origin of Istanbul is a bit too obscure to decide whether it is due to a transcription or a complete name change. Just to confuse things further Konstantaniye was apparently used by the Turkish administration and a Greek form Istimboli is attested in the XIVth century. P. .A
Re: Looking for transcription or transliteration standards latin- arabic
On 06/07/2004 13:05, Patrick Andries wrote: Patrick Andries a crit : So the change is more like Beijing - Peking than Berlin - Kitchener. Without a political change Constantinople would not have changed name in a matter of days (at least as far as the officials were concerned). In any case, it is not a transliteration problem (Beijing -- Pkin). Well, did Gdansk/Danzig change its name backwards and forwards several times over history (thank you, Qrczak, for the interesting information about that), or was it simply that it had different names in different languages? This makes it not a transliteration problem but a translation problem, one which is common to many geographical names - sometimes the names in different languages are related, and sometimes they are not e.g. Turku/bo in Finland, or Yerushalayim/al-Quds, or Dublin/(I'll let Michael tell us the correct Irish form). [PA] I wrote this a bit too fast this morning (first message !). I believe the origin of Istanbul is a bit too obscure to decide whether it is due to a transcription or a complete name change. Just to confuse things further Konstantaniye was apparently used by the Turkish administration and a Greek form Istimboli is attested in the XIVth century. Thanks for this. The matter is indeed not so simple. P. .A -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Looking for transcription or transliteration standards latin- arabic
Patrick Andries scripsit: [PA] I wrote this a bit too fast this morning (first message !). I believe the origin of Istanbul is a bit too obscure to decide whether it is due to a transcription or a complete name change. Just to confuse things further Konstantaniye was apparently used by the Turkish administration and a Greek form Istimboli is attested in the XIVth century. Thanks a lot for this interesting information. I think, the underying meaning of Istimboli must be town at the isthmus, which makes sense, indeed. Gerd
Re: Looking for transcription or transliteration standards latin- arabic
The latter problem could be solved easily by transcribing as dh, but English speakers seem really terrified of the sequence dh. Not quite so fast. Where a d can end a syllable and an h can start one, then it can collide with dh representing . The general issue is that whenever you use a sequence of letters in the target for transliteration/transcription, and the elements of that sequence can individually be targets, then you can get ambiguity. There are mechanisms to separate a sequence of letters that would otherwise be read as a unit: apostrophe: as in Japanese transliterations (When vowels or consonant y follow the syllabic nasal n, ng, m, add apostrophe (') after n. Example: ren'ai / gen'in / sin'en / kon'ya -- Cabinet Order (Kunrei) No.1) hyphen: as in the Korean Ministry of Education transliteration, to distinguish jeong-eum versus jeon-geum )... diaeresis on second element: (doesn't work very well, since it only really sits well on vowels). Transcriptions are another matter; the reader can read Tchaikovsky or Beijing without knowing anything at all about Cyrillic or Chinese, and still come close (theoretically) to the real pronunciation. Agreed about the distinction in meaning between 'transcription' and 'transliteration'. However, the two examples of transcriptions are not necessary good ones, at least for English speakers: the only reason that English speakers will read 'Tchaikovsky' reasonably is because they have learned the word, since it doesn't follow normal English orthographic rules. rk - Original Message - From: Doug Ewell [EMAIL PROTECTED] To: Unicode Mailing List [EMAIL PROTECTED] Cc: Mark Davis [EMAIL PROTECTED]; Mike Ayers [EMAIL PROTECTED] Sent: Saturday, July 03, 2004 09:40 Subject: Re: Looking for transcription or transliteration standards latin- arabic RE: Looking for transcription or transliteration standards latin-arabicMark Davis wrote: In that case, we'd call it a transcription, since it doesn't roundtrip from source to target back to source. It is actually quite common for style guides for non-academic publications to have a restricted list of characters and character + accent combinations, and convert all others. For example, the Economist style guide, as I recall, recommends keeping accents in French, German, Italian, and Spanish names and words, but dropping them otherwise; and converting characters like and to nearest equivalents, th. Note that the latter loses information in two ways; the obvious one is that the distinction between and are lost; the less obvious one is that the distinction between them and a *real* 't' followed by 'h' in the source is lost. So that loses the distinction in sounds between 'th' in 'cathode' and 'cathouse', as well as between 'thy' and 'thigh'. The latter problem could be solved easily by transcribing as dh, but English speakers seem really terrified of the sequence dh. The former problem is only a problem if t + h combinations (like cathouse) are actually used in the language. I don't know if this is true for Icelandic. It is certainly true for Old English, where and are also seen. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/ - Original Message - From: Doug Ewell [EMAIL PROTECTED] To: Unicode Mailing List [EMAIL PROTECTED] Cc: Anto'nio Martins-Tuva'lkin [EMAIL PROTECTED] Sent: Saturday, July 03, 2004 14:22 Subject: Re: Looking for transcription or transliteration standards latin- arabic Anto'nio Martins-Tuva'lkin antonio at tuvalkin dot web dot pt wrote: Only specialists can make sense of them, Pray tell, why so? Is the letter an usuperable obstacle for those who know only the letter a?... Can't the remove diacriticals action be performed in the reader's brain, instead of in the typesetter's office? But if the reader merely removes the diacriticals, that destroys the whole purpose of using a *transliteration* scheme, where 'a' and '' represent different letters in the source writing system. Jony's point (I think) was that only specialists can keep track of which target characters represent which source characters, especially when obscure diacritics or digits or other symbols are used. At that point, the specialist probably knows the source characters well enough to read them directly, and the widespread use of Unicode enables document producers to use them directly. Transcriptions are another matter; the reader can read Tchaikovsky or Beijing without knowing anything at all about Cyrillic or Chinese, and still come close (theoretically) to the real pronunciation. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
RE: Looking for transcription or transliteration standards latin- arabic
Title: RE: Looking for transcription or transliteration standards latin- arabic From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Anto'nio Martins-Tuva'lkin Sent: Saturday, July 03, 2004 7:28 AM On 2004.07.02, 21:53, Mike Ayers [EMAIL PROTECTED] wrote: On the other hand, maybe Ha Tinh is just lazy typography. From National Geographic? Medoubts. This is a deliberate removal of the diacritics unfamiliar to English readers, and is a traditional way to present foreign words. It is lazy typography, then. Deliberate, traditional and lazy. ;-) No. Lazy implies not doing something to avoid doing the work. This is not the case here. It's an accessibility issue. From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Anto'nio Martins-Tuva'lkin Sent: Saturday, July 03, 2004 12:37 PM Pray tell, why so? Is the letter an usuperable obstacle for those who know only the letter a?... For some of us, at least, yes. The diacritic implies, by its very existence, that it has meaning, but I do not know what that meaning is, so I am stymied. Removing the diacritics yields a strange word, but one which I can probably absorb. Can't the remove diacriticals action be performed in the reader's brain, instead of in the typesetter's office? Again, for at least some of us (and I suspect this is a majority of the population unfamiliar with a given diacritic), simply ignoring diacritics is not an option, just as ignoring letters would not be. /|/|ike
Re: Looking for transcription or transliteration standards latin- arabic
On 2004.07.06, 14:00, Peter Kirk [EMAIL PROTECTED] wrote: sometimes the names in different languages are related, and sometimes they are not e.g. Turku/Åbo in Finland, or Yerushalayim/al-Quds, or Dublin/ Baile Átha Cliath. (Formerly, with U+1E6B for the th.) This makes it not a transliteration problem but a translation problem, Quite right! --. António MARTINS-Tuválkin | ()| [EMAIL PROTECTED]|| PT-1XXX-XXX LISBOA Não me invejo de quem tem| +351 934 821 700 carros, parelhas e montes| http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe| http://pagina.de/bandeiras/ a água em todas as fontes|
Re: Looking for transcription or transliteration standards latin- arabic
Gerd Schumacher wrote I think, the underying meaning of Istimboli must be "town at the isthmus", which makes sense, indeed. How does that work ? Do you mean istim , bol ? Raymond Mercier
Re: Looking for transcription or transliteration standards latin- arabic
Peter Kirk scripsit: Well, did Gdansk/Danzig change its name backwards and forwards several times over history (thank you, Qrczak, for the interesting information about that), or was it simply that it had different names in different languages? Yes to both. Its name in Polish is Gdan'sk, in German Danzig. Which one is the dominant name is determined by which power is dominant at a given time. What foreigners call the city is influenced, though not determined, by when the city first became important to them. There is hardly a city in Europe that isn't like this. What makes this one special, though hardly unique, is the repeated changes of sovereignty. Consider Strassburg/Strasbourg. This makes it not a transliteration problem but a translation problem, one which is common to many geographical names - sometimes the names in different languages are related, and sometimes they are not e.g. Turku/Åbo in Finland, or Yerushalayim/al-Quds, or Dublin/(I'll let Michael tell us the correct Irish form). Baile Atha Cliath. Dublin is also an Irish name, though used mostly by Norse and English (and now by anglophone Irish, of course). -- My confusion is rapidly waxing John Cowan For XML Schema's too taxing:[EMAIL PROTECTED] I'd use DTDshttp://www.reutershealth.com If they had local trees -- http://www.ccil.org/~cowan I think I best switch to RELAX NG.
Re: Looking for transcription or transliteration standards latin- arabic
Patrick Andries scripsit: So the change is more like Beijing - Peking than Berlin - Kitchener. Without a political change Constantinople would not have changed name in a matter of days (at least as far as the officials were concerned). In any case, it is not a transliteration problem (Beijing -- Pékin). Not just a transliteration problem, either: Mandarin Chinese underwent a sound-shift in the 17th century that changed the second syllable from ging to jing, but the English name was already set (and the change did not affect Southern Sinitic in any case; cf. Cantonese pak king). In addition, when it isn't the capital (bei jing = North-capital), i.e. 1928-49, its name is Beiping (north-peace). -- Here lies the Christian,John Cowan judge, and poet Peter, http://www.reutershealth.com Who broke the laws of God http://www.ccil.org/~cowan and man and metre. [EMAIL PROTECTED]
Re: Looking for transcription or transliteration standards latin- arabic
On 06/07/2004 20:47, Raymond Mercier wrote: Gerd Schumacher wrote I think, the underying meaning of Istimboli must be town at the isthmus, which makes sense, indeed. How does that work ? Do you mean istim , bol ? Raymond Mercier This is more complicated than it looks. The Greek form Istimboli is impossible for the period as Greek had no [b] sound, for was pronounced [v] except that later and perhaps already at that period was pronounced [b] at least in foreign words. So is the Greek consonant cluster , or , or , or what? Also is the previous consonant cluster as transliterated, or corresponding to isthmus? And then what are the Greek vowels? -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Looking for transcription or transliteration standards latin-arabic
Mike Ayers wrote: From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Anto'nio Martins-Tuva'lkin Sent: Saturday, July 03, 2004 7:28 AM On 2004.07.02, 21:53, Mike Ayers [EMAIL PROTECTED] wrote: On the other hand, maybe Ha Tinh is just lazy typography. From National Geographic? Medoubts. This is a deliberate removal of the diacritics unfamiliar to English readers, and is a traditional way to present foreign words. It is lazy typography, then. Deliberate, traditional and lazy. ;-) No. Lazy implies not doing something to avoid doing the work. This is not the case here. It's an accessibility issue. Perhaps it is. But then it's partly due to the lazy tradition. Can't the remove diacriticals action be performed in the reader's brain, instead of in the typesetter's office? Again, for at least some of us (and I suspect this is a majority of the population unfamiliar with a given diacritic), simply ignoring diacritics is not an option I don't think it's a problem with any given diacritical. Its rather an indistinct horror of diacriticals in general in speakers of a language without any diacriticals at all, like English. E.g. Hungarian uses three diacriticals and Hungarian speakers make no big deal of just ignoring the meaningless caron in Czech or the grave and the cedilla in Roumanian names. On the other hand, I must admit, that we also can be quite brutal to diacriticals in some newspapers or when it comes to a language like Vietnamese... Miert fizetsz az internetert? Korlatlan, ingyenes internet hozzaferes a FreeStarttol. Probald ki most! http://www.freestart.hu
Re: Looking for transcription or transliteration standards latin- arabic
On 2004.07.07, 00:49, Mike Ayers [EMAIL PROTECTED] wrote: Are you implying that, had printers throughout the centuries put the effort into faithfully reproducing every obscure symbol I spell my own name with some of those obscure symbols, thank you. Obscure indeed -- that's the last thing I'd expect in a list such as this! Is internationalization is serious issue, or just a toy to kill off idle time? from every foreign language, that the modern American would accept words with arbitrary diacritics? Foreign? American? I obviously misunderstood the whole purpose of these discussions, then. Bye bye -- will back as soon as I get my Green Card, señor! ;-) --. António MARTINS-Tuválkin | ()| [EMAIL PROTECTED]|| PT-1XXX-XXX LISBOA Não me invejo de quem tem| +351 934 821 700 carros, parelhas e montes| http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe| http://pagina.de/bandeiras/ a água em todas as fontes|
Re: Looking for transcription or transliteration standards latin-arabic
busmanus wrote: Philipp Reichmuth wrote: If we were starting from scratch today, we'd probably do better. (I hope we would retain the v sound in instead of converting it to f.) Except there is no v sound, only an f sound in the Russian pronunciation of due to regressive assimilation. Just like in English or French, as far as I can perceive. The reason for spellings like Stroganoff for Stroganov is word-final devoicing in Russian, which is absent from French and at least much less marked in English, so it had to be denoted explicitly. I was inaccurate here: word final devoicing does occur in French sometimes, but not in the voiced member of a voiced-unvoiced pair like /v/-/f/. In Russian it _only_ occurs in such pairs. Miert fizetsz az internetert? Korlatlan, ingyenes internet hozzaferes a FreeStarttol. Probald ki most! http://www.freestart.hu
Re: Looking for transcription or transliteration standards latin- arabic
Doug Ewell scripsit: On the contrary, untransliterated (or untranscribed) text can only be read by people who know the original script. Transliterations and transcriptions at least give the Latin-script-only reader a fighting chance to pronounce the text. Transliterations don't work so well for that, but transliterating some scripts to Latin is a necessity (for me, at least) to even recognize them. I can cope with Greek, Hebrew, and Cyrillic, but an English text full of Arabic or Chinese names presented in the usual scripts for those languages would be hopeless -- I wouldn't be able to reliably tell one name from another. This is true even though I have no more Greek, Hebrew, or Russian than I have Arabic or Chinese. -- John Cowan [EMAIL PROTECTED] www.ccil.org/~cowan www.reutershealth.com If he has seen farther than others, it is because he is standing on a stack of dwarves. --Mike Champion, describing Tim Berners-Lee (adapted)
Re: Looking for transcription or transliteration standards latin- arabic
Doug Ewell schrieb: Transcription does not require roundtrip. It is intended in this case for the English speaker to be able to deliver an approximate pronunciation adapted to his native vocal capabilities. Except when it doesn't. We write Tchaikovsky, not Chykoffskee. But then, English spelling isn't really logical anyway, and the average English speaker will be able to produce something from Tchaikovsky that would be more or less recognizable by a Russian. If we were starting from scratch today, we'd probably do better. (I hope we would retain the v sound in instead of converting it to f.) Except there is no v sound, only an f sound in the Russian pronunciation of due to regressive assimilation. Chykoffskee is pretty accurate, actually. I'd say Tchaikovsky is just a spelling taken over from French at a time when French was pretty much the international common language at least in diplomacy and art. Philipp -- Nur Miele schwrmt die Kuh Roswitha und gibt so manchen Extra-Liter. - Miele-Melkmaschinenwerbung, 70er
Re: Looking for transcription or transliteration standards latin- arabic
Philipp Reichmuth a crit : Except there is no v sound, only an f sound in the Russian pronunciation of due to regressive assimilation. Chykoffskee is pretty accurate, actually. I'd say Tchaikovsky is just a spelling taken over from French at a time when French was pretty much the international common language at least in diplomacy and art. [PA] And the prevalence of French in the Russian imperial nobility. In French it is today Tchakovsky (with trma), but the v looks like an attempt to transliterate, Russian names written in French in the XIXth century would usually transcribe as ff : boeuf Strogonoff, Michel Strogoff (Jules Verne), *Princesse Demidoff* ne Strogonoff, Tchkoff as an migr name in France [2 born in Paris between 1916 and 1940].
Re: Looking for transcription or transliteration standards latin- arabic
Philipp Reichmuth scripsit: Chykoffskee is pretty accurate, actually. Thank you. I have long since forgotten all the (very small amount of) Russian I ever learned, but I retain a firm grip on its phonology due to an interesting paedagogical device. My Russian instructor spent the first week or so of class teaching us to speak English with a Russian accent (and this I can do to this day). The idea was that having mastered this, we could then begin to speak Russian as well with a Russian accent, which is to say, perfectly. I'd say Tchaikovsky is just a spelling taken over from French at a time when French was pretty much the international common language at least in diplomacy and art. Doubtless. I have even seen it spelled in German fashion in English a time or two. -- I suggest you call for help,John Cowan or learn the difficult art of mud-breathing.[EMAIL PROTECTED] --Great-Souled Sam http://www.ccil.org/~cowan
Re: Looking for transcription or transliteration standards latin-arabic
Philipp Reichmuth wrote: If we were starting from scratch today, we'd probably do better. (I hope we would retain the v sound in instead of converting it to f.) Except there is no v sound, only an f sound in the Russian pronunciation of due to regressive assimilation. Just like in English or French, as far as I can perceive. The reason for spellings like Stroganoff for Stroganov is word-final devoicing in Russian, which is absent from French and at least much less marked in English, so it had to be denoted explicitly. Miert fizetsz az internetert? Korlatlan, ingyenes internet hozzaferes a FreeStarttol. Probald ki most! http://www.freestart.hu
Re: Hausa: Boko-Ajami? (RE: Looking for transcription or transliteration standards latin- arabic)
You might take a look at what we have in ICU for doing transliteration. It is rule-based, where each of the rules can take the context of surrounding letters into account. For information, see http://oss.software.ibm.com/icu/userguide/Transform.html http://oss.software.ibm.com/icu/userguide/TransformRule.html You can try out the rules with an interactive demo at http://oss.software.ibm.com/cgi-bin/icu/tr rk - Original Message - From: Donald Z. Osborn [EMAIL PROTECTED] To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Friday, July 02, 2004 21:52 Subject: Hausa: Boko-Ajami? (RE: Looking for transcription or transliteration standards latin- arabic) I've read selected messages in this thread (on Unicode list) and some messages bring to mind the thought of developing routines or standards to permit toggling back and forth between standard Latin and Arabic transcriptions for the same language, such as between the Boko and Ajami writing of Hausa. (Same applies to any two or three transcription systems used for particular languages.) One of the benefits of ICT is, theoretically anyway, that one can have text both (all) ways. Which would mean that the user has options, people using alternative systems are not excluded, and the society does not have to debate a decision of which writing system to use, etc. Because there is generally not a 1-to-1 character correspondence in spellings in different transcriptions, I wonder if you don't end up having to consider something that operates a bit like machine translation, analyzing the context of words in cases where transcription of a word in one system could be transliterated into something misspelled or taken as more than one word in the other system. Necessarily, I think, such routines would have to be language-specific. Any feedback would be appreciated. TIA... Don Osborn Bisharat.net
Re: Looking for transcription or transliteration standards latin- arabic
On 2004.07.02, 21:53, Mike Ayers [EMAIL PROTECTED] wrote: On the other hand, maybe Ha Tinh is just lazy typography. From National Geographic? Medoubts. This is a deliberate removal of the diacritics unfamiliar to English readers, and is a traditional way to present foreign words. It is lazy typography, then. Deliberate, traditional and lazy. ;-) --. António MARTINS-Tuválkin | ()| [EMAIL PROTECTED]|| PT-1XXX-XXX LISBOA Não me invejo de quem tem| +351 934 821 700 carros, parelhas e montes| http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe| http://pagina.de/bandeiras/ a água em todas as fontes|
Re: Looking for transcription or transliteration standards latin- arabic
RE: Looking for transcription or transliteration standards latin-arabicMike Ayers wrote: Trivia question: Which Vietnamese city does my atlas spell correctly, much to the chagrin of the Vietnamese? Probably Saigon. (Or is it Sai Gon?) -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Re: Looking for transcription or transliteration standards latin- arabic
RE: Looking for transcription or transliteration standards latin-arabicMark Davis wrote: In that case, we'd call it a transcription, since it doesn't roundtrip from source to target back to source. It is actually quite common for style guides for non-academic publications to have a restricted list of characters and character + accent combinations, and convert all others. For example, the Economist style guide, as I recall, recommends keeping accents in French, German, Italian, and Spanish names and words, but dropping them otherwise; and converting characters like and to nearest equivalents, th. Note that the latter loses information in two ways; the obvious one is that the distinction between and are lost; the less obvious one is that the distinction between them and a *real* 't' followed by 'h' in the source is lost. So that loses the distinction in sounds between 'th' in 'cathode' and 'cathouse', as well as between 'thy' and 'thigh'. The latter problem could be solved easily by transcribing as dh, but English speakers seem really terrified of the sequence dh. The former problem is only a problem if t + h combinations (like cathouse) are actually used in the language. I don't know if this is true for Icelandic. It is certainly true for Old English, where and are also seen. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Re: Looking for transcription or transliteration standards latin- arabic
Jony Rosenne rosennej at qsm dot co dot il wrote: And with the availability of Unicode, I think the need for transliteration is fading. It seems that these schemes can only be used by people who know the transliterated script. On the contrary, untransliterated (or untranscribed) text can only be read by people who know the original script. Transliterations and transcriptions at least give the Latin-script-only reader a fighting chance to pronounce the text. (Without them, those of use who can't read Arabic would have a real struggle reading today's news: Saddam Hussein, Al Qaeda, Osama bin Laden, etc.) The availability of Unicode means that scores of writing systems and orthographies can be represented in computers, all at once, unambiguously It doesn't mean that humans have become capable of reading scripts they previously couldn't read. Sorry if this wasn't what you meant. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Re: Looking for transcription or transliteration standards latin- arabic
John Cowan jcowan at reutershealth dot com wrote: Jony Rosenne scripsit: Transcription does not require roundtrip. It is intended in this case for the English speaker to be able to deliver an approximate pronunciation adapted to his native vocal capabilities. Except when it doesn't. We write Tchaikovsky, not Chykoffskee. Approximate is the operative word here. Like English spelling in general, our transcription schemes for personal names have derived from numerous sources across many years, and so are irregular. If we were starting from scratch today, we'd probably do better. (I hope we would retain the v sound in instead of converting it to f.) -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
RE: Looking for transcription or transliteration standards latin- arabic
These are transcriptions. I was talking about transliterations, which use various uncommon letter and diacritics combinations to achieve roundtrip accuracy. Only specialists can make sense of them, and they can just as easily read the original. Jony -Original Message- From: Doug Ewell [mailto:[EMAIL PROTECTED] Sent: Saturday, July 03, 2004 7:50 PM To: Unicode Mailing List Cc: Jony Rosenne Subject: Re: Looking for transcription or transliteration standards latin- arabic Jony Rosenne rosennej at qsm dot co dot il wrote: And with the availability of Unicode, I think the need for transliteration is fading. It seems that these schemes can only be used by people who know the transliterated script. On the contrary, untransliterated (or untranscribed) text can only be read by people who know the original script. Transliterations and transcriptions at least give the Latin-script-only reader a fighting chance to pronounce the text. (Without them, those of use who can't read Arabic would have a real struggle reading today's news: Saddam Hussein, Al Qaeda, Osama bin Laden, etc.) The availability of Unicode means that scores of writing systems and orthographies can be represented in computers, all at once, unambiguously It doesn't mean that humans have become capable of reading scripts they previously couldn't read. Sorry if this wasn't what you meant. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Re: Looking for transcription or transliteration standards latin- arabic
On 2004.07.03, 18:02, Jony Rosenne [EMAIL PROTECTED] wrote: transliterations, which use various uncommon letter and diacritics combinations to achieve roundtrip accuracy. OK. Only specialists can make sense of them, Pray tell, why so? Is the letter â an usuperable obstacle for those who know only the letter a?... Can't the remove diacriticals action be performed in the reader's brain, instead of in the typesetter's office? --. António MARTINS-Tuválkin | ()| [EMAIL PROTECTED]|| PT-1XXX-XXX LISBOA Não me invejo de quem tem| +351 934 821 700 carros, parelhas e montes| http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe| http://pagina.de/bandeiras/ a água em todas as fontes|
Re: Looking for transcription or transliteration standards latin- arabic
At 14:22 -0700 2004-07-03, Doug Ewell wrote: Anto'nio Martins-Tuva'lkin antonio at tuvalkin dot web dot pt wrote: Only specialists can make sense of them, Pray tell, why so? Is the letter â an usuperable obstacle for those who know only the letter a?... Can't the remove diacriticals action be performed in the reader's brain, instead of in the typesetter's office? But if the reader merely removes the diacriticals, He means, I think, that the reader ignores them, not knowing what they mean. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Looking for transcription or transliteration standards latin-arabic
Yes, transliterations are between different scripts. However, there are often different transliterations *between the same two scripts* that vary by language. To take your example, the transliterations customarily used between the Greek script and the Latin script are different in the cases: (a) for ancient Greek and English (e.g. = eu) (b) for modern Greek and English (e.g. = ev, ef) (see http://www.eki.ee/wgrs/rom1_el.pdf) For that matter, the transliterations customarily used between Cyrillic and Latin are different for the cases: (a) Russian and English (b) Russian and French (c) Russian and German (d) Serbian and English ... Note: I am still speaking of transliterations (e.g. transformations that 'roundtrip'), not transcriptions (which try to match the pronunciation more precisely, and may lose information). Thus, for brevity, one may and does speak of a transliteration between Russian and English, as shorthand for a transliteration between the Cyrillic script and the Latin script following customary conventions for Russian and English. Mark - Original Message - From: Anto'nio Martins-Tuva'lkin [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, July 01, 2004 17:19 Subject: Re: Looking for transcription or transliteration standards latin-arabic On 2004.07.01, 18:06, Mark Davis [EMAIL PROTECTED] wrote: different transliterations for different languages, Strictly speaking, transliterations are between two given scripts, the language being transparent -- I mean *real* transliterating from, say Greek to latin, uses the same rules for the Illiad as for cypriot or greek phone books or license plates... --. Antnio MARTINS-Tuvlkin | ()| [EMAIL PROTECTED]|| PT-1XXX-XXX LISBOA No me invejo de quem tem| +351 934 821 700 carros, parelhas e montes| http://www.tuvalkin.web.pt/bandeira/ s me invejo de quem bebe| http://pagina.de/bandeiras/ a gua em todas as fontes|
RE: Looking for transcription or transliteration standards latin- arabic
Title: RE: Looking for transcription or transliteration standards latin-arabic From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Mark Davis Sent: Friday, July 02, 2004 8:36 AM Note: I am still speaking of transliterations (e.g. transformations that 'roundtrip'), not transcriptions (which try to match the pronunciation more precisely, and may lose information). OK, just because I do so love monkey wrenches, please explain what I found in my atlas: Vietnamese English -- Ha Tinh Ha Tinh In which we have a trancription/transliteration/taxonomy problem between Latin and Latin. Since this does not remotely roundtrip (Ha, for instance, has 18 Vietnamese equivalents), and no attempt is made to match pronunciation, how do we refer to it? /|/|ike Trivia question: Which Vietnamese city does my atlas spell correctly, much to the chagrin of the Vietnamese?
RE: Looking for transcription or transliteration standards latin- arabic
OK, just because I do so love monkey wrenches, please explain what I found in my atlas: Vietnamese English -- Ha Tinh Ha Tinh In which we have a trancription/transliteration/taxonomy problem between Latin and Latin. Since this does not remotely roundtrip (Ha, for instance, has 18 Vietnamese equivalents), and no attempt is made to match pronunciation, how do we refer to it? Perhaps one could think of Ha Tinh as the English word for the city, like Rome (English) for Roma (Italian), or Tokyo (English) for Tky (English transliteration of Japanese), or Kahnawake (English/French) for Kahnaw:ke (Mohawk). In these and many other cases, place-names as used in foreign languages sould not be considered tranliterations, but linguistic borrowings, where pronunciation and spelling are often changed in the new language. On the other hand, maybe Ha Tinh is just lazy typography. Chris Harvey languagegeek.com
RE: Looking for transcription or transliteration standards latin- arabic
Transcription does not require roundtrip. It is intended in this case for the English speaker to be able to deliver an approximate pronunciation adapted to his native vocal capabilities. And with the availability of Unicode, I think the need for transliteration is fading. It seems that these schemes can only be used by people who know the transliterated script. Jony -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Mike Ayers Sent: Friday, July 02, 2004 8:24 PM To: 'Mark Davis'; [EMAIL PROTECTED] Subject: RE: Looking for transcription or transliteration standards latin- arabic From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Mark Davis Sent: Friday, July 02, 2004 8:36 AM Note: I am still speaking of transliterations (e.g. transformations that 'roundtrip'), not transcriptions (which try to match the pronunciation more precisely, and may lose information). OK, just because I do so love monkey wrenches, please explain what I found in my atlas: Vietnamese English -- Ha Tinh Ha Tinh In which we have a trancription/transliteration/taxonomy problem between Latin and Latin. Since this does not remotely roundtrip (Ha, for instance, has 18 Vietnamese equivalents), and no attempt is made to match pronunciation, how do we refer to it? /|/|ike Trivia question: Which Vietnamese city does my atlas spell correctly, much to the chagrin of the Vietnamese?
Re: Looking for transcription or transliteration standards latin- arabic
Title: RE: Looking for transcription or transliteration standards latin->arabic In that case,we'd call it a transcription, since it doesn't roundtrip from source to target back to source. It is actually quite common for style guides for non-academic publications to have a restricted list of characters and character + accent combinations, and convert all others. For example, the Economist style guide, as I recall,recommends keeping accents in French, German, Italian, and Spanish names and words, but dropping them otherwise; and converting characters like and to nearest equivalents, "th". Note that the latter loses information in two ways; the obvious one is that the distinction between and are lost; the less obvious one is that the distinction between them and a *real* 't' followed by 'h' in the source is lost. So that loses the distinction in sounds between 'th' in 'cathode'and 'cathouse', as well as between 'thy' and 'thigh'. rk - Original Message - From: Mike Ayers To: 'Mark Davis' ; [EMAIL PROTECTED] Sent: Friday, July 02, 2004 10:24 Subject: RE: Looking for transcription or transliteration standards latin- arabic From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Mark Davis Sent: Friday, July 02, 2004 8:36 AM Note: I am still speaking of transliterations (e.g. transformations that 'roundtrip'), not transcriptions (which try to match the pronunciation more precisely, and may lose information). OK, just because I do so love monkey wrenches, please explain what I found in my atlas: Vietnamese English -- Ha Tinh Ha Tinh In which we have a trancription/transliteration/taxonomy problem between Latin and Latin. Since this does not remotely roundtrip (Ha, for instance, has 18 Vietnamese equivalents), and no attempt is made to match pronunciation, how do we refer to it? /|/|ike Trivia question: Which Vietnamese city does my atlas spell correctly, much to the chagrin of the Vietnamese?
Re: Looking for transcription or transliteration standards latin- arabic
Jul 2, 2004 11:17 AM Chris Harvey Perhaps one could think of Ha Tinh as the English word for the city, like Rome (English) for Roma (Italian), or Tokyo (English) for Tky (English transliteration of Japanese), or Kahnawake (English/French) for Kahnaw:ke (Mohawk). Or Peking for Bejng. :-) John H. Jenkins [EMAIL PROTECTED] [EMAIL PROTECTED] http://homepage.mac.com/jhjenkins/
Re: Looking for transcription or transliteration standards latin- arabic
Jony Rosenne scripsit: Transcription does not require roundtrip. It is intended in this case for the English speaker to be able to deliver an approximate pronunciation adapted to his native vocal capabilities. Except when it doesn't. We write Tchaikovsky, not Chykoffskee. -- I could dance with you till the cows John Cowan come home. On second thought, I'd http://www.ccil.org/~cowan rather dance with the cows when you http://www.reutershealth.com came home. --Rufus T. Firefly [EMAIL PROTECTED]
RE: Looking for transcription or transliteration standards latin- arabic
Title: RE: Looking for transcription or transliteration standards latin- arabic From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of John H. Jenkins Jul 2, 2004 11:17 AM Chris Harvey Perhaps one could think of Ha Tinh as the English word for the city, like Rome (English) for Roma (Italian), or Tokyo (English) for Tky (English transliteration of Japanese), or Kahnawake (English/French) for Kahnaw:ke (Mohawk). Or Peking for Bejng. :-) Or either of those for ? Hmmm - can't really transcribe , now can we? After all, it doesn't have a definitive pronunciation, various government mandates aside. We can only transcribe pronunciation, not spelling. And isn't that the real difference? I always thought it was. Transcribing is making sounds readable, whereas transliteration is making letters familiar, yes? I think this is a bit of a Rorshach, though - I doubt any definition or definitons would well cover all the available ground. /|/|ike
RE: Looking for transcription or transliteration standards latin- arabic
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of John H. Jenkins Sent: Friday, July 02, 2004 9:48 PM To: [EMAIL PROTECTED] Subject: Re: Looking for transcription or transliteration standards latin- arabic Jul 2, 2004 11:17 AM ?Chris Harvey Perhaps one could think of Ha Tinh as the English word for the city, like Rome (English) for Roma (Italian), or Tokyo (English) for Tky (English transliteration of Japanese), or Kahnawake (English/French) for Kahnaw:ke (Mohawk). Or Peking for Bejng. :-) Or Constantinople for Istanbul. :-) Jony John H. Jenkins [EMAIL PROTECTED] [EMAIL PROTECTED] http://homepage.mac.com/jhjenkins/
RE: Looking for transcription or transliteration standards latin- arabic
Title: RE: Looking for transcription or transliteration standards latin- arabic From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Chris Harvey Sent: Friday, July 02, 2004 11:17 AM Perhaps one could think of Ha Tinh as the English word for the city, like Rome (English) for Roma (Italian), or Tokyo (English) for Tky (English transliteration of Tky is not an English transliteration of Japanese, as it uses diacritics not found in English. The correct English transliteration is in fact Tokyo, which does not round trip. Japanese), or Kahnawake (English/French) for Kahnaw:ke Errr - didn't the Emglish/French useage predate the Mohawk alphabet? Pretty perverse case there. (Mohawk). In these and many other cases, place-names as used in foreign languages sould not be considered tranliterations, but linguistic borrowings, where pronunciation and spelling are often changed in the new language. In part you are correct, but this really only holds where the place name gets enough usage to develop its own name in the other language. Most famous places (Paris, New York, et. al.) have language specific names in most languages, but lesser knowns such as Ha Tinh are unlikely to have such names. On the other hand, maybe Ha Tinh is just lazy typography. From National Geographic? Medoubts. This is a deliberate removal of the diacritics unfamiliar to English readers, and is a traditional way to present foreign words. If we're going to categorize trans-thingies, I think this deserves its own category, but since it's all relative and vague, I'm not terribly concerned. Mostly I just wondered if it did fit in anywhere. Seems it doesn't. /|/|ike
RE: Looking for transcription or transliteration standards latin- arabic
Tky is not an English transliteration of Japanese, as it uses diacritics not found in English. The correct English transliteration is in fact Tokyo, which does not round trip. My mistake, I meant Latin/Roman transliteration. or Kahnawake (English/French) for Kahnaw:ke Errr - didn't the Emglish/French useage predate the Mohawk alphabet? Pretty perverse case there. Not as such. The previous English/French spelling of the community was Caughnawaga, pronounced in the local English as [kgnwg]. As society has changed somewhat, there has been a trend for Canadian society to go back to using the original Native names (which the Native people have been using all along). So what happened was, the government looked at the way the Mohawk name was already spelled in Mohawk, Kahnaw:ke [khnwke], and modified it to suit English/French orthographical practice. My point here was that the Mohawk language uses a grave accent and long vowel marker, which are discarded in English and French. Today, the local English speakers still by and large call the town Caughnawaga, but the English speakers call the golf course (which uses the new name) [knwki]. So for people living in that part of Qubec, you could say that the word Kanawake is treated like Paris. Chris Harvey languagegeek.com
[totally OT] Mohawk, Re: Looking for transcription or transliteration standards latin- arabic
Mike Ayers a crit : From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Chris Harvey Sent: Friday, July 02, 2004 11:17 AM Perhaps one could think of Ha Tinh as the English word for the city, like Rome (English) for Roma (Italian), or Tokyo (English) for Tky (English transliteration of Tky is not an English transliteration of Japanese, as it uses diacritics not found in English. The correct English transliteration is in fact Tokyo, which does not round trip. Japanese), or Kahnawake (English/French) for Kahnaw:ke Errr - didn't the Emglish/French useage predate the Mohawk alphabet? Pretty perverse case there. Yes, the Mohwak alphabet certainly postdates the French transcriptions. Just a few pieces of information about Mohawk (Agnier in its traditional French form) names around Montreal (Kanesatake North Shore, Kahnawake South Shore) : 1) Heard one of the Mohawk leaders speak on the radio the other day and he pronounced the K of Kanesatake as Kansatgu for my French ear, which seems to be validated by the old French spelling Canessedage (first attested in 1695), the name was first used apparently when the Agniers found refuge at the foot of Mont Royal on Montral Island than already occupied by the French for quite a time before the Sulpicians moved them to another area ouside Montreal. The French adopted Oka (an Algonquian name, if I recall properly) to designate the same place the Mohawk named Kanesatake. 2) As far as Kahnawake is concerned the settlement occurred again while the French had settled the area (long story but the small group of Mohawk that had converted to Catholicism and found refuge around Montreal went through several settlements before settling in Kahnawake), at the same time the priests and French settlers that accompagnied the Mohawk called the place (now Kahnawake) Saint-Franois-Xavier-du-Sault or simply Le Sault. In Mohawk (agnier) the present-day Kahnawake was respectively called Kahnawake ( au rapide , by the rapids ), in 1676, Kahnawakon, ( dans le rapide , in the rapids ), in 1690, Kanatakwenke, ( d'o on est parti , whence we left ), in 1696 and Caughnawaga, in 1716 and many other spellings thereafter until 1980 when Kahnawake was chosen as the official spelling. P. A.
Re: Looking for transcription or transliteration standards latin- arabic
Jony Rosenne a crit : -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of John H. Jenkins Peking for Bejng. :-) Or Constantinople for Istanbul. :-) Two very different political realities (before and after 1453). Cities change names without going through transliterattions, cf. Berlin (Ontario) becoming Kitchener in 1916. In any case, it is Istamboul and Pkin. P. A.
Hausa: Boko-Ajami? (RE: Looking for transcription or transliteration standards latin- arabic)
I've read selected messages in this thread (on Unicode list) and some messages bring to mind the thought of developing routines or standards to permit toggling back and forth between standard Latin and Arabic transcriptions for the same language, such as between the Boko and Ajami writing of Hausa. (Same applies to any two or three transcription systems used for particular languages.) One of the benefits of ICT is, theoretically anyway, that one can have text both (all) ways. Which would mean that the user has options, people using alternative systems are not excluded, and the society does not have to debate a decision of which writing system to use, etc. Because there is generally not a 1-to-1 character correspondence in spellings in different transcriptions, I wonder if you don't end up having to consider something that operates a bit like machine translation, analyzing the context of words in cases where transcription of a word in one system could be transliterated into something misspelled or taken as more than one word in the other system. Necessarily, I think, such routines would have to be language-specific. Any feedback would be appreciated. TIA... Don Osborn Bisharat.net
Re: Looking for transcription or transliteration standards latin-arabic
On 2004.06.30, 18:56, Jorg Knappen [EMAIL PROTECTED] wrote: Are there standards for transscribing or transliterating western languages written in latin to arabic? A real transliteration should work both ways, shouldn't it? (I managed to deeply shock a former KGB-bueraucrat when applying for a Russian residence permit by spelling my sixth brother's name, Henrique as ...) --. Antnio MARTINS-Tuvalkin | ()| [EMAIL PROTECTED]|| PT-1XXX-XXX LISBOA Nao me invejo de quem tem| +351 934 821 700 carros, parelhas e montes| http://www.tuvalkin.web.pt/bandeira/ so me invejo de quem bebe| http://pagina.de/bandeiras/ a agua em todas as fontes|
Re: Looking for transcription or transliteration standards latin-arabic
When we looked into this, the problem we found is that there are many standards. We ended up with the following in ICU (see http://oss.software.ibm.com/cgi-bin/icu/tr for a demo, http://oss.software.ibm.com/icu/userguide/Transform.html for descriptions). I believe that we followed the UNGEGN conventions, with added accents to support round-tripping. Note that while we have the ability to have different transliterations for different languages, or for variant transliterations, we have not added any as yet. (The '' means 'transforms into' below). '.'; ','; ','; ';'; '?'; '%'; 0; 1; 2; 3; 4; 5; 6; 7; 8; 9; 0; 1; 2; 3; 4; 5; 6; 7; 8; 9; a; u; i; th; dh; sh; s; d; t; z; gh; t; zh; ng; v; y; ; a; b; t; j; h; kh; d; r; z; s; ; ; f; q; k; l; m; n; h; w; y; y; a; u; i; a; u; i; ; ; ; ; ; p; ch; v; g; rk - Original Message - From: Anto'nio Martins-Tuva'lkin [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, July 01, 2004 08:34 Subject: Re: Looking for transcription or transliteration standards latin-arabic On 2004.06.30, 18:56, Jorg Knappen [EMAIL PROTECTED] wrote: Are there standards for transscribing or transliterating western languages written in latin to arabic? A real transliteration should work both ways, shouldn't it? (I managed to deeply shock a former KGB-bueraucrat when applying for a Russian residence permit by spelling my sixth brother's name, Henrique as ...) --. Antnio MARTINS-Tuvalkin | ()| [EMAIL PROTECTED]|| PT-1XXX-XXX LISBOA Nao me invejo de quem tem| +351 934 821 700 carros, parelhas e montes| http://www.tuvalkin.web.pt/bandeira/ so me invejo de quem bebe| http://pagina.de/bandeiras/ a agua em todas as fontes|
Re: Looking for transcription or transliteration standards latin-arabic
On 2004.07.01, 18:06, Mark Davis [EMAIL PROTECTED] wrote: different transliterations for different languages, Strictly speaking, transliterations are between two given scripts, the language being transparent -- I mean *real* transliterating from, say Greek to latin, uses the same rules for the Illiad as for cypriot or greek phone books or license plates... --. António MARTINS-Tuválkin | ()| [EMAIL PROTECTED]|| PT-1XXX-XXX LISBOA Não me invejo de quem tem| +351 934 821 700 carros, parelhas e montes| http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe| http://pagina.de/bandeiras/ a água em todas as fontes|
Looking for transcription or transliteration standards latin-arabic
Are there standards for transscribing or transliterating western languages written in latin to arabic? I am specifically interested in german-arabic, but english-arabic and french-arabic is of interest, too. --Jorg Knappen