Re: Unicode transliterations (and other operations)
Mike Ayers wrote with the solution to the mathematical puzzle. Kudos, Mike! Substituting digits rather than letters, shoulda known. Is there a prize? Best regards, James Kass.
FW: Re: Unicode transliterations (and other operations)
Have you a better idea? That is not low. Low is when I scare myself. You do not want to see what I think. Low is why I ought to be kept away from real, living women because of what I might do after 700 or 800 millilitres of sake. Low would be bad. And there is lower. Let us not go there. I wish nobody had brought this up. You know not low. I can see myself doing it even without the sake. Yes, I ought to go. $B$i$s$^(B $B!z$8$e$&$$$C$A$c$s!z(B $B!!!_$"$+$M(B $B!http://www.trigeminal.com/
Re: Unicode transliterations (and other operations)
-- http://www.lonelyplanet.com/destinations/south_east_asia/myanmar/ Burma became Myanmar in 1989 after the State Law and Order Restoration Council decided that the old name implied the dominance of Burmese culture; the Burmese are just one of the many ethnic groups in the country... An interesting site with writings from various people favoring either Burma or Myanmar suggests that Burma and Myanmar are separate words with different etymologies. http://ffmemorial.hypermart.net/burma_or_myanmar.html - There is apparently some controversy about this, which is beside the point. Perhaps Cambodia would make a better example? Ever read a technical paper in a field not your own, but a field which may be of interest or related to your field? Maybe you'd have to read it a second or third time (or more) before eventually beginning to understand the message. Does trade jargon (the technical language in a particular field) exist to clarify a trade, or is its purpose more to exclude anyone not part of the inner circle? Technical writing by techies for techies is a bit of a peeve for me (in case this isn't already evident). If we need to make distinctions and it is possible to make these distinctions using plain language, don't we reach more people with such plain language? I've often wondered about this with regards to subjects like programming languages. Is this practice (trade jargon) unique to English? In other words, does a Hindi speaker wishing to learn, for example, the C programming language have any advantage over the English speaker because the C programming instructions in Hindi are in 'plain-Hindi' rather than 'tech-speak'? Quoting from McCormick on Evidence Third Edition (1984): In cases where privity in the strict sense does not exist between a person suing for injuries and his administrator suing for death caused thereby, identity of interest is advanced as a basis for admitting in the later case testimony given in the former. (from footnote 12 on page 765, a random selection) And we all thought privity and identity of interest were synonyms. smiley Well, unless someone comes out with Rules of Evidence for Dummies, I suppose it would be necessary to hire a lawyer. We need precision, sure, but clarity is important too. Let's try the phrase from the purpose.html page quoted earlier again: It is indispensable in that it permits the univocal transmission of a written message between two countries using different writing systems or exchanging a message the writing of which is different from their own. Or, to paraphrase: 'It's needed because it makes straightforward message exchange possible between groups which use different writing systems.' By the way, univocal is a word, after all. It's in a bigger, hardback Webster's and means having one voice, just as its roots suggest. I'd mistakenly assumed that the author was going for unequivocal and had made a typo. Shucks. John Cowan wrote: ... In transliteration, we are mapping one script to another in a language-independent way. In transcription, we are mapping the writing conventions of one language to those of another. This is clear enough and precise. It's also concise in that it condenses much of the verbose page purpose.html down to two sentences. The reason it makes me uncomfortable is that these definitions don't match the standard meanings of the words as contained in dictionaries. I'm afraid to suggest alternatives like machine transliteration and phonetic transcription, though, because they are a bit cumbersome and would possibly only add to the confusion. Peter Constable wrote: True, though of course they do have the authority to say, In the context of our standards we use term x to mean X. (and, in a different letter) ...it is my impression that many people use the term transliteration in a broader sense than the strict definition defined by TC 46. That appears to be the case for the help file associated with the ICU demo, which defines transliteration as, the general process of converting characters from one particular script to another one. So, if words are to be re-defined, let's assure that they are explicitly re-defined and that these re-definitions are accessible. Meanwhile, when someone uses the terms in the 'broader sense' (id est: dictionary definition), please let's not chide them for it. Best regards, James Kass. - Original Message - From: John Cowan [EMAIL PROTECTED] To: James Kass [EMAIL PROTECTED] Cc: Unicode List [EMAIL PROTECTED]; Lukas Pietsch [EMAIL PROTECTED]; J M Sykes [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Wednesday, July 04, 2001 9:23 PM Subject: Re: Unicode transliterations (and other operations) James Kass scripsit: Does the vocabulary make things clearer or cause confusion? If we need to distinguish between reversible
Re: Unicode transliterations (and other operations)
Just FYI: For a history of practices, terminology debates, of transliteration, transcription etc., see: Wellisch, Hans H., 1920-, The conversion of scripts, its nature, history, and utilization / Hans H. Wellisch. -- New York : Wiley, c1978, xviii, 509 p. : ill. ; 24 cm. The same author has a much shorter bibliography, I think superceded by this book. Martin Heijdra - Original Message - From: [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Wednesday, July 04, 2001 4:37 AM Subject: Re: Unicode transliterations (and other operations) On 07/02/2001 02:56:16 PM Mark Davis wrote: For those interested in Transliteration (and other Unicode transformations), there is a new ICU web demo program on http://oss.software.ibm.com/developerworks/opensource/icu/translitdemo... This opens an area of some interest to me and some of my colleagues. There have been some messages in this thread discussing whether something is transliteration or transcription. On that point I have two comments: first, ISO TC 46 has created definitions for these two terms that apply to ISO standards under their purview; these definitions can be found at http://www.elot.gr/tc46sc2/purpose.html. Secondly, it is my impression that many people use the term transliteration in a broader sense than the strict definition defined by TC 46. That appears to be the case for the help file associated with the ICU demo, which defines transliteration as, the general process of converting characters from one particular script to another one. Moreover, there is a need for a term to described a particular situation that is very common around the world, and so far as I know the term transliteration is the only term that comes close to describing that phenomenon. It is this phenomenon which is the focus of interest for me and my SIL colleagues: a single language that is written by different portions of the language community in different writing systems, particularly different writing systems based on different scripts. For example, Kashmiri (India / Pakistan) is written in Devanagari and in Nastaliq-style Arabic (aka Persio-Arabic); Wolaytta (Ethiopia) is written in Ethiopic and Roman; Tai Dam is written in Tai Dam script, in Lao script and in Roman with Vietnamese-style diacritics. This phenomenon is of particular interest and concern for applied linguists involved in literacy and literature development: for literacy, they might need to assist people in learning how to make the transition between one writing system and another, and they certainly need to develop different sets of literacy materials for each writing system (probably with significant duplication in content). For those working on literature development, there is a repeated need to publish documents in multiple writing systems. For large publications that are developed over long periods of time, such as dictionaries or translations of long works such as the Bible, issues of versioning and data management become particularly focal: the opus is going to be edited and revised literally hundreds of times: if one has to maintain three copies (corresponding to three writing systems) of a document through dozens of changes each working day over (say) an eight-year period, that is a lot of additional work. Clearly in situations such as this, there would be a significant benefit to be gained if it were possible for a person to create a document in one writing system and have the parallel documents in the other writing systems generated by some automated processes. There are, in principle, three potential ways to deal with publishing in multiple writing systems: 1. Separate documents are created manually, one for each writing system. 2. A document is created manually in one writing system, and different parallel documents are generated through an automated process for the other writing systems. 3. A single document is created that can be displayed in terms of alternate writing systems using font mechanisms, possibly relying on transduction done within smart fonts. (Note that I say these are *potential* possibilities; there are additional factors such as whether a spelling in one writing system contains adequate information to determine a unique spelling in a different writing system - can one be generated deterministically from the other.) There are plenty of cases in which the first method has been used. We have done some implementations of both the second and the third varieties. For example, last year we developed a system of the second variety that simultaneously supports both Ethiopic and Roman writing systems using a custom encoding and Worldscript and GX (yes, GX, not AAT), and that is being used by a linguist for work on the Koorete language in Ethiopia. Our SIL Hebrew font package includes the third variety as a capability: the Ezra Standard Encoding permits changing between Hebrew script and Roman-based
Re: Unicode transliterations (and other operations)
James Kass scripsit: An interesting site with writings from various people favoring either Burma or Myanmar suggests that Burma and Myanmar are separate words with different etymologies. I don't think so. But the question has become politicized, because the change (in Latin transliteration only, note) was made by a government which many believe to be illegitimate. I agree that the example was a bad one for that reason. Does trade jargon (the technical language in a particular field) exist to clarify a trade, or is its purpose more to exclude anyone not part of the inner circle? Some of each, to be sure. I've often wondered about this with regards to subjects like programming languages. Is this practice (trade jargon) unique to English? In other words, does a Hindi speaker wishing to learn, for example, the C programming language have any advantage over the English speaker because the C programming instructions in Hindi are in 'plain-Hindi' rather than 'tech-speak'? On the contrary, it is often worse in other languages, because most of the technical jargon is typically adopted straight from English. ... In transliteration, we are mapping one script to another in a language-independent way. In transcription, we are mapping the writing conventions of one language to those of another. This is clear enough and precise. It's also concise in that it condenses much of the verbose page purpose.html down to two sentences. Thank you. Note that I used the jargon verb map, which is old enough in this sense that it does appear in dictionaries, but is still probably unfamiliar to many. The reason it makes me uncomfortable is that these definitions don't match the standard meanings of the words as contained in dictionaries. So much the worse for dictionaries, then. :-) I'm afraid to suggest alternatives like machine transliteration and phonetic transcription, though, because they are a bit cumbersome and would possibly only add to the confusion. Right. And note that until a decade or two ago, all transliteration *and* transcription was very much by hand: no machines involved. Meanwhile, when someone uses the terms in the 'broader sense' (id est: dictionary definition), please let's not chide them for it. Well, fine. But when someone is talking about physics, and uses energy, power, and force interchangeably, do we accept this as a broader sense of the terms, or do we explain to them that in this field, the terms are definitely *not* interchangeable? -- John Cowan [EMAIL PROTECTED] One art/there is/no less/no more/All things/to do/with sparks/galore --Douglas Hofstadter
Re: Unicode transliterations (and other operations)
John Cowan wrote: I don't think so. But the question has become politicized, because the change (in Latin transliteration only, note) was made by a government which many believe to be illegitimate. ... in every sense of the word, apparently. I agree that the example was a bad one for that reason. Yet coming across that web page while probing the issue was quite an eye-opener for me, and I am grateful. ...advantage over the English speaker because the C programming instructions in Hindi are in 'plain-Hindi' rather than 'tech-speak'? On the contrary, it is often worse in other languages, because most of the technical jargon is typically adopted straight from English. Then member variable would be transcribed to Devanagari? If so, how unfortunate. Note that I used the jargon verb map, which is old enough in this sense that it does appear in dictionaries, but is still probably unfamiliar to many. Using map in this fashion shouldn't be too much of a problem, though, it's generic enough that the meaning can be derived from context. The reason it makes me uncomfortable is that these definitions don't match the standard meanings of the words as contained in dictionaries. So much the worse for dictionaries, then. :-) And for standards? (-: Right. And note that until a decade or two ago, all transliteration *and* transcription was very much by hand: no machines involved. Yes, and the dictionary definitions seem to derive from the manuscript era. Perhaps a newer dictionary... Well, fine. But when someone is talking about physics, and uses energy, power, and force interchangeably, do we accept this as a broader sense of the terms, or do we explain to them that in this field, the terms are definitely *not* interchangeable? Physics isn't my forte, but even in the vernacular the terms aren't necessarily interchangeable: Energy shortage, power to the people, and may the Force be with you. Best regards, James Kass.
RE: Unicode transliterations (and other operations)
From: James Kass [mailto:[EMAIL PROTECTED]] てんどうりゅうじ wrote: Still haven't got the multiplication riddle solved, Mr. Kass? Sorry, I didn't know it was required. Almost asked 'which riddle?', but now notice the × in the signature portion as follows... らんま ×あかね ー あまんけ ねけあず らんま ー いいなずけ The key: 0 - ん 5 - な 1 - あ 6 - け 2 - ま 7 - い 3 - ね 8 - ず 4 - ら 9 - か So we get: 402 193 - 1206 3618 402 - 77586 ...which you can verify on your calculator. Colloquial Japanese by Noboru Inamoto doesn't include any of these words in the vocabulary list. Easy Japanese by Samuel E. Martin doesn't list them in PART IV 3000 Useful Japanese Words, either. (But, the Japanese word for riddle is nazo.) Surely there are better references around here somewhere, but your CD collection is probably better organized than my books at present. Hee hee - unless you're packing a guide to anime, you'll never find 'em anyway. らんま is Ranma, as in Ranma Saotome, and あかね is Akane, as in Akane Tendo, the two main stars of Rumiko Takahashi's bizarre (if monothematic) sex comedy Ranma 1/2. /|/|ike
Re: Unicode transliterations (and other operations)
Hee hee - unless you're packing a guide to anime, you'll never find 'em anyway. らんま is Ranma, as in Ranma Saotome, and あかね is Akane, as in Akane Tendo, the two main stars of Rumiko Takahashi's bizarre (if monothematic) sex comedy Ranma 1/2. Seeing this wonderful use of Unicode text in e-mail brings a quote to mind: Marvelous technology is at our disposal; but instead of reaching up to new heights, we're going to see how far down we can go, how deep into the muck we can immerse ourselves. -- Barry Champlaign (Eric Bogosian), Oliver Stone's Talk Radio g MichKa Michael Kaplan Trigeminal Software, Inc. http://www.trigeminal.com/
Re: Unicode transliterations (and other operations)
Doug Ewell wrote: Maybe not. This is the part I got wrong several weeks ago when we had this discussion, and I hope my understanding is better now. Transliteration is about building a reversible mapping between the original (in this case, Japanese) sounds and a set of (in this case, Latin) characters, with the focus on reversibility rather than legibility. You might even use numbers or other symbols to ensure that the transliterated version can be mapped unambiguously back to Japanese. The reader might have to go through a learning curve to equate your symbols with the desired sounds. Transcription is about optimizing the Latin-script version for, say, a Polish-language reader. A transcription has not only a target script but also a target language, and it might be different for each of Polish, German, French, English, etc. The goal is enabling the Polish reader to pronounce the Japanese text with a minimal learning curve. snip Unfortunately, the terms transcription and transliteration are commonly mixed up by non-experts, causing much confusion. Please, somebody let me know if this is still not right. Transliteration just means to write something using the characters of another alphabet. Legibility is the focus, so numbers or symbols shouldn't enter the picture. A transcription is simply a copy (usually in the same language/script as the source, otherwise it wouldn't be a copy). An exception would be a typed transcript of something originally written in shorthand. This according to Webster's New World Dictionary (of English), a recognized authority (on English). Best regards, James Kass.
Re: Unicode transliterations (and other operations)
Maybe we are just being weird here. We ought to try to avoid twisting language, even if we do pretty much operate within our own little techie world here. Still haven't got the multiplication riddle solved, Mr. Kass? $B$i$s$^(B $B!z$8$e$&$$$C$A$c$s!z(B $B!!!_$"$+$M(B $B!
Re: Unicode transliterations (and other operations)
On 07/03/2001 09:47:17 PM Doug Ewell wrote: Unfortunately, the terms transcription and transliteration are commonly mixed up by non-experts, causing much confusion. Please, somebody let me know if this is still not right. See my comments on this and the URL for ISO definitions in my other message. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: [EMAIL PROTECTED]
RE: Unicode transliterations (and other operations)
Peter Constable wrote: It is this phenomenon which is the focus of interest for me and my SIL colleagues: a single language that is written by different portions of the language community in different writing systems, particularly different writing systems based on different scripts. I would include braille in this scenario. Braille transcription is another facet of the issue, but it involves practically every written language. Some languages approximately have a 1-to-1 relationship between the graphemes in their visual scripts and braille patterns. Some other languages have more complex and indirect relationship between the two worlds. E.g., English and other languages normally require level 2 braille, which is a quasi-logographic system of abbreviations for words or part of words. Conversely, Chinese brailles are strictly phonetic, having signs for initial consonants and final rhymes. Japanese braille uses a single kana syllabary, etc. Any project in the area of automatic transcription should start by analyzing what is already done in the braille world, in order not to reinvent the wheel. Conversely, any progress in the automatic transcription across scripts could potentially be reused in braille technology. It may turn out that the two kind of transcriptions share a lot of points, such as conversions depending on context, or the need of dictionary look up in some cases. 3. A single document is created that can be displayed in terms of alternate writing systems using font mechanisms, possibly relying on transduction done within smart fonts. Peter, the fact that SIL has a nice smart font technology available does not mean that this technology should be used also for brewing beer! IMHO, font technology should be used only for displaying text, which is where it applies. Other tasks, unrelated to this problem, should be handled with different tools. Of course, some basic algorithms could be in common, such as moving letters around, splitting or joining ligatures, etc. But the similarity ends here, methinks. _ Marco
Re: Unicode transliterations (and other operations)
てんどうりゅうじ wrote: We ought to try to avoid twisting language, even if we do pretty much operate within our own little techie world here. Indeed! Or, at least if we need a correct definition of an English word, we should consult an English dictionary. The web page cited by Mr. Constable is simply misleading, unless it were to be amended to clearly state for the purposes of this and related documents... these words mean c. Languages change over time and so do the definitions of words or phrases within a language. Blind pig meant something other than a sightless farm critter in the 1920s and '30s, for example, and my guess is that a larger percentage of subscribers to this list would recognize that term than the average ranihan on the streets. (Hope ranihan is spelled correctly, for some reason it isn't in the paperback Webster's here.) No international body has any authority to alter the meaning of existing words in my language or any of our languages. Still haven't got the multiplication riddle solved, Mr. Kass? Sorry, I didn't know it was required. Almost asked 'which riddle?', but now notice the × in the signature portion as follows... らんま ×あかね ー あまんけ ねけあず らんま ー いいなずけ So, here goes with a transliteration... ranma × akane - amanke nekeazu ranma iinazuke Japanese class was a long time ago... Colloquial Japanese by Noboru Inamoto doesn't include any of these words in the vocabulary list. Easy Japanese by Samuel E. Martin doesn't list them in PART IV 3000 Useful Japanese Words, either. (But, the Japanese word for riddle is nazo.) Surely there are better references around here somewhere, but your CD collection is probably better organized than my books at present. If the riddle is a Japanese cryptogram, there is little hope for me. Has anyone solved the riddle, てんどうりゅうじ-san ? (Besides Sarasvati, who probably figured it out at once.) Perhaps you will take some sake, become magnanimous, and enlighten us? Back on topic, with regards to the terminology... The page in question ( http://www.elot.gr/tc46sc2/purpose.html ) uses the word transcription where the word transliteration should be, and what they call transliteration could easily be referred to as reversible transliteration in plain English, without 'breaking existing applications' like my dictionary. English is too complicated already, let's not make it more complex. Back off topic... PTKA IZGT F SFNNGYGB ZRMSFTB WM NFEGT FM MGYWPRMKA FM F SFNNGYGB IWOG IWKK QGT FT IPQGT ZFXG GHRFK YWJZNM. Only when a battered husband is taken as seriously as a battered wife will men an women have equal rights. The typo in the third line threw me off for a moment... Best regards, James Kass.
Re: Unicode transliterations (and other operations)
James Kass wrote: Indeed! Or, at least if we need a correct definition of an English word, we should consult an English dictionary. The web page cited by Mr. Constable is simply misleading, unless it were to be amended to clearly state for the purposes of this and related documents... these words mean c. well, the English dictionaries give usages of words in everyday language, and that's fine. But in their usage as technical terms, the distinction between transcription and transliteration (roughly along the lines of the http://www.elot.gr/tc46sc2/purpose.html page) seems to me to be a fairly well-established one, in the field of linguistics at least. No international body has any authority to alter the meaning of existing words in my language or any of our languages. Sure, but we're dealing with a scholarly discipline's technical vocabulary here, and it's not such a bad idea in this case if computer people dealing with language adopt the usage of linguists, is it? what they call transliteration could easily be referred to as reversible transliteration in plain English, without 'breaking existing applications' like my dictionary. You must understand: this isn't about breaking existing applications, it's about a higher-level protocol! ;-) Lukas Pietsch
Re: Unicode transliterations (and other operations)
- Original Message - From: James Kass [EMAIL PROTECTED] To: Unicode List [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Wednesday, July 04, 2001 8:10 AM Subject: Re: Unicode transliterations (and other operations) Doug Ewell wrote: Maybe not. This is the part I got wrong several weeks ago when we had this discussion, and I hope my understanding is better now. Transliteration is about building a reversible mapping between the original (in this case, Japanese) sounds and a set of (in this case, Latin) characters, with the focus on reversibility rather than legibility. You might even use numbers or other symbols to ensure that the transliterated version can be mapped unambiguously back to Japanese. The reader might have to go through a learning curve to equate your symbols with the desired sounds. Transcription is about optimizing the Latin-script version for, say, a Polish-language reader. A transcription has not only a target script but also a target language, and it might be different for each of Polish, German, French, English, etc. The goal is enabling the Polish reader to pronounce the Japanese text with a minimal learning curve. snip Unfortunately, the terms transcription and transliteration are commonly mixed up by non-experts, causing much confusion. Please, somebody let me know if this is still not right. Transliteration just means to write something using the characters of another alphabet. Legibility is the focus, so numbers or symbols shouldn't enter the picture. From the New Shorter OED: Transliterate: Replace (letters or characters of one language) by those of another used to represent the same sounds; write (a word etc.) in the closest corresponding characters of another alphabet or language. A transcription is simply a copy (usually in the same language/script as the source, otherwise it wouldn't be a copy). An exception would be a typed transcript of something originally written in shorthand. From the New Shorter OED: Transcribe (among other meanings) v.t. Transliterate; write out (shorthand, notes, etc.) in ordinary characters or continuous prose. Formerly also, translate. I'm relieved to find that OED and Webster agree, though note that the OED recognises that transcribe is sometimes used as a synonym of transliterate. This is not to say that I don't recognise the useful distinction between a reversible transformation and an non-reversible one. Experts redefine words at the risk of confusing non-experts; when they do, they should not be surprised at the ensuing confusion -- they brought it on themselves. Regards, (non-expert) Mike. Impenetrability! That's what I say!
Re: Unicode transliterations (and other operations)
[EMAIL PROTECTED] writes: There have been some messages in this thread discussing whether something is transliteration or transcription. On that point I have two comments: first, ISO TC 46 has created definitions for these two terms that apply to ISO standards under their purview; these definitions can be found at http://www.elot.gr/tc46sc2/purpose.html. Secondly, it is my impression that many people use the term transliteration in a broader sense than the strict definition defined by TC 46. That appears to be the case for the help file associated with the ICU demo, which defines transliteration as, the general process of converting characters from one particular script to another one. Moreover, there is a need for a term to described a This is because ICU implementation of transliteration actually allows for even more general thing - converting characters according to a given set of rules. It can be used both for transliteration and transcription as defined in TC 46. For example, Kashmiri (India / Pakistan) is written in Devanagari and in Nastaliq-style Arabic (aka Persio-Arabic); Wolaytta (Ethiopia) is written in Ethiopic and Roman; Tai Dam is written in Tai Dam script, in Lao script and in Roman with Vietnamese-style diacritics. Let me add Serbian to this list - it is written both in Latin and Cyrillic scripts with mapping that is almost one to one. In case of Serbian, There are, in principle, three potential ways to deal with publishing in multiple writing systems: 1. Separate documents are created manually, one for each writing system. This method is not feasible at all in case of Serbian. . 2. A document is created manually in one writing system, and different parallel documents are generated through an automated process for the other writing systems. This is the most common practice used, although with some interesting consequences, see below. 3. A single document is created that can be displayed in terms of alternate writing systems using font mechanisms, possibly relying on transduction done within smart fonts. This one is also used. Here is the case of Serbian. It uses 30 cyrillic letters or 30 latin letters. However, some of the letters in the latin alphabet are represented as two letters - here are the pairs: \u0409/\u0459 == Lj/lj \u040A/\u045A == Nj/nj \u040F/\u045F == D\u017E/d\u017E \u0402/\u0452 occasionally represented in latin as Dj/dj, but usually represented by \u0110/\u0111 Transliteration from cyrillic to latin is very easy. The only problem is transliteration of upper case letters above, which can be transliterated either to upper/lower case combination or to two upper case letters, depending on the case of following letters. A little bit more complicated is transliteration of Serbian from latin to cyrillic, even when Unicode encoded, for two reasons: 1) if foreign names are not transcribed or tagged, they will be simply transliterated to cyrillic form, which is always a source of good laugh for Serbian readers, 2) this one happens extremely rarely - some words that use two-letter latin letters should be transliterated to two cyrillic letters, instead of just one. This is the case with some adopted foreign words. However, it is not of interest in everyday practice. Interesting and wrong practice used by a lot of magazines that print in cyrillic and also have a latin Internet publication is using a latin based encoding for cyrillic version, where q, w, x and y are used for cyrillic letters that use two letters in latin representation, for example, W and w represent \u040A and \u045A. However, foreign names are not transcribed, but written in original form in latin script. So, after moving from cyrillic to latin, Washington becomes Njashington. Of course, if Unicode was used for storing the text, transliteration from cyrillic to latin would be correct and almost trivial. My experience in transliteration says that 'pure' Unicode text is not enough for comfortable transliteration, especially for texts that tend to mix latin and cyrillic, as it is the case with most of technical texts. Some additional tagging is required to make it fully automatic. Otherwise, additional proof reading is required. I had reasonable success in writing MS Word macros that did transliteration - things that helped were formatting foreign word differently - using italic or bold. Hope this makes sense, V. -- Vladimir Weinstein, IBM GCoC-Unicode/ICU Cupertino, CA, [EMAIL PROTECTED]
Re: Unicode transliterations (and other operations)
Lukas Pietsch wrote: well, the English dictionaries give usages of words in everyday language, and that's fine. But in their usage as technical terms, the distinction between transcription and transliteration (roughly along the lines of the http://www.elot.gr/tc46sc2/purpose.html page) seems to me to be a fairly well-established one, in the field of linguistics at least. Yes, this would seem to be fairly widespread in the field. Sure, but we're dealing with a scholarly discipline's technical vocabulary here, and it's not such a bad idea in this case if computer people dealing with language adopt the usage of linguists, is it? Does the vocabulary make things clearer or cause confusion? If we need to distinguish between reversible script conversion and irreversible script conversion, could we not simply say reversible script conversion and so forth? We speak of code page conversions, but we haven't re-defined existing words to differentiate between the kind that's reversible and the kind that isn't (as far as I know). what they call transliteration could easily be referred to as reversible transliteration in plain English, without 'breaking existing applications' like my dictionary. You must understand: this isn't about breaking existing applications, it's about a higher-level protocol! ;-) It's about clarity and precision, too. When someone obviously intelligent like Doug Ewell admits to still being unclear weeks after being educated by hair-splitting techies, isn't there a problem? With regards to the 'purpose.html' page linked above, how seriously should we take a page which includes phraseology like: It is indispensable in that it permits the univocal transmission of a written message between two countries using different writing systems or exchanging a message the writing of which is different from their own. ...? The page was last updated in 1996, yet the first line of the page has the typo were for where. The sentence quoted above is needlessly redundant and there is no such word as univocal (as far as I know). My apologies to the authors of that page for mentioning this in a public forum. I make typos, too. J. M. Sykes wrote: I'm relieved to find that OED and Webster agree, though note that the OED recognises that transcribe is sometimes used as a synonym of transliterate. Perhaps it is sometimes mis-used as a synonym, I'm tempted to say, but must bow to the higher authority of the Oxford English Dictionary. Experts redefine words at the risk of confusing non-experts; when they do, they should not be surprised at the ensuing confusion -- they brought it on themselves. This is an excellent point, thank you for making it. Best regards, James Kass.
Re: Unicode transliterations (and other operations)
James Kass scripsit: Does the vocabulary make things clearer or cause confusion? If we need to distinguish between reversible script conversion and irreversible script conversion, could we not simply say reversible script conversion and so forth? No, that does not capture the distinction. In transliteration, we are mapping one script to another in a language-independent way. In transcription, we are mapping the writing conventions of one language to those of another. Handy example: the name of the country written Myanmar (in transliteration) is pronounced ['b@m@]. This was transcribed into (British) English as Burma. Of course, to represent the pronunciation I am using an ASCII transliteration of IPA! -- John Cowan [EMAIL PROTECTED] One art/there is/no less/no more/All things/to do/with sparks/galore --Douglas Hofstadter
RE: Unicode transliterations (and other operations)
Looks interesting. How are you approaching the complication that transliteration is between pairs of languages? E.g. Russian to English, Russian to French, Russian to German, and Russian to Finnish, all these are slightly different (as far as I know), because the goal of transliteration is to create something that is pronouncable by the target language but still close enough to the pronunciation of the origin language.
Re: Unicode transliterations (and other operations)
Looks interesting. How are you approaching the complication that transliteration is between pairs of languages? I know what you mean: Gorbachev is Gorbatschow in German. I think that the rules that we have in ICU are probably English-centric where it makes a difference. Note that some of the transliterator functions like uppercasing and any-name are just wrappers around Unicode functions, and so not language-dependent. The strength of the API is that you can roll your own rules at runtime and at compile-time. If you have different rules for Finnish as a target language for transliteration, then you can modify the ICU rules or supply a whole different set for your own. The rules are written somewhat similarly to regular expressions. See the (draft, somewhat outdated) user guide chapter: http://oss.software.ibm.com/icu/userguide/Transliteration.html and the API references: http://oss.software.ibm.com/icu/apiref/class_Transliterator.html and http://oss.software.ibm.com/icu/apiref/utrans_h.html markus
Re: Unicode transliterations (and other operations)
As Markus says, one can do that right now, by making your own (say) German-Serbian transliterator, one that is different from Latin-Cyrillic, Latin-Serbian, or German-Cyrillic. In ICU 2.0, we are examining the possibility of a lookup heirarchy, similar to the resource heirarchy, that would allow us to organize them more effectively. Our goal for the script-script rules will be to try to be as neutral as we can, while preserving round-tripping. See Guidelines for... in (the slightly out-of-date) http://oss.software.ibm.com/icu/userguide/Transliteration.html We are also adding variant tags, since there are many transliteration schemes that are not associated with language per se, but rather with a particular standard. For example, Latin-Greek/ISO-834. Since the goal for these rule sets will be to match the standard, they will not, in general, roundtrip. Also, here are some responses to a private mail I got on my original message. Горбачев, Михаил = Gorbachèv, Mìkhaìl Hmmm. First, is it Горбачев, or Горбачёв ? These were names given to us by our Russian center, so I assume it is correct (but don't know otherwise). Then, your translitteration uses grave accents, which I never saw for Russian (or even Cyrillic). The Cyrillic and Devanagari rules are preliminary. We'll be fixing those once we get some more of the code features in place. For Devanagari, we already have an interindic representation, that goes to and from all of the indic scripts. We will be developing a Latin-Interindic that lets us get from Latin to (and from) interindic, when can then pivot to (and from) the others. And here are some pages that might be of interest: - Transliteration of Non-Roman Alphabets and Scripts [http://homepage.mac.com/sirbinks/translit.html] - TC46 Transliteration Links [http://www.elot.gr/tc46sc2/bookmarks.html] - UN Working Group on Geographical Names [http://www.eki.ee/wgrs] Mark - Original Message - From: Markus Scherer [EMAIL PROTECTED] To: unicode [EMAIL PROTECTED] Sent: Tuesday, July 03, 2001 10:00 Subject: Re: Unicode transliterations (and other operations) Looks interesting. How are you approaching the complication that transliteration is between pairs of languages? I know what you mean: Gorbachev is Gorbatschow in German. I think that the rules that we have in ICU are probably English-centric where it makes a difference. Note that some of the transliterator functions like uppercasing and any-name are just wrappers around Unicode functions, and so not language-dependent. The strength of the API is that you can roll your own rules at runtime and at compile-time. If you have different rules for Finnish as a target language for transliteration, then you can modify the ICU rules or supply a whole different set for your own. The rules are written somewhat similarly to regular expressions. See the (draft, somewhat outdated) user guide chapter: http://oss.software.ibm.com/icu/userguide/Transliteration.html and the API references: http://oss.software.ibm.com/icu/apiref/class_Transliterator.html and http://oss.software.ibm.com/icu/apiref/utrans_h.html markus
Re: Unicode transliterations (and other operations)
I trust that 'moving' a name or a term between languages would be called transcription, not transliteration. Transliteration just tries to 'move' from script to script. Markus Scherer writes: Looks interesting. How are you approaching the complication that transliteration is between pairs of languages? I know what you mean: Gorbachev is Gorbatschow in German. This would then be an example of transcription, which differs on language pair basis, as it tries to get the speakers to pronounce the same word. I think that the rules that we have in ICU are probably English-centric where it makes a difference. V. -- Vladimir Weinstein, IBM GCoC-Unicode/ICU Cupertino, CA, [EMAIL PROTECTED]
RE: Unicode transliterations (and other operations)
I know what you mean: Gorbachev is Gorbatschow in German. Gorbatsov in Finnish transliteration, the ch would be very unwieldy for a Finnish mouth. (The s is used solely in transliteration, not in Finnish proper.) I think that the rules that we have in ICU are probably English-centric where it makes a difference. Note that some of the transliterator functions like uppercasing and any-name are just wrappers around Unicode functions, and so not language-dependent. The strength of the API is that you can roll your own rules at runtime and at compile-time. If you have different rules for Finnish as a target language for transliteration, then you can modify the ICU rules or supply a whole different set for your own. The rules are written somewhat similarly to regular expressions. See the (draft, somewhat outdated) user guide chapter: http://oss.software.ibm.com/icu/userguide/Transliteration.html One thing you could update in this page is the very first line :-) where it is claimed that transliteration is between scripts...
RE: Unicode transliterations (and other operations)
From: ext [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] Sent: Tuesday, July 03, 2001 2:56 PM To: [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: RE: Unicode transliterations (and other operations) I know what you mean: Gorbachev is Gorbatschow in German. Gorbatsov in Finnish transliteration, the ch would be very unwieldy Grrr. Something ate the caron from the s in ts... for a Finnish mouth. (The s is used solely in transliteration, not in Finnish proper.) ...just like in here, Finnish does have s...
Re: Unicode transliterations (and other operations)
So if I was trying to write my fake name in Polish, or for a Pole to read, I would write it as "Tendou Rjuud{U+017E}i"? That would be transliteration, right? $B$i$s$^(B $B!z$8$e$&$$$C$A$c$s!z(B $B!!!_$"$+$M(B $B!(B: Re: Unicode transliterations (and other operations) I trust that 'moving' a name or a term between languages would be called transcription, not transliteration. Transliteration just tries to 'move' from script to script. Markus Scherer writes: Looks interesting. How are you approaching the complication that transliteration is between pairs of languages? I know what you mean: Gorbachev is Gorbatschow in German. This would then be an example of transcription, which differs on language pair basis, as it tries to get the speakers to pronounce the same word. I think that the rules that we have in ICU are probably English-centric where it makes a difference. V. -- Vladimir Weinstein, IBM GCoC-Unicode/ICU Cupertino, CA, [EMAIL PROTECTED]
Re: Unicode transliterations (and other operations)
In a message dated 2001-07-03 21:06:50 Pacific Daylight Time, [EMAIL PROTECTED] writes: So if I was trying to write my fake name in Polish, or for a Pole to read, I would write it as Tendou Rjuud{U+017E}i? That would be transliteration, right? Maybe not. This is the part I got wrong several weeks ago when we had this discussion, and I hope my understanding is better now. Transliteration is about building a reversible mapping between the original (in this case, Japanese) sounds and a set of (in this case, Latin) characters, with the focus on reversibility rather than legibility. You might even use numbers or other symbols to ensure that the transliterated version can be mapped unambiguously back to Japanese. The reader might have to go through a learning curve to equate your symbols with the desired sounds. Transcription is about optimizing the Latin-script version for, say, a Polish-language reader. A transcription has not only a target script but also a target language, and it might be different for each of Polish, German, French, English, etc. The goal is enabling the Polish reader to pronounce the Japanese text with a minimal learning curve. A classic example of Russian-to-X transcription (where X is some Latin-script language) is a well-known name like Khrushchev or Gorbachev. Here the spellings I have used are those that would likely lead an English speaker to pronounce the names reasonably correctly. A transcription intended for German speakers might be Khruschtschow. None of these would be a proper transliteration, because they are not completely reversible (the 'shch' and 'schtsch' combinations could be U+0449 or (U+0448 plus U+0447). Unfortunately, the terms transcription and transliteration are commonly mixed up by non-experts, causing much confusion. Please, somebody let me know if this is still not right. -Doug Ewell Fullerton, California