Re: Unicode transliterations (and other operations)

2001-07-06 Thread James Kass


Mike Ayers wrote with the solution to the mathematical
puzzle.

Kudos, Mike!

Substituting digits rather than letters, shoulda known.

Is there a prize?

Best regards,

James Kass.






FW: Re: Unicode transliterations (and other operations)

2001-07-06 Thread $B$F$s$I$&$j$e$&$8(B
Have you a better idea?

That is not low.

Low is when I scare myself. You do not want to see what I think.
Low is why I ought to be kept away from real, living women because of what I might do 
after 700 or 800 millilitres of sake.

Low would be bad.

And there is lower. Let us not go there.

I wish nobody had brought this up. You know not low.

I can see myself doing it even without the sake. Yes, I ought to go.


$B$i$s$^(B $B!z$8$e$&$$$C$A$c$s!z(B
$B!!!_$"$+$M(B 
$B!http://www.trigeminal.com/






Re: Unicode transliterations (and other operations)

2001-07-05 Thread James Kass


--
http://www.lonelyplanet.com/destinations/south_east_asia/myanmar/

Burma became Myanmar in 1989 after the State Law and
Order Restoration Council decided that the old name implied
the dominance of Burmese culture; the Burmese are just one
of the many ethnic groups in the country...


An interesting site with writings from various people
favoring either Burma or Myanmar suggests that
Burma and Myanmar are separate words with different
etymologies.

http://ffmemorial.hypermart.net/burma_or_myanmar.html

-

There is apparently some controversy about this, which is
beside the point.  Perhaps Cambodia would make a better
example?

Ever read a technical paper in a field not your own, but a field
which may be of interest or related to your field?  Maybe you'd
have to read it a second or third time (or more) before eventually
beginning to understand the message.

Does trade jargon (the technical language in a particular field) exist
to clarify a trade, or is its purpose more to exclude anyone not
part of the inner circle?

Technical writing by techies for techies is a bit of a peeve for me
(in case this isn't already evident).  If we need to make distinctions
and it is possible to make these distinctions using plain language,
don't we reach more people with such plain language?

I've often wondered about this with regards to subjects like
programming languages.  Is this practice (trade jargon) unique
to English?  In other words, does a Hindi speaker wishing to
learn, for example, the C programming language have any
advantage over the English speaker because the C programming
instructions in Hindi are in 'plain-Hindi' rather than 'tech-speak'?

Quoting from McCormick on Evidence Third Edition (1984):

 In cases where privity in the strict sense does not exist
  between a person suing for injuries and his administrator
  suing for death caused thereby, identity of interest is
  advanced as a basis for admitting in the later case
  testimony given in the former.

(from footnote 12 on page 765, a random selection)

And we all thought privity and identity of interest
were synonyms. smiley

Well, unless someone comes out with Rules of Evidence for
Dummies, I suppose it would be necessary to hire a lawyer.

We need precision, sure, but clarity is important too.

Let's try the phrase from the purpose.html page quoted
earlier again:

  It is indispensable in that it permits the univocal 
   transmission of a written message between two 
   countries using different writing systems or 
   exchanging a message the writing of which is 
   different from their own.

Or, to paraphrase:

 'It's needed because it makes straightforward message
  exchange possible between groups which use different
  writing systems.'

By the way, univocal is a word, after all.  It's in a bigger,
hardback Webster's and means having one voice, just as its
roots suggest.  I'd mistakenly assumed that the author was
going for unequivocal and had made a typo.  Shucks.

John Cowan wrote:
 ... In transliteration, we
 are mapping one script to another in a language-independent way.
 In transcription, we are mapping the writing conventions of one
 language to those of another.

This is clear enough and precise.  It's also concise in that it condenses
much of the verbose page purpose.html down to two sentences.

The reason it makes me uncomfortable is that these definitions
don't match the standard meanings of the words as contained in
dictionaries.  I'm afraid to suggest alternatives like machine
transliteration and phonetic transcription, though, because
they are a bit cumbersome and would possibly only add to the
confusion.

Peter Constable wrote:

 True, though of course they do have the authority to 
 say, In the context of our standards we use term x to mean X.

(and, in a different letter)

 ...it is my impression that many people use the term 
 transliteration in a broader sense than the strict definition 
 defined by TC 46. That appears to be the case for the help file 
 associated with the ICU demo, which defines transliteration as,
 the general process of converting characters from one particular 
 script to another one.

So, if words are to be re-defined, let's assure that they are
explicitly re-defined and that these re-definitions are
accessible.  Meanwhile, when someone uses the terms in the
'broader sense' (id est: dictionary definition), please let's not
chide them for it.

Best regards,

James Kass.

- Original Message -
From: John Cowan [EMAIL PROTECTED]
To: James Kass [EMAIL PROTECTED]
Cc: Unicode List [EMAIL PROTECTED]; Lukas Pietsch 
[EMAIL PROTECTED]; J M Sykes [EMAIL PROTECTED];
[EMAIL PROTECTED]
Sent: Wednesday, July 04, 2001 9:23 PM
Subject: Re: Unicode transliterations (and other operations)


 James Kass scripsit:

  Does the vocabulary make things clearer or cause confusion?
  If we need to distinguish between reversible

Re: Unicode transliterations (and other operations)

2001-07-05 Thread Martin Heijdra

Just FYI:

For a history of practices, terminology debates, of transliteration,
transcription etc., see:

Wellisch, Hans H., 1920-, The conversion of scripts, its nature, history,
and utilization / Hans H.   Wellisch. -- New York : Wiley, c1978,  xviii,
509 p. : ill. ; 24 cm.

The same author has a much shorter bibliography, I think superceded by this
book.

Martin Heijdra

- Original Message -
From: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Wednesday, July 04, 2001 4:37 AM
Subject: Re: Unicode transliterations (and other operations)



 On 07/02/2001 02:56:16 PM Mark Davis wrote:

 For those interested in Transliteration (and other  Unicode
 transformations),
 there is a new ICU web demo program on
 
 http://oss.software.ibm.com/developerworks/opensource/icu/translitdemo...

 This opens an area of some interest to me and some of my colleagues.

 There have been some messages in this thread discussing whether something
 is transliteration or transcription. On that point I have two comments:
 first, ISO TC 46 has created definitions for these two terms that apply to
 ISO standards under their purview; these definitions can be found at
 http://www.elot.gr/tc46sc2/purpose.html. Secondly, it is my impression
that
 many people use the term transliteration in a broader sense than the
 strict definition defined by TC 46. That appears to be the case for the
 help file associated with the ICU demo, which defines transliteration as,
 the general process of converting characters from one particular script
to
 another one. Moreover, there is a need for a term to described a
 particular situation that is very common around the world, and so far as I
 know the term transliteration is the only term that comes close to
 describing that phenomenon. It is this phenomenon which is the focus of
 interest for me and my SIL colleagues: a single language that is written
by
 different portions of the language community in different writing systems,
 particularly different writing systems based on different scripts.

 For example, Kashmiri (India / Pakistan) is written in Devanagari and in
 Nastaliq-style Arabic (aka Persio-Arabic); Wolaytta (Ethiopia) is written
 in Ethiopic and Roman; Tai Dam is written in Tai Dam script, in Lao script
 and in Roman with Vietnamese-style diacritics.

 This phenomenon is of particular interest and concern for applied
linguists
 involved in literacy and literature development: for literacy, they might
 need to assist people in learning how to make the transition between one
 writing system and another, and they certainly need to develop different
 sets of literacy materials for each writing system (probably with
 significant duplication in content). For those working on literature
 development, there is a repeated need to publish documents in multiple
 writing systems. For large publications that are developed over long
 periods of time, such as dictionaries or translations of long works such
as
 the Bible, issues of versioning and data management become particularly
 focal: the opus is going to be edited and revised literally hundreds of
 times: if one has to maintain three copies (corresponding to three writing
 systems) of a document through dozens of changes each working day over
 (say) an eight-year period, that is a lot of additional work.

 Clearly in situations such as this, there would be a significant benefit
to
 be gained if it were possible for a person to create a document in one
 writing system and have the parallel documents in the other writing
systems
 generated by some automated processes.

 There are, in principle, three potential ways to deal with publishing in
 multiple writing systems:

 1. Separate documents are created manually, one for each writing system.

 2. A document is created manually in one writing system, and different
 parallel documents are generated through an automated process for the
other
 writing systems.

 3. A single document is created that can be displayed in terms of
alternate
 writing systems using font mechanisms, possibly relying on transduction
 done within smart fonts.

 (Note that I say these are *potential* possibilities; there are additional
 factors such as whether a spelling in one writing system contains adequate
 information to determine a unique spelling in a different writing system -
 can one be generated deterministically from the other.)

 There are plenty of cases in which the first method has been used. We have
 done some implementations of both the second and the third varieties. For
 example, last year we developed a system of the second variety that
 simultaneously supports both Ethiopic and Roman writing systems using a
 custom encoding and Worldscript and GX (yes, GX, not AAT), and that is
 being used by a linguist for work on the Koorete language in Ethiopia. Our
 SIL Hebrew font package includes the third variety as a capability: the
 Ezra Standard Encoding permits changing between Hebrew script and
 Roman-based

Re: Unicode transliterations (and other operations)

2001-07-05 Thread John Cowan

James Kass scripsit:

 An interesting site with writings from various people
 favoring either Burma or Myanmar suggests that
 Burma and Myanmar are separate words with different
 etymologies.

I don't think so.  But the question has become politicized, because
the change (in Latin transliteration only, note) was made by
a government which many believe to be illegitimate.

I agree that the example was a bad one for that reason.

 Does trade jargon (the technical language in a particular field) exist
 to clarify a trade, or is its purpose more to exclude anyone not
 part of the inner circle?

Some of each, to be sure.

 I've often wondered about this with regards to subjects like
 programming languages.  Is this practice (trade jargon) unique
 to English?  In other words, does a Hindi speaker wishing to
 learn, for example, the C programming language have any
 advantage over the English speaker because the C programming
 instructions in Hindi are in 'plain-Hindi' rather than 'tech-speak'?

On the contrary, it is often worse in other languages, because most of the
technical jargon is typically adopted straight from English.

  ... In transliteration, we
  are mapping one script to another in a language-independent way.
  In transcription, we are mapping the writing conventions of one
  language to those of another.
 
 This is clear enough and precise.  It's also concise in that it condenses
 much of the verbose page purpose.html down to two sentences.

Thank you.

Note that I used the jargon verb map, which is old enough in this
sense that it does appear in dictionaries, but is still probably
unfamiliar to many.

 The reason it makes me uncomfortable is that these definitions
 don't match the standard meanings of the words as contained in
 dictionaries.

So much the worse for dictionaries, then.  :-)

 I'm afraid to suggest alternatives like machine
 transliteration and phonetic transcription, though, because
 they are a bit cumbersome and would possibly only add to the
 confusion.

Right.  And note that until a decade or two ago, all transliteration
*and* transcription was very much by hand: no machines involved.

 Meanwhile, when someone uses the terms in the
 'broader sense' (id est: dictionary definition), please let's not
 chide them for it.

Well, fine.  But when someone is talking about physics, and
uses energy, power, and force interchangeably, do we
accept this as a broader sense of the terms, or do we
explain to them that in this field, the terms are definitely
*not* interchangeable?

-- 
John Cowan   [EMAIL PROTECTED]
One art/there is/no less/no more/All things/to do/with sparks/galore
--Douglas Hofstadter




Re: Unicode transliterations (and other operations)

2001-07-05 Thread James Kass


John Cowan wrote:

 
 I don't think so.  But the question has become politicized, because
 the change (in Latin transliteration only, note) was made by
 a government which many believe to be illegitimate.
 

... in every sense of the word, apparently.

 I agree that the example was a bad one for that reason.


Yet coming across that web page while probing the issue was
quite an eye-opener for me, and I am grateful.

  ...advantage over the English speaker because the C programming
  instructions in Hindi are in 'plain-Hindi' rather than 'tech-speak'?
 
 On the contrary, it is often worse in other languages, because most of the
 technical jargon is typically adopted straight from English.
 

Then member variable would be transcribed to Devanagari?  
If so, how unfortunate.

 Note that I used the jargon verb map, which is old enough in this
 sense that it does appear in dictionaries, but is still probably
 unfamiliar to many.
 

Using map in this fashion shouldn't be too much of a problem,
though, it's generic enough that the meaning can be derived from
context.  

  The reason it makes me uncomfortable is that these definitions
  don't match the standard meanings of the words as contained in
  dictionaries.
 
 So much the worse for dictionaries, then.  :-)
 

And for standards? (-:

 
 Right.  And note that until a decade or two ago, all transliteration
 *and* transcription was very much by hand: no machines involved.
 

Yes, and the dictionary definitions seem to derive from the
manuscript era.  Perhaps a newer dictionary...

 
 Well, fine.  But when someone is talking about physics, and
 uses energy, power, and force interchangeably, do we
 accept this as a broader sense of the terms, or do we
 explain to them that in this field, the terms are definitely
 *not* interchangeable?
 

Physics isn't my forte, but even in the vernacular the terms
aren't necessarily interchangeable:  Energy shortage, power 
to the people, and may the Force be with you.

Best regards,

James Kass.






RE: Unicode transliterations (and other operations)

2001-07-05 Thread Ayers, Mike


 From: James Kass [mailto:[EMAIL PROTECTED]] 
 
 てんどうりゅうじ wrote:
 
  Still haven't got the multiplication riddle solved, Mr. Kass?
 
 
 Sorry, I didn't know it was required.  Almost asked 'which
 riddle?', but now notice the × in the signature portion as
 follows...
 
 
    らんま
   ×あかね
  ー
   あまんけ
  ねけあず 
  らんま  
  ー
  いいなずけ
 

The key:

0 - ん 5 - な
1 - あ 6 - け
2 - ま 7 - い
3 - ね 8 - ず
4 - ら 9 - か

So we get:

402
193
 -
  1206
 3618
 402
 -
 77586

...which you can verify on your calculator.

 Colloquial Japanese by Noboru Inamoto doesn't include any
 of these words in the vocabulary list.  Easy Japanese by
 Samuel E. Martin doesn't list them in PART IV 3000 Useful
 Japanese Words, either.  (But, the Japanese word for riddle
 is nazo.)  Surely there are better references around here
 somewhere, but your CD collection is probably better
 organized than my books at present.

Hee hee - unless you're packing a guide to anime, you'll never find
'em anyway.  らんま is Ranma, as in Ranma Saotome, and あかね is Akane, as in
Akane Tendo, the two main stars of Rumiko Takahashi's bizarre (if
monothematic) sex comedy Ranma 1/2.


/|/|ike




Re: Unicode transliterations (and other operations)

2001-07-05 Thread Michael \(michka\) Kaplan

 Hee hee - unless you're packing a guide to anime, you'll never find
 'em anyway.  らんま is Ranma, as in Ranma Saotome, and あかね is Akane, as in
 Akane Tendo, the two main stars of Rumiko Takahashi's bizarre (if
 monothematic) sex comedy Ranma 1/2.

Seeing this wonderful use of Unicode text in e-mail brings a quote to mind:

Marvelous technology is at our disposal; but instead of reaching up to new
heights, we're going to see how far down we can go, how deep into the muck
we can immerse ourselves. -- Barry Champlaign (Eric Bogosian), Oliver
Stone's Talk Radio

g

MichKa

Michael Kaplan
Trigeminal Software, Inc.
http://www.trigeminal.com/






Re: Unicode transliterations (and other operations)

2001-07-04 Thread James Kass


Doug Ewell wrote:


 
 Maybe not.  This is the part I got wrong several weeks ago when we had this 
 discussion, and I hope my understanding is better now.
 
 Transliteration is about building a reversible mapping between the original 
 (in this case, Japanese) sounds and a set of (in this case, Latin) 
 characters, with the focus on reversibility rather than legibility.  You 
 might even use numbers or other symbols to ensure that the transliterated 
 version can be mapped unambiguously back to Japanese.  The reader might have 
 to go through a learning curve to equate your symbols with the desired sounds.
 
 Transcription is about optimizing the Latin-script version for, say, a 
 Polish-language reader.  A transcription has not only a target script but 
 also a target language, and it might be different for each of Polish, German, 
 French, English, etc.  The goal is enabling the Polish reader to pronounce 
 the Japanese text with a minimal learning curve.
 
snip
 
 Unfortunately, the terms transcription and transliteration are commonly 
 mixed up by non-experts, causing much confusion.
 
 Please, somebody let me know if this is still not right.

Transliteration just means to write something using the characters
of another alphabet.  Legibility is the focus, so numbers or
symbols shouldn't enter the picture.  

A transcription is simply a copy (usually in the same 
language/script as the source, otherwise it wouldn't be a copy).
An exception would be a typed transcript of something
originally written in shorthand.

This according to Webster's New World Dictionary (of English),
a recognized authority (on English).

Best regards,

James Kass.







Re: Unicode transliterations (and other operations)

2001-07-04 Thread $B$F$s$I$&$j$e$&$8(B
Maybe we are just being weird here.

We ought to try to avoid twisting language, even if we do pretty much operate within 
our own little techie world here.

Still haven't got the multiplication riddle solved, Mr. Kass?


$B$i$s$^(B $B!z$8$e$&$$$C$A$c$s!z(B
$B!!!_$"$+$M(B 
$B!

Re: Unicode transliterations (and other operations)

2001-07-04 Thread Peter_Constable


On 07/03/2001 09:47:17 PM Doug Ewell wrote:

Unfortunately, the terms transcription and transliteration are
commonly
mixed up by non-experts, causing much confusion.

Please, somebody let me know if this is still not right.

See my comments on this and the URL for ISO definitions in my other
message.


- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]






RE: Unicode transliterations (and other operations)

2001-07-04 Thread Marco Cimarosti

Peter Constable wrote:
 It is this phenomenon which is the focus of
 interest for me and my SIL colleagues: a single language that 
 is written by different portions of the language community
 in different writing systems, particularly different writing
 systems based on different scripts.

I would include braille in this scenario.

Braille transcription is another facet of the issue, but it involves
practically every written language.

Some languages approximately have a 1-to-1 relationship between the
graphemes in their visual scripts and braille patterns. Some other languages
have more complex and indirect relationship between the two worlds. E.g.,
English and other languages normally require level 2 braille, which is a
quasi-logographic system of abbreviations for words or part of words.
Conversely, Chinese brailles are strictly phonetic, having signs for initial
consonants and final rhymes. Japanese braille uses a single kana
syllabary, etc.

Any project in the area of automatic transcription should start by analyzing
what is already done in the braille world, in order not to reinvent the
wheel. Conversely, any progress in the automatic transcription across
scripts could potentially be reused in braille technology.

It may turn out that the two kind of transcriptions share a lot of points,
such as conversions depending on context, or the need of dictionary look up
in some cases.
 
 3. A single document is created that can be displayed in 
 terms of alternate
 writing systems using font mechanisms, possibly relying on 
 transduction
 done within smart fonts.

Peter, the fact that SIL has a nice smart font technology available does
not mean that this technology should be used also for brewing beer!

IMHO, font technology should be used only for displaying text, which is
where it applies. Other tasks, unrelated to this problem, should be handled
with different tools.

Of course, some basic algorithms could be in common, such as moving letters
around, splitting or joining ligatures, etc. But the similarity ends here,
methinks.

_ Marco




Re: Unicode transliterations (and other operations)

2001-07-04 Thread James Kass


てんどうりゅうじ wrote:


 We ought to try to avoid twisting language, even if we do pretty much operate within 
our own little techie world here.

Indeed!  Or, at least if we need a correct definition of
an English word, we should consult an English dictionary.
The web page cited by Mr. Constable is simply misleading, unless
it were to be amended to clearly state for the purposes of
this and related documents... these words mean c.

Languages change over time and so do the definitions of words
or phrases within a language.  Blind pig meant something
other than a sightless farm critter in the 1920s and '30s, for
example, and my guess is that a larger percentage of subscribers
to this list would recognize that term than the average ranihan
on the streets.  (Hope ranihan is spelled correctly, for some
reason it isn't in the paperback Webster's here.)

No international body has any authority to alter the meaning of
existing words in my language or any of our languages.


 Still haven't got the multiplication riddle solved, Mr. Kass?


Sorry, I didn't know it was required.  Almost asked 'which
riddle?', but now notice the × in the signature portion as
follows...


   らんま
  ×あかね
 ー
  あまんけ
 ねけあず 
 らんま  
 ー
 いいなずけ


So, here goes with a transliteration...
ranma
× akane
-
amanke
nekeazu
ranma

iinazuke

Japanese class was a long time ago...

Colloquial Japanese by Noboru Inamoto doesn't include any
of these words in the vocabulary list.  Easy Japanese by
Samuel E. Martin doesn't list them in PART IV 3000 Useful
Japanese Words, either.  (But, the Japanese word for riddle
is nazo.)  Surely there are better references around here
somewhere, but your CD collection is probably better
organized than my books at present.

If the riddle is a Japanese cryptogram, there is little hope
for me.

Has anyone solved the riddle, てんどうりゅうじ-san ?  (Besides
Sarasvati, who probably figured it out at once.)  Perhaps you
will take some sake, become magnanimous, and enlighten us?

Back on topic, with regards to the terminology...  The page
in question ( http://www.elot.gr/tc46sc2/purpose.html )
uses the word transcription where the word transliteration
should be, and what they call transliteration could easily be
referred to as reversible transliteration in plain English,
without 'breaking existing applications' like my dictionary.

English is too complicated already, let's not make it more complex.

Back off topic...

 PTKA IZGT F SFNNGYGB ZRMSFTB WM
 NFEGT FM MGYWPRMKA FM F SFNNGYGB IWOG
 IWKK QGT FT IPQGT ZFXG GHRFK YWJZNM.

Only when a battered husband is
taken as seriously as a battered wife
will men an women have equal rights.

The typo in the third line threw me off for a moment...

Best regards,

James Kass.









Re: Unicode transliterations (and other operations)

2001-07-04 Thread Lukas Pietsch

James Kass wrote:

 Indeed!  Or, at least if we need a correct definition of
 an English word, we should consult an English dictionary.
 The web page cited by Mr. Constable is simply misleading, unless
 it were to be amended to clearly state for the purposes of
 this and related documents... these words mean c.

well, the English dictionaries give usages of words in everyday language,
and that's fine. But in their usage as technical terms, the distinction between 
transcription
and transliteration (roughly along the lines of the 
http://www.elot.gr/tc46sc2/purpose.html page) seems to me to be a fairly 
well-established one, in the field of linguistics at least.

 No international body has any authority to alter the meaning of
 existing words in my language or any of our languages.

Sure, but we're dealing with a scholarly discipline's technical vocabulary here, and 
it's not such a bad idea in this case if computer people dealing with language adopt 
the usage of linguists, is it?

 what they call transliteration could easily be
 referred to as reversible transliteration in plain English,
 without 'breaking existing applications' like my dictionary.

You must understand: this isn't about breaking existing applications, it's about a 
higher-level protocol! ;-)


Lukas Pietsch





Re: Unicode transliterations (and other operations)

2001-07-04 Thread J M Sykes


- Original Message -
From: James Kass [EMAIL PROTECTED]
To: Unicode List [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Wednesday, July 04, 2001 8:10 AM
Subject: Re: Unicode transliterations (and other operations)



 Doug Ewell wrote:


 
  Maybe not.  This is the part I got wrong several weeks ago when we had
this
  discussion, and I hope my understanding is better now.
 
  Transliteration is about building a reversible mapping between the
original
  (in this case, Japanese) sounds and a set of (in this case, Latin)
  characters, with the focus on reversibility rather than legibility.  You
  might even use numbers or other symbols to ensure that the
transliterated
  version can be mapped unambiguously back to Japanese.  The reader might
have
  to go through a learning curve to equate your symbols with the desired
sounds.
 
  Transcription is about optimizing the Latin-script version for, say, a
  Polish-language reader.  A transcription has not only a target script
but
  also a target language, and it might be different for each of Polish,
German,
  French, English, etc.  The goal is enabling the Polish reader to
pronounce
  the Japanese text with a minimal learning curve.
 
 snip
 
  Unfortunately, the terms transcription and transliteration are
commonly
  mixed up by non-experts, causing much confusion.
 
  Please, somebody let me know if this is still not right.

 Transliteration just means to write something using the characters
 of another alphabet.  Legibility is the focus, so numbers or
 symbols shouldn't enter the picture.

From the New Shorter OED:
Transliterate:
Replace (letters or characters of one language) by those of another used to
represent the same sounds; write (a word etc.) in the closest corresponding
characters of another alphabet or language.

 A transcription is simply a copy (usually in the same
 language/script as the source, otherwise it wouldn't be a copy).
 An exception would be a typed transcript of something
 originally written in shorthand.

From the New Shorter OED:
Transcribe (among other meanings)
v.t. Transliterate; write out (shorthand, notes, etc.) in ordinary
characters or continuous prose. Formerly also, translate.

I'm relieved to find that OED and Webster agree, though note that the OED
recognises that transcribe is sometimes used as a synonym of transliterate.

This is not to say that I don't recognise the useful distinction between a
reversible transformation and an non-reversible one.

Experts redefine words at the risk of confusing non-experts; when they do,
they should not be surprised at the ensuing confusion -- they brought it on
themselves.

Regards,

(non-expert) Mike.

Impenetrability! That's what I say!






Re: Unicode transliterations (and other operations)

2001-07-04 Thread Vladimir Weinstein

[EMAIL PROTECTED] writes:
  There have been some messages in this thread discussing whether something
  is transliteration or transcription. On that point I have two comments:
  first, ISO TC 46 has created definitions for these two terms that apply to
  ISO standards under their purview; these definitions can be found at
  http://www.elot.gr/tc46sc2/purpose.html. Secondly, it is my impression that
  many people use the term transliteration in a broader sense than the
  strict definition defined by TC 46. That appears to be the case for the
  help file associated with the ICU demo, which defines transliteration as,
  the general process of converting characters from one particular script to
  another one. Moreover, there is a need for a term to described a

This is because ICU implementation of transliteration actually allows for even more 
general thing - converting characters according to a given set of rules. It can be 
used both for transliteration and transcription as defined in TC 46.

  For example, Kashmiri (India / Pakistan) is written in Devanagari and in
  Nastaliq-style Arabic (aka Persio-Arabic); Wolaytta (Ethiopia) is written
  in Ethiopic and Roman; Tai Dam is written in Tai Dam script, in Lao script
  and in Roman with Vietnamese-style diacritics.

Let me add Serbian to this list - it is written both in Latin and Cyrillic scripts 
with mapping that is almost one to one. 

In case of Serbian, 
  There are, in principle, three potential ways to deal with publishing in
  multiple writing systems:
  
  1. Separate documents are created manually, one for each writing system.

This method is not feasible at all in case of Serbian. .

  2. A document is created manually in one writing system, and different
  parallel documents are generated through an automated process for the other
  writing systems.

This is the most common practice used, although with some interesting consequences, 
see below.

  3. A single document is created that can be displayed in terms of alternate
  writing systems using font mechanisms, possibly relying on transduction
  done within smart fonts.
This one is also used.

Here is the case of Serbian. It uses 30 cyrillic letters or 30 latin letters. However, 
some of the letters in the latin alphabet are represented as two letters - here are 
the pairs:
\u0409/\u0459 == Lj/lj
\u040A/\u045A == Nj/nj
\u040F/\u045F == D\u017E/d\u017E
\u0402/\u0452 occasionally represented in latin as Dj/dj, but usually represented by 
\u0110/\u0111

Transliteration from cyrillic to latin is very easy. The only problem is 
transliteration of upper case letters above, which can be transliterated either to 
upper/lower case combination or to two upper case letters, depending on the case of 
following letters.

A little bit more complicated is transliteration of Serbian from latin to cyrillic, 
even when Unicode encoded, for two reasons:
1) if foreign names are not transcribed or tagged, they will be simply transliterated 
to cyrillic form, which is always a source of good laugh for Serbian readers,
2) this one happens extremely rarely - some words that use two-letter latin letters 
should be transliterated to two cyrillic letters, instead of just one. This is the 
case with some adopted foreign words. However, it is not of interest in everyday 
practice.

Interesting and wrong practice used by a lot of magazines that print in cyrillic and 
also have a latin Internet publication is using a latin based encoding for cyrillic 
version, where q, w, x and y are used for cyrillic letters that use two letters in 
latin representation, for example, W and w represent \u040A and \u045A. However, 
foreign names are not transcribed, but written in original form in latin script. So, 
after moving from cyrillic to latin, Washington becomes Njashington. Of course, if 
Unicode was used for storing the text, transliteration from cyrillic to latin would be 
correct and almost trivial.

My experience in transliteration says that 'pure' Unicode text is not enough for 
comfortable transliteration, especially for texts that tend to mix latin and cyrillic, 
as it is the case with most of technical texts. Some additional tagging is required to 
make it fully automatic. Otherwise, additional proof reading is required. 

I had reasonable success in writing MS Word macros that did transliteration - things 
that helped were formatting foreign word differently - using italic or bold.

Hope this makes sense,
V.

-- 
Vladimir Weinstein, IBM GCoC-Unicode/ICU  Cupertino, CA,  [EMAIL PROTECTED]







Re: Unicode transliterations (and other operations)

2001-07-04 Thread James Kass


Lukas Pietsch wrote:

 
 well, the English dictionaries give usages of words in everyday 
 language, and that's fine. But in their usage as technical terms, 
 the distinction between transcription and transliteration 
 (roughly along the lines of the 
 http://www.elot.gr/tc46sc2/purpose.html page) seems to me 
 to be a fairly well-established one, in the field of linguistics at least.
 

Yes, this would seem to be fairly widespread in the field.

 
 Sure, but we're dealing with a scholarly discipline's technical 
 vocabulary here, and it's not such a bad idea in this case if 
 computer people dealing with language adopt the usage of 
 linguists, is it?
 

Does the vocabulary make things clearer or cause confusion?
If we need to distinguish between reversible script conversion
and irreversible script conversion, could we not simply say
reversible script conversion and so forth?

We speak of code page conversions, but we haven't re-defined
existing words to differentiate between the kind that's reversible 
and the kind that isn't (as far as I know).

  what they call transliteration could easily be
  referred to as reversible transliteration in plain English,
  without 'breaking existing applications' like my dictionary.
 
 You must understand: this isn't about breaking existing 
 applications, it's about a higher-level protocol! ;-)

It's about clarity and precision, too.  When someone obviously
intelligent like Doug Ewell admits to still being unclear weeks
after being educated by hair-splitting techies, isn't there
a problem?

With regards to the 'purpose.html' page linked above, how
seriously should we take a page which includes phraseology
like:

 It is indispensable in that it permits the univocal 
  transmission of a written message between two 
  countries using different writing systems or 
  exchanging a message the writing of which is 
  different from their own.

...?  The page was last updated in 1996, yet the first line
of the page has the typo were for where.  The sentence
quoted above is needlessly redundant and there is no such
word as univocal (as far as I know).

My apologies to the authors of that page for mentioning 
this in a public forum.  I make typos, too.

J. M. Sykes wrote:

 I'm relieved to find that OED and Webster agree, though 
 note that the OED recognises that transcribe is sometimes 
 used as a synonym of transliterate.

Perhaps it is sometimes mis-used as a synonym, I'm tempted
to say, but must bow to the higher authority of the Oxford
English Dictionary.

 Experts redefine words at the risk of confusing non-experts; 
 when they do, they should not be surprised at the ensuing 
 confusion -- they brought it on themselves.

This is an excellent point, thank you for making it.

Best regards,

James Kass.







Re: Unicode transliterations (and other operations)

2001-07-04 Thread John Cowan

James Kass scripsit:

 Does the vocabulary make things clearer or cause confusion?
 If we need to distinguish between reversible script conversion
 and irreversible script conversion, could we not simply say
 reversible script conversion and so forth?

No, that does not capture the distinction.  In transliteration, we
are mapping one script to another in a language-independent way.
In transcription, we are mapping the writing conventions of one
language to those of another.

Handy example:  the name of the country written Myanmar (in
transliteration) is pronounced ['b@m@].  This was transcribed
into (British) English as Burma.

Of course, to represent the pronunciation I am using an ASCII
transliteration of IPA!


-- 
John Cowan   [EMAIL PROTECTED]
One art/there is/no less/no more/All things/to do/with sparks/galore
--Douglas Hofstadter




RE: Unicode transliterations (and other operations)

2001-07-03 Thread jarkko . hietaniemi



Looks 
interesting. How are you approaching the complication that transliteration 
is between pairs of languages?
E.g. 
Russian to English, Russian to French, Russian to German, and Russian to 
Finnish, all these are slightly different (as far as I know), because the 
goal of transliteration is to create something that is pronouncable by the 
target language but still close enough to the pronunciation of the origin 
language.




Re: Unicode transliterations (and other operations)

2001-07-03 Thread Markus Scherer

 Looks interesting.  How are you approaching the complication that transliteration is 
between pairs of languages?

I know what you mean: Gorbachev is Gorbatschow in German.

I think that the rules that we have in ICU are probably English-centric where it makes 
a difference.
Note that some of the transliterator functions like uppercasing and any-name are just 
wrappers around Unicode functions, and so not language-dependent.

The strength of the API is that you can roll your own rules at runtime and at 
compile-time. If you have different rules for Finnish as a target language for 
transliteration, then you can modify the ICU rules or supply a whole different set for 
your own.
The rules are written somewhat similarly to regular expressions.

See the (draft, somewhat outdated) user guide chapter: 
http://oss.software.ibm.com/icu/userguide/Transliteration.html
and the API references: 
http://oss.software.ibm.com/icu/apiref/class_Transliterator.html and 
http://oss.software.ibm.com/icu/apiref/utrans_h.html

markus




Re: Unicode transliterations (and other operations)

2001-07-03 Thread Mark Davis

As Markus says, one can do that right now, by making your own (say)
German-Serbian transliterator, one that is different from Latin-Cyrillic,
Latin-Serbian, or German-Cyrillic. In ICU 2.0, we are examining the
possibility of a lookup heirarchy, similar to the resource heirarchy, that
would allow us to organize them more effectively. Our goal for the
script-script rules will be to try to be as neutral as we can, while
preserving round-tripping. See Guidelines for... in (the slightly
out-of-date) http://oss.software.ibm.com/icu/userguide/Transliteration.html

We are also adding variant tags, since there are many transliteration
schemes that are not associated with language per se, but rather with a
particular standard. For example, Latin-Greek/ISO-834. Since the goal for
these rule sets will be to match the standard, they will not, in general,
roundtrip.

Also, here are some responses to a private mail I got on my original
message.

  Горбачев, Михаил = Gorbachèv, Mìkhaìl

 Hmmm.
 First, is it Горбачев, or Горбачёв ?

These were names given to us by our Russian center, so I assume it is
correct (but don't know otherwise).

 Then, your translitteration uses grave accents, which I never saw for
Russian
 (or even Cyrillic).

The Cyrillic and Devanagari rules are preliminary. We'll be fixing those
once we get some more of the code features in place. For Devanagari, we
already have an interindic representation, that goes to and from all of
the indic scripts. We will be developing a Latin-Interindic that lets us
get from Latin to (and from) interindic, when can then pivot to (and from)
the others.

And here are some pages that might be of interest:
 - Transliteration of Non-Roman Alphabets and Scripts
[http://homepage.mac.com/sirbinks/translit.html]
 - TC46 Transliteration Links [http://www.elot.gr/tc46sc2/bookmarks.html]
 - UN Working Group on Geographical Names [http://www.eki.ee/wgrs]

Mark

- Original Message -
From: Markus Scherer [EMAIL PROTECTED]
To: unicode [EMAIL PROTECTED]
Sent: Tuesday, July 03, 2001 10:00
Subject: Re: Unicode transliterations (and other operations)


  Looks interesting.  How are you approaching the complication that
transliteration is between pairs of languages?

 I know what you mean: Gorbachev is Gorbatschow in German.

 I think that the rules that we have in ICU are probably English-centric
where it makes a difference.
 Note that some of the transliterator functions like uppercasing and
any-name are just wrappers around Unicode functions, and so not
language-dependent.

 The strength of the API is that you can roll your own rules at runtime and
at compile-time. If you have different rules for Finnish as a target
language for transliteration, then you can modify the ICU rules or supply a
whole different set for your own.
 The rules are written somewhat similarly to regular expressions.

 See the (draft, somewhat outdated) user guide chapter:
http://oss.software.ibm.com/icu/userguide/Transliteration.html
 and the API references:
http://oss.software.ibm.com/icu/apiref/class_Transliterator.html and
http://oss.software.ibm.com/icu/apiref/utrans_h.html

 markus







Re: Unicode transliterations (and other operations)

2001-07-03 Thread Vladimir Weinstein

I trust that 'moving' a name or a term between languages would be called 
transcription, not transliteration. Transliteration just tries to 'move' from script 
to script.

Markus Scherer writes:
   Looks interesting.  How are you approaching the complication that transliteration 
 is between pairs of languages?
  
  I know what you mean: Gorbachev is Gorbatschow in German.

This would then be an example of transcription, which differs on language pair basis, 
as it tries to get the speakers to pronounce the same word.


  
  I think that the rules that we have in ICU are probably English-centric where it 
 makes a difference.


V.

-- 
Vladimir Weinstein, IBM GCoC-Unicode/ICU  Cupertino, CA,  [EMAIL PROTECTED]







RE: Unicode transliterations (and other operations)

2001-07-03 Thread jarkko . hietaniemi

 I know what you mean: Gorbachev is Gorbatschow in German.

Gorbatsov in Finnish transliteration, the ch would be very unwieldy
for a Finnish mouth.  (The s is used solely in transliteration, not
in Finnish proper.) 

 I think that the rules that we have in ICU are probably 
 English-centric where it makes a difference.
 Note that some of the transliterator functions like 
 uppercasing and any-name are just wrappers around Unicode 
 functions, and so not language-dependent.
 
 The strength of the API is that you can roll your own rules 
 at runtime and at compile-time. If you have different rules 
 for Finnish as a target language for transliteration, then 
 you can modify the ICU rules or supply a whole different set 
 for your own.
 The rules are written somewhat similarly to regular expressions.
 
 See the (draft, somewhat outdated) user guide chapter: 
 http://oss.software.ibm.com/icu/userguide/Transliteration.html

One thing you could update in this page is the very first line :-)
where it is claimed that transliteration is between scripts...




RE: Unicode transliterations (and other operations)

2001-07-03 Thread jarkko . hietaniemi

 From: ext [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED]]
 Sent: Tuesday, July 03, 2001 2:56 PM
 To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
 Subject: RE: Unicode transliterations (and other operations)
 
 
  I know what you mean: Gorbachev is Gorbatschow in German.
 
 Gorbatsov in Finnish transliteration, the ch would be very unwieldy

Grrr.  Something ate the caron from the s in ts...

 for a Finnish mouth.  (The s is used solely in transliteration, not
 in Finnish proper.)

...just like in here, Finnish does have s...
 




Re: Unicode transliterations (and other operations)

2001-07-03 Thread $B$F$s$I$&$j$e$&$8(B
So if I was trying to write my fake name in Polish, or for a Pole to read, I would 
write it as "Tendou Rjuud{U+017E}i"?

That would be transliteration, right?


$B$i$s$^(B $B!z$8$e$&$$$C$A$c$s!z(B
$B!!!_$"$+$M(B 
$B!(B: Re: Unicode transliterations (and other operations)

I trust that 'moving' a name or a term between languages would be called 
transcription, not transliteration. Transliteration just tries to 'move' from script 
to script.

Markus Scherer writes:
   Looks interesting.  How are you approaching the complication that 
transliteration is between pairs of languages?
  
  I know what you mean: Gorbachev is Gorbatschow in German.

This would then be an example of transcription, which differs on language pair basis, 
as it tries to get the speakers to pronounce the same word.


  
  I think that the rules that we have in ICU are probably English-centric where it 
makes a difference.


V.

-- 
Vladimir Weinstein, IBM GCoC-Unicode/ICU  Cupertino, CA,  [EMAIL PROTECTED]







Re: Unicode transliterations (and other operations)

2001-07-03 Thread DougEwell2

In a message dated 2001-07-03 21:06:50 Pacific Daylight Time, [EMAIL PROTECTED] 
writes:

 So if I was trying to write my fake name in Polish, or for a Pole to read, 
I 
 would write it as Tendou Rjuud{U+017E}i?

 That would be transliteration, right?

Maybe not.  This is the part I got wrong several weeks ago when we had this 
discussion, and I hope my understanding is better now.

Transliteration is about building a reversible mapping between the original 
(in this case, Japanese) sounds and a set of (in this case, Latin) 
characters, with the focus on reversibility rather than legibility.  You 
might even use numbers or other symbols to ensure that the transliterated 
version can be mapped unambiguously back to Japanese.  The reader might have 
to go through a learning curve to equate your symbols with the desired sounds.

Transcription is about optimizing the Latin-script version for, say, a 
Polish-language reader.  A transcription has not only a target script but 
also a target language, and it might be different for each of Polish, German, 
French, English, etc.  The goal is enabling the Polish reader to pronounce 
the Japanese text with a minimal learning curve.

A classic example of Russian-to-X transcription (where X is some Latin-script 
language) is a well-known name like Khrushchev or Gorbachev.  Here the 
spellings I have used are those that would likely lead an English speaker to 
pronounce the names reasonably correctly.  A transcription intended for 
German speakers might be Khruschtschow.  None of these would be a proper 
transliteration, because they are not completely reversible (the 'shch' and 
'schtsch' combinations could be U+0449 or (U+0448 plus U+0447).

Unfortunately, the terms transcription and transliteration are commonly 
mixed up by non-experts, causing much confusion.

Please, somebody let me know if this is still not right.

-Doug Ewell
 Fullerton, California