Re: Unicode transliterations (and other operations)

Mark Davis Tue, 03 Jul 2001 11:01:24 -0700
As Markus says, one can do that right now, by making your own (say)
"German-Serbian" transliterator, one that is different from Latin-Cyrillic,
Latin-Serbian, or German-Cyrillic. In ICU 2.0, we are examining the
possibility of a lookup heirarchy, similar to the resource heirarchy, that
would allow us to organize them more effectively. Our goal for the
"script-script" rules will be to try to be as neutral as we can, while
preserving round-tripping. See "Guidelines for..." in (the slightly
out-of-date) http://oss.software.ibm.com/icu/userguide/Transliteration.html

We are also adding variant tags, since there are many transliteration
schemes that are not associated with language per se, but rather with a
particular standard. For example, "Latin-Greek/ISO-834". Since the goal for
these rule sets will be to match the standard, they will not, in general,
roundtrip.

Also, here are some responses to a private mail I got on my original
message.

> > "Горбачев, Михаил" => "Gorbachèv, Mìkhaìl"
>
> Hmmm.
> First, is it Горбачев, or Горбачёв ?

These were names given to us by our Russian center, so I assume it is
correct (but don't know otherwise).

> Then, your translitteration uses grave accents, which I never saw for
Russian
> (or even Cyrillic).

The Cyrillic and Devanagari rules are preliminary. We'll be fixing those
once we get some more of the code features in place. For Devanagari, we
already have an "interindic" representation, that goes to and from all of
the indic scripts. We will be developing a "Latin-Interindic" that lets us
get from Latin to (and from) interindic, when can then pivot to (and from)
the others.

And here are some pages that might be of interest:
 - Transliteration of Non-Roman Alphabets and Scripts
[http://homepage.mac.com/sirbinks/translit.html]
 - TC46 Transliteration Links [http://www.elot.gr/tc46sc2/bookmarks.html]
 - UN Working Group on Geographical Names [http://www.eki.ee/wgrs]

Mark

----- Original Message -----
From: "Markus Scherer" <[EMAIL PROTECTED]>
To: "unicode" <[EMAIL PROTECTED]>
Sent: Tuesday, July 03, 2001 10:00
Subject: Re: Unicode transliterations (and other operations)


> > Looks interesting.  How are you approaching the complication that
transliteration is between pairs of languages?
>
> I know what you mean: Gorbachev is Gorbatschow in German.
>
> I think that the rules that we have in ICU are probably English-centric
where it makes a difference.
> Note that some of the transliterator functions like uppercasing and
any-name are just wrappers around Unicode functions, and so not
language-dependent.
>
> The strength of the API is that you can roll your own rules at runtime and
at compile-time. If you have different rules for Finnish as a target
language for transliteration, then you can modify the ICU rules or supply a
whole different set for your own.
> The rules are written somewhat similarly to regular expressions.
>
> See the (draft, somewhat outdated) user guide chapter:
http://oss.software.ibm.com/icu/userguide/Transliteration.html
> and the API references:
http://oss.software.ibm.com/icu/apiref/class_Transliterator.html and
http://oss.software.ibm.com/icu/apiref/utrans_h.html
>
> markus
>
>
Re: Unicode transliterations (and other operations)

Reply via email to