Re: DUCET and supplementary foldings (was: Looking for transcription or transliteration standards latin- arabic)

2004-07-13 Thread Philippe Verdy
From: Asmus Freytag [EMAIL PROTECTED]
 I have a certain sympathy for the idea of designing UCA so that the
 untailored *default* works for such kind of multilingual usage. However,
 the other use of the DUCET is to be the most convenient base for applying
 all tailorings. I have a certain sympathy for the position that claims
that
 there are important, but perhaps specialized or not economically powerful
 classes of users that will not likely have access to a tailored UCA for
 their language or writing system.

 If that is really the case, i.e. appreciable numbers of smaller languages
 would be able to survive without tailoring, then the alternative to fixing
 the DUCET could be a separate publication of a common base tailoring for
 multilingual data access. (A base tailoring would be applied before
further
 tailoring for a specific language).

I appreciate much this analysis. The DUCET has effectively two supposed
usages, whose purposes are opposed. If used as a base collation from which a
language-specific collation can be built simply with few rules, it's true
that the other common usage needed for multilanguage searches is not easy to
build.

May be we could think about designing a new standard collation tailoring
table which could be used as an alternative to the DUCET, but targetting
multilanguage searches.

And so, such tailoring would include more folding than the DUCET, putting
the differences at a higher weight level. And give it a name (MUCET? for
Multilanguage Unicode Collation Elements Table?) that would be supported as
well.

The DUCET is now quite stable and there's no need to change it, as it is now
well known and certainly used in many applications that depend on it (RDBMS
engines notably). But a MUCET would be certainly useful, including for users
that would no more need to search for multiple words in a multilanguage
database or simply for the web. Nothing forbids, in addition, to sort the
matching entries by relevance using the DUCET as a secondary collation
order.

After all a collation elements table works exactly like a custom
decomposition table that creates additional strings whose encoding is not
portable as it depends on weight values. Using custom decompositions is
often much simpler than implementing a multilevel collation, using existing
algorithms implemented for NFD and NFKD decompositions. In such a view, some
extra decompositions are needed, using non-standard Unicode characters for
some elements (for example when decomposing a AE letter into a ligature with
an extra custom control with a higher collation level, to be used only for
full collation order but that could be ignored for searches limited at level
1 or 2).




User Expectations for collation (was Re: Looking for transcription or transliteration standards latin-arabic)

2004-07-12 Thread Mark Davis
These provide good examples. It would be interesting to see, of the people
on the [EMAIL PROTECTED] list, how many non-Poles would expect to find the
following orders:

Ab  b  Ac
Eb  b  Ec
Ob  b  Oc

Ce  e  Cy
Ne  e  Ny
Sa  a  Sy
Za  a  Zy
Za  a  Zy

and either (a) or (b):

a) La  a  Ly// interleaved
b) La  Ly  a// non-interleaved

Mark

- Original Message - 
From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Saturday, July 10, 2004 01:02
Subject: Re: Looking for transcription or transliteration standards
latin-arabic


 W licie z pi, 09-07-2004, godz. 19:34 -0700, Asmus Freytag napisa:

  o-slash, can be analyzed as o and slash, even though that's not done
  canonically in Unicode. Allowing users outside Scandinavia to perform
  fuzzy  searches for words with this character is useful.
 
  In this view of folding, Language-specific fuzzy searches would be
tailored
  (usually by being based on collation information, rather than on generic
  diacritic folding).

 In Polish letters with diacritics  are sorted after the
 corresponding letters without. Omitting diacritics is an error, even
 though text without them is generally readable. They are removed when
 the given protocol requires or encourages ASCII (e.g. filenames to be
 used in URLs, login names, variable names in programming languages,
 ancient computer systems). There is no alternate spelling scheme like
 German AE/OE/UE/SS.

 Polish leters are never folded when sorting lexicographically. This
 applies to  in the same way as to other eight letters. Foreign
 diacritics are always folded though, at least I don't remember seeing
 any other case. I think  would be folded together with O in an
 encyclopaedia if this is a foreign O with some accent, unrelated to
 Polish  which is a separate letter (can you suggest some non-Polish
 word starting with  which could be found in an encyclopaedia?).

 But there are cases when I would prefer to fold Polish diacritics in
 searches.

 It's basically every case when you are not sure that all stored data is
 using diacritics, for example in generic WWW searching. There are still
 people who don't use diacritics in usenet and email, or in entries in
 guest books and other unprofessional web content. There are even
 sometimes people who insist that Polish letters *should not* be used in
 usenet and email because some computer systems can't handle them.
 Diacritics are rare on IRC (because the IRC protocol doesn't distinguish
 between CP-1250, ISO-8859-2 and UTF-8) and with instant messengers
 (because of laziness). This is why for searching archives of unknown
 data it's generally better to fold them.

 As far as I know, the default UCA folds these letters except , and
 standard Polish tailoring doesn't fold any Polish letter. While not
 folding them in searching is technically correct and nobody would be
 surprised that they are not folded, it's often more useful to fold them
 and people would be pleasantly surprised if they don't have to repeat
 the search with omitted diacritics.

 If one wants to find data containing a word, rather than collect
 statistics about usage of a word with and without diacritics, it's very
 rare than folding does some harm.

 Hmm, it's not that simple. When I'm searching for JZYK (existing word),
 I will be happy to find occurrences of JEZYK too (non-existing word,
 must have had diacritics stripped), but it makes no sense to return
 JEYK (another existing word). It's not just making the letters
 equivalent.

 -- 
__( Marcin Kowalczyk
\__/   [EMAIL PROTECTED]
 ^^ http://qrnik.knm.org.pl/~qrczak/








Re: Looking for transcription or transliteration standards latin- arabic

2004-07-12 Thread Asmus Freytag
At 01:02 AM 7/10/2004, Marcin 'Qrczak' Kowalczyk wrote:
But there are cases when I would prefer to fold Polish diacritics in
searches.
It's basically every case when you are not sure that all stored data is
using diacritics,
Or when you are unsure how it is spelled, for example, looking up a 
personal or geographic name you are not familiar with.

The discussion started around the case where searching is not localized 
(tailored) to the language, which, by definition means that users will not 
be familiar with the spelling of the items they are trying to retrieve.

If one wants to find data containing a word, rather than collect
statistics about usage of a word with and without diacritics, it's very
rare than folding does some harm.
Hmm, it's not that simple. When I'm searching for JĘZYK (existing word),
I will be happy to find occurrences of JEZYK too (non-existing word,
must have had diacritics stripped), but it makes no sense to return
JEŻYK (another existing word). It's not just making the letters
equivalent.
There are other types of searches than 'google'. One example is searches 
for for station names on services such as http://www.bahn.de. Unlike 
air-travel sites, the number of destinations (all across Europe, by the 
way), is huge, as the site also includes commuter train services.

They've changed their search algorithm a number of times over the years, 
but at one time, you could enter a destination without diacritics and it 
would attempt to match that to the list of known station names. In case of 
multiple hits it would give you a list to pick from. They also supported 
alternative non-native names (such as Cologne). I haven't used it in a 
while, so I don't know what they support today, but when I did, I found it 
very useful in looking up destinations.

I have a certain sympathy for the idea of designing UCA so that the 
untailored *default* works for such kind of multilingual usage. However, 
the other use of the DUCET is to be the most convenient base for applying 
all tailorings. I have a certain sympathy for the position that claims that 
there are important, but perhaps specialized or not economically powerful 
classes of users that will not likely have access to a tailored UCA for 
their language or writing system.

If that is really the case, i.e. appreciable numbers of smaller languages 
would be able to survive without tailoring, then the alternative to fixing 
the DUCET could be a separate publication of a common base tailoring for 
multilingual data access. (A base tailoring would be applied before further 
tailoring for a specific language).

A./




Re: User Expectations for collation (was Re: Looking for transcription or transliteration standards latin-arabic)

2004-07-12 Thread Asmus Freytag
I missed Mark's change in subject - so I replied to Marcin's message right 
now under the old subject line:

- Original Message -
From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Saturday, July 10, 2004 01:02
Subject: Re: Looking for transcription or transliteration standards
latin-arabic
 W liście z pią, 09-07-2004, godz. 19:34 -0700, Asmus Freytag napisał:

  o-slash, can be analyzed as o and slash, even though that's not done
  canonically in Unicode. Allowing users outside Scandinavia to perform
  fuzzy  searches for words with this character is useful.
 
  In this view of folding, Language-specific fuzzy searches would be
tailored
  (usually by being based on collation information, rather than on generic
  diacritic folding).

 In Polish letters with diacritics ĄĆĘŁŃÓŚŹŻ are sorted after the
 corresponding letters without. Omitting diacritics is an error, even
 though text without them is generally readable. They are removed when
 the given protocol requires or encourages ASCII (e.g. filenames to be
 used in URLs, login names, variable names in programming languages,
 ancient computer systems). There is no alternate spelling scheme like
 German AE/OE/UE/SS.

 Polish leters are never folded when sorting lexicographically. This
 applies to Ł in the same way as to other eight letters. Foreign
 diacritics are always folded though, at least I don't remember seeing
 any other case. I think Ó would be folded together with O in an
 encyclopaedia if this is a foreign O with some accent, unrelated to
 Polish Ó which is a separate letter (can you suggest some non-Polish
 word starting with Ó which could be found in an encyclopaedia?).

 But there are cases when I would prefer to fold Polish diacritics in
 searches.

 It's basically every case when you are not sure that all stored data is
 using diacritics, for example in generic WWW searching. There are still
 people who don't use diacritics in usenet and email, or in entries in
 guest books and other unprofessional web content. There are even
 sometimes people who insist that Polish letters *should not* be used in
 usenet and email because some computer systems can't handle them.
 Diacritics are rare on IRC (because the IRC protocol doesn't distinguish
 between CP-1250, ISO-8859-2 and UTF-8) and with instant messengers
 (because of laziness). This is why for searching archives of unknown
 data it's generally better to fold them.

 As far as I know, the default UCA folds these letters except Ł, and
 standard Polish tailoring doesn't fold any Polish letter. While not
 folding them in searching is technically correct and nobody would be
 surprised that they are not folded, it's often more useful to fold them
 and people would be pleasantly surprised if they don't have to repeat
 the search with omitted diacritics.

 If one wants to find data containing a word, rather than collect
 statistics about usage of a word with and without diacritics, it's very
 rare than folding does some harm.

 Hmm, it's not that simple. When I'm searching for JĘZYK (existing word),
 I will be happy to find occurrences of JEZYK too (non-existing word,
 must have had diacritics stripped), but it makes no sense to return
 JEŻYK (another existing word). It's not just making the letters
 equivalent.





Re: Looking for transcription or transliteration standards latin- arabic

2004-07-10 Thread Marcin 'Qrczak' Kowalczyk
W licie z pi, 09-07-2004, godz. 19:34 -0700, Asmus Freytag napisa:

 o-slash, can be analyzed as o and slash, even though that's not done 
 canonically in Unicode. Allowing users outside Scandinavia to perform 
 fuzzy  searches for words with this character is useful.
 
 In this view of folding, Language-specific fuzzy searches would be tailored 
 (usually by being based on collation information, rather than on generic 
 diacritic folding).

In Polish letters with diacritics  are sorted after the
corresponding letters without. Omitting diacritics is an error, even
though text without them is generally readable. They are removed when
the given protocol requires or encourages ASCII (e.g. filenames to be
used in URLs, login names, variable names in programming languages,
ancient computer systems). There is no alternate spelling scheme like
German AE/OE/UE/SS.

Polish leters are never folded when sorting lexicographically. This
applies to  in the same way as to other eight letters. Foreign
diacritics are always folded though, at least I don't remember seeing
any other case. I think  would be folded together with O in an
encyclopaedia if this is a foreign O with some accent, unrelated to
Polish  which is a separate letter (can you suggest some non-Polish
word starting with  which could be found in an encyclopaedia?).

But there are cases when I would prefer to fold Polish diacritics in
searches.

It's basically every case when you are not sure that all stored data is
using diacritics, for example in generic WWW searching. There are still
people who don't use diacritics in usenet and email, or in entries in
guest books and other unprofessional web content. There are even
sometimes people who insist that Polish letters *should not* be used in
usenet and email because some computer systems can't handle them.
Diacritics are rare on IRC (because the IRC protocol doesn't distinguish
between CP-1250, ISO-8859-2 and UTF-8) and with instant messengers
(because of laziness). This is why for searching archives of unknown
data it's generally better to fold them.

As far as I know, the default UCA folds these letters except , and
standard Polish tailoring doesn't fold any Polish letter. While not
folding them in searching is technically correct and nobody would be
surprised that they are not folded, it's often more useful to fold them
and people would be pleasantly surprised if they don't have to repeat
the search with omitted diacritics.

If one wants to find data containing a word, rather than collect
statistics about usage of a word with and without diacritics, it's very
rare than folding does some harm.

Hmm, it's not that simple. When I'm searching for JZYK (existing word),
I will be happy to find occurrences of JEZYK too (non-existing word,
must have had diacritics stripped), but it makes no sense to return
JEYK (another existing word). It's not just making the letters
equivalent.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/





RE: Looking for transcription or transliteration standards latin- arabic

2004-07-09 Thread D. Starner
 transliteration is no longer needed or useful. Transliteration 
 is a one-to-one mapping between scripts, and the reader needs to be familiar 
 with both scripts and the transliteration rules to make sense of it. 

That's not true. Looking at Wright's Historical German Grammar, I 
see Goth. baírand, OHG. bërant=Skr. bháranti. It would be illegible
to me, and probably many Germantists, if it were written in three
scripts instead of one. Using foreign scripts is rarely of help to
the casual reader, especially in the frequent cases where it's not
important that understand the details of the transliteration scheme.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm




Re: Looking for transcription or transliteration standards latin- arabic

2004-07-09 Thread Simon Montagu
Jony Rosenne wrote:
Cologne is not a transliteration of Kln but the English name of the city, just as Munich,
 Rome, Moscow, The Hague, Longhorn, Venice, Jaffa and Jerusalem.
   
Would that be the English name for Windows Ligorno?



RE: Looking for transcription or transliteration standards latin- arabic

2004-07-09 Thread Jony Rosenne
Sorry, I meant Leghorn.

Jony

 -Original Message-
 From: Simon Montagu [mailto:[EMAIL PROTECTED] 
 Sent: Friday, July 09, 2004 9:19 AM
 To: Jony Rosenne
 Cc: [EMAIL PROTECTED]
 Subject: Re: Looking for transcription or transliteration 
 standards latin- arabic
 
 
 Jony Rosenne wrote:
  Cologne is not a transliteration of Köln but the English 
 name of the 
  city, just as Munich,
   Rome, Moscow, The Hague, Longhorn, Venice, Jaffa and Jerusalem.
 
 
 Would that be the English name for Windows Ligorno?
 
 
 





Re: Looking for transcription or transliteration standards latin- arabic

2004-07-09 Thread John Cowan
Jony Rosenne scripsit:

 I doubt it makes much sense to the casual reader. Witness how nearly every
 radio and television pronounces New Delhi as New Del-hi.

O pity the poor poor Zippity,
For he can eat nothing but Greli,
   A plant that grows only
   In New Caledony,
While the Zippity lives in New Delhi.
--Shel Silverstein

-- 
Take two turkeys, one goose, four  John Cowan
cabbages, but no duck, and mix them http://www.ccil.org/~cowan
together. After one taste, you'll duck  [EMAIL PROTECTED]
soup the rest of your life.http://www.reutershealth.com
--Groucho



RE: Looking for transcription or transliteration standards latin- arabic

2004-07-09 Thread Jony Rosenne


 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of D. Starner
 Sent: Friday, July 09, 2004 9:13 AM
 To: [EMAIL PROTECTED]
 Subject: RE: Looking for transcription or transliteration 
 standards latin- arabic
 
 
  transliteration is no longer needed or useful. Transliteration
  is a one-to-one mapping between scripts, and the reader 
 needs to be familiar 
  with both scripts and the transliteration rules to make 
 sense of it. 
 
 That's not true. Looking at Wright's Historical German Grammar, I 
 see Goth. baírand, OHG. bërant=Skr. bháranti. It would be 
 illegible to me, and probably many Germantists, if it were 
 written in three scripts instead of one. Using foreign 
 scripts is rarely of help to the casual reader, especially in 
 the frequent cases where it's not important that understand 
 the details of the transliteration scheme.

I doubt it makes much sense to the casual reader. Witness how nearly every
radio and television pronounces New Delhi as New Del-hi.

Jony

 -- 
 ___
 Sign-up for Ads Free at Mail.com http://promo.mail.com/adsfreejump.htm
 
 
 
 





Re: Looking for transcription or transliteration standards latin- arabic

2004-07-09 Thread Peter Kirk
On 09/07/2004 01:41, Michael (michka) Kaplan wrote:
From: Michael Everson [EMAIL PROTECTED]
 

I think it's stupid (in general) to argue for stripping a letter of
diacritics. If a reader is ignorant of their meaning, that can be
cured. But if they are meaningful, stripping them is just misspelling
the words they belong to. Why would anyone want to do that?
   

I think its inadvisable (in general) to call things stupid merely because
one does not see the need. on the whole, that is a better time to ask the
question than to make the judgment.
There is actually a great deal of both European and American data in
programs like Microsoft Exchange and Outlook, as well as in web search) that
folding away diacritics as a part of giving full lists of possible matches
is indeed preferred by users. Now they would (also) prefer the exact matches
to have priority, but having additional matches without the diacritics is a
common request, and one that has been built into many scenarios.
 

It seems to me that you two Michaels are talking at cross purposes.
Everson was apparently referring to the practice of stripping diacritics 
from foreign words as rendered typographically, e.g. in magazines and 
presumably online texts. And I tend to agree with him (from my European 
perspective) that this is unnecessary. On the other hand, if some people 
want to do it, they should not be prevented.

But Kaplan is referring to something quite different, optionally 
ignoring diacritics in search operations. This is indeed desirable, so 
that a single search can match both Dvorak and Dvok for example, and 
so that the one doing the search does not need to remember exactly which 
diacritics are used in the name. And it is already covered by the 
Unicode collation algorithm and default table, in which diacritics are 
distinguished only at the second level and so folded by a top level only 
collation.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/



Re: Looking for transcription or transliteration standards latin- arabic

2004-07-09 Thread Michael Everson
At 17:43 -0700 2004-07-08, Mark Davis wrote:
  Why would anyone want to do that?
I tend to be with you on this, that it does little harm to retain accents.
However, most major periodic popular publications have this practice; for
example The Economist keeps accents for French, German, Spanish, Italian
words and names but discards others (as I recall).
I wouldn't consider that good typography, that's all I'm saying.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com


Re: Looking for transcription or transliteration standards latin- arabic

2004-07-09 Thread Ted Hopp
Pronunciation keys in dictionaries are a kind of transliteration. We still
need those (well, I do, at least).

Ted

On Friday, July 09, 2004 1:08 AM, Jony Rosenne wrote:
 Now that we have moved from the world of typewriters, that imposed
technical
 constraints on the writer, such as being able to use only the limited set
of
 characters implemented, to the world of Unicode which removes this
 constraint, transliteration is no longer needed or useful.


Ted Hopp, Ph.D.
ZigZag, Inc.
[EMAIL PROTECTED]
+1-301-990-7453

newSLATE is your personal learning workspace
   ...on the web at http://www.newSLATE.com/





Re: Looking for transcription or transliteration standards latin- arabic

2004-07-09 Thread Mark Davis
Of course, that's true about Kln. My point was that after all this time,
the use of Dvorak or Tchaikovsky are *now* the English names for what
originated in a different language.

Mark

- Original Message - 
From: Jony Rosenne [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Thursday, July 08, 2004 22:12
Subject: RE: Looking for transcription or transliteration standards latin-
arabic




  -Original Message-
  From: [EMAIL PROTECTED]
  [mailto:[EMAIL PROTECTED] On Behalf Of Mark Davis
  Sent: Friday, July 09, 2004 3:43 AM
  To: [EMAIL PROTECTED]; Michael Everson
  Subject: Re: Looking for transcription or transliteration
  standards latin- arabic
 
 

 ...

 
  In one sense, the using Dvorak in English for Dvok is
  little different than using Cologne in English for Kln.
  Both are transcriptions into a form that has become more or
  less customary.

 Cologne is not a transliteration of Kln but the English name of the city,
just as Munich, Rome, Moscow, The Hague, Longhorn, Venice, Jaffa and
Jerusalem.

 Why a foreign city should have an English name is an interesting
philosophical question, but not directly concerned with Unicode. This is
however common in many languages.

 The transliteration of Kln would be Koln.
 

 Jony

 
  Mark
 








Re: Looking for transcription or transliteration standards latin- arabic

2004-07-09 Thread Mark Davis
Whether it is a matter of typography or not depends on what the input text
is. Setting the letters D v o   k  as Dvorak would indeed be bad
typography. Setting the letters D v o r a k as Dvorak would be perfect
fine typography.

Mark

- Original Message - 
From: Michael Everson [EMAIL PROTECTED]
To: Mark Davis [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Friday, July 09, 2004 02:29
Subject: Re: Looking for transcription or transliteration standards latin-
arabic


 At 17:43 -0700 2004-07-08, Mark Davis wrote:
Why would anyone want to do that?
 
 I tend to be with you on this, that it does little harm to retain
accents.
 However, most major periodic popular publications have this practice; for
 example The Economist keeps accents for French, German, Spanish, Italian
 words and names but discards others (as I recall).

 I wouldn't consider that good typography, that's all I'm saying.
 -- 
 Michael Everson * * Everson Typography *  * http://www.evertype.com





Re: Looking for transcription or transliteration standards latin- arabic

2004-07-09 Thread Michael Everson
At 06:55 -0700 2004-07-09, Mark Davis wrote:
Of course, that's true about Köln. My point was that after all this time,
the use of Dvorak or Tchaikovsky are *now* the English names for what
originated in a different language.
I don't agree that Dvorak is the English name 
for the composer. But I don't agree that façade 
is correctly spelled in English without the ç 
either.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com




Re: Looking for transcription or transliteration standards latin- arabic

2004-07-09 Thread Jon Hanna
Quoting Michael Everson [EMAIL PROTECTED]:

 At 06:55 -0700 2004-07-09, Mark Davis wrote:
 Of course, that's true about Köln. My point was that after all this time,
 the use of Dvorak or Tchaikovsky are *now* the English names for what
 originated in a different language.
 
 I don't agree that Dvorak is the English name 
 for the composer. But I don't agree that façade 
 is correctly spelled in English without the ç 
 either.

Yes, Dvorak is the name of the American branch of the family; after they changed
the spelling of their name. It's not even pronounced the same. They have a
famous typewriter keyboard inventor in their line, but no famous composers.

-- 
Jon Hanna
http://www.hackcraft.net/
Write a wise saying and your name will live forever - Anonymous



Re: Looking for transcription or transliteration standards latin- arabic

2004-07-09 Thread Michael \(michka\) Kaplan
From: Peter Kirk [EMAIL PROTECTED]

 But Kaplan is referring to something quite different, optionally
 ignoring diacritics in search operations. This is indeed desirable, so
 that a single search can match both Dvorak and Dvok for example, and
 so that the one doing the search does not need to remember exactly which
 diacritics are used in the name. And it is already covered by the
 Unicode collation algorithm and default table, in which diacritics are
 distinguished only at the second level and so folded by a top level only
 collation.

(a) If this were true and it were the only need, then case folding would
also just be a UCA issue, yet case folding is in the document.

(b) Not everyone uses the UCA who uses Unicode (most of the corporate
members companies in Unicode -- including IBM -- had alternate collation
methods that existed prior to the UCA and which to this day support more
languages, in their databases and operating systems)

(c) Since the operation (diacritic folding) is a valid one that
implementations may want to do and the UCA is a UTS and thus not required
for Unicode conformance, it is a sensible folding operation to define.

Does diacritic folding destroy information provided by the distinctions that
diacritcs provide? Of course it does. But then again, the same can be said
of all foldings. This does not diminish their potential usefulness in
specific tasks/operations.


MichKa [MS]
NLS Collation/Locale/Keyboard Development
Globalization Infrastructure and Font Technologies
Windows International Division




RE: Looking for transcription or transliteration standards latin- arabic

2004-07-09 Thread Mike Ayers
Title: RE: Looking for transcription or transliteration standards latin- arabic






 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
 Behalf Of Michael Everson
 Sent: Friday, July 09, 2004 7:13 AM


 At 06:55 -0700 2004-07-09, Mark Davis wrote:
 Of course, that's true about Köln. My point was that after 
 all this time,
 the use of Dvorak or Tchaikovsky are *now* the English names for what
 originated in a different language.
 
 I don't agree that Dvorak is the English name 
 for the composer.


 The English name is, I think, a poor choice of words. Standard anglicization would be better.


 But I don't agree that façade 
 is correctly spelled in English without the ç 
 either.


 On this, we must resign ourselves to disagreement.



/|/|ike





Re: Looking for transcription or transliteration standards latin- arabic

2004-07-09 Thread Mark Davis
 #CYRILLIC SMALL LETTER SHORT I WITH TAIL
049B; 043A; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER KA WITH DESCENDER
049D; 043A; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER KA WITH VERTICAL
STROKE
049F; 043A; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER KA WITH STROKE
04C4; 043A; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER KA WITH HOOK
04C6; 043B; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER EL WITH TAIL
04CE; 043C; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER EM WITH TAIL
04A3; 043D; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER EN WITH DESCENDER
04C8; 043D; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER EN WITH HOOK
04CA; 043D; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER EN WITH TAIL
04E7; 043E; ; !uca #CYRILLIC SMALL LETTER O WITH DIAERESIS
04A7; 043F; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER PE WITH MIDDLE
HOOK
048F; 0440; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER ER WITH TICK
04AB; 0441; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER ES WITH DESCENDER
04AD; 0442; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER TE WITH DESCENDER
04F1; 0443; ; !uca #CYRILLIC SMALL LETTER U WITH DIAERESIS
04F3; 0443; ; !uca #CYRILLIC SMALL LETTER U WITH DOUBLE ACUTE
04B9; 0447; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER CHE WITH VERTICAL
STROKE
04F5; 0447; ; !uca #CYRILLIC SMALL LETTER CHE WITH DIAERESIS
04F9; 044B; ; !uca #CYRILLIC SMALL LETTER YERU WITH DIAERESIS
04ED; 044D; ; !uca #CYRILLIC SMALL LETTER E WITH DIAERESIS
047C; 0460; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER OMEGA WITH
TITLO
047D; 0461; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER OMEGA WITH TITLO
0476; 0474; ; !uca #CYRILLIC CAPITAL LETTER IZHITSA WITH DOUBLE GRAVE ACCENT
0477; 0475; ; !uca #CYRILLIC SMALL LETTER IZHITSA WITH DOUBLE GRAVE ACCENT
04B0; 04AE; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER STRAIGHT U WITH
STROKE
04B1; 04AF; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER STRAIGHT U WITH
STROKE
04B6; 04BC; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER CHE WITH
DESCENDER
04B7; 04BC; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER CHE WITH
DESCENDER
04B8; 04BC; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER CHE WITH
VERTICAL STROKE
04BE; 04BC; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER ABKHASIAN CHE
WITH DESCENDER
04BF; 04BC; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER ABKHASIAN CHE
WITH DESCENDER
04CB; 04BC; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER KHAKASSIAN CHE
04CC; 04BC; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER KHAKASSIAN CHE
04DA; 04D8; ; !uca #CYRILLIC CAPITAL LETTER SCHWA WITH DIAERESIS
04DB; 04D9; ; !uca #CYRILLIC SMALL LETTER SCHWA WITH DIAERESIS
04EA; 04E8; ; !uca #CYRILLIC CAPITAL LETTER BARRED O WITH DIAERESIS
04EB; 04E9; ; !uca #CYRILLIC SMALL LETTER BARRED O WITH DIAERESIS

Mark

- Original Message - 
From: Michael (michka) Kaplan [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Friday, July 09, 2004 07:40
Subject: Re: Looking for transcription or transliteration standards latin-
arabic


 From: Peter Kirk [EMAIL PROTECTED]

  But Kaplan is referring to something quite different, optionally
  ignoring diacritics in search operations. This is indeed desirable, so
  that a single search can match both Dvorak and Dvok for example, and
  so that the one doing the search does not need to remember exactly which
  diacritics are used in the name. And it is already covered by the
  Unicode collation algorithm and default table, in which diacritics are
  distinguished only at the second level and so folded by a top level only
  collation.

 (a) If this were true and it were the only need, then case folding would
 also just be a UCA issue, yet case folding is in the document.

 (b) Not everyone uses the UCA who uses Unicode (most of the corporate
 members companies in Unicode -- including IBM -- had alternate collation
 methods that existed prior to the UCA and which to this day support more
 languages, in their databases and operating systems)

 (c) Since the operation (diacritic folding) is a valid one that
 implementations may want to do and the UCA is a UTS and thus not required
 for Unicode conformance, it is a sensible folding operation to define.

 Does diacritic folding destroy information provided by the distinctions
that
 diacritcs provide? Of course it does. But then again, the same can be said
 of all foldings. This does not diminish their potential usefulness in
 specific tasks/operations.


 MichKa [MS]
 NLS Collation/Locale/Keyboard Development
 Globalization Infrastructure and Font Technologies
 Windows International Division







Re: Looking for transcription or transliteration standards latin- arabic

2004-07-09 Thread D. Starner
Michael Everson writes:

 I don't agree that Dvorak is the English name 
 for the composer. But I don't agree that façade 
 is correctly spelled in English without the ç 
 either. 

The Society for Pure English 
(http://www.gutenberg.net/1/2/3/9/12390/12390-h/12390-h.htm) disagreed:

We still borrow as freely as ever; but half the benefit of this 
borrowing is lost to us, owing to our modern and pedantic attempts 
to preserve the foreign sounds and shapes of imported words, which 
make their current use unnecessarily difficult. Owing to our false 
taste in this matter many words which have been long naturalized 
in the language are being now put back into their foreign forms, 
and our speech is being thus gradually impoverished. This process 
of de-assimilation generally begins with the restoration of foreign 
accents to such words as have them in French; thus ‘role’ is now 
written ‘rôle’; ‘debris’, ‘débris’; ‘detour’, ‘détour’; ‘depot’, 
‘dépôt’; and the old words long established in our language, 
‘levee’, ‘naivety’, now appear as ‘levée’, and ‘naïveté’.

-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm




Re: Looking for transcription or transliteration standards latin- arabic

2004-07-09 Thread Anto'nio Martins-Tuva'lkin
On 2004.07.09, 17:06, Mark Davis [EMAIL PROTECTED] wrote:

 we do not decompose characters like U+00D8 LATIN CAPITAL LETTER O
 WITH STROKE. [I have felt from the beginning that it was a mistake
 to not be consistent in our decompositions

Where can one join your party? ;-)

 -- but that is water under the bridge.]
 
Hm, there is a Nature's Cycle of Water, you know? ;-)

--.
António MARTINS-Tuválkin |  ()|
[EMAIL PROTECTED]||
PT-1XXX-XXX LISBOA   Não me invejo de quem tem|
+351 934 821 700 carros, parelhas e montes|
http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe|
http://pagina.de/bandeiras/  a água em todas as fontes|




Re: Looking for transcription or transliteration standards latin- arabic

2004-07-09 Thread Peter Kirk
On 09/07/2004 17:06, Mark Davis wrote:
I agree with Michael -- diacritic folding is a useful folding to add,
independent of the UCA.
Also, Peter's remark that: And it is already covered by the Unicode
collation algorithm and default table... is incorrect. ...
Well, I think this depends on whether the stroke in characters like 
U+00D8 and similar additional marks are considered to be diacritics. I 
am not sure that they are diacritics in the strict sense, and the 
current DUCET mappings don't treat them as such, but John Cowan's list 
does treat them as such.

... The UCA generally
follows our decompositions in determining many primary weights, and we do
not decompose characters like U+00D8 LATIN CAPITAL LETTER O WITH STROKE. [I
have felt from the beginning that it was a mistake to not be consistent in
our decompositions -- but that is water under the bridge.] If you look at
John's suggested file for diacritic
folding(http://www.ccil.org/~cowan/DiacriticFolding.txt), ...
I have just reviewed this list and found it odd that Hebrew presentation 
forms are included but Arabic ones are not. But in fact surely not only 
the Hebrew presentation forms but also most of the precomposed 
characters are redundant in this list. For the basic folding algorithm 
(in http://www.unicode.org/reports/tr30/) is:

a. Apply optional  folding operations
b. Apply canonical decomposition
c. Repeat (*a*) and (*b*) until stable
d. Apply composition if necessary

Step (b) will decompose not only presentation forms but also all 
precomposed characters with canonical decompositions, and the combining 
marks will be deleted by the repeat of step (a). It is therefore 
necessary to list in the specification of the folding only all (?) 
combining marks, which are to be deleted, and all precomposed characters 
which do *not* have canonical decompositions. Letters like O with stroke 
are presumably in this latter list, along with many of the listed 
Cyrillic characters.

But I would suggest some caution about listing for diacritic folding 
some of the Cyrillic characters below, especially those with descenders. 
I note that 0429 is not folded to 0428 etc, and this is correct because 
within the Cyrillic writing system these are entirely separate 
characters. But the difference between these two is in fact exactly the 
same descender which is removed in 0496 etc. I am also surprised to note 
that no folding is given for 0419/0439; although in some ways this is 
desirable because Russians do not consider this breve to be a diacritic 
(and after all we would not want the dot on i to be removed as a 
diacritic!), these characters have canonical decompositions to 0418/0438 
and breve and the principle of canonical equivalence and the folding 
algorithm (which works on decomposed characters) more or less demand 
that the breve be deleted. Also 048A/048B should then fold to 0418/0438 
rather than 0419/0439.

...
04D0; 0410; ; !uca #CYRILLIC CAPITAL LETTER A WITH BREVE
04D2; 0410; ; !uca #CYRILLIC CAPITAL LETTER A WITH DIAERESIS
0490; 0413; !nfd+remove_marks;  #CYRILLIC CAPITAL LETTER GHE WITH UPTURN
0492; 0413; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER GHE WITH STROKE
0494; 0413; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER GHE WITH MIDDLE
HOOK
04D6; 0415; ; !uca #CYRILLIC CAPITAL LETTER IE WITH BREVE
0496; 0416; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER ZHE WITH
DESCENDER
04DC; 0416; ; !uca #CYRILLIC CAPITAL LETTER ZHE WITH DIAERESIS
0498; 0417; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER ZE WITH
DESCENDER
04DE; 0417; ; !uca #CYRILLIC CAPITAL LETTER ZE WITH DIAERESIS
04E4; 0418; ; !uca #CYRILLIC CAPITAL LETTER I WITH DIAERESIS
048A; 0419; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER SHORT I WITH
TAIL
049A; 041A; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER KA WITH
DESCENDER
049C; 041A; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER KA WITH
VERTICAL STROKE
049E; 041A; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER KA WITH STROKE
04C3; 041A; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER KA WITH HOOK
04C5; 041B; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER EL WITH TAIL
04CD; 041C; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER EM WITH TAIL
04A2; 041D; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER EN WITH
DESCENDER
04C7; 041D; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER EN WITH HOOK
04C9; 041D; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER EN WITH TAIL
04E6; 041E; ; !uca #CYRILLIC CAPITAL LETTER O WITH DIAERESIS
04A6; 041F; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER PE WITH MIDDLE
HOOK
048E; 0420; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER ER WITH TICK
04AA; 0421; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER ES WITH
DESCENDER
04AC; 0422; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER TE WITH
DESCENDER
04F0; 0423; ; !uca #CYRILLIC CAPITAL LETTER U WITH DIAERESIS
04F2; 0423; ; !uca #CYRILLIC CAPITAL LETTER U WITH DOUBLE ACUTE
04B2; 0425; !nfd+remove_marks; !uca #CYRILLIC CAPITAL 

Re: Looking for transcription or transliteration standards latin- arabic

2004-07-09 Thread John Cowan
Peter Kirk scripsit:

 I have just reviewed this list and found it odd that Hebrew presentation 
 forms are included but Arabic ones are not. 

The specification actually called only for Latin, Greek, and Cyrillic;
I added Hebrew pour la lagniappe.  If someone wants to add Arabic, I
encourage them to do so.

 the Hebrew presentation forms but also most of the precomposed 
 characters are redundant in this list. 

True; however, the current list indicates the scope of what actually
happens, even if it is overlong.

 It is therefore
 necessary to list in the specification of the folding only all (?) 
 combining marks, which are to be deleted, 

I believe that all Mn-class characters, and only they, are deleted by this.

 I note that 0429 is not folded to 0428 etc, and this is correct because 
 within the Cyrillic writing system these are entirely separate 
 characters. But the difference between these two is in fact exactly the 
 same descender which is removed in 0496 etc.

I don't think that matters.  Long historical practice has made SHCHA a
separate letter, just as G, J, U, and W are now separate Latin letters
from C, I, V, and VV-ligature.

 I am also surprised to note 
 that no folding is given for 0419/0439; although in some ways this is 
 desirable because Russians do not consider this breve to be a diacritic 
 (and after all we would not want the dot on i to be removed as a 
 diacritic!), these characters have canonical decompositions to 0418/0438 
 and breve and the principle of canonical equivalence and the folding 
 algorithm (which works on decomposed characters) more or less demand 
 that the breve be deleted. Also 048A/048B should then fold to 0418/0438 
 rather than 0419/0439.

I think I agree with this: i-breve does not have the same universal status as
shch.

-- 
John Cowan  www.reutershealth.com  www.ccil.org/~cowan  [EMAIL PROTECTED]
'Tis the Linux rebellion / Let coders take their place,
The Linux-nationale / Shall Microsoft outpace,
We can write better programs / Our CPUs won't stall,
So raise the penguin banner of / The Linux-nationale.



Re: Looking for transcription or transliteration standards latin- arabic

2004-07-09 Thread Asmus Freytag
At 08:33 PM 7/9/2004, John Cowan wrote:
 I have just reviewed this list and found it odd that Hebrew presentation
 forms are included but Arabic ones are not.
The specification actually called only for Latin, Greek, and Cyrillic;
I added Hebrew pour la lagniappe.  If someone wants to add Arabic, I
encourage them to do so.
 the Hebrew presentation forms but also most of the precomposed
 characters are redundant in this list.
True; however, the current list indicates the scope of what actually
happens, even if it is overlong.
I have taken the file from the server today and massaged it to be in a form 
suitable for inclusion in the next draft of TR#30, which will be issued in 
time for the UTC to review it in August.

Once the review issue opens for this draft, please comment on the review 
form, so that the UTC has formal input to evaluate.

My understanding of the folding would be that it would be more agressive in 
diacritic folding than some languages, so that it is useful in cross 
language searching. For example, it should allow English users to search 
for words with accented characters in them by supplying the equivalent word 
spelled in base letters only.

'i' has a dot, but doesn't have a base letter that's more 'basic' than 
itself, since dotless-i, while theoretically there, is more specialized and 
not universally accessible from input devices.

o-slash, can be analyzed as o and slash, even though that's not done 
canonically in Unicode. Allowing users outside Scandinavia to perform 
fuzzy  searches for words with this character is useful.

In this view of folding, Language-specific fuzzy searches would be tailored 
(usually by being based on collation information, rather than on generic 
diacritic folding).

A./ 




Re: FW: Looking for transcription or transliteration standards latin- arabic

2004-07-08 Thread Curtis Clark
John Cowan wrote:
The Unicode people are probably going to standardize on calling it
diacritic folding, by analogy to the term case folding.
Añd whàt shåll wë câll thë ãddítiõn of dîacrìtícs bÿ spämmêrs, ïñ ân 
ättëmpt tò fóòl spåm fîltêrs?

--
Curtis Clark  http://www.csupomona.edu/~jcclark/
Web Coordinator, Cal Poly Pomona +1 909 979 6371
Professor, Biological Sciences   +1 909 869 4062


Re: FW: Looking for transcription or transliteration standards latin- arabic

2004-07-08 Thread Doug Ewell
Curtis Clark jcclark dash lists at earthlink dot net wrote:

 John Cowan wrote:
 The Unicode people are probably going to standardize on calling
 it diacritic folding, by analogy to the term case folding.

 Ad wht shll w cll th ddtin of dacrtcs b spmmrs, 
 n ttmpt t fl spm fltrs?

.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/





RE: FW: Looking for transcription or transliteration standards latin- arabic

2004-07-08 Thread jarkko.hietaniemi
Sanan virkkoi, noin nimesi Curtis Clark:
 Añd whàt shåll wë câll thë ãddítiõn of dîacrìtícs bÿ spämmêrs, ïñ ân 
 ättëmpt tò fóòl spåm fîltêrs?

http://en.wikipedia.org/wiki/Heavy_metal_umlaut




Re: FW: Looking for transcription or transliteration standards latin- arabic

2004-07-08 Thread Peter Kirk
On 08/07/2004 06:44, Curtis Clark wrote:
John Cowan wrote:
The Unicode people are probably going to standardize on calling it
diacritic folding, by analogy to the term case folding.

Añd whàt shåll wë câll thë ãddítiõn of dîacrìtícs bÿ spämmêrs, ïñ ân 
ättëmpt tò fóòl spåm fîltêrs?

An opportunity for spam filters to employ diacritic folding. National 
Geographic may not need this folding but spam filters could certainly 
use it.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/



Re: FW: Looking for transcription or transliteration standards latin- arabic

2004-07-08 Thread Anto'nio Martins-Tuva'lkin
On 2004.07.08, 09:56, Peter Kirk [EMAIL PROTECTED] wrote:

 Añd whàt shåll wë câll thë ãddítiõn of dîacrìtícs bÿ spämmêrs, ïñ
 ân ättëmpt tò fóòl spåm fîltêrs?

 An opportunity for spam filters to employ diacritic folding.

What about things like PEN|S en|argement or G00D L00KING |\/|EN?

--.
António MARTINS-Tuválkin |  ()|
[EMAIL PROTECTED]||
PT-1XXX-XXX LISBOA   Não me invejo de quem tem|
+351 934 821 700 carros, parelhas e montes|
http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe|
http://pagina.de/bandeiras/  a água em todas as fontes|




Re: FW: Looking for transcription or transliteration standards latin- arabic

2004-07-08 Thread Sarasvati
This thread seems to have gone far enough off-topic.
Please keep to the topic or take comments off list.

Regards from your,
-- Sarasvati


 Añd whàt shåll wë câll thë ãddítiõn of dîacrìtícs bÿ spämmêrs, ïñ ân
 ättëmpt tò fóòl spåm fîltêrs? 

 What about things like PEN|S en|argement or G00D L00KING |\/|EN? 




Re: Looking for transcription or transliteration standards latin-arabic

2004-07-08 Thread busmanus
You will need a Unicode font with Central-European an IPA
characters to read my examples.
Mike Ayers wrote:
  Perhaps it is. But then it's partly due to the lazy tradition.
Are you implying that, had printers throughout the centuries put 
the effort into faithfully reproducing every obscure symbol from every 
foreign language, that the modern American would accept words with 
arbitrary diacritics?
I do not pretend to know, but accept is probably not the best word
to use in this context, after all it's not about the spelling of
English words. And not every tradition needs to be hundreds of
years old.
  I don't think it's a problem with any given diacritical. Its rather
  an indistinct horror of diacriticals in general in speakers of a
  language without any diacriticals at all, like English. E.g.
  Hungarian uses three diacriticals and Hungarian speakers make no
  big deal of just ignoring the meaningless caron in Czech or
  the grave
  and the cedilla in Roumanian names.
  On the other hand, I must admit, that we also can be quite brutal
  to diacriticals in some newspapers or when it comes to a language
  like Vietnamese...
In other words, you're pretty comfortable with your own 
diacritics.  You make my point for me.
Our own are the acute (to show vowel length), the diaeresis
(to show timbre, like in German) and the doubleacute (=a stretched
diaeresis actually, to show both timbre and length at the same
time). The caron or the cedilla are just as foreign for us as e.g.
the odd question marks above Vietnamese vowels, even if they
may be less unusual. And the case of the newpapers I'm talking about
may be just classic examples of lazy typography, at least the silly
spelling mistakes and other inaccuracies they allow themselves point
in that direction. In books by any serious publisher, it would
definitely be completely unacceptable to write e.g. Haek's name
(a famous Czech satyrist) as Hasek.
Once we got into this debate, let me quote an example where
distinguishing between diacritics as familiar and unfamiliar may
lead to undesirable results. Imagine, someone writes an article about
a person named Trcsik [trik] (we accidentally have an actress
by that surname). Suppose the journalist thinks it reasonable to retain
the familiar diaeresis, because it is found in German and many other
well-known orthographies. But what should be the fate of the
doubleacute (which is actually nothing but a special kind of diaeresis,
as I mentioned above)? As an unfamiliar diacritic, it should be
discarded if the principle is applied mechanically. This would result
in the form Trocsik [troik]; however, as you may see from the
phonetic transcription, this is not simply incomplete information
in such a context, but explicit misinformation. The less cruel
approach would be to replace the special diaeresis with the normal
one and write Trcsik [trik]. This is undoubtedly the least
unacceptable of the diacritic-folded variants mathematically possible,
but it is neither a proper English transcription because of the
diaereses and the unusual value of the consonant cluster cs, nor
correct Hungarian because of denying the long vowel, so what is it
after all?
There may not be an easy way to solve sucht situations, so that
everybody would be pleased, but at least thinking about them does no
harm. Sorry for being so long, perhaps someone finds my data
interesting.
Regards,

Miert fizetsz az internetert? Korlatlan, ingyenes internet hozzaferes a FreeStarttol.
Probald ki most! http://www.freestart.hu


RE: Looking for transcription or transliteration standards latin- arabic

2004-07-08 Thread Mike Ayers
Title: RE: Looking for transcription or transliteration standards latin-arabic






 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
 Behalf Of busmanus
 Sent: Thursday, July 08, 2004 1:27 PM


 I do not pretend to know, but accept is probably not the best word
 to use in this context, after all it's not about the spelling of
 English words. And not every tradition needs to be hundreds of
 years old.


 Actually, I was sating the most extreme case. Hundreds of years meant, basically, the lifetime of any given reader, so that maximum familiarity could be achieved. I do not believe that even in such a case would the average reader become comfortable with foreign diacritics. Although I speak with regards to English, as it is the only language I know well enough, I believe the principle applies for all languages, as it is an issue of familiarity, which is rather general to humanity.

 it would
 definitely be completely unacceptable to write e.g. Haek's name
 (a famous Czech satyrist) as Hasek.


 When transcribing to English, however, removal of the caron (macron? Apologies, but I tend to forget the names of most accents) would be most acceptable (for American English, at least).

 Once we got into this debate, let me quote an example where
 distinguishing between diacritics as familiar and unfamiliar may
 lead to undesirable results.


SNIP/


 Interesting case, and one reason why diacritic stripping, although brutal, may be desireable - it doesn't pretend to be accurate. Accuracy can be very hard to achieve when transcribing, especially since diacritics can be used to indicate very different things in different languages.

 There may not be an easy way to solve sucht situations, so that
 everybody would be pleased, but at least thinking about them does no
 harm. Sorry for being so long, perhaps someone finds my data
 interesting.


 I do find it interesting. It gave me some insight into the European view of diacritics, which is very different from mine. For instance, it seems that diacritics have similar effects on vowels, and that those vowels have similar sounds both before and after modification, across most (all?) European languages - am I reading correctly here?


 Thanks,


/|/|ike





Re: Looking for transcription or transliteration standards latin- arabic

2004-07-08 Thread Doug Ewell
RE: Looking for transcription or transliteration standards
latin-arabicMike Ayers wrote:

 it would
 definitely be completely unacceptable to write e.g. Haek's name
 (a famous Czech satyrist) as Hasek.

When transcribing to English, however, removal of the caron
 (macron?  Apologies, but I tend to forget the names of most accents)
 would be most acceptable (for American English, at least).

Caron, or more commonly hacek.  A macron is a shortish overline.

English-speaking classical music buffs quickly learn to associate the
diacritic-free spelling Dvorak with the (approximate) pronunciation
/'dvrk/.  Whether Dvorak is an acceptable way to spell Dvok
probably depends on who's doing the accepting.  For the computer
columnist and the keyboard layout inventor, whose names are apparently
pronounced /'dvk/ anyway, it's fine.

 Once we got into this debate, let me quote an example where
 distinguishing between diacritics as familiar and unfamiliar may
 lead to undesirable results.

 SNIP/
Interesting case, and one reason why diacritic stripping,
 although brutal, may be desireable - it doesn't pretend to be
 accurate.  Accuracy can be very hard to achieve when transcribing,
 especially since diacritics can be used to indicate very different
 things in different languages.

Desirable because it doesn't pretend to be accurate.  That's a useful
philosophy at times, but I have to admit I'm surprised to see it
expressed on the Unicode list.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/





RE: Looking for transcription or transliteration standards latin- arabic

2004-07-08 Thread Michael Everson
At 14:57 -0700 2004-07-08, Mike Ayers wrote:
When transcribing to English, however, removal of the caron (macron? 
Apologies, but I tend to forget the names of most accents) would be 
most acceptable (for American English, at least).
NOT in good typography, ever.
It gave me some insight into the European view of diacritics, which 
is very different from mine.  For instance, it seems that diacritics 
have similar effects on vowels, and that those vowels have similar 
sounds both before and after modification, across most (all?) 
European languages - am I reading correctly here?
Not really. Diacritics may affect the quantity of a vowel, the 
quality of a vowel, or simply indicate something about a word's 
history.

I think it's stupid (in general) to argue for stripping a letter of 
diacritics. If a reader is ignorant of their meaning, that can be 
cured. But if they are meaningful, stripping them is just misspelling 
the words they belong to. Why would anyone want to do that?
--
Michael Everson * * Everson Typography *  * http://www.evertype.com



Re: Looking for transcription or transliteration standards latin- arabic

2004-07-08 Thread Michael \(michka\) Kaplan
From: Michael Everson [EMAIL PROTECTED]

 I think it's stupid (in general) to argue for stripping a letter of
 diacritics. If a reader is ignorant of their meaning, that can be
 cured. But if they are meaningful, stripping them is just misspelling
 the words they belong to. Why would anyone want to do that?

I think its inadvisable (in general) to call things stupid merely because
one does not see the need. on the whole, that is a better time to ask the
question than to make the judgment.

There is actually a great deal of both European and American data in
programs like Microsoft Exchange and Outlook, as well as in web search) that
folding away diacritics as a part of giving full lists of possible matches
is indeed preferred by users. Now they would (also) prefer the exact matches
to have priority, but having additional matches without the diacritics is a
common request, and one that has been built into many scenarios.

Formalizing that operation in Unicode is only a bad thing (or a stupid
thing, to use your words) if creating a standard that meets real world needs
(as opposed to ideal typographic or linguistic preferences) is considered a
bad (or stupid) thing.

As far as I know, most of the members of the Unicode Consortium have those
real world use cases as their first priority.

MichKa [MS]




Re: Looking for transcription or transliteration standards latin- arabic

2004-07-08 Thread Mark Davis
 Why would anyone want to do that?

I tend to be with you on this, that it does little harm to retain accents.
However, most major periodic popular publications have this practice; for
example The Economist keeps accents for French, German, Spanish, Italian
words and names but discards others (as I recall).

In one sense, the using Dvorak in English for Dvok is little different
than using Cologne in English for Kln. Both are transcriptions into a
form that has become more or less customary.

Mark

- Original Message - 
From: Michael Everson [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Thursday, July 08, 2004 15:13
Subject: RE: Looking for transcription or transliteration standards latin-
arabic


 At 14:57 -0700 2004-07-08, Mike Ayers wrote:

 When transcribing to English, however, removal of the caron (macron?
 Apologies, but I tend to forget the names of most accents) would be
 most acceptable (for American English, at least).

 NOT in good typography, ever.

 It gave me some insight into the European view of diacritics, which
 is very different from mine.  For instance, it seems that diacritics
 have similar effects on vowels, and that those vowels have similar
 sounds both before and after modification, across most (all?)
 European languages - am I reading correctly here?

 Not really. Diacritics may affect the quantity of a vowel, the
 quality of a vowel, or simply indicate something about a word's
 history.

 I think it's stupid (in general) to argue for stripping a letter of
 diacritics. If a reader is ignorant of their meaning, that can be
 cured. But if they are meaningful, stripping them is just misspelling
 the words they belong to. Why would anyone want to do that?
 -- 
 Michael Everson * * Everson Typography *  * http://www.evertype.com






RE: Looking for transcription or transliteration standards latin- arabic

2004-07-08 Thread Jony Rosenne
Transcription is useful and necessary, transliteration less so.

When transcribing from, for example, Czech , into English, we should not be
mislead by the fact that in Unicode both use the Latin script. In fact,
Czech uses the Czech script (= writing system, in this case), and English
uses the English script. The Czech script includes letter-diacritic
combinations that are not part of the English script or maybe have a
different meaning. To the English or American reader who does not know Czech
they are incomprehensible, so he relies on transcription. The purpose of
transcription is to copy the word into the English script.

If the reader, or all intended readers, are comfortable with the Czech
script then transcription is not necessary.

The situation is only slightly different from Russian to English
transcription. It appears to be different because the Russian script looks
different.

Now that we have moved from the world of typewriters, that imposed technical
constraints on the writer, such as being able to use only the limited set of
characters implemented, to the world of Unicode which removes this
constraint, transliteration is no longer needed or useful. Transliteration
is a one-to-one mapping between scripts, and the reader needs to be familiar
with both scripts and the transliteration rules to make sense of it.

Jony

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of Michael Everson
 Sent: Friday, July 09, 2004 1:13 AM
 To: [EMAIL PROTECTED]
 Subject: RE: Looking for transcription or transliteration 
 standards latin- arabic
 
 
 At 14:57 -0700 2004-07-08, Mike Ayers wrote:
 
 When transcribing to English, however, removal of the caron (macron?
 Apologies, but I tend to forget the names of most accents) would be 
 most acceptable (for American English, at least).
 
 NOT in good typography, ever.
 
 It gave me some insight into the European view of diacritics, which
 is very different from mine.  For instance, it seems that diacritics 
 have similar effects on vowels, and that those vowels have similar 
 sounds both before and after modification, across most (all?) 
 European languages - am I reading correctly here?
 
 Not really. Diacritics may affect the quantity of a vowel, the 
 quality of a vowel, or simply indicate something about a word's 
 history.
 
 I think it's stupid (in general) to argue for stripping a letter of 
 diacritics. If a reader is ignorant of their meaning, that can be 
 cured. But if they are meaningful, stripping them is just misspelling 
 the words they belong to. Why would anyone want to do that?
 -- 
 Michael Everson * * Everson Typography *  * http://www.evertype.com
 
 
 





RE: Looking for transcription or transliteration standards latin- arabic

2004-07-08 Thread Jony Rosenne


 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of Mark Davis
 Sent: Friday, July 09, 2004 3:43 AM
 To: [EMAIL PROTECTED]; Michael Everson
 Subject: Re: Looking for transcription or transliteration 
 standards latin- arabic
 
 

...

 
 In one sense, the using Dvorak in English for Dvok is 
 little different than using Cologne in English for Kln. 
 Both are transcriptions into a form that has become more or 
 less customary.

Cologne is not a transliteration of Kln but the English name of the city, just as 
Munich, Rome, Moscow, The Hague, Longhorn, Venice, Jaffa and Jerusalem.

Why a foreign city should have an English name is an interesting philosophical 
question, but not directly concerned with Unicode. This is however common in many 
languages.

The transliteration of Kln would be Koln.


Jony

 
 Mark
 





Re: Looking for transcription or transliteration standards latin- arabic

2004-07-08 Thread Adam Twardoch
From: Mark Davis [EMAIL PROTECTED]

 In one sense, the using Dvorak in English for Dvok is little
different
 than using Cologne in English for Kln. Both are transcriptions into a
 form that has become more or less customary.

If at all, Kln is a German and Cologne is a French/English
transcription of the Latin name Colonia.

Adam




Re: Looking for transcription or transliteration standards latin- arabic

2004-07-07 Thread Raymond Mercier

Peter Kirk writes
 This is more complicated than it looks. The Greek form Istimboli is
 impossible for the period as Greek had no [b] sound, for  was
 pronounced [v] except that later and perhaps already at that period 
 was pronounced [b] at least in foreign words. So is the Greek consonant
 cluster , or , or , or what? Also is the previous consonant
 cluster  as transliterated, or  corresponding to isthmus? And then
 what are the Greek vowels?

I was only trying to grasp the sense of Gerd's throw-away remark (which I
hope he will explain), but I appreciate the difficulties you raise,
especially the point about the Greek beta as the phoneme /v/ . That
particular difficulty at least doesn't apply to the Ottoman b, if we look
for a Turkish -bul  .

Raymond Mercier

http://ourworld.compuserve.com/homepages/RaymondM




Re: Looking for transcription or transliteration standards latin- arabic

2004-07-07 Thread Peter Kirk
On 07/07/2004 07:08, Raymond Mercier wrote:
...
I was only trying to grasp the sense of Gerd's throw-away remark (which I
hope he will explain), but I appreciate the difficulties you raise,
especially the point about the Greek beta as the phoneme /v/ . That
particular difficulty at least doesn't apply to the Ottoman b, if we look
for a Turkish -bul  .
 

The last part is uncontroversial, I think. The uncertainty is over the 
first part of the word.

Google gives only three hits for istimboli, one of which 
(http://linguistlist.org/issues/3/3-929.html) says:

An interesting historical case is Istanbul, whose name comes from
the Greek phrase eis ten poli (to the city -- first e is epsilon,
and second e is eta).  That phrase tended to be pronounced istimboli
and with dissimilation istamboli.  So when the Turks changed the name
from Constantinople to Istanbul, they simply changed from a name with
an obvious Greek derivation to one with a nonobvious Greek derivation.
This is a possible derivation. If this is Gerd's source, he failed to 
make the point that istimboli was not a Greek name of the city but a 
colloquial pronunciation of a phrase. And the source of that may be the 
following old German text, from 
http://www.staff.ncl.ac.uk/jon.west/get/hc0144_3.htm:

Constantinopel hayssen die Chrichen Istimboli und die Thrcken 
hayssends Stambol;

And according to http://www.fotoist.8m.com/ad.htm (in Turkish) this 
information comes the from 14th-15th century German traveller Johan 
Schildtberger. But I have my suspicions about this information. The 
Greeks had no problem with initial consonant clusters but the Turks did, 
so it is much more likely that the Turks added the initial I to a Greek 
word starting with ST, just as Spanish and French add initial E before 
such clusters.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/



[OT] Istanbul [was: Re: Looking for transcription or transliteration standards latin- arabic]

2004-07-07 Thread Philipp Reichmuth
Constantinopel hayssen die Chrichen Istimboli und die Thrcken 
hayssends Stambol;
The Greeks had no problem with initial consonant clusters but the Turks did, 
so it is much more likely that the Turks added the initial I to a Greek 
word starting with ST, just as Spanish and French add initial E before 
such clusters.
Are you sure about the Turks and the initial consonant clusters? I 
always thought it depends on the actual cluster structure. Modern 
Turkish at least has loanwords such as brokoli,  graten or the 
notorious spor where the problem is the word-*final* cluster, not the 
word *initial* one. While Turkic roots usually do not begin with 
consonant clusters, it appears to be OK in loans.

The situation is a bit difficult because of the Persian and Arabic 
adstrata in Ottoman Turkish. Both Arabic and Persian definitely do not 
allow word-initial consonant clusters at all, which led to a lot of 
words with auxiliary vowels in Turkish. However, these words already had 
the auxiliary vowels when

Philipp
--
Was fr Japan ist der Tenno,
ist fr Frankfurt Brezel-Benno.
  - Brezelverkufer in Frankfurt/Main


Re: [OT] Istanbul [was: Re: Looking for transcription or transliteration standards latin- arabic]

2004-07-07 Thread Peter Kirk
On 07/07/2004 11:22, Philipp Reichmuth wrote:
...
Are you sure about the Turks and the initial consonant clusters? I 
always thought it depends on the actual cluster structure. Modern 
Turkish at least has loanwords such as brokoli,  graten or the 
notorious spor where the problem is the word-*final* cluster, not 
the word *initial* one. While Turkic roots usually do not begin with 
consonant clusters, it appears to be OK in loans.

There are certainly no word initial consonant clusters in native Turkic 
words. Looking at the specific ST cluster in my Turkish-English 
dictionary, there are a number of words listed, but they are all 
transparently loans from western languages and the kinds of words which 
were probably borrowed in the 20th century: stabilize, stadyum/stat, 
staj, stajyer, stand, standart, star, statik, stat, statko, sten, 
steno(grafi), step (steppe), stepne (spare tyre), stereo(foni(k)), 
stereotip, steril(ize/izasyon), sterlin, stetoskop, setyn (station 
wagon), stil, stilistik, stilo, stok(u/lamak), stop, stopaj, stor, 
strateji(k), stratosfer, stratus, streptokok, streptomisin, stres 
(medical), striptiz(ci), stdyo. But here are the corresponding words 
with word initial added vowels: stampa, stavroz/istavroz (from Greek 
stavros), istasyon, istatistik(i), istavrit (a fish), istep (steppe), 
istim, istimbot, istiridye (? oyster), istop, usturmaa (a nautical 
term probably from storm). These words seem to be rather older loans, 
some perhaps 19th century but stavroz/istavroz is surely much earlier, 
also istavrit if that is a loan from Greek stavrit- as seems likely.

These are complete lists for ST but the same happens with other 
consonant clusters e.g. SP, SV.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/



Re: Looking for transcription or transliteration standards latin- arabic

2004-07-07 Thread Patrick Andries
Peter Kirk a crit :
On 07/07/2004 07:08, Raymond Mercier wrote:
This is a possible derivation. If this is Gerd's source, he failed to 
make the point that istimboli was not a Greek name of the city but a 
colloquial pronunciation of a phrase. And the source of that may be 
the following old German text, from 
http://www.staff.ncl.ac.uk/jon.west/get/hc0144_3.htm:

Constantinopel hayssen die Chrichen Istimboli und die Thrcken 
hayssends Stambol;

And according to http://www.fotoist.8m.com/ad.htm (in Turkish) this 
information comes the from 14th-15th century German traveller Johan 
Schildtberger. But I have my suspicions about this information. The 
Greeks had no problem with initial consonant clusters but the Turks 
did, so it is much more likely that the Turks added the initial I to a 
Greek word starting with ST, just as Spanish and French add initial E 
before such clusters.

French (for the last 5 centuries) no longer adds an initial E in front 
of ST (see : stop, start, sport (*), stage, stature, station, etc.), 
historically (in Old French) this was true (estable [stable], estamper 
[to stamp], estat [state, station], esterlin [sterling], estrange 
[stange, stranger]). Old French is before the fall of Constatinople and 
the end of the Hundred Year war (both in 1453 as all French-speaking 
schoolchildren learn).

Spanish still does (or a least did recently) see recent loanwords : 
esqu (ski) or esprint (sprint).

P. A.
(*) English word derived from an Old French word desport / deport  
(entertainment), see deporte in Spanish and desporto/desporte in 
Portuguese (but esporte in Brazil).
.




Re: Looking for transcription or transliteration standards latin- arabic

2004-07-07 Thread Curtis Clark
An interesting historical case is Istanbul, whose name comes from
the Greek phrase eis ten poli (to the city -- first e is epsilon,
and second e is eta).  That phrase tended to be pronounced istimboli
and with dissimilation istamboli.  So when the Turks changed the name
from Constantinople to Istanbul, they simply changed from a name with
an obvious Greek derivation to one with a nonobvious Greek derivation.
This explanation seems rather Byzantine to me.
--
Curtis Clark  http://www.csupomona.edu/~jcclark/
Web Coordinator, Cal Poly Pomona +1 909 979 6371
Professor, Biological Sciences   +1 909 869 4062


RE: Looking for transcription or transliteration standards latin- arabic

2004-07-07 Thread Mike Ayers
Title: RE: Looking for transcription or transliteration standards latin- arabic






 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
 Behalf Of Anto'nio Martins-Tuva'lkin
 Sent: Tuesday, July 06, 2004 9:04 PM


 On 2004.07.07, 00:49, Mike Ayers [EMAIL PROTECTED] wrote:
 
  Are you implying that, had printers throughout the centuries put the
  effort into faithfully reproducing every obscure symbol
 
 I spell my own name with some of those obscure symbols, thank you.


 Yep. Hope you don't mind my inability to pronounce it. However, grave (and acute) accents hardly rate as obscure, so I could pronounce through them and get passably close. Even here in the cultural boondocks we know that.

 Obscure indeed -- that's the last thing I'd expect in a list such as
 this! Is internationalization is serious issue, or just a toy to kill
 off idle time?


 Oh, calm down. We were originally talking about Vietnamese diacritics, many of which definitely qualify as obscure, the rest being obscure uses of more familiar diacritics. Just because you don't like the kind of internationalization I mentioned does not mean it shouldn't be discussed.

  from every foreign language, that the modern American would accept
  words with arbitrary diacritics?
 
 Foreign? American? I obviously misunderstood the whole purpose of
 these discussions, then. Bye bye -- will back as soon as I get my
 Green Card, seor! ;-)


 Are you just trying to kick up dirt here, or were you genuinely unaware that National Geographic is an American publication? I specified American, as opposed to English speaking in this case for that reason, also because the British are known to be more familiar with, and therefore tolerant of, various diacritics. I doubt, however, that this would have any bearing on Vietnamese, which, while it uses familiar looking diacritics, uses them in very unfamiliar (to Europeans in general, as best I understand it) ways.

 Now, in a last desperate hope to address the issue I raised: does the practice of stripping diacritics have a name?


 Thanks,


/|/|ike





Re: Looking for transcription or transliteration standards latin- arabic

2004-07-07 Thread Peter Kirk
On 07/07/2004 17:04, Mike Ayers wrote:
...
Are you just trying to kick up dirt here, or were you 
genuinely unaware that National Geographic is an American 
publication?  I specified American, as opposed to English speaking 
in this case for that reason, also because the British are known to be 
more familiar with, and therefore tolerant of, various diacritics.  I 
doubt, however, that this would have any bearing on Vietnamese, which, 
while it uses familiar looking diacritics, uses them in very 
unfamiliar (to Europeans in general, as best I understand it) ways.

Indeed we British are more tolerant. Most of us have learned at least a 
little French and so vaguely know what e acute sounds like, perhaps also 
e grave, and that e with an accent is not silent, as in café. Other 
accents we tend to understand as marking stress and/or length, which 
works for Spanish and probably also António's Portuguese. So we do a lot 
better in guessing pronunciation than we would do if the diacritics were 
stripped off completely, even if we don't actually understand properly 
what they mean.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/



FW: Looking for transcription or transliteration standards latin- arabic

2004-07-07 Thread Mike Ayers
Title: FW: Looking for transcription or transliteration standards latin- arabic






 John notified me that he intended to CC the list, so here it is:


-Original Message-
From: John Cowan [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, July 07, 2004 8:32 AM
To: Mike Ayers
Subject: Re: Looking for transcription or transliteration standards
latin- arabic



Mike Ayers scripsit:


  Now, in a last desperate hope to address the issue I raised: does
 the practice of stripping diacritics have a name?


The Unicode people are probably going to standardize on calling it
diacritic folding, by analogy to the term case folding.


I have provided them with a table that does diacritic folding for the
Latin, Greek, Cyrillic, and Hebrew scripts; it does not, however, remove
combining diacritics (which is easy to do on your own).


-- 
There are three kinds of people in the world: John Cowan
those who can count, http://www.reutershealth.com
and those who can't. [EMAIL PROTECTED]





Re: Looking for transcription or transliteration standards latin- arabic

2004-07-06 Thread Peter Kirk
On 03/07/2004 00:07, Patrick Andries wrote:
Jony Rosenne a crit :
 


   

-Original Message-
From: [EMAIL PROTECTED] 
[mailto:[EMAIL PROTECTED] On Behalf Of John H. Jenkins

  

Peking for Bejng.  :-)
  

 

Or Constantinople for Istanbul.  :-)
   

Two very different political realities (before and after 1453). Cities
change names without going through transliterattions, cf. Berlin
(Ontario) becoming Kitchener in 1916.
 

But Constantinople - Istanbul is not in fact this kind of name change, 
for Istanbul (that is, stanbul) is probably a corrupted and shortened 
version of Constantinople, with the initial I added to fit Turkish 
phonology (cf. the old western version Stamboul, still used in Russian, 
also Smyrna - Izmir). (I have also heard it said that Istanbul comes 
from Greek EIS TN POLIN to the city, but that seems less likely to 
me.) So the change is more like Beijing - Peking than Berlin - 
Kitchener. I guess another similar change would be Danzig - Gdansk, but 
I don't know where the initial G came from so possibly the Polish form 
is older than the German.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/



Re: Looking for transcription or transliteration standards latin- arabic

2004-07-06 Thread Marcin 'Qrczak' Kowalczyk
W licie z wto, 06-07-2004, godz. 10:50 +0100, Peter Kirk napisa:

 I guess another similar change would be Danzig - Gdansk, but 
 I don't know where the initial G came from so possibly the Polish form 
 is older than the German.

A name with initial Gd is older than with D:
   http://encyclopedia.thefreedictionary.com/Gdansk
   http://en.wikipedia.org/wiki/Gda%C5%84sk#Names
but Wikipedia has now a hot dispute about how it should call the city:
   http://en.wikipedia.org/wiki/Talk:Gdansk/Naming_convention

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/





Re: Looking for transcription or transliteration standards latin- arabic

2004-07-06 Thread Patrick Andries
Peter Kirk a crit :
On 03/07/2004 00:07, Patrick Andries wrote:
o very different political realities (before and after 1453). Cities
change names without going through transliterattions, cf. Berlin
(Ontario) becoming Kitchener in 1916.
 

But Constantinople - Istanbul is not in fact this kind of name 
change, for Istanbul (that is, stanbul) is probably a corrupted and 
shortened version of Constantinople, with the initial I added to fit 
Turkish phonology (cf. the old western version Stamboul, still used in 
Russian, also Smyrna - Izmir). (I have also heard it said that 
Istanbul comes from Greek EIS TN POLIN to the city, but that seems 
less likely to me.)
Yes, I have heard this.
So the change is more like Beijing - Peking than Berlin - Kitchener. 
Without a political change Constantinople would not have changed name in 
a matter of days (at least as far as the officials were concerned). In 
any case, it is not a transliteration problem (Beijing -- Pkin).

P. A.



Re: Looking for transcription or transliteration standards latin- arabic

2004-07-06 Thread Patrick Andries
Patrick Andries a crit :

So the change is more like Beijing - Peking than Berlin - Kitchener. 

Without a political change Constantinople would not have changed name 
in a matter of days (at least as far as the officials were concerned). 
In any case, it is not a transliteration problem (Beijing -- Pkin).
[PA] I wrote this a bit too fast this morning (first message !). I 
believe the origin of Istanbul is a bit too obscure to decide whether it 
is due to a transcription or a complete name change. Just to confuse 
things further Konstantaniye was apparently used by the Turkish 
administration and a Greek form Istimboli is attested in the XIVth century.

P. .A



Re: Looking for transcription or transliteration standards latin- arabic

2004-07-06 Thread Peter Kirk
On 06/07/2004 13:05, Patrick Andries wrote:
Patrick Andries a crit :

So the change is more like Beijing - Peking than Berlin - Kitchener. 

Without a political change Constantinople would not have changed name 
in a matter of days (at least as far as the officials were 
concerned). In any case, it is not a transliteration problem (Beijing 
-- Pkin).

Well, did Gdansk/Danzig change its name backwards and forwards several 
times over history (thank you, Qrczak, for the interesting information 
about that), or was it simply that it had different names in different 
languages? This makes it not a transliteration problem but a translation 
problem, one which is common to many geographical names - sometimes the 
names in different languages are related, and sometimes they are not 
e.g. Turku/bo in Finland, or Yerushalayim/al-Quds, or Dublin/(I'll let 
Michael tell us the correct Irish form).

[PA] I wrote this a bit too fast this morning (first message !). I 
believe the origin of Istanbul is a bit too obscure to decide whether 
it is due to a transcription or a complete name change. Just to 
confuse things further Konstantaniye was apparently used by the 
Turkish administration and a Greek form Istimboli is attested in the 
XIVth century.

Thanks for this. The matter is indeed not so simple.
P. .A



--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/



Re: Looking for transcription or transliteration standards latin- arabic

2004-07-06 Thread Gerd Schumacher
 Patrick Andries scripsit:

 [PA] I wrote this a bit too fast this morning (first message !). I 
 believe the origin of Istanbul is a bit too obscure to decide whether it 
 is due to a transcription or a complete name change. Just to confuse 
 things further Konstantaniye was apparently used by the Turkish 
 administration and a Greek form Istimboli is attested in the XIVth
 century.

Thanks a lot for this interesting information.
I think, the underying meaning of Istimboli must be 
town at the isthmus, which makes sense, indeed.

Gerd




Re: Looking for transcription or transliteration standards latin- arabic

2004-07-06 Thread Mark Davis
 The latter problem could be solved easily by transcribing  as dh, but
 English speakers seem really terrified of the sequence dh.

Not quite so fast. Where a d can end a syllable and an h can start one, then
it can collide with dh representing . The general issue is that whenever
you use a sequence of letters in the target for
transliteration/transcription, and the elements of that sequence can
individually be targets, then you can get ambiguity.

There are mechanisms to separate a sequence of letters that would otherwise
be read as a unit:

apostrophe: as in Japanese transliterations (When vowels or consonant y
follow the syllabic nasal n, ng, m, add apostrophe (') after n. Example:
ren'ai / gen'in / sin'en / kon'ya  -- Cabinet Order (Kunrei) No.1)

hyphen: as in the Korean Ministry of Education transliteration, to
distinguish jeong-eum versus jeon-geum )...

diaeresis on second element: (doesn't work very well, since it only really
sits well on vowels).

 Transcriptions are another matter; the reader can read Tchaikovsky or
 Beijing without knowing anything at all about Cyrillic or Chinese, and
 still come close (theoretically) to the real pronunciation.

Agreed about the distinction in meaning between 'transcription' and
'transliteration'. However, the two examples of transcriptions are not
necessary good ones, at least for English speakers: the only reason that
English speakers will read 'Tchaikovsky' reasonably is because they have
learned the word, since it doesn't follow normal English orthographic rules.

rk
- Original Message - 
From: Doug Ewell [EMAIL PROTECTED]
To: Unicode Mailing List [EMAIL PROTECTED]
Cc: Mark Davis [EMAIL PROTECTED]; Mike Ayers
[EMAIL PROTECTED]
Sent: Saturday, July 03, 2004 09:40
Subject: Re: Looking for transcription or transliteration standards latin-
arabic


 RE: Looking for transcription or transliteration standards
 latin-arabicMark Davis wrote:

  In that case, we'd call it a transcription, since it doesn't roundtrip
  from source to target back to source. It is actually quite common for
  style guides for non-academic publications to have a restricted list
  of characters and character + accent combinations, and convert all
  others. For example, the Economist style guide, as I recall,
  recommends keeping accents in French, German, Italian, and Spanish
  names and words, but dropping them otherwise; and converting
  characters like  and  to nearest equivalents, th.
 
  Note that the latter loses information in two ways; the obvious one is
  that the distinction between  and  are lost; the less obvious one is
  that the distinction between them and a *real* 't' followed by 'h' in
  the source is lost. So that loses the distinction in sounds between
  'th' in 'cathode' and 'cathouse', as well as between 'thy' and
  'thigh'.

 The latter problem could be solved easily by transcribing  as dh, but
 English speakers seem really terrified of the sequence dh.

 The former problem is only a problem if t + h combinations (like
 cathouse) are actually used in the language.  I don't know if this is
 true for Icelandic.  It is certainly true for Old English, where  and 
 are also seen.

 -Doug Ewell
  Fullerton, California
  http://users.adelphia.net/~dewell/



- Original Message - 
From: Doug Ewell [EMAIL PROTECTED]
To: Unicode Mailing List [EMAIL PROTECTED]
Cc: Anto'nio Martins-Tuva'lkin [EMAIL PROTECTED]
Sent: Saturday, July 03, 2004 14:22
Subject: Re: Looking for transcription or transliteration standards latin-
arabic


 Anto'nio Martins-Tuva'lkin antonio at tuvalkin dot web dot pt wrote:

  Only specialists can make sense of them,
 
  Pray tell, why so? Is the letter  an usuperable obstacle for those
  who know only the letter a?...
 
  Can't the remove diacriticals action be performed in the reader's
  brain, instead of in the typesetter's office?

 But if the reader merely removes the diacriticals, that destroys the
 whole purpose of using a *transliteration* scheme, where 'a' and ''
 represent different letters in the source writing system.

 Jony's point (I think) was that only specialists can keep track of which
 target characters represent which source characters, especially when
 obscure diacritics or digits or other symbols are used.  At that point,
 the specialist probably knows the source characters well enough to read
 them directly, and the widespread use of Unicode enables document
 producers to use them directly.

 Transcriptions are another matter; the reader can read Tchaikovsky or
 Beijing without knowing anything at all about Cyrillic or Chinese, and
 still come close (theoretically) to the real pronunciation.

 -Doug Ewell
  Fullerton, California
  http://users.adelphia.net/~dewell/








RE: Looking for transcription or transliteration standards latin- arabic

2004-07-06 Thread Mike Ayers
Title: RE: Looking for transcription or transliteration standards latin- arabic






 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
 Behalf Of Anto'nio Martins-Tuva'lkin
 Sent: Saturday, July 03, 2004 7:28 AM


 On 2004.07.02, 21:53, Mike Ayers [EMAIL PROTECTED] wrote:
 
  On the other hand, maybe Ha Tinh is just lazy typography.
 
  From National Geographic? Medoubts. This is a deliberate removal
  of the diacritics unfamiliar to English readers, and is a
  traditional way to present foreign words.
 
 It is lazy typography, then. Deliberate, traditional and lazy. ;-)


 No. Lazy implies not doing something to avoid doing the work. This is not the case here. It's an accessibility issue.

 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
 Behalf Of Anto'nio Martins-Tuva'lkin
 Sent: Saturday, July 03, 2004 12:37 PM


 Pray tell, why so? Is the letter  an usuperable obstacle for those
 who know only the letter a?...


 For some of us, at least, yes. The diacritic implies, by its very existence, that it has meaning, but I do not know what that meaning is, so I am stymied. Removing the diacritics yields a strange word, but one which I can probably absorb.

 Can't the remove diacriticals action be performed in the reader's
 brain, instead of in the typesetter's office?


 Again, for at least some of us (and I suspect this is a majority of the population unfamiliar with a given diacritic), simply ignoring diacritics is not an option, just as ignoring letters would not be.


/|/|ike





Re: Looking for transcription or transliteration standards latin- arabic

2004-07-06 Thread Anto'nio Martins-Tuva'lkin
On 2004.07.06, 14:00, Peter Kirk [EMAIL PROTECTED] wrote:

 sometimes the names in different languages are related, and
 sometimes they are not e.g. Turku/Åbo in Finland, or
 Yerushalayim/al-Quds, or Dublin/

Baile Átha Cliath. (Formerly, with U+1E6B for the th.)

 This makes it not a transliteration problem but a translation
 problem,

Quite right!

--.
António MARTINS-Tuválkin |  ()|
[EMAIL PROTECTED]||
PT-1XXX-XXX LISBOA   Não me invejo de quem tem|
+351 934 821 700 carros, parelhas e montes|
http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe|
http://pagina.de/bandeiras/  a água em todas as fontes|




Re: Looking for transcription or transliteration standards latin- arabic

2004-07-06 Thread Raymond Mercier



Gerd Schumacher wrote
 I think, the underying meaning of Istimboli 
must be  "town at the isthmus", which makes sense, 
indeed.
How does that work ? Do you mean
istim 
, 
bol 
?

Raymond 
Mercier


Re: Looking for transcription or transliteration standards latin- arabic

2004-07-06 Thread John Cowan
Peter Kirk scripsit:

 Well, did Gdansk/Danzig change its name backwards and forwards several 
 times over history (thank you, Qrczak, for the interesting information 
 about that), or was it simply that it had different names in different 
 languages?

Yes to both.  Its name in Polish is Gdan'sk, in German Danzig.  Which one is
the dominant name is determined by which power is dominant at a given time.
What foreigners call the city is influenced, though not determined, by
when the city first became important to them.

There is hardly a city in Europe that isn't like this.  What makes this
one special, though hardly unique, is the repeated changes of sovereignty.
Consider Strassburg/Strasbourg.

 This makes it not a transliteration problem but a translation 
 problem, one which is common to many geographical names - sometimes the 
 names in different languages are related, and sometimes they are not 
 e.g. Turku/Åbo in Finland, or Yerushalayim/al-Quds, or Dublin/(I'll let 
 Michael tell us the correct Irish form).

Baile Atha Cliath.  Dublin is also an Irish name, though used mostly by
Norse and English (and now by anglophone Irish, of course).

-- 
My confusion is rapidly waxing  John Cowan
For XML Schema's too taxing:[EMAIL PROTECTED]
I'd use DTDshttp://www.reutershealth.com
If they had local trees --  http://www.ccil.org/~cowan
I think I best switch to RELAX NG.



Re: Looking for transcription or transliteration standards latin- arabic

2004-07-06 Thread John Cowan
Patrick Andries scripsit:

 So the change is more like Beijing - Peking than Berlin - Kitchener. 
 
 Without a political change Constantinople would not have changed name in 
 a matter of days (at least as far as the officials were concerned). In 
 any case, it is not a transliteration problem (Beijing -- Pékin).

Not just a transliteration problem, either:  Mandarin Chinese underwent
a sound-shift in the 17th century that changed the second syllable from
ging to jing, but the English name was already set (and the change
did not affect Southern Sinitic in any case; cf. Cantonese pak king).

In addition, when it isn't the capital (bei jing = North-capital),
i.e. 1928-49, its name is Beiping (north-peace).

-- 
Here lies the Christian,John Cowan
judge, and poet Peter,  http://www.reutershealth.com
Who broke the laws of God   http://www.ccil.org/~cowan
and man and metre.  [EMAIL PROTECTED]



Re: Looking for transcription or transliteration standards latin- arabic

2004-07-06 Thread Peter Kirk
On 06/07/2004 20:47, Raymond Mercier wrote:
Gerd Schumacher wrote
 I think, the underying meaning of Istimboli must be
 town at the isthmus, which makes sense, indeed.
How does that work ? Do you mean
istim , bol ?
 
Raymond Mercier

This is more complicated than it looks. The Greek form Istimboli is 
impossible for the period as Greek had no [b] sound, for  was 
pronounced [v] except that later and perhaps already at that period  
was pronounced [b] at least in foreign words. So is the Greek consonant 
cluster , or , or , or what? Also is the previous consonant 
cluster  as transliterated, or  corresponding to isthmus? And then 
what are the Greek vowels?

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/



Re: Looking for transcription or transliteration standards latin-arabic

2004-07-06 Thread busmanus
Mike Ayers wrote:
  From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
  Behalf Of Anto'nio Martins-Tuva'lkin
  Sent: Saturday, July 03, 2004 7:28 AM
  On 2004.07.02, 21:53, Mike Ayers [EMAIL PROTECTED] wrote:
 
   On the other hand, maybe Ha Tinh is just lazy typography.
  
   From National Geographic?  Medoubts.  This is a deliberate removal
   of the diacritics unfamiliar to English readers, and is a
   traditional way to present foreign words.
 
  It is lazy typography, then. Deliberate, traditional and lazy. ;-)
No.  Lazy implies not doing something to avoid doing the 
work.  This is not the case here.  It's an accessibility issue.
Perhaps it is. But then it's partly due to the lazy tradition.
  Can't the remove diacriticals action be performed in the reader's
  brain, instead of in the typesetter's office?
Again, for at least some of us (and I suspect this is a majority 
of the population unfamiliar with a given diacritic), simply ignoring 
diacritics is not an option
I don't think it's a problem with any given diacritical. Its rather
an indistinct horror of diacriticals in general in speakers of a
language without any diacriticals at all, like English. E.g.
Hungarian uses three diacriticals and Hungarian speakers make no
big deal of just ignoring the meaningless caron in Czech or the grave
and the cedilla in Roumanian names.
On the other hand, I must admit, that we also can be quite brutal
to diacriticals in some newspapers or when it comes to a language
like Vietnamese...

Miert fizetsz az internetert? Korlatlan, ingyenes internet hozzaferes a FreeStarttol.
Probald ki most! http://www.freestart.hu


Re: Looking for transcription or transliteration standards latin- arabic

2004-07-06 Thread Anto'nio Martins-Tuva'lkin
On 2004.07.07, 00:49, Mike Ayers [EMAIL PROTECTED] wrote:

 Are you implying that, had printers throughout the centuries put the
 effort into faithfully reproducing every obscure symbol

I spell my own name with some of those obscure symbols, thank you.
Obscure indeed -- that's the last thing I'd expect in a list such as
this! Is internationalization is serious issue, or just a toy to kill
off idle time?

 from every foreign language, that the modern American would accept
 words with arbitrary diacritics?

Foreign? American? I obviously misunderstood the whole purpose of
these discussions, then. Bye bye -- will back as soon as I get my
Green Card, señor! ;-)

--.
António MARTINS-Tuválkin |  ()|
[EMAIL PROTECTED]||
PT-1XXX-XXX LISBOA   Não me invejo de quem tem|
+351 934 821 700 carros, parelhas e montes|
http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe|
http://pagina.de/bandeiras/  a água em todas as fontes|




Re: Looking for transcription or transliteration standards latin-arabic

2004-07-05 Thread busmanus
busmanus wrote:
Philipp Reichmuth wrote:
If we were starting from scratch today, we'd probably do better.  (I
hope we would retain the v sound in  instead of converting
it to f.)
Except there is no v sound, only an f sound in the Russian 
pronunciation of  due to regressive assimilation. 
Just like in English or French, as far as I can perceive.
The reason for spellings like Stroganoff for Stroganov
is word-final devoicing in Russian, which is absent from
French and at least much less marked in English, so it had
to be denoted explicitly.
I was inaccurate here: word final devoicing does occur in French
sometimes, but not in the voiced member of a voiced-unvoiced pair
like /v/-/f/. In Russian it _only_ occurs in such pairs.

Miert fizetsz az internetert? Korlatlan, ingyenes internet hozzaferes a FreeStarttol.
Probald ki most! http://www.freestart.hu


Re: Looking for transcription or transliteration standards latin- arabic

2004-07-04 Thread John Cowan
Doug Ewell scripsit:

 On the contrary, untransliterated (or untranscribed) text can only be
 read by people who know the original script.  Transliterations and
 transcriptions at least give the Latin-script-only reader a fighting
 chance to pronounce the text.  

Transliterations don't work so well for that, but transliterating some
scripts to Latin is a necessity (for me, at least) to even recognize them.
I can cope with Greek, Hebrew, and Cyrillic, but an English text full
of Arabic or Chinese names presented in the usual scripts for those
languages would be hopeless -- I wouldn't be able to reliably tell one
name from another.

This is true even though I have no more Greek, Hebrew, or Russian than
I have Arabic or Chinese.

-- 
John Cowan  [EMAIL PROTECTED]  www.ccil.org/~cowan  www.reutershealth.com
If he has seen farther than others,
it is because he is standing on a stack of dwarves.
--Mike Champion, describing Tim Berners-Lee (adapted)



Re: Looking for transcription or transliteration standards latin- arabic

2004-07-04 Thread Philipp Reichmuth
Doug Ewell schrieb:
Transcription does not require roundtrip. It is intended in this case
for the English speaker to be able to deliver an approximate
pronunciation adapted to his native vocal capabilities.
Except when it doesn't.  We write Tchaikovsky, not Chykoffskee.
But then, English spelling isn't really logical anyway, and the average 
English speaker will be able to produce something from Tchaikovsky that 
would be more or less recognizable by a Russian.

If we were starting from scratch today, we'd probably do better.  (I
hope we would retain the v sound in  instead of converting
it to f.)
Except there is no v sound, only an f sound in the Russian 
pronunciation of  due to regressive assimilation. 
Chykoffskee is pretty accurate, actually. I'd say Tchaikovsky is just 
a spelling taken over from French at a time when French was pretty much 
the international common language at least in diplomacy and art.

Philipp
--
Nur Miele schwrmt die Kuh Roswitha
und gibt so manchen Extra-Liter.
  - Miele-Melkmaschinenwerbung, 70er


Re: Looking for transcription or transliteration standards latin- arabic

2004-07-04 Thread Patrick Andries
Philipp Reichmuth a crit :
Except there is no v sound, only an f sound in the Russian 
pronunciation of  due to regressive assimilation. 
Chykoffskee is pretty accurate, actually. I'd say Tchaikovsky is 
just a spelling taken over from French at a time when French was 
pretty much the international common language at least in diplomacy 
and art.
[PA] And the prevalence of French in the Russian imperial nobility.
In French it is today Tchakovsky (with trma), but the v looks like an 
attempt to transliterate, Russian names written in French in the XIXth 
century would usually transcribe  as ff : boeuf Strogonoff, Michel 
Strogoff (Jules Verne), *Princesse Demidoff* ne Strogonoff, Tchkoff as 
an migr name in France [2 born in Paris between 1916 and 1940].




Re: Looking for transcription or transliteration standards latin- arabic

2004-07-04 Thread John Cowan
Philipp Reichmuth scripsit:

 Chykoffskee is pretty accurate, actually.

Thank you.  I have long since forgotten all the (very small amount of)
Russian I ever learned, but I retain a firm grip on its phonology due to
an interesting paedagogical device.  My Russian instructor spent the first
week or so of class teaching us to speak English with a Russian accent
(and this I can do to this day).  The idea was that having mastered this,
we could then begin to speak Russian as well with a Russian accent,
which is to say, perfectly.

 I'd say Tchaikovsky is just 
 a spelling taken over from French at a time when French was pretty much 
 the international common language at least in diplomacy and art.

Doubtless.  I have even seen it spelled in German fashion in English a
time or two.

-- 
I suggest you call for help,John Cowan
or learn the difficult art of mud-breathing.[EMAIL PROTECTED]
--Great-Souled Sam  http://www.ccil.org/~cowan



Re: Looking for transcription or transliteration standards latin-arabic

2004-07-04 Thread busmanus
Philipp Reichmuth wrote:
If we were starting from scratch today, we'd probably do better.  (I
hope we would retain the v sound in  instead of converting
it to f.)

Except there is no v sound, only an f sound in the Russian 
pronunciation of  due to regressive assimilation. 
Just like in English or French, as far as I can perceive.
The reason for spellings like Stroganoff for Stroganov
is word-final devoicing in Russian, which is absent from
French and at least much less marked in English, so it had
to be denoted explicitly.

Miert fizetsz az internetert? Korlatlan, ingyenes internet hozzaferes a FreeStarttol.
Probald ki most! http://www.freestart.hu


Re: Hausa: Boko-Ajami? (RE: Looking for transcription or transliteration standards latin- arabic)

2004-07-03 Thread Mark Davis
You might take a look at what we have in ICU for doing transliteration. It
is rule-based, where each of the rules can take the context of surrounding
letters into account.

For information, see
http://oss.software.ibm.com/icu/userguide/Transform.html
http://oss.software.ibm.com/icu/userguide/TransformRule.html
You can try out the rules with an interactive demo at
http://oss.software.ibm.com/cgi-bin/icu/tr

rk
- Original Message - 
From: Donald Z. Osborn [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Friday, July 02, 2004 21:52
Subject: Hausa: Boko-Ajami? (RE: Looking for transcription or
transliteration standards latin- arabic)


 I've read selected messages in this thread (on Unicode list) and some
messages
 bring to mind the thought of developing routines or standards to permit
 toggling back and forth between standard Latin and Arabic transcriptions
for
 the same language, such as between the Boko and Ajami writing of Hausa.
(Same
 applies to any two or three transcription systems used for particular
 languages.)

 One of the benefits of ICT is, theoretically anyway, that one can have
text both
 (all) ways. Which would mean that the user has options, people using
 alternative systems are not excluded, and the society does not have to
debate a
 decision of which writing system to use, etc.

 Because there is generally not a 1-to-1 character correspondence in
spellings in
 different transcriptions, I wonder if you don't end up having to consider
 something that operates a bit like machine translation, analyzing the
context
 of words in cases where transcription of a word in one system could be
 transliterated into something misspelled or taken as more than one word in
the
 other system. Necessarily, I think, such routines would have to be
 language-specific.

 Any feedback would be appreciated. TIA...

 Don Osborn
 Bisharat.net











Re: Looking for transcription or transliteration standards latin- arabic

2004-07-03 Thread Anto'nio Martins-Tuva'lkin
On 2004.07.02, 21:53, Mike Ayers [EMAIL PROTECTED] wrote:

 On the other hand, maybe Ha Tinh is just lazy typography.

 From National Geographic?  Medoubts.  This is a deliberate removal
 of the diacritics unfamiliar to English readers, and is a
 traditional way to present foreign words.

It is lazy typography, then. Deliberate, traditional and lazy. ;-)

--.
António MARTINS-Tuválkin |  ()|
[EMAIL PROTECTED]||
PT-1XXX-XXX LISBOA   Não me invejo de quem tem|
+351 934 821 700 carros, parelhas e montes|
http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe|
http://pagina.de/bandeiras/  a água em todas as fontes|




Re: Looking for transcription or transliteration standards latin- arabic

2004-07-03 Thread Doug Ewell
RE: Looking for transcription or transliteration standards
latin-arabicMike Ayers wrote:

 Trivia question: Which Vietnamese city does my atlas spell correctly,
 much to the chagrin of the Vietnamese?

Probably Saigon.  (Or is it Sai Gon?)

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/





Re: Looking for transcription or transliteration standards latin- arabic

2004-07-03 Thread Doug Ewell
RE: Looking for transcription or transliteration standards
latin-arabicMark Davis wrote:

 In that case, we'd call it a transcription, since it doesn't roundtrip
 from source to target back to source. It is actually quite common for
 style guides for non-academic publications to have a restricted list
 of characters and character + accent combinations, and convert all
 others. For example, the Economist style guide, as I recall,
 recommends keeping accents in French, German, Italian, and Spanish
 names and words, but dropping them otherwise; and converting
 characters like  and  to nearest equivalents, th.

 Note that the latter loses information in two ways; the obvious one is
 that the distinction between  and  are lost; the less obvious one is
 that the distinction between them and a *real* 't' followed by 'h' in
 the source is lost. So that loses the distinction in sounds between
 'th' in 'cathode' and 'cathouse', as well as between 'thy' and
 'thigh'.

The latter problem could be solved easily by transcribing  as dh, but
English speakers seem really terrified of the sequence dh.

The former problem is only a problem if t + h combinations (like
cathouse) are actually used in the language.  I don't know if this is
true for Icelandic.  It is certainly true for Old English, where  and 
are also seen.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/





Re: Looking for transcription or transliteration standards latin- arabic

2004-07-03 Thread Doug Ewell
Jony Rosenne rosennej at qsm dot co dot il wrote:

 And with the availability of Unicode, I think the need for
 transliteration is fading. It seems that these schemes can only be
 used by people who know the transliterated script.

On the contrary, untransliterated (or untranscribed) text can only be
read by people who know the original script.  Transliterations and
transcriptions at least give the Latin-script-only reader a fighting
chance to pronounce the text.  (Without them, those of use who can't
read Arabic would have a real struggle reading today's news: Saddam
Hussein, Al Qaeda, Osama bin Laden, etc.)

The availability of Unicode means that scores of writing systems and
orthographies can be represented in computers, all at once,
unambiguously  It doesn't mean that humans have become capable of
reading scripts they previously couldn't read.

Sorry if this wasn't what you meant.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/





Re: Looking for transcription or transliteration standards latin- arabic

2004-07-03 Thread Doug Ewell
John Cowan jcowan at reutershealth dot com wrote:

 Jony Rosenne scripsit:
 Transcription does not require roundtrip. It is intended in this case
 for the English speaker to be able to deliver an approximate
 pronunciation adapted to his native vocal capabilities.

 Except when it doesn't.  We write Tchaikovsky, not Chykoffskee.

Approximate is the operative word here.  Like English spelling in
general, our transcription schemes for personal names have derived from
numerous sources across many years, and so are irregular.

If we were starting from scratch today, we'd probably do better.  (I
hope we would retain the v sound in  instead of converting
it to f.)

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/





RE: Looking for transcription or transliteration standards latin- arabic

2004-07-03 Thread Jony Rosenne
These are transcriptions. I was talking about transliterations, which use
various uncommon letter and diacritics combinations to achieve roundtrip
accuracy. Only specialists can make sense of them, and they can just as
easily read the original.

Jony

 -Original Message-
 From: Doug Ewell [mailto:[EMAIL PROTECTED] 
 Sent: Saturday, July 03, 2004 7:50 PM
 To: Unicode Mailing List
 Cc: Jony Rosenne
 Subject: Re: Looking for transcription or transliteration 
 standards latin- arabic
 
 
 Jony Rosenne rosennej at qsm dot co dot il wrote:
 
  And with the availability of Unicode, I think the need for 
  transliteration is fading. It seems that these schemes can only be 
  used by people who know the transliterated script.
 
 On the contrary, untransliterated (or untranscribed) text can 
 only be read by people who know the original script.  
 Transliterations and transcriptions at least give the 
 Latin-script-only reader a fighting chance to pronounce the 
 text.  (Without them, those of use who can't read Arabic 
 would have a real struggle reading today's news: Saddam 
 Hussein, Al Qaeda, Osama bin Laden, etc.)
 
 The availability of Unicode means that scores of writing 
 systems and orthographies can be represented in computers, 
 all at once, unambiguously  It doesn't mean that humans have 
 become capable of reading scripts they previously couldn't read.
 
 Sorry if this wasn't what you meant.
 
 -Doug Ewell
  Fullerton, California
  http://users.adelphia.net/~dewell/
 
 
 
 




Re: Looking for transcription or transliteration standards latin- arabic

2004-07-03 Thread Anto'nio Martins-Tuva'lkin
On 2004.07.03, 18:02, Jony Rosenne [EMAIL PROTECTED] wrote:

 transliterations, which use various uncommon letter and diacritics
 combinations to achieve roundtrip accuracy.

OK.

 Only specialists can make sense of them,

Pray tell, why so? Is the letter â an usuperable obstacle for those
who know only the letter a?...

Can't the remove diacriticals action be performed in the reader's
brain, instead of in the typesetter's office?

--.
António MARTINS-Tuválkin |  ()|
[EMAIL PROTECTED]||
PT-1XXX-XXX LISBOA   Não me invejo de quem tem|
+351 934 821 700 carros, parelhas e montes|
http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe|
http://pagina.de/bandeiras/  a água em todas as fontes|




Re: Looking for transcription or transliteration standards latin- arabic

2004-07-03 Thread Michael Everson
At 14:22 -0700 2004-07-03, Doug Ewell wrote:
Anto'nio Martins-Tuva'lkin antonio at tuvalkin dot web dot pt wrote:
 Only specialists can make sense of them,
 Pray tell, why so? Is the letter â an usuperable obstacle for those
 who know only the letter a?...
 Can't the remove diacriticals action be performed in the reader's
 brain, instead of in the typesetter's office?
But if the reader merely removes the diacriticals,
He means, I think, that the reader ignores them, not knowing what they mean.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com



Re: Looking for transcription or transliteration standards latin-arabic

2004-07-02 Thread Mark Davis
Yes, transliterations are between different scripts. However, there are
often different transliterations *between the same two scripts* that vary by
language. To take your example, the transliterations customarily used
between the Greek script and the Latin script are different in the cases:

(a) for ancient Greek and English (e.g.  = eu)
(b) for modern Greek and English (e.g.  = ev, ef)

(see http://www.eki.ee/wgrs/rom1_el.pdf)

For that matter, the transliterations customarily used between Cyrillic and
Latin are different for the cases:

(a) Russian and English
(b) Russian and French
(c) Russian and German
(d) Serbian and English
...

Note: I am still speaking of transliterations (e.g. transformations that
'roundtrip'), not transcriptions (which try to match the pronunciation more
precisely, and may lose information).

Thus, for brevity, one may and does speak of a transliteration between
Russian and English, as shorthand for a transliteration between the
Cyrillic script and the Latin script following customary conventions for
Russian and English.

Mark

- Original Message - 
From: Anto'nio Martins-Tuva'lkin [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Thursday, July 01, 2004 17:19
Subject: Re: Looking for transcription or transliteration standards
latin-arabic


 On 2004.07.01, 18:06, Mark Davis [EMAIL PROTECTED] wrote:

  different transliterations for different languages,

 Strictly speaking, transliterations are between two given scripts, the
 language being transparent -- I mean *real* transliterating from, say
 Greek to latin, uses the same rules for the Illiad as for cypriot or
 greek phone books or license plates...

 --.
 Antnio MARTINS-Tuvlkin |  ()|
 [EMAIL PROTECTED]||
 PT-1XXX-XXX LISBOA   No me invejo de quem tem|
 +351 934 821 700 carros, parelhas e montes|
 http://www.tuvalkin.web.pt/bandeira/ s me invejo de quem bebe|
 http://pagina.de/bandeiras/  a gua em todas as fontes|







RE: Looking for transcription or transliteration standards latin- arabic

2004-07-02 Thread Mike Ayers
Title: RE: Looking for transcription or transliteration standards latin-arabic






 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
 Behalf Of Mark Davis
 Sent: Friday, July 02, 2004 8:36 AM


 Note: I am still speaking of transliterations (e.g. 
 transformations that
 'roundtrip'), not transcriptions (which try to match the 
 pronunciation more
 precisely, and may lose information).


 OK, just because I do so love monkey wrenches, please explain what I found in my atlas:


 Vietnamese English
  --
 Ha Tinh Ha Tinh


 In which we have a trancription/transliteration/taxonomy problem between Latin and Latin. Since this does not remotely roundtrip (Ha, for instance, has 18 Vietnamese equivalents), and no attempt is made to match pronunciation, how do we refer to it?


/|/|ike


Trivia question: Which Vietnamese city does my atlas spell correctly, much to the chagrin of the Vietnamese?





RE: Looking for transcription or transliteration standards latin- arabic

2004-07-02 Thread Chris Harvey
 OK, just because I do so love monkey wrenches, please explain what I found 
 in 
 my atlas: 
Vietnamese   English 
  -- 
Ha Tinh  Ha Tinh 

In which we have a trancription/transliteration/taxonomy problem between 
 Latin 
 and Latin.  Since this does not remotely roundtrip (Ha, for instance, has 18 
 Vietnamese 
 equivalents), and no attempt is made to match pronunciation, how do we refer to it?

Perhaps one could think of Ha Tinh as the English word for the city, like Rome 
(English) for Roma (Italian), or Tokyo (English) for Tky (English 
transliteration of Japanese), or Kahnawake (English/French) for Kahnaw:ke (Mohawk). 
In these and many other cases, place-names as used in foreign languages sould not be 
considered tranliterations, but linguistic borrowings, where pronunciation and 
spelling are often changed in the new language.

On the other hand, maybe Ha Tinh is just lazy typography.

Chris Harvey
languagegeek.com





RE: Looking for transcription or transliteration standards latin- arabic

2004-07-02 Thread Jony Rosenne
Transcription does not require roundtrip. It is intended in this case for
the English speaker to be able to deliver an approximate pronunciation
adapted to his native vocal capabilities.

And with the availability of Unicode, I think the need for transliteration
is fading. It seems that these schemes can only be used by people who know
the transliterated script.

Jony

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On
Behalf Of Mike Ayers
Sent: Friday, July 02, 2004 8:24 PM
To: 'Mark Davis'; [EMAIL PROTECTED]
Subject: RE: Looking for transcription or transliteration standards latin-
arabic




 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
 Behalf Of Mark Davis 
 Sent: Friday, July 02, 2004 8:36 AM 
 Note: I am still speaking of transliterations (e.g. 
 transformations that 
 'roundtrip'), not transcriptions (which try to match the 
 pronunciation more 
 precisely, and may lose information). 
OK, just because I do so love monkey wrenches, please explain what I
found in my atlas: 
Vietnamese   English 
  -- 
Ha Tinh  Ha Tinh 
In which we have a trancription/transliteration/taxonomy problem
between Latin and Latin.  Since this does not remotely roundtrip (Ha, for
instance, has 18 Vietnamese equivalents), and no attempt is made to match
pronunciation, how do we refer to it?


/|/|ike 
Trivia question: Which Vietnamese city does my atlas spell correctly, much
to the chagrin of the Vietnamese? 





Re: Looking for transcription or transliteration standards latin- arabic

2004-07-02 Thread Mark Davis
Title: RE: Looking for transcription or transliteration standards latin->arabic



In that case,we'd call it a transcription, 
since it doesn't roundtrip from source to target back to source. It is actually 
quite common for style guides for non-academic publications to have a restricted 
list of characters and character + accent combinations, and convert all others. 
For example, the Economist style guide, as I recall,recommends keeping 
accents in French, German, Italian, and Spanish names and words, but dropping 
them otherwise; and converting characters like  and  to nearest equivalents, 
"th".

Note that the latter loses information in two ways; 
the obvious one is that the distinction between  and  are lost; the less 
obvious one is that the distinction between them and a *real* 't' followed by 
'h' in the source is lost. So that loses the distinction in sounds between 'th' 
in 'cathode'and 'cathouse', as well as between 'thy' and 
'thigh'.

rk

  - Original Message - 
  From: 
  Mike Ayers 
  To: 'Mark Davis' ; [EMAIL PROTECTED] 
  Sent: Friday, July 02, 2004 10:24
  Subject: RE: Looking for transcription or 
  transliteration standards latin- arabic
  
   From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On 
   Behalf Of Mark Davis  Sent: 
  Friday, July 02, 2004 8:36 AM 
   Note: I am still speaking of transliterations (e.g. 
   transformations that  
  'roundtrip'), not transcriptions (which try to match the  pronunciation more  precisely, and may 
  lose information). 
   OK, just because I 
  do so love monkey wrenches, please explain what I found in my atlas: 
  
   Vietnamese English 

  --  Ha Tinh Ha 
  Tinh 
   In which we have a 
  trancription/transliteration/taxonomy problem between Latin and Latin. 
  Since this does not remotely roundtrip (Ha, for instance, has 18 Vietnamese 
  equivalents), and no attempt is made to match pronunciation, how do we refer 
  to it?
  /|/|ike 
  Trivia question: Which Vietnamese city does my atlas spell 
  correctly, much to the chagrin of the Vietnamese? 



Re: Looking for transcription or transliteration standards latin- arabic

2004-07-02 Thread John H. Jenkins
 Jul 2, 2004 11:17 AM Chris Harvey 
Perhaps one could think of Ha Tinh as the English word for the city, 
like Rome (English) for Roma (Italian), or Tokyo (English) for 
Tky (English transliteration of Japanese), or Kahnawake 
(English/French) for Kahnaw:ke (Mohawk).
Or Peking for Bejng.  :-)

John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage.mac.com/jhjenkins/



Re: Looking for transcription or transliteration standards latin- arabic

2004-07-02 Thread John Cowan
Jony Rosenne scripsit:
 Transcription does not require roundtrip. It is intended in this case for
 the English speaker to be able to deliver an approximate pronunciation
 adapted to his native vocal capabilities.

Except when it doesn't.  We write Tchaikovsky, not Chykoffskee.

-- 
I could dance with you till the cows   John Cowan
come home.  On second thought, I'd  http://www.ccil.org/~cowan
rather dance with the cows when you http://www.reutershealth.com
came home.  --Rufus T. Firefly [EMAIL PROTECTED]



RE: Looking for transcription or transliteration standards latin- arabic

2004-07-02 Thread Mike Ayers
Title: RE: Looking for transcription or transliteration standards latin- arabic






 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
 Behalf Of John H. Jenkins


  Jul 2, 2004 11:17 AM Chris Harvey 
 
  Perhaps one could think of Ha Tinh as the English word 
 for the city, 
  like Rome (English) for Roma (Italian), or Tokyo (English) for 
  Tky (English transliteration of Japanese), or Kahnawake 
  (English/French) for Kahnaw:ke (Mohawk).
 
 Or Peking for Bejng. :-)


 Or either of those for ? Hmmm - can't really transcribe , now can we? After all, it doesn't have a definitive pronunciation, various government mandates aside. We can only transcribe pronunciation, not spelling. And isn't that the real difference? I always thought it was. Transcribing is making sounds readable, whereas transliteration is making letters familiar, yes?

 I think this is a bit of a Rorshach, though - I doubt any definition or definitons would well cover all the available ground.


/|/|ike





RE: Looking for transcription or transliteration standards latin- arabic

2004-07-02 Thread Jony Rosenne


 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of John H. Jenkins
 Sent: Friday, July 02, 2004 9:48 PM
 To: [EMAIL PROTECTED]
 Subject: Re: Looking for transcription or transliteration 
 standards latin- arabic
 
 
 
  Jul 2, 2004 11:17 AM ?Chris Harvey 
 
  Perhaps one could think of Ha Tinh as the English word 
 for the city,
  like Rome (English) for Roma (Italian), or Tokyo (English) for 
  Tky (English transliteration of Japanese), or Kahnawake 
  (English/French) for Kahnaw:ke (Mohawk).
 
 Or Peking for Bejng.  :-)

Or Constantinople for Istanbul.  :-)

Jony

 
 
 John H. Jenkins
 [EMAIL PROTECTED]
 [EMAIL PROTECTED]
 http://homepage.mac.com/jhjenkins/
 
 
 
 
 





RE: Looking for transcription or transliteration standards latin- arabic

2004-07-02 Thread Mike Ayers
Title: RE: Looking for transcription or transliteration standards latin- arabic






 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
 Behalf Of Chris Harvey
 Sent: Friday, July 02, 2004 11:17 AM


 Perhaps one could think of Ha Tinh as the English word for 
 the city, like Rome (English) for Roma (Italian), or 
 Tokyo (English) for Tky (English transliteration of 


 Tky is not an English transliteration of Japanese, as it uses diacritics not found in English. The correct English transliteration is in fact Tokyo, which does not round trip.

 Japanese), or Kahnawake (English/French) for Kahnaw:ke 


 Errr - didn't the Emglish/French useage predate the Mohawk alphabet? Pretty perverse case there.


 (Mohawk). In these and many other cases, place-names as used 
 in foreign languages sould not be considered tranliterations, 
 but linguistic borrowings, where pronunciation and spelling 
 are often changed in the new language.


 In part you are correct, but this really only holds where the place name gets enough usage to develop its own name in the other language. Most famous places (Paris, New York, et. al.) have language specific names in most languages, but lesser knowns such as Ha Tinh are unlikely to have such names.

 On the other hand, maybe Ha Tinh is just lazy typography.


 From National Geographic? Medoubts. This is a deliberate removal of the diacritics unfamiliar to English readers, and is a traditional way to present foreign words. If we're going to categorize trans-thingies, I think this deserves its own category, but since it's all relative and vague, I'm not terribly concerned. Mostly I just wondered if it did fit in anywhere. Seems it doesn't.


/|/|ike





RE: Looking for transcription or transliteration standards latin- arabic

2004-07-02 Thread Chris Harvey

  Tky is not an English transliteration of Japanese, as it uses diacritics not 
  found 
  in English.  The correct English transliteration is in fact Tokyo, which does not 
  round trip.

My mistake, I meant Latin/Roman transliteration. 

  or Kahnawake (English/French) for Kahnaw:ke
Errr - didn't the Emglish/French useage predate the Mohawk alphabet?  Pretty perverse 
 case there. 

Not as such. The previous English/French spelling of the community was Caughnawaga, 
pronounced in the local English as [kgnwg]. As society has changed somewhat, 
there has been a trend for Canadian society to go back to using the original Native 
names (which the Native people have been using all along). So what happened was, the 
government looked at the way the Mohawk name was already spelled in Mohawk, 
Kahnaw:ke [khnwke], and modified it to suit English/French orthographical 
practice. My point here was that the Mohawk language uses a grave accent and long 
vowel marker, which are discarded in English and French. Today, the local English 
speakers still by and large call the town Caughnawaga, but the English speakers call 
the golf course (which uses the new name) [knwki]. So for people living in that 
part of Qubec, you could say that the word Kanawake is treated like Paris.

Chris Harvey
languagegeek.com





[totally OT] Mohawk, Re: Looking for transcription or transliteration standards latin- arabic

2004-07-02 Thread Patrick Andries
Mike Ayers a crit :
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
 Behalf Of Chris Harvey
 Sent: Friday, July 02, 2004 11:17 AM
 Perhaps one could think of Ha Tinh as the English word for
 the city, like Rome (English) for Roma (Italian), or
 Tokyo (English) for Tky (English transliteration of
Tky is not an English transliteration of Japanese, as it 
uses diacritics not found in English.  The correct English 
transliteration is in fact Tokyo, which does not round trip.

 Japanese), or Kahnawake (English/French) for Kahnaw:ke
Errr - didn't the Emglish/French useage predate the Mohawk 
alphabet?  Pretty perverse case there.

Yes, the Mohwak alphabet certainly postdates the French transcriptions.
Just a few pieces of information about Mohawk (Agnier in its traditional 
French form) names around Montreal (Kanesatake North Shore, Kahnawake 
South Shore) :

   1) Heard one of the Mohawk leaders speak on the radio the other day 
and he pronounced the K of  Kanesatake as Kansatgu for my French ear, 
which seems to be validated by the old French spelling Canessedage 
(first attested in 1695), the name was first used apparently when the 
Agniers found refuge at the foot of Mont Royal on Montral Island than 
already occupied by the French for quite a time before the Sulpicians 
moved them to another area ouside Montreal. The French adopted Oka (an 
Algonquian name, if I recall properly) to designate the same place the 
Mohawk named Kanesatake.

   2) As far as Kahnawake is concerned the settlement occurred again 
while the French had settled the area (long story but the small group of 
Mohawk that had converted to Catholicism and found refuge around 
Montreal went through several settlements before settling in Kahnawake), 
at the same time the priests and French settlers that accompagnied the 
Mohawk called the place (now Kahnawake) Saint-Franois-Xavier-du-Sault 
or simply Le Sault. In Mohawk (agnier) the present-day Kahnawake was 
respectively called Kahnawake ( au rapide ,  by the rapids ), in 
1676, Kahnawakon, ( dans le rapide ,  in the rapids ), in 1690, 
Kanatakwenke, ( d'o on est parti ,  whence we left ), in 1696 and 
Caughnawaga, in 1716 and many other spellings thereafter until 1980 when 
Kahnawake was chosen as the official spelling.

P. A.



Re: Looking for transcription or transliteration standards latin- arabic

2004-07-02 Thread Patrick Andries
Jony Rosenne a crit :

  

-Original Message-
From: [EMAIL PROTECTED] 
[mailto:[EMAIL PROTECTED] On Behalf Of John H. Jenkins



Peking for Bejng.  :-)



Or Constantinople for Istanbul.  :-)

Two very different political realities (before and after 1453). Cities
change names without going through transliterattions, cf. Berlin
(Ontario) becoming Kitchener in 1916.

In any case, it is Istamboul and Pkin.

P. A.





Hausa: Boko-Ajami? (RE: Looking for transcription or transliteration standards latin- arabic)

2004-07-02 Thread Donald Z. Osborn
I've read selected messages in this thread (on Unicode list) and some messages
bring to mind the thought of developing routines or standards to permit
toggling back and forth between standard Latin and Arabic transcriptions for
the same language, such as between the Boko and Ajami writing of Hausa. (Same
applies to any two or three transcription systems used for particular
languages.)

One of the benefits of ICT is, theoretically anyway, that one can have text both
(all) ways. Which would mean that the user has options, people using
alternative systems are not excluded, and the society does not have to debate a
decision of which writing system to use, etc.

Because there is generally not a 1-to-1 character correspondence in spellings in
different transcriptions, I wonder if you don't end up having to consider
something that operates a bit like machine translation, analyzing the context
of words in cases where transcription of a word in one system could be
transliterated into something misspelled or taken as more than one word in the
other system. Necessarily, I think, such routines would have to be
language-specific.

Any feedback would be appreciated. TIA...

Don Osborn
Bisharat.net








Re: Looking for transcription or transliteration standards latin-arabic

2004-07-01 Thread Anto'nio Martins-Tuva'lkin
On 2004.06.30, 18:56, Jorg Knappen [EMAIL PROTECTED] wrote:

 Are there standards for transscribing or transliterating western
 languages written in latin to arabic?

A real transliteration should work both ways, shouldn't it?

(I managed to deeply shock a former KGB-bueraucrat when applying for a
Russian residence permit by spelling my sixth brother's name, Henrique
as ...)

--.
Antnio MARTINS-Tuvalkin |  ()|
[EMAIL PROTECTED]||
PT-1XXX-XXX LISBOA   Nao me invejo de quem tem|
+351 934 821 700 carros, parelhas e montes|
http://www.tuvalkin.web.pt/bandeira/ so me invejo de quem bebe|
http://pagina.de/bandeiras/  a agua em todas as fontes|




Re: Looking for transcription or transliteration standards latin-arabic

2004-07-01 Thread Mark Davis
When we looked into this, the problem we found is that there are many
standards.

We ended up with the following in ICU (see
http://oss.software.ibm.com/cgi-bin/icu/tr for a demo,
http://oss.software.ibm.com/icu/userguide/Transform.html for descriptions).
I believe that we followed the UNGEGN conventions, with added accents to
support round-tripping. Note that while we have the ability to have
different transliterations for different languages, or for variant
transliterations, we have not added any as yet. (The '' means 'transforms
into' below).

  '.';
  ',';
  ',';
  ';';
  '?';
  '%';
  0;
  1;
  2;
  3;
  4;
  5;
  6;
  7;
  8;
  9;
  0;
  1;
  2;
  3;
  4;
  5;
  6;
  7;
  8;
  9;
  a;
  u;
  i;
  th;
  dh;
  sh;
  s;
  d;
  t;
  z;
  gh;
  t;
  zh;
  ng;
  v;
  y;
  ;
  a;
  b;
  t;
  j;
  h;
  kh;
  d;
  r;
  z;
  s;
  ;
  ;
  f;
  q;
  k;
  l;
  m;
  n;
  h;
  w;
  y;
  y;
  a;
  u;
  i;
  a;
  u;
  i;
  ;
  ;
  ;
  ;
  ;
  p;
  ch;
  v;
  g;


rk
- Original Message - 
From: Anto'nio Martins-Tuva'lkin [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Thursday, July 01, 2004 08:34
Subject: Re: Looking for transcription or transliteration standards
latin-arabic


 On 2004.06.30, 18:56, Jorg Knappen [EMAIL PROTECTED] wrote:

  Are there standards for transscribing or transliterating western
  languages written in latin to arabic?

 A real transliteration should work both ways, shouldn't it?

 (I managed to deeply shock a former KGB-bueraucrat when applying for a
 Russian residence permit by spelling my sixth brother's name, Henrique
 as ...)

 --.
 Antnio MARTINS-Tuvalkin |  ()|
 [EMAIL PROTECTED]||
 PT-1XXX-XXX LISBOA   Nao me invejo de quem tem|
 +351 934 821 700 carros, parelhas e montes|
 http://www.tuvalkin.web.pt/bandeira/ so me invejo de quem bebe|
 http://pagina.de/bandeiras/  a agua em todas as fontes|







Re: Looking for transcription or transliteration standards latin-arabic

2004-07-01 Thread Anto'nio Martins-Tuva'lkin
On 2004.07.01, 18:06, Mark Davis [EMAIL PROTECTED] wrote:

 different transliterations for different languages,
 
Strictly speaking, transliterations are between two given scripts, the
language being transparent -- I mean *real* transliterating from, say
Greek to latin, uses the same rules for the Illiad as for cypriot or
greek phone books or license plates...

--.
António MARTINS-Tuválkin |  ()|
[EMAIL PROTECTED]||
PT-1XXX-XXX LISBOA   Não me invejo de quem tem|
+351 934 821 700 carros, parelhas e montes|
http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe|
http://pagina.de/bandeiras/  a água em todas as fontes|




Looking for transcription or transliteration standards latin-arabic

2004-06-30 Thread Jörg Knappen
Are there standards for transscribing or transliterating western languages 
written in latin to arabic? I am specifically interested in 
german-arabic, but english-arabic and french-arabic is of interest, 
too.

--Jorg Knappen