Re: Collation (was RE: [OT] o-circumflex)

2001-09-15 Thread Christopher JS Vance

On Thu, Sep 13, 2001 at 12:40:30AM -0700, Edward Cherlin wrote:
: For example,
: 
: 1984 (Nineteen Eighty Four)
: 1066 and all that (Ten Sixty Six)
: 3001 (Three Thousand One)
: 2050 (Twenty Fifty)
: 2010 (Twenty Ten)
: 2001, A Space Odyssey (Two Thousand One)

You're missing the and from 3001 and 2001.  I know Merkins often
leave it out, but a number of us always use it and feel it's wrong
without.  :-)

Putting dialect aside, you may find that 2050 and possibly 2010 will
be said two thousand (and) whatever.

The problem here is that there's no single way to spell out numbers in
English, so no single way to alphabetise.  It's better to sort numbers
numerically, and then you only have to decide the order for negative
numbers.

-- 
Christopher Vance




Collation (was RE: [OT] o-circumflex)

2001-09-13 Thread Edward Cherlin

English and several other languages have dozens of collations. Compare telephone 
books, library catalogs, book indexes (sic), and other sorted data. Knuth vol. 3 
Sorting and Searching gives an example of a set of library sorting rules that runs to 
more than a page, and suggests programming it as an exercise. ;-) Among the rules are 
to spell out numbers. 
For example,

1984 (Nineteen Eighty Four)
1066 and all that (Ten Sixty Six)
3001 (Three Thousand One)
2050 (Twenty Fifty)
2010 (Twenty Ten)
2001, A Space Odyssey (Two Thousand One)

Bell Labs invented a whole programming language, Snobol, to deal with telephone 
listing conversions, matches, and sorts. Many phone books sort Mc- and Mac- together, 
others one after the other but separate from other names.

Edward Cherlin
Generalist
A knot! Oh, do let me help to undo it. 
Alice in Wonderland


 -Original Message-
 Behalf Of Michael (michka) Kaplan
 Sent: Mon, September 10, 2001 8:36 AM
 From: Mark Davis [EMAIL PROTECTED]
 
  Michael, that isn't the point. There is a problem even 
 when you stick to
 one
  language.


 By that time, many langauges may have TWO collations, since 
 users have been
 expecting something else for the last few decades?
 
 MichKa
 
 Michael Kaplan
 Trigeminal Software, Inc.
 http://www.trigeminal.com/
 
 
 





Collation (was RE: [OT] o-circumflex)

2001-09-13 Thread $B$F$s$I$&$j$e$&$8(B
Whoever invented English number words, then, had a very sick sense of humour. Why 
doesn't the word for "one" start with "a", the word for "two" with "b", etc.,?


rubyrb$B$8$e$&$$$C$A$c$s(B/rbrp(/rprtJuuitchan/rtrp)/rp/ruby
Well, I guess what you say is true,
I could never be the right kind of girl for you,
I could never be your woman
  - White Town


--- Original Message ---
$B:9=P?M(B: Edward Cherlin [EMAIL PROTECTED];
$B08@h(B: [EMAIL PROTECTED];
Cc: 
$BF|;~(B: 01/09/13 7:40
$B7oL>(B: Collation (was RE: [OT] o-circumflex)

English and several other languages have dozens of collations. Compare telephone 
books, library catalogs, book indexes (sic), and other sorted data. Knuth vol. 3 
Sorting and Searching gives an example of a set of library sorting rules that runs to 
more than a page, and suggests programming it as an exercise. ;-) Among the rules are 
to spell out numbers. 
For example,

1984 (Nineteen Eighty Four)
1066 and all that (Ten Sixty Six)
3001 (Three Thousand One)
2050 (Twenty Fifty)
2010 (Twenty Ten)
2001, A Space Odyssey (Two Thousand One)

Bell Labs invented a whole programming language, Snobol, to deal with telephone 
listing conversions, matches, and sorts. Many phone books sort Mc- and Mac- together, 
others one after the other but separate from other names.

Edward Cherlin
Generalist
"A knot! Oh, do let me help to undo it." 
Alice in Wonderland


 -Original Message-
 Behalf Of Michael (michka) Kaplan
 Sent: Mon, September 10, 2001 8:36 AM
 From: "Mark Davis" [EMAIL PROTECTED]
 
  Michael, that isn't the point. There is a problem even 
 when you stick to
 one
  language.


 By that time, many langauges may have TWO collations, since 
 users have been
 expecting something else for the last few decades?
 
 MichKa
 
 Michael Kaplan
 Trigeminal Software, Inc.
 http://www.trigeminal.com/
 
 
 





Re: Collation (was RE: [OT] o-circumflex)

2001-09-13 Thread David Gallardo

Java's collation class has a rule-based  collator that is in effect
programmable using a little language. Here is how an example from Sun's API
doc for Norwegian:

String Norwegian =  a,A b,B c,C d,D e,E f,F g,G h,H i,I j,J
  k,K l,L m,M n,N o,O p,P q,Q r,R s,S t,T
  u,U v,V w,W x,X y,Y z,Z
  å=a?,Å=A?
 ;aa,AA æ,Æ ø,Ø;
 RuleBasedCollator myNorwegian = new RuleBasedCollator(Norwegian);

There is also syntax for things such as specifying reverse order (for French
accents for example), contraction and expansion.

- David Gallardo

- Original Message -
From: Edward Cherlin [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Thursday, September 13, 2001 3:40 AM
Subject: Collation (was RE: [OT] o-circumflex)


 English and several other languages have dozens of collations. Compare
telephone books, library catalogs, book indexes (sic), and other sorted
data. Knuth vol. 3 Sorting and Searching gives an example of a set of
library sorting rules that runs to more than a page, and suggests
programming it as an exercise. ;-) Among the rules are to spell out numbers.
 For example,

 1984 (Nineteen Eighty Four)
 1066 and all that (Ten Sixty Six)
 3001 (Three Thousand One)
 2050 (Twenty Fifty)
 2010 (Twenty Ten)
 2001, A Space Odyssey (Two Thousand One)

 Bell Labs invented a whole programming language, Snobol, to deal with
telephone listing conversions, matches, and sorts. Many phone books sort Mc-
and Mac- together, others one after the other but separate from other names.

 Edward Cherlin
 Generalist
 A knot! Oh, do let me help to undo it.
 Alice in Wonderland








Re: Collation (was RE: [OT] o-circumflex)

2001-09-13 Thread Mark Davis

In the latest ICU, we took the work we did for Java collation and extended
it substantially (and made it many times faster). It also allows arbitrary
customization at runtime.

I happen to be giving a presentation on it in a few hours at the conference.
For more information, see the draft collation chapter in the User guide, at
http://oss.software.ibm.com/icu/. The presentation (a slightly older draft)
is on my site at www.macchiato.com

Mark
—

Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο 
πάντα — Όμήρου Μαργίτῃ
[http://www.macchiato.com]
- Original Message -
From: David Gallardo [EMAIL PROTECTED]
To: Edward Cherlin [EMAIL PROTECTED];
[EMAIL PROTECTED]
Sent: Thursday, September 13, 2001 8:35 AM
Subject: Re: Collation (was RE: [OT] o-circumflex)


 Java's collation class has a rule-based  collator that is in effect
 programmable using a little language. Here is how an example from Sun's
API
 doc for Norwegian:

 String Norwegian =  a,A b,B c,C d,D e,E f,F g,G h,H i,I j,J
   k,K l,L m,M n,N o,O p,P q,Q r,R s,S t,T
   u,U v,V w,W x,X y,Y z,Z
   å=a?,Å=A?
  ;aa,AA æ,Æ ø,Ø;
  RuleBasedCollator myNorwegian = new RuleBasedCollator(Norwegian);

 There is also syntax for things such as specifying reverse order (for
French
 accents for example), contraction and expansion.

 - David Gallardo

 - Original Message -
 From: Edward Cherlin [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Sent: Thursday, September 13, 2001 3:40 AM
 Subject: Collation (was RE: [OT] o-circumflex)


  English and several other languages have dozens of collations. Compare
 telephone books, library catalogs, book indexes (sic), and other sorted
 data. Knuth vol. 3 Sorting and Searching gives an example of a set of
 library sorting rules that runs to more than a page, and suggests
 programming it as an exercise. ;-) Among the rules are to spell out
numbers.
  For example,
 
  1984 (Nineteen Eighty Four)
  1066 and all that (Ten Sixty Six)
  3001 (Three Thousand One)
  2050 (Twenty Fifty)
  2010 (Twenty Ten)
  2001, A Space Odyssey (Two Thousand One)
 
  Bell Labs invented a whole programming language, Snobol, to deal with
 telephone listing conversions, matches, and sorts. Many phone books sort
Mc-
 and Mac- together, others one after the other but separate from other
names.
 
  Edward Cherlin
  Generalist
  A knot! Oh, do let me help to undo it.
  Alice in Wonderland
 
 









Re: Alternative sorting for digraphs (Was Re: [OT] o-circumflex)

2001-09-13 Thread Roozbeh Pournader

On Mon, 10 Sep 2001, Mark Davis wrote:

 A ZWNJ will break ligatures and cursive connections. While probably safe in
 Danish or Dutch, it is unclear to me that that is safe in all languages
 where this situation occurs. There are diagraphs in Urdu, for example. While
 I don't know their sorting order, if they do sort separately then ZWNJ can't
 be used to express the alternative sorting, since it would give the wrong
 rendering.

:'-(

I would like to ask for stopping the overuse of ZWNJ. I once loved that
character... What about *renaming* the character to Zero Width
All-Purpose Everything Breaker?

roozbeh





Re: [OT] o-circumflex

2001-09-11 Thread Stefan Persson

- Original Message -
From: Keld Jørn Simonsen [EMAIL PROTECTED]
To: Stefan Persson [EMAIL PROTECTED]
Cc: Mark Davis [EMAIL PROTECTED]; Michael (michka) Kaplan
[EMAIL PROTECTED]; Keld Jørn Simonsen [EMAIL PROTECTED];
[EMAIL PROTECTED]
Sent: den 10 september 2001 22:12
Subject: Re: [OT] o-circumflex


 Where is this done for swedish? I have read both the TN and the SIS
 standard, and I dont believe these say something on sorting
 ü according to either German or Dutch sounds. Rolf Gavare does not
 say something along this either, as far as I can remember.

This is the sorting used in dictionnaries, encyclopædias, phone books etc.
For example, SAOL (Svenska Akademiens ordlista över svenska språket) sorts
myskoxe/müsli/mysning.

Stefan


_
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com





Re: [OT] o-circumflex

2001-09-11 Thread Stefan Persson

- Original Message -
From: Lars Marius Garshol [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: den 10 september 2001 22:45
Subject: Re: [OT] o-circumflex


 I am not sure of this, but I think 'å' is a relatively modern
 invention, and that it was originally written only as 'aa'.

FYI, a relatively modern invention means that is has been used since the
Medieval (in Swedish).

Stefan


_
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com





Re: [OT] o-circumflex

2001-09-11 Thread Keld Jørn Simonsen

On Tue, Sep 11, 2001 at 06:27:20PM +0200, Stefan Persson wrote:
 - Original Message -
 From: Keld Jørn Simonsen [EMAIL PROTECTED]
 To: Stefan Persson [EMAIL PROTECTED]
 Cc: Mark Davis [EMAIL PROTECTED]; Michael (michka) Kaplan
 [EMAIL PROTECTED]; Keld Jørn Simonsen [EMAIL PROTECTED];
 [EMAIL PROTECTED]
 Sent: den 10 september 2001 22:12
 Subject: Re: [OT] o-circumflex
 
 
  Where is this done for swedish? I have read both the TN and the SIS
  standard, and I dont believe these say something on sorting
  ü according to either German or Dutch sounds. Rolf Gavare does not
  say something along this either, as far as I can remember.
 
 This is the sorting used in dictionnaries, encyclopædias, phone books etc.
 For example, SAOL (Svenska Akademiens ordlista över svenska språket) sorts
 myskoxe/müsli/mysning.

Yes, I can understand that. In Danish we have the same rule.
But do you have examples of Dutch words
that are ordered in another way? That is, you need to know the
origin of the word, to sort it.

Kind regards
keld




Re: [OT] o-circumflex

2001-09-11 Thread Lars Marius Garshol


* Lars Marius Garshol
|
| I am not sure of this, but I think 'å' is a relatively modern
| invention, and that it was originally written only as 'aa'.

* Stefan Persson
| 
| FYI, a relatively modern invention means that is has been used
| since the Medieval (in Swedish).

I don't think that is the case in Norwegian and Danish. The Norwegian
constitution from 1814, for example, uses 'ø' and 'æ', but never 'å'.
Possibly this was a Swedish invention only adopted later by the Danes
and Norwegians.

--Lars M.





RE: [OT] o-circumflex

2001-09-10 Thread Marco Cimarosti

John Cowan wrote:
 None of which is as weird as Leghorn for Livorno (Italy).

It's as weird as some Italian names for German cities: Aquisgrana for
Aachen, Augusta for Augsburg, Magonza for Mainz, Monaco (di Baviera) for
München.

_ Marco




RE: [OT] o-circumflex

2001-09-10 Thread Marco Cimarosti

Carl W. Brown wrote:
 In Arabic do you include vowels or not?

Yes, and also consonants sometimes...

Traditional Arabic dictionary sorting uses the three-letter root (radical)
of a word as the primary key.  So, madrasa (school) would be under d
(because its radical is d-r-s = to learn), ignoring the ma- prefix.

I doubt, however, that this system is used with automatic sort orders
generated by computers.

_ Marco




RE: [OT] o-circumflex

2001-09-10 Thread Marco Cimarosti

Asmus Freytag wrote:
 But if you do this, all compound words starting with data 
 and continuing 
 with another word starting with a will be sorted incorrectly!
 
 To achieve this effect, you would have to mark which AAs are 
 A-Rings and which ones are accidental adjacencies. In Danish
 one can use the SHY (soft hyphen) [...]

Real-life sort orders often ignore these subtleties and are often based on a
small set of rules which is applied blindly, regardless of the origin,
meaning, or pronunciation of headwords.

For instance, I have noticed that Dutch telephone directories always sort
the sequence ij as if it was y, regardless that it actually occurs in a
Dutch word.  E.g., Beijing Chinese Restaurant would be listed after Mr. Bex.

Similarly, old Italian encyclopedias (e.g. Dizionario Enciclopedico Teccani)
equated j to i because, in Italian, the former is just a graphic variant
of the latter.  But this also applied to foreign name such as Jefferson
(which was listed between iee- and ieg-), regardless that, of course, it
would not be allowed to spell Iefferson.

_ Marco




Re: [OT] o-circumflex

2001-09-10 Thread Michael Everson

At 18:04 +0200 2001-09-09, Stefan Persson wrote:

   well, the official spelling of the town is Aalborg.

In Sweden it has always been written Ålborg.

At one stage, in both countries, it was written Álaborg, I suspect, 
as it is in Iceland today.
-- 
Michael Everson




Re: [OT] o-circumflex

2001-09-10 Thread Michael Everson

At 18:10 -0400 2001-09-09, John Cowan wrote:
Keld Jørn Simonsen scripsit:

  Yes, foreigners call our cities many strange things:-)
  København is called Köpenhamn, Copenhagen, Kobenhagen, Copenhague,
   and many more.

In Iceland it is Kaupmannahöfn, I believe. In unadorned English that 
would be something like Cheapmenshaven, maybe to weaken as 
Cheapenhaven, in German Kaufenhagen
-- 
Michael Everson




Re: [OT] o-circumflex

2001-09-10 Thread Keld Jørn Simonsen

On Mon, Sep 10, 2001 at 11:09:28AM +0200, Marco Cimarosti wrote:
 Asmus Freytag wrote:
  But if you do this, all compound words starting with data 
  and continuing 
  with another word starting with a will be sorted incorrectly!
  
  To achieve this effect, you would have to mark which AAs are 
  A-Rings and which ones are accidental adjacencies. In Danish
  one can use the SHY (soft hyphen) [...]
 
 Real-life sort orders often ignore these subtleties and are often based on a
 small set of rules which is applied blindly, regardless of the origin,
 meaning, or pronunciation of headwords.
 

Real-life sorts, like MS Windows sorting or Linux sorting, actually adheres
to these Danish rules, once you have set up your machine for Danish.

Kind regards
Keld




Re: [OT] o-circumflex

2001-09-10 Thread Michael \(michka\) Kaplan

From: Keld Jørn Simonsen [EMAIL PROTECTED]

 Real-life sorts, like MS Windows sorting or Linux sorting, actually
adheres
 to these Danish rules, once you have set up your machine for Danish.

And this is the *true* answer to the whole mess of attempting *multilingual*
sorts -- once the user chooses the sort they WANT, the system might handle
other language strings in a way that might be obscure to those who know the
other language but the person who expected Danish or whatever will see what
they want.

Since various sorts openly conflict with each other there is no other
general case solution which would be appropriate, anyway?

(can't believe this thread is still going on!)


MichKa

Michael Kaplan
Trigeminal Software, Inc.
http://www.trigeminal.com/






RE: [OT] o-circumflex

2001-09-10 Thread Marco Cimarosti

 On Mon, Sep 10, 2001 at 11:09:28AM +0200, Marco Cimarosti wrote:
  Asmus Freytag wrote:
   But if you do this, all compound words starting with data 
   and continuing 
   with another word starting with a will be sorted incorrectly!
   
   To achieve this effect, you would have to mark which AAs are 
   A-Rings and which ones are accidental adjacencies. In Danish
   one can use the SHY (soft hyphen) [...]
  
  Real-life sort orders often ignore these subtleties and are 
 often based on a
  small set of rules which is applied blindly, regardless of 
 the origin,
  meaning, or pronunciation of headwords.
  
 
 Real-life sorts, like MS Windows sorting or Linux sorting, 
 actually adheres
 to these Danish rules, once you have set up your machine for Danish.

If I understand what you mean, perhaps my point was not clear.

I know that aa sorts like å, and that it should go after z.  But there
are also cases when the sequence aa is just two a's, adjacent to each
other by pure chance.

One of these cases could be the word dataarkiv, which I found in a Danish
web page
(http://www.riksarkivet.no/nordiskarknytt/98-nr4/institusjonen.html).

Now: if your Windows or Linux collations states (correctly!) that aa
should go after z, you may have a list ordered like this:

Order A:
1. data
2. Datben, Dr. Keld
3. Datz, Mr. Marco
4. dataarkiv
5. Datåz, Dr. Asmus

But if dataarkiv was written using an invisible separator between the two
a's (e.g. a soft hyphen, or a zero width non joiner), the your list would be
like this:

Order B:
1. data
2. dataarkiv
3. Datben, Dr. Keld
4. Datz, Mr. Marco
5. Datåz, Dr. Asmus

Asmus was arguing that List B would be the correct one (and this is
certainly true on, e.g., a dictionary) but, in order to obtain it, the
source text must be properly encoded with invisible separators inserted
where needed.

What I was saying is that the automatic Order A is also often used, and I
brought the example of the Dutch phone directories (where Beijing is
sorted as if it was Beying), and of the Italian encyclopedia (where
Jefferson is sorted as if it was Iefferson).

Michael (michka) Kaplan wrote:
 And this is the *true* answer to the whole mess of attempting 
 *multilingual* sorts -- once the user chooses the sort they
 WANT, the system might handle other language strings in a
 way that might be obscure to those who know the other
 language but the person who expected Danish or whatever 
 will see what they want.

And this is precisely what I was trying to say, although I was not
necessarily talking about multilingual sort (dataarkiv seems a purely
Danish word, although derived from Latin roots).

For some users and some usages, the incorrect Order B may be much more
useful than the correct Order A.  If the rules says that ij goes between
x and z, a Dutchman should find the Beijing Restaurant between bex-
and bez-.

If someone wants Order A (as may be the case for the author of a
dictionary), then they should apply Asmus' suggestion in order to drive the
collation algorithm.

_ Marco




Re: [OT] o-circumflex

2001-09-10 Thread Marcin 'Qrczak' Kowalczyk

Mon, 10 Sep 2001 10:47:48 +0200, Marco Cimarosti [EMAIL PROTECTED] pisze:

 It's as weird as some Italian names for German cities: Aquisgrana
 for Aachen, Augusta for Augsburg, Magonza for Mainz, Monaco (di
 Baviera) for Mnchen.

Interesting that Polish names of these cities are more like Italian
than German: Akwizgran, Augsburg, Moguncja, Monachium.

Ko/benhavn is Kopenhaga, again more like other foreign forms than
Danish.

-- 
 __("  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTPCZA
QRCZAK





Re: [OT] o-circumflex

2001-09-10 Thread Keld Jørn Simonsen

On Mon, Sep 10, 2001 at 03:58:05PM +0200, Marco Cimarosti wrote:
  On Mon, Sep 10, 2001 at 11:09:28AM +0200, Marco Cimarosti wrote:
   Asmus Freytag wrote:
But if you do this, all compound words starting with data 
and continuing 
with another word starting with a will be sorted incorrectly!

To achieve this effect, you would have to mark which AAs are 
A-Rings and which ones are accidental adjacencies. In Danish
one can use the SHY (soft hyphen) [...]
   
   Real-life sort orders often ignore these subtleties and are 
  often based on a
   small set of rules which is applied blindly, regardless of 
  the origin,
   meaning, or pronunciation of headwords.
   
  
  Real-life sorts, like MS Windows sorting or Linux sorting, 
  actually adheres
  to these Danish rules, once you have set up your machine for Danish.
 
 If I understand what you mean, perhaps my point was not clear.

My point was that real-life sorts nowadays are quite sophisticated,
and the major systems have adequate sorting for Danish and other
languages with that kind of complexity.

 I know that aa sorts like å, and that it should go after z.  But there
 are also cases when the sequence aa is just two a's, adjacent to each
 other by pure chance.
 
 One of these cases could be the word dataarkiv, which I found in a Danish
 web page
 (http://www.riksarkivet.no/nordiskarknytt/98-nr4/institusjonen.html).

Yes, and ekstraarbejde - extra work. I know.

 Now: if your Windows or Linux collations states (correctly!) that aa
 should go after z, you may have a list ordered like this:
 
   Order A:
   1. data
   2. Datben, Dr. Keld
   3. Datz, Mr. Marco
   4. dataarkiv
   5. Datåz, Dr. Asmus
 
 But if dataarkiv was written using an invisible separator between the two
 a's (e.g. a soft hyphen, or a zero width non joiner), the your list would be
 like this:
 
   Order B:
   1. data
   2. dataarkiv
   3. Datben, Dr. Keld
   4. Datz, Mr. Marco
   5. Datåz, Dr. Asmus
 
 Asmus was arguing that List B would be the correct one (and this is
 certainly true on, e.g., a dictionary) but, in order to obtain it, the
 source text must be properly encoded with invisible separators inserted
 where needed.

Yes, that is also my advice.

 What I was saying is that the automatic Order A is also often used, and I
 brought the example of the Dutch phone directories (where Beijing is
 sorted as if it was Beying), and of the Italian encyclopedia (where
 Jefferson is sorted as if it was Iefferson).

You have to sort it according to the expectations of the user.
A Dutch book would use Dutch rules, an Italian book would use
the italian order. You cannot mix ordering, such that some words follow
one set of rules, and other words follow other rules. It all needs
to be comprehended by one human, the reader, and there only one ruleset
applies.

 
 Michael (michka) Kaplan wrote:
  And this is the *true* answer to the whole mess of attempting 
  *multilingual* sorts -- once the user chooses the sort they
  WANT, the system might handle other language strings in a
  way that might be obscure to those who know the other
  language but the person who expected Danish or whatever 
  will see what they want.
 
 And this is precisely what I was trying to say, although I was not
 necessarily talking about multilingual sort (dataarkiv seems a purely
 Danish word, although derived from Latin roots).
 
 For some users and some usages, the incorrect Order B may be much more
 useful than the correct Order A.  If the rules says that ij goes between
 x and z, a Dutchman should find the Beijing Restaurant between bex-
 and bez-.
 
 If someone wants Order A (as may be the case for the author of a
 dictionary), then they should apply Asmus' suggestion in order to drive the
 collation algorithm.

I think we agree, but what you call simple set of rules I call quite complex.
I also think that the Danish rules are quite simple as they can be formulated
in say 4 lines of Danish prose. But compared to ascii sorting they are to some
people unbelievable complex, and I think many Danish believes that you cannot get
programs that adhere, although the major systems do that out of the box.

Your incorrect and correct examples use the very same sorting algoritm, the only
thing is that the data is coded differently.

But maybe you are driving for a yet more complex sorting, one that can sort
according to multiple rules? Beijing should then not be sorted as Beÿing?
As stated above I think - and other sorting experts too - that sorting
with multiple rules is a conceptual misunderstanding.

Kind regards
Keld




Re: [OT] o-circumflex

2001-09-10 Thread Michael \(michka\) Kaplan

From: Mark Davis [EMAIL PROTECTED]

 Michael, that isn't the point. There is a problem even when you stick to
one
 language.

 That is, there are situations where two letters in a language, e.g. ch
in
 Slovak, are normally sorted as one. However, in some exceptional
 circumstances those letters should be sorted separated. It could be
because
 they come originally from another language, or it could be because they
 happen to arise when two other words are conjoined. There is no
algorithmic
 distinction. So without some special character, it would require a
 dictionary look-up to produce the right sort

I would argue that most users of the language are not expecting this type of
thing, and that when they are looking for a word that this might be the
SECOND place they look, not the first.

There are exceptions, but they are not outnumbered by the general case, by
any means.

 For example, suppose that th were sorted separately in English, after Z.
 Yet people would expect the following order:

 cast
 cathouse
 caul
 cathode

 because the t and h are logically separate in cathouse.

Again, I think most people would look first in the place that does not
assume the exception -- the computer's original limitations havse trained
them. The notion of a natural language processing engine that would have all
of the specific differences (with appropriate dictionaries for exceptions to
even the NLP results) is a fascinating notion, but one that no one is even
close to, yet.

We do not even have available UCA tailorings for most of the world's
languages. Though I have high hopes for the future (if not in the UCA then
in other mechanisms).

By that time, many langauges may have TWO collations, since users have been
expecting something else for the last few decades?

MichKa

Michael Kaplan
Trigeminal Software, Inc.
http://www.trigeminal.com/






Re: [OT] o-circumflex

2001-09-10 Thread Mark Davis

Michael, that isn't the point. There is a problem even when you stick to one
language.

That is, there are situations where two letters in a language, e.g. ch in
Slovak, are normally sorted as one. However, in some exceptional
circumstances those letters should be sorted separated. It could be because
they come originally from another language, or it could be because they
happen to arise when two other words are conjoined. There is no algorithmic
distinction. So without some special character, it would require a
dictionary look-up to produce the right sort

For example, suppose that th were sorted separately in English, after Z.
Yet people would expect the following order:

cast
cathouse
caul
cathode

because the t and h are logically separate in cathouse.

Mark
—

Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο 
πάντα — Όμήρου Μαργίτῃ
[http://www.macchiato.com]
- Original Message -
From: Michael (michka) Kaplan [EMAIL PROTECTED]
To: Keld Jørn Simonsen [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Monday, September 10, 2001 5:48 AM
Subject: Re: [OT] o-circumflex


 From: Keld Jørn Simonsen [EMAIL PROTECTED]

  Real-life sorts, like MS Windows sorting or Linux sorting, actually
 adheres
  to these Danish rules, once you have set up your machine for Danish.

 And this is the *true* answer to the whole mess of attempting
*multilingual*
 sorts -- once the user chooses the sort they WANT, the system might handle
 other language strings in a way that might be obscure to those who know
the
 other language but the person who expected Danish or whatever will see
what
 they want.

 Since various sorts openly conflict with each other there is no other
 general case solution which would be appropriate, anyway?

 (can't believe this thread is still going on!)


 MichKa

 Michael Kaplan
 Trigeminal Software, Inc.
 http://www.trigeminal.com/









Re: [OT] o-circumflex

2001-09-10 Thread John Wilcock

On Mon, 10 Sep 2001 16:42:45 +0200, Keld Jørn Simonsen wrote:
 But maybe you are driving for a yet more complex sorting, one that can sort
 according to multiple rules? Beijing should then not be sorted as Beÿing?

I haven't followed this discussion from the beginning, so apologies if
I'm missing the point, but it seems to me that the Beijing case in
Dutch is no different from the ekstraarbejde case in Danish - a SHY or
ZWNJ is all that is needed to stop Beijing sorting with Bey. 


John.

-- 
-- Over 1500 webcams from ski resorts around the world - http://www.snoweye.com/
-- Translate your technical documents and web pages- http://www.tradoc.fr/




Alternative sorting for digraphs (Was Re: [OT] o-circumflex)

2001-09-10 Thread Mark Davis

A SHY will mean that the word can break at Bei-
jing. It is not clear to me at least that that is safe in all cases for all
languages with digraphs that sort separately, although it may be a solution
for some.

A ZWNJ will break ligatures and cursive connections. While probably safe in
Danish or Dutch, it is unclear to me that that is safe in all languages
where this situation occurs. There are diagraphs in Urdu, for example. While
I don't know their sorting order, if they do sort separately then ZWNJ can't
be used to express the alternative sorting, since it would give the wrong
rendering.

Mark
—

Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο 
πάντα — Όμήρου Μαργίτῃ
[http://www.macchiato.com]
- Original Message -
From: John Wilcock [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Monday, September 10, 2001 8:39 AM
Subject: Re: [OT] o-circumflex


 On Mon, 10 Sep 2001 16:42:45 +0200, Keld Jørn Simonsen wrote:
  But maybe you are driving for a yet more complex sorting, one that can
sort
  according to multiple rules? Beijing should then not be sorted as
Beÿing?

 I haven't followed this discussion from the beginning, so apologies if
 I'm missing the point, but it seems to me that the Beijing case in
 Dutch is no different from the ekstraarbejde case in Danish - a SHY or
 ZWNJ is all that is needed to stop Beijing sorting with Bey.


 John.

 --
 -- Over 1500 webcams from ski resorts around the world -
http://www.snoweye.com/
 -- Translate your technical documents and web pages-
http://www.tradoc.fr/







RE: [OT] o-circumflex

2001-09-10 Thread Marco Cimarosti

John Wilcock wrote:
 I haven't followed this discussion from the beginning, so apologies if
 I'm missing the point, but it seems to me that the Beijing case in
 Dutch is no different from the ekstraarbejde case in Danish - a SHY or
 ZWNJ is all that is needed to stop Beijing sorting with Bey. 

Yes, it is exactly the same thing.

But my point is that a Dutch reader probably *does* expect Beijing to sort
like Bey, not like Bei.  So, in some cases, a correct (i.e., expected)
behavior could rather be to *remove* all SHY/ZWNJ's before sorting.

_ Marco




Re: [OT] o-circumflex

2001-09-10 Thread $B$F$s$I$&$j$e$&$8(B

If they can't agree on the pronunciation for these cities, can they agree on the Hanzi 
for them?
What ARE the Hanzi for these cities, anyway??

rubyrb$B$8$e$&$$$C$A$c$s(B/rbrp(/rprtJuuitchan/rtrp)/rp/ruby
Well, I guess what you say is true,
I could never be the right kind of girl for you,
I could never be your woman
  - White Town


--- Original Message ---
$B:9=P?M(B: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED];
$B08@h(B: [EMAIL PROTECTED];
Cc: 
$BF|;~(B: 01/09/10 14:02
$B7oL>(B: Re: [OT] o-circumflex

Mon, 10 Sep 2001 10:47:48 +0200, Marco Cimarosti [EMAIL PROTECTED] pisze:

 It's as weird as some Italian names for German cities: Aquisgrana
 for Aachen, Augusta for Augsburg, Magonza for Mainz, Monaco (di
 Baviera) for M$B!&(Bchen.

Interesting that Polish names of these cities are more like Italian
than German: Akwizgran, Augsburg, Moguncja, Monachium.

Ko/benhavn is Kopenhaga, again more like other foreign forms than
Danish.

-- 
 __("  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZAST$B%O(BPCZA
QRCZAK





Re: [OT] o-circumflex

2001-09-10 Thread Stefan Persson

- Original Message -
From: Marco Cimarosti [EMAIL PROTECTED]
To: 'John Wilcock' [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: den 10 september 2001 18:35
Subject: RE: [OT] o-circumflex


 John Wilcock wrote:
  I haven't followed this discussion from the beginning, so apologies if
  I'm missing the point, but it seems to me that the Beijing case in
  Dutch is no different from the ekstraarbejde case in Danish - a SHY or
  ZWNJ is all that is needed to stop Beijing sorting with Bey.

 Yes, it is exactly the same thing.

 But my point is that a Dutch reader probably *does* expect Beijing to sort
 like Bey, not like Bei.  So, in some cases, a correct (i.e., expected)
 behavior could rather be to *remove* all SHY/ZWNJ's before sorting.

I thought ij sorted after z?


_
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com





RE: [OT] o-circumflex

2001-09-10 Thread Marco Cimarosti

Stefan Persson wrote:
 I thought ij sorted after z?

Not in Dutch: as far as I have seen it sorts the same as y.  In fact, in
the telephone directory many people who had an y in their surname listed
near people who had the same surname spelled with ij (e.g. Meyer and
Meijer).

(Anyway, next time they send me to Holland, I'll ask for a downtown hotel.
So, after dinner, I'll go sightseeing rather than spending the whole evening
looking at the collation of the phone directory:-)

_ Marco




Re: [OT] o-circumflex

2001-09-10 Thread Stefan Persson

There is a similar problem with Swedish:

Our alphabet goes:

a
...
u
v  w (no difference made)
x
y
z
å
ä (the Danish/Norwegian æ is also sorted as ä)
ö (the Danish/Norwegian ø is also sorted as ö)

The German character ü is pronunciated as a Swedish y, so when any
German name or loan word containing that character occurs in Swedish it
should be sorted as y. However, if any ü occurs in a Dutch loan word it
is considered as an u with umlaut and is sorted as u.

The same goes for ä and ö: If they are the Swedish/Finnish/German
letters ä and ö they are sorted after å, if they are the Dutch letters
a with umlaut and o with umlaut, they're sorted as a and o in a
Swedish encyclopædia.

In Swedish the Danish/Norwegian letter æ is sorted as ä, while the
Latin/Icelandic letter æ is sorted as ae.

Stefan

- Original Message -
From: Mark Davis [EMAIL PROTECTED]
To: Michael (michka) Kaplan [EMAIL PROTECTED]; Keld Jørn Simonsen
[EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: den 10 september 2001 17:27
Subject: Re: [OT] o-circumflex


 Michael, that isn't the point. There is a problem even when you stick to
one
 language.

 That is, there are situations where two letters in a language, e.g. ch
in
 Slovak, are normally sorted as one. However, in some exceptional
 circumstances those letters should be sorted separated. It could be
because
 they come originally from another language, or it could be because they
 happen to arise when two other words are conjoined. There is no
algorithmic
 distinction. So without some special character, it would require a
 dictionary look-up to produce the right sort

 For example, suppose that th were sorted separately in English, after Z.
 Yet people would expect the following order:

 cast
 cathouse
 caul
 cathode

 because the t and h are logically separate in cathouse.

 Mark
 —

 Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο 
πάντα — Όμήρου Μαργίτῃ
 [http://www.macchiato.com]
 - Original Message -
 From: Michael (michka) Kaplan [EMAIL PROTECTED]
 To: Keld Jørn Simonsen [EMAIL PROTECTED]; [EMAIL PROTECTED]
 Sent: Monday, September 10, 2001 5:48 AM
 Subject: Re: [OT] o-circumflex


  From: Keld Jørn Simonsen [EMAIL PROTECTED]
 
   Real-life sorts, like MS Windows sorting or Linux sorting, actually
  adheres
   to these Danish rules, once you have set up your machine for Danish.
 
  And this is the *true* answer to the whole mess of attempting
 *multilingual*
  sorts -- once the user chooses the sort they WANT, the system might
handle
  other language strings in a way that might be obscure to those who know
 the
  other language but the person who expected Danish or whatever will see
 what
  they want.
 
  Since various sorts openly conflict with each other there is no other
  general case solution which would be appropriate, anyway?
 
  (can't believe this thread is still going on!)
 
 
  MichKa
 
  Michael Kaplan
  Trigeminal Software, Inc.
  http://www.trigeminal.com/
 
 
 
 



_
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com





Re: [OT] o-circumflex

2001-09-10 Thread Thomas Chan

On Mon, 10 Sep 2001, [ISO-2022-JP] $B$F$s$I$$j$e$$8(B wrote:

 If they can't agree on the pronunciation for these cities, can they
 agree on the Hanzi for them? What ARE the Hanzi for these cities,
 anyway??

Are you asking for the names of cities in Chinese?  Copenhagen is
ge1ben3ha1gen1 \u54e5\u672c\u54c8\u6839.  The Han characters used to write
the names of cities depends on many factors, including but not
limited to source spelling/pronunciation, language/dialect of the
rendering party, mapping rules used by the renderer, time period, etc.
For example, New York is rendered in Chinese as Mandarin niu3yue4
\u7d10\u7d04, lit. 'button-appointment' (nauyeuk in Cantonese), while in
Japanese it was at one time rendered as \u7d10\u80b2, lit.
'button-rearing'.  Asking for the hanzi (from your wording, I don't
think you are just talking about Chinese usage of Han characters) is like
asking for a single Latin script rendering.

(I think you need to get yourself an English-Chinese dictionary or
something, btw...)


Thomas Chan
[EMAIL PROTECTED]






Re: [OT] o-circumflex

2001-09-10 Thread Keld Jørn Simonsen

Where is this done for swedish? I have read both the TN and the SIS
standard, and I dont believe these say something on sorting 
ü according to either German or Dutch sounds. Rolf Gavare does not
say something along this either, as far as I can remember.

Kind regards
keld

On Mon, Sep 10, 2001 at 07:09:34PM +0200, Stefan Persson wrote:
 There is a similar problem with Swedish:
 
 Our alphabet goes:
 
 a
 ...
 u
 v  w (no difference made)
 x
 y
 z
 å
 ä (the Danish/Norwegian æ is also sorted as ä)
 ö (the Danish/Norwegian ø is also sorted as ö)
 
 The German character ü is pronunciated as a Swedish y, so when any
 German name or loan word containing that character occurs in Swedish it
 should be sorted as y. However, if any ü occurs in a Dutch loan word it
 is considered as an u with umlaut and is sorted as u.
 
 The same goes for ä and ö: If they are the Swedish/Finnish/German
 letters ä and ö they are sorted after å, if they are the Dutch letters
 a with umlaut and o with umlaut, they're sorted as a and o in a
 Swedish encyclopædia.
 
 In Swedish the Danish/Norwegian letter æ is sorted as ä, while the
 Latin/Icelandic letter æ is sorted as ae.
 
 Stefan




Re: [OT] o-circumflex

2001-09-10 Thread $B$F$s$I$&$j$e$&$8(B
I hate this sort:
Club Mix 2000
Club Mix 98
Club Mix 99

Those non Y2K compliant fools!


rubyrb$B$8$e$&$$$C$A$c$s(B/rbrp(/rprtJuuitchan/rtrp)/rp/ruby
Well, I guess what you say is true,
I could never be the right kind of girl for you,
I could never be your woman
  - White Town


--- Original Message ---
$B:9=P?M(B: Stefan Persson [EMAIL PROTECTED];
$B08@h(B: Mark Davis [EMAIL PROTECTED];"Michael (michka) Kaplan" 
[EMAIL PROTECTED];Keld J?n Simonsen [EMAIL PROTECTED];[EMAIL PROTECTED];
Cc: 
$BF|;~(B: 01/09/10 17:09
$B7oL>(B: Re: [OT] o-circumflex

There is a similar problem with Swedish:

Our alphabet goes:

a
...
u
v  w (no difference made)
x
y
z
$B%F!&(B
$B%F!"(B (the Danish/Norwegian "$B%F%r(B" is also sorted as "$B%F!"(B")
$B%F%+(B (the Danish/Norwegian "$B%F%/(B" is also sorted as "$B%F%+(B")

The German character "$B%F%7(B" is pronunciated as a Swedish "y," so when any
German name or loan word containing that character occurs in Swedish it
should be sorted as "y." However, if any "$B%F%7(B" occurs in a Dutch loan word it
is considered as an "u" with umlaut and is sorted as "u."

The same goes for "$B%F!"(B" and "$B%F%+(B": If they are the 
Swedish/Finnish/German
letters "$B%F!"(B" and "$B%F%+(B" they are sorted after "$B%F!&(B," if they are 
the Dutch letters
"a" with umlaut and "o" with umlaut, they're sorted as "a" and "o" in a
Swedish encyclop$B%F%r(Bdia.

In Swedish the Danish/Norwegian letter "$B%F%r(B" is sorted as "$B%F!"(B," while 
the
Latin/Icelandic letter "$B%F%r(B" is sorted as "ae."

Stefan

- Original Message -
From: "Mark Davis" [EMAIL PROTECTED]
To: "Michael (michka) Kaplan" [EMAIL PROTECTED]; "Keld J$B%F%/(Brn Simonsen"
[EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: den 10 september 2001 17:27
Subject: Re: [OT] o-circumflex


 Michael, that isn't the point. There is a problem even when you stick to
one
 language.

 That is, there are situations where two letters in a language, e.g. "ch"
in
 Slovak, are normally sorted as one. However, in some exceptional
 circumstances those letters should be sorted separated. It could be
because
 they come originally from another language, or it could be because they
 happen to arise when two other words are conjoined. There is no
algorithmic
 distinction. So without some special character, it would require a
 dictionary look-up to produce the right sort

 For example, suppose that "th" were sorted separately in English, after Z.
 Yet people would expect the following order:

 cast
 cathouse
 caul
 cathode

 because the "t" and "h" are logically separate in "cathouse".

 Mark
 $Bc`Hd?Hd?Hd?Hd?!&(B
 $B%[?%^8P%5%[%5c`!>?%^?%[%C%^!&%"%^!&%=(B $Bb>HQ"P%&%[%"(B, 
$B%[%3%[%"%[%3bA%+%^!&%[%(c`!>?%^?%[%C%^!&%"%^!&%=(B $B%^?%[%c%[%9%^!&%"(B 
$Bc`!%1%[%7%[%g%^"P%=%^!&%[XP%"%^"P%&%[%C%^!&%=!&(B [http://www.macchiato.com]
 - Original Message -
 From: "Michael (michka) Kaplan" [EMAIL PROTECTED]
 To: "Keld J$B%F%/(Brn Simonsen" [EMAIL PROTECTED]; [EMAIL PROTECTED]
 Sent: Monday, September 10, 2001 5:48 AM
 Subject: Re: [OT] o-circumflex


  From: "Keld J$B%F%/(Brn Simonsen" [EMAIL PROTECTED]
 
   Real-life sorts, like MS Windows sorting or Linux sorting, actually
  adheres
   to these Danish rules, once you have set up your machine for Danish.
 
  And this is the *true* answer to the whole mess of attempting
 *multilingual*
  sorts -- once the user chooses the sort they WANT, the system might
handle
  other language strings in a way that might be obscure to those who know
 the
  other language but the person who expected Danish or whatever will see
 what
  they want.
 
  Since various sorts openly conflict with each other there is no other
  general case solution which would be appropriate, anyway?
 
  (can't believe this thread is still going on!)
 
 
  MichKa
 
  Michael Kaplan
  Trigeminal Software, Inc.
  http://www.trigeminal.com/
 
 
 
 



_
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com





Re: [OT] o-circumflex

2001-09-10 Thread Lars Marius Garshol


* Carl W. Brown
| 
| You are quite correct that is why Unicode support differing
| collation strengths.  Some times you only care about the actual
| letters without diacritics.  But even then letters are locale
| sensitive.  For example the Danish alphabet starts with an A and
| ends it with A ring above.  A Dane would look for Alborg near the
| end of a list of towns.

This example doesn't apply to this discussions, since Danes and
Norwegians consider Å to be a separate letter. That is, it is not A
with ring above, but Å, which is not related to A any more than E is
related to F.

What J. M. Sykes writes about the lack of established sort orders
seems right to me. I've done consulting work for Norwegian
encyclopedia publishers, which involved developing their sorting
routines. The orders for the different publishers did differ, and it
is not so surprising given that there are a number of cases to
consider, such as how to sort diacritics, what to consider as
diacritics, how to sort numbers, Roman numerals, ordinals, and
whatnot.

--Lars M.





Re: [OT] o-circumflex

2001-09-10 Thread Lars Marius Garshol


* Francesco Zappa Nardelli
| 
| I was in Aalborg fifteen days ago, and I have seen its name written
| both as Ålborg and as Aalborg.  Where does Aalborg appear in a list
| of towns?

At the end.

In both Danish and Norwegian 'aa' and 'å' are considered equivalent.
I am not sure of this, but I think 'å' is a relatively modern
invention, and that it was originally written only as 'aa'. 

--Lars M.





Re: [OT] o-circumflex

2001-09-10 Thread Lars Marius Garshol


* Jonathan Rosenne
|
| This is not always the right thing to do. For example, with personal
| names the person involved may decide whether he prefers the old (AA)
| spelling or the new Å. In any case they are equivalent.

This is true, but this is nothing particular to the aa/å distinction.
Many given names have a number of possible spellings, such as Astri /
Astrid, Cathrine / Katrine / Kathrine, Wenche / Venke / Venche, Espen
/ Esben, ...   In fact, given names which can be written both aa and å
are rare. I can only think of Åge offhand, and that is only rarely
written Aage in Norway (and the other way round in Denmark).

AA/Å confusion is much more common in surnames, but there there is no
choice involved.

--Lars M.





Re: [OT] o-circumflex

2001-09-10 Thread Lars Marius Garshol


* Keld Jørn Simonsen
|
| Yes, foreigners call our cities many strange things:-) København is
| called Köpenhamn, Copenhagen, Kobenhagen, Copenhague, and many more.

* Michael Everson
| 
| In Iceland it is Kaupmannahöfn, I believe. In unadorned English that
| would be something like Cheapmenshaven, maybe to weaken as
| Cheapenhaven, in German Kaufenhagen

Which makes eminent sense, given that København by this logic would
translate as Cheapenhaven. (Your German translation should be
Kaufmannshagen, I guess, to become Kaufenhagen when translated from
København.)

--Lars M.





Re: [OT] o-circumflex

2001-09-10 Thread Lars Marius Garshol


* Marco Cimarosti
| 
| One of these cases could be the word dataarkiv, which I found in a Danish
| web page
| (http://www.riksarkivet.no/nordiskarknytt/98-nr4/institusjonen.html).

Uh, no, you found it in a Norwegian web page. The word is the same in
Danish, though.
 
|   Order B:
|   1. data
|   2. dataarkiv
|   3. Datben, Dr. Keld
|   4. Datz, Mr. Marco
|   5. Datåz, Dr. Asmus
| 
| Asmus was arguing that List B would be the correct one (and this is
| certainly true on, e.g., a dictionary) but, in order to obtain it, the
| source text must be properly encoded with invisible separators inserted
| where needed.

Not necessarily. One solution I've seen automatically generated sort
keys from the headwords, but allowed users to adjust them where
necessary. I think users are likely to favour this solution if given a
choice. 

Of course, it depends on how important it is to get the sorting
right, and what importance the headwords have within the system
whether this solution is feasible or not. In a phone directory I guess
nobody would use it.
 
| And this is precisely what I was trying to say, although I was not
| necessarily talking about multilingual sort (dataarkiv seems a purely
| Danish word, although derived from Latin roots).

It's a simple concatenation of the words for 'computing' (data) and
'archive' (arkiv), meaning any electronic archive. 

This kind of construction is very common in Norwegian and Danish,
leading speakers to invent all kinds of strange new words when writing
English[1], and the Swedes to joke that we call bananas 'yellowbends'.
 
--Lars M.

[1] And, conversely, after learning English, to split apart words that
God meant us to write without spaces in them. It really ann oys to
see people write in that incon venient way.





Re: [OT] o-circumflex

2001-09-10 Thread Peter_Constable


On 09/10/2001 07:48:05 AM Michael \(michka\) Kaplan wrote:

(can't believe this thread is still going on!)

I just wanted to know about how Francophones perceive certain graphemes,
and I got that answer a long time ago.



Peter





Re: [OT] o-circumflex

2001-09-10 Thread Juliusz Chroboczek

 It's as weird as some Italian names for German cities: Aquisgrana
 for Aachen, Augusta for Augsburg, Magonza for Mainz, Monaco (di
 Baviera) for München.

MK Interesting that Polish names of these cities are more like Italian
MK than German: Akwizgran, Augsburg, Moguncja, Monachium.

Because they're adaptations of the mediaeval Latin names.

The same is true of historically important Polish cities, by the way:
Varsovie, Cracovie in French, Varsavia, Cracovia in Italian.  English
uses the German names instead (Warsaw, Cracow).

Juliusz




RE: [OT] o-circumflex

2001-09-10 Thread Otmar Permentier

Marco,

When you're in Holland you may want to check some dictionaries too. You'll notice in 
dictionaries 'ij' is considered to consist of two letters 'i' and 'j', so the word 
'ijs' sorts between 'iets' and 'ik'.
You're right the PTT doesn't make the distinction between 'ij' and 'y', so in the 
phone book 'Meyer' and 'Meijer' are indeed near each other. I suspected they would at 
least first list all Meijers, then all Meyers, but when I just checked they appeared 
to be intermingled. On closer inspection it turned out the Meijers and Meyers are 
further sorted by street name! 
By the way, in crossword puzzles and the like, 'ij' always occupies one box (but isn't 
considered the same as 'y' I believe)

Regards,

Otmar Permentier

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
 Behalf Of Marco Cimarosti
 Sent: maandag 10 september 2001 19:59
 To: 'Stefan Persson'; 'John Wilcock'; [EMAIL PROTECTED]
 Subject: RE: [OT] o-circumflex
 
 
 Stefan Persson wrote:
  I thought ij sorted after z?
 
 Not in Dutch: as far as I have seen it sorts the same as y.  In fact, in
 the telephone directory many people who had an y in their surname listed
 near people who had the same surname spelled with ij (e.g. Meyer and
 Meijer).
 
 (Anyway, next time they send me to Holland, I'll ask for a downtown hotel.
 So, after dinner, I'll go sightseeing rather than spending the 
 whole evening
 looking at the collation of the phone directory:-)
 
 _ Marco
 
 





Re: [OT] o-circumflex

2001-09-10 Thread $B$F$s$I$&$j$e$&$8(B
AAARRRGGHHH

I give up!

I was hoping that there is SOME system that would give these cities UNIQUE names... 
postal codes???


rubyrb$B$8$e$&$$$C$A$c$s(B/rbrp(/rprtJuuitchan/rtrp)/rp/ruby
Well, I guess what you say is true,
I could never be the right kind of girl for you,
I could never be your woman
  - White Town


--- Original Message ---
$B:9=P?M(B: Thomas Chan [EMAIL PROTECTED];
$B08@h(B: [EMAIL PROTECTED];
Cc: 
$BF|;~(B: 01/09/10 19:59
$B7oL>(B: Re: [OT] o-circumflex

On Mon, 10 Sep 2001, [ISO-2022-JP] $B$F$s$I$&$j$e$&$8(B wrote:

 If they can't agree on the pronunciation for these cities, can they
 agree on the Hanzi for them? What ARE the Hanzi for these cities,
 anyway??

Are you asking for the names of cities in Chinese?  Copenhagen is
ge1ben3ha1gen1 \u54e5\u672c\u54c8\u6839.  The Han characters used to write
the names of cities depends on many factors, including but not
limited to source spelling/pronunciation, language/dialect of the
rendering party, mapping rules used by the renderer, time period, etc.
For example, New York is rendered in Chinese as Mandarin niu3yue4
\u7d10\u7d04, lit. 'button-appointment' (nauyeuk in Cantonese), while in
Japanese it was at one time rendered as \u7d10\u80b2, lit.
'button-rearing'.  Asking for the "hanzi" (from your wording, I don't
think you are just talking about Chinese usage of Han characters) is like
asking for a single Latin script rendering.

(I think you need to get yourself an English-Chinese dictionary or
something, btw...)


Thomas Chan
[EMAIL PROTECTED]






Re: [OT] o-circumflex

2001-09-10 Thread Kenneth Whistler

Wy OT by now...

 AAARRRGGHHH
 
 I give up!
 
 I was hoping that there is SOME system that would give these cities UNIQUE names... 
postal codes???

Ain't reality a bitch?

What you're looking for doesn't exist in the world of natural language
names -- it can only exist in artificially constructed global
geographic databases, where people may have assigned unique keys
to cities. And even there, the geographic experts are going to
argue over the exact meaning of terms. Is Los Angeles the
incorporated city presided over by the mayor or does it include
all the other small cities that Los Angeles surrounds and engulfs,
or does it included unincorporated parts of Los Angeles county,
or does it refer to Greater Los Angeles, the metropolitan area,
or is it related to Los Angeles county?

Not such a simple distinction, sometimes. San Francisco is
a city *and* a county, and the mayor of the city is also mayor
of the county. The mayor of New York is mayor of half a
dozen boroughs, the moral equivalent of counties.

Is Stonyford, California (population 150), a city? It isn't
incorporated as a city, or even a town, but it is an independent
geographic location that occurs as a town on maps. Where do
you draw the line between named localities and cities? Do you
depend on legally incorporated city status? But what if the
laws don't match up between different countries? How am I going
to know that cities in Bourkina Fasso match the same criteria
I use to designate cities in the United States or Japan?

Some cities have multiple postal codes, and some postal codes
cover multiple cities. And while postal codes are subject to
international treaty, how countries divide their territories
up and use the codes is still up to them.

--Ken





Re: [OT] o-circumflex/Spanish sorting

2001-09-09 Thread Tex Texin

David,
I also don't know if the other countries have academies, but my
understanding is Latin American countries haven't accepted the modern
sort. Having said that, there is a lot of software that does not
implement the traditional sort, so acceptance is moot.
(The reason the Real Academia Española did away with the sorting of ch
and ll is that a majority of software wasn't implementing sorts that
way.)

tex

David Gallardo wrote:
 
 Hi -
 
 I know the Real Academia Española decided to do away with ch and ll in
 1994, but do you know if the other Spanish speaking countries' corresponding
 academies done the same?
 
 - David Gallardo

-- 
-
Tex TexinDirector, International Business
mailto:[EMAIL PROTECTED]Tel: +1-781-280-4271
the Progress Company Fax: +1-781-280-4655
-




Re: [OT] o-circumflex

2001-09-09 Thread Keld Jørn Simonsen

On Sat, Sep 08, 2001 at 06:38:57PM -0700, Carl W. Brown wrote:
 Asmus,
 
 If you are entering Danish city names then enter it as Ålborg.  You should
 only use Aalborg where the font does not support Å.  For matching logic you
 can equate Å to Aa then the issue of compound words goes away.

well, the official spelling of the town is Aalborg.

Keld




Re: [OT] o-circumflex

2001-09-09 Thread Stefan Persson

- Original Message -
From: Keld Jørn Simonsen [EMAIL PROTECTED]
To: Carl W. Brown [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: den 9 september 2001 14:21
Subject: Re: [OT] o-circumflex


 On Sat, Sep 08, 2001 at 06:38:57PM -0700, Carl W. Brown wrote:
  Asmus,
 
  If you are entering Danish city names then enter it as Ålborg.  You
should
  only use Aalborg where the font does not support Å.  For matching logic
you
  can equate Å to Aa then the issue of compound words goes away.

 well, the official spelling of the town is Aalborg.

In Sweden it has always been written Ålborg.


_
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com





Re: [OT] o-circumflex/Spanish sorting

2001-09-09 Thread David Gallardo

I received a private email stating that that ch and ll were abolished by
the 10th Congress of  the 12 academies of the various Spanish speaking
countries in 1994, not just the RAE.  (There are, in addition to the
obvious, also academies for Puerto Rico, North America and the Phillipines.)

However, it was also my understanding that the modern sort wasn't accepted
outside of Spain, but it's never been clear to me if this is just a matter
of popular or academic opinion, or if there has been formal resistance as
well.

Now I wonder if the various academies have the same authority in their
country that the Royal Academy has in Spain, or if there are other national
standards bodies with which they compete or cooperate.

- David Gallardo

- Original Message -
From: Tex Texin [EMAIL PROTECTED]
To: David Gallardo [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Sunday, September 09, 2001 2:15 AM
Subject: Re: [OT] o-circumflex/Spanish sorting


 David,
 I also don't know if the other countries have academies, but my
 understanding is Latin American countries haven't accepted the modern
 sort. Having said that, there is a lot of software that does not
 implement the traditional sort, so acceptance is moot.
 (The reason the Real Academia Española did away with the sorting of ch
 and ll is that a majority of software wasn't implementing sorts that
 way.)

 tex

 David Gallardo wrote:
 
  Hi -
 
  I know the Real Academia Española decided to do away with ch and ll
in
  1994, but do you know if the other Spanish speaking countries'
corresponding
  academies done the same?
 
  - David Gallardo

 --
 -
 Tex TexinDirector, International Business
 mailto:[EMAIL PROTECTED]Tel: +1-781-280-4271
 the Progress Company Fax: +1-781-280-4655
 -






Re: [OT] o-circumflex

2001-09-09 Thread Keld Jørn Simonsen

On Sun, Sep 09, 2001 at 06:04:30PM +0200, Stefan Persson wrote:
 - Original Message -
 From: Keld Jørn Simonsen [EMAIL PROTECTED]
 To: Carl W. Brown [EMAIL PROTECTED]
 Cc: [EMAIL PROTECTED]
 Sent: den 9 september 2001 14:21
 Subject: Re: [OT] o-circumflex
 
 
  On Sat, Sep 08, 2001 at 06:38:57PM -0700, Carl W. Brown wrote:
   Asmus,
  
   If you are entering Danish city names then enter it as Ålborg.  You
 should
   only use Aalborg where the font does not support Å.  For matching logic
 you
   can equate Å to Aa then the issue of compound words goes away.
 
  well, the official spelling of the town is Aalborg.
 
 In Sweden it has always been written Ålborg.

Yes, foreigners call our cities many strange things:-)
København is called Köpenhamn, Copenhagen, Kobenhagen, Copenhague,
and many more. Helsingør is called Elsinore. 
Well, Ålborg is sometimes spelled Ålborg, but the official spelling, as
defined by zip and postal addresses is 9100 Aalborg, and the kommune is called
Aalborg kommune, viz www.aalborg.dk . 

Århus is however almost always spelled Århus in Danish.

Kind regards
Keld




Re: [OT] o-circumflex

2001-09-09 Thread John Cowan

Keld Jørn Simonsen scripsit:

 Yes, foreigners call our cities many strange things:-)
 København is called Köpenhamn, Copenhagen, Kobenhagen, Copenhague,
 and many more. Helsingør is called Elsinore. 

None of which is as weird as Leghorn for Livorno (Italy).

-- 
John Cowan   http://www.ccil.org/~cowan  [EMAIL PROTECTED]
Please leave your values|   Check your assumptions.  In fact,
   at the front desk.   |  check your assumptions at the door.
 --sign in Paris hotel  |--Miles Vorkosigan




Re: [OT] o-circumflex

2001-09-08 Thread DougEwell2

In a message dated 2001-09-07 17:19:49 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

  You are quite correct that is why Unicode support differing collation
  strengths.  Some times you only care about the actual letters without
  diacritics.  But even then letters are locale sensitive.  For example the
  Danish alphabet starts with an A and ends it with A ring above.  A Dane
  would look for Alborg near the end of a list of towns.  It is like having
  the Spanish ch follow cz.

That would be Ålborg, right?

I hasten to add that Carl's Spanish example is for the so-called traditional 
sort, in contrast to the modern sort in which ch sorts simply as c 
followed by h.  In many Spanish-speaking communities, particularly here in 
Alta California, the simplified modern sort is by far the more common of 
the two.

-Doug Ewell
 Fullerton, California




RE: [OT] o-circumflex

2001-09-08 Thread Carl W. Brown

Doug,

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
 Behalf Of [EMAIL PROTECTED]
 Sent: Friday, September 07, 2001 10:52 PM
 To: [EMAIL PROTECTED]
 Cc: [EMAIL PROTECTED]
 Subject: Re: [OT] o-circumflex


 In a message dated 2001-09-07 17:19:49 Pacific Daylight Time,
 [EMAIL PROTECTED] writes:

   You are quite correct that is why Unicode support differing collation
   strengths.  Some times you only care about the actual letters without
   diacritics.  But even then letters are locale sensitive.  For
 example the
   Danish alphabet starts with an A and ends it with A ring above.  A Dane
   would look for Alborg near the end of a list of towns.  It is
 like having
   the Spanish ch follow cz.

 That would be Ålborg, right?

That is right.  I am concerned that not everyone can view special
characters.  I think that having an alphabet that goes for A to Å must be
due to the Danish sense of humor.

I also did not use the ? in ?stanbul.


 I hasten to add that Carl's Spanish example is for the so-called
 traditional
 sort, in contrast to the modern sort in which ch sorts simply as c
 followed by h.  In many Spanish-speaking communities,
 particularly here in
 Alta California, the simplified modern sort is by far the more
 common of
 the two.

Again correct they also use the modern sort here in Muy Alta California as
well as most of the Spanish speaking world.

There also is the differences between ASCII and EBCDIC sorting.  Talk about
people who are worlds apart.  ;-}

Carl W. Brown
Lafayette, CA






Re: [OT] o-circumflex

2001-09-08 Thread Francesco Zappa Nardelli

Hello.

 For example the Danish alphabet starts with an A and ends it with A
 ring above.  A Dane would look for Alborg near the end of a list of
 towns.  

I was in Aalborg fifteen days ago, and I have seen its name written
both as Ålborg and as Aalborg.  Where does Aalborg appear in a list of
towns?

-francesco




Re: [OT] o-circumflex

2001-09-08 Thread Asmus Freytag

At 09:04 PM 9/7/01 -0700, Mark Davis wrote:
I disagree. What you want is a merged database field. See
http://www.macchiato.com/slides/icu_collation.ppt

Mark

Mark,

David took the remainder of our discussion off the alias. I won't repeat it 
here, just to note that we've agreed that merged database fields are the 
answer to (some) of the scenarios that we've discussed, but that there are 
cases (like indexing a mixed corpus where both naive and naïve occur) where 
it might indeed make sense to ignore accent differences altogether - 
although, as is often the case, dictionary-based pre- or post processing or 
manual adjustments might give better results yet.

Thanks for your pointer to the presentation.

A./








Re: [OT] o-circumflex

2001-09-08 Thread Mark Davis

If you use a Danish tailoring of the UCA that equates Å and AA (at least at
a primary and secondary level), then they will sort the same way. A string
search that uses the same tailoring will also find Ålborg when given
Aalborg (and vice versa).

Mark

BTW, internationalized string search is one of the features of ICU 2.0 (see
http://www-124.ibm.com/icu/develop/tasks.html). There are a number of
exceptional cases that have to be handled, due to issues with ignorable
characters, Thai  Lao boundaries, canonical equivalence and contractions
(see
http://oss.software.ibm.com/cvs/icu/~checkout~/icuhtml/design/searchproposal
.html).

—

Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο 
πάντα — Όμήρου Μαργίτῃ
[http://www.macchiato.com]
- Original Message -
From: Francesco Zappa Nardelli [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Saturday, September 08, 2001 10:51 AM
Subject: Re: [OT] o-circumflex


 Hello.

  For example the Danish alphabet starts with an A and ends it with A
  ring above.  A Dane would look for Alborg near the end of a list of
  towns.

 I was in Aalborg fifteen days ago, and I have seen its name written
 both as Ålborg and as Aalborg.  Where does Aalborg appear in a list of
 towns?

 -francesco







Re: [OT] o-circumflex

2001-09-08 Thread DougEwell2

In a message dated 2001-09-08 12:00:43 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

  I know the Real Academia Española decided to do away with ch and ll in
  1994, but do you know if the other Spanish speaking countries' 
corresponding
  academies done the same?

I have no idea.  I don't know which, if any, even have a language academy.

-Doug Ewell
 Fullerton, California




Re: [OT] o-circumflex

2001-09-08 Thread Asmus Freytag

At 02:45 PM 9/8/01 -0700, Mark Davis wrote:
If you use a Danish tailoring of the UCA that equates Å and AA (at least at
a primary and secondary level), then they will sort the same way. A string
search that uses the same tailoring will also find Ålborg when given
Aalborg (and vice versa).

But if you do this, all compound words starting with data and continuing 
with another word starting with a will be sorted incorrectly!

To achieve this effect, you would have to mark which AAs are A-Rings and 
which ones are accidental adjacencies. In Danish one can use the SHY (soft 
hyphen) to break the latter, as these accidental pairs occur at legal word 
break points. In fact, that's the recommended solution, but it requires 
that the input data are in a sepecific form.

A./




RE: [OT] o-circumflex

2001-09-08 Thread Carl W. Brown

Asmus,

This discussion reminds me of my ill fated efforts to produce a manageable
set of rules to do automatic title casing starting with French text.  It
would have required either special dictionaries or entering the text in a
special way.  If special text was used, one could enter it in the proper
title case to begin with.

If you are entering Danish city names then enter it as Ålborg.  You should
only use Aalborg where the font does not support Å.  For matching logic you
can equate Å to Aa then the issue of compound words goes away.

Carl

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
 Behalf Of Asmus Freytag
 Sent: Saturday, September 08, 2001 5:56 PM
 To: Mark Davis; [EMAIL PROTECTED]; Francesco Zappa Nardelli
 Subject: Re: [OT] o-circumflex


 At 02:45 PM 9/8/01 -0700, Mark Davis wrote:
 If you use a Danish tailoring of the UCA that equates Å and AA
 (at least at
 a primary and secondary level), then they will sort the same
 way. A string
 search that uses the same tailoring will also find Ålborg when given
 Aalborg (and vice versa).

 But if you do this, all compound words starting with data and
 continuing
 with another word starting with a will be sorted incorrectly!

 To achieve this effect, you would have to mark which AAs are A-Rings and
 which ones are accidental adjacencies. In Danish one can use the
 SHY (soft
 hyphen) to break the latter, as these accidental pairs occur at
 legal word
 break points. In fact, that's the recommended solution, but it requires
 that the input data are in a sepecific form.

 A./






RE: [OT] o-circumflex

2001-09-08 Thread Jonathan Rosenne

This is not always the right thing to do. For example, with personal names the
person involved may decide whether he prefers the old (AA) spelling or the new
Å. In any case they are equivalent.

Jony

 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED]]On Behalf Of Carl W. Brown
 Sent: Sunday, September 09, 2001 4:39 AM
 To: [EMAIL PROTECTED]
 Subject: RE: [OT] o-circumflex


 Asmus,

 This discussion reminds me of my ill fated efforts to produce a manageable
 set of rules to do automatic title casing starting with French text.  It
 would have required either special dictionaries or entering the text in a
 special way.  If special text was used, one could enter it in the proper
 title case to begin with.

 If you are entering Danish city names then enter it as Ålborg.  You should
 only use Aalborg where the font does not support Å.  For matching logic you
 can equate Å to Aa then the issue of compound words goes away.

 Carl

  -Original Message-
  From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
  Behalf Of Asmus Freytag
  Sent: Saturday, September 08, 2001 5:56 PM
  To: Mark Davis; [EMAIL PROTECTED]; Francesco Zappa Nardelli
  Subject: Re: [OT] o-circumflex
 
 
  At 02:45 PM 9/8/01 -0700, Mark Davis wrote:
  If you use a Danish tailoring of the UCA that equates Å and AA
  (at least at
  a primary and secondary level), then they will sort the same
  way. A string
  search that uses the same tailoring will also find Ålborg when given
  Aalborg (and vice versa).
 
  But if you do this, all compound words starting with data and
  continuing
  with another word starting with a will be sorted incorrectly!
 
  To achieve this effect, you would have to mark which AAs are A-Rings and
  which ones are accidental adjacencies. In Danish one can use the
  SHY (soft
  hyphen) to break the latter, as these accidental pairs occur at
  legal word
  break points. In fact, that's the recommended solution, but it requires
  that the input data are in a sepecific form.
 
  A./
 








Re: [OT] o-circumflex

2001-09-07 Thread Bertrand Laidain

I would say it is a variant of o we just called it... o with a circumflex
accent (o avec un accent circonflex). The difference between o and ô
is normally audible (for a French speaker). The relationship is the same
than with any other letter which sometimes have accents (e.g. a and à,
e and è, etc.).

o avec un accent circonflexe, with an e at the end. From Petit
Robert (french dictionnary) the circumflexe is a mark for long vowel
(eg. île for isle (ancient french)) or to avoid confusion between two
words (eg. du and dû). The prononciation of the ô is closed (o fermé)
opposed to o without accent. But Thierry is right it's a letter with an
accent like à and è not a distinct grapheme.

Bertrand

The only little thing to know about French and diacritical mark is that when
doing a sort diacritical mark are evaluated from right to left.  (e.g.
cote  côte  coté vs the English order cote   coté  côte ).
Cheers,
Thierry


How do Francophones view the o-circumflex ô in relation to the letter o?
Is it a distinct grapheme, or is it considered a variant of o?
- Peter





RE: [OT] o-circumflex

2001-09-07 Thread James E. Agenbroad

On Thu, 6 Sep 2001, Ayers, Mike wrote:

 
  From: David Starner [mailto:[EMAIL PROTECTED]] 
  Sent: Thursday, September 06, 2001 01:40 PM
 
  On Thu, Sep 06, 2001 at 04:03:07PM +0200, Thierry Sourbier wrote:
   The only little thing to know about French and diacritical 
  mark is that when
   doing a sort diacritical mark are evaluated from right to 
  left.  (e.g.
   cote  côte  coté vs the English order cote   
  coté  côte ).
  
  I'm not sure there is an established English sort order. It's not a 
  problem that comes up much in English. 
 
   I believe that there is an established sort order in English, which
 is to sort without regard to diacritics, or else we'd never find the words!
 In English (American English more than British English), diacritics are
 considered optional, and it is common to see naїve written naive, San
 José written San Jose, etc.  Especially amongst Americans, the two are
 considered equivalent, and I know of no word pair in all of English which is
 separated only by a diacritic.
 
 Friday, September 7, 2001
Librarians have *filing* rules--the American Library Association (ALA) and
the Library of Congress (LC) each issued some in, I think, 1980.  I
believe they both say to ignore diacritics because Americans do not
recognize that they have an order.  These days filing in vendor software
for libraries tends to follow neither one very closely--the phrase
more honored in the breach than the observance comes to mind.  I may be
wrong but I do not believe there is an established U.S. standard for
sorting/filing.  A few years ago a National Information Standards
Organization (NISO) committee drafted one but it didn't get the
votes needed to become an accepted standard.  

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  





Re: [OT] o-circumflex

2001-09-07 Thread J M Sykes


 I believe that there is an established sort order in English, which
 is to sort without regard to diacritics, or else we'd never find the
words!
 In English (American English more than British English), diacritics are
 considered optional, and it is common to see naїve written naive, San
 José written San Jose, etc.  Especially amongst Americans, the two are
 considered equivalent, and I know of no word pair in all of English which
is
 separated only by a diacritic.

That depends what you mean by 'established' ;-)

The classic example is 'resume' and 'résumé'. These are, by now, two quite
distinct words, and the fact that there is no 'established' order is shown
by the fact that the New Shorter Oxford English Dictionary (Version: 1.0.4,
Data version: 02.10.96s, January 1997, on disk) has them in the order:
'résumé', 'resume' while the New Oxford Dictionary of English (Clarendon
Press, 1998) has 'resume', 'resumé'. The Concise Oxford Dictionary (of
Current English, Clarendon Press, 1982, edited, as it happens, by a second
cousin of mine) also has 'resume', 'résumé'.

Evidently, we see here evidence that the diacritic on the first 'e' has
become optional since 1982, though not that on the second, presumably
because that 'e' might otherwise be supposed to be silent.

Reverting the question of order, the 'Guide to the New SOED' (a.k.a. Help)
reveals that:

quote
Entries are accessed in strict alphabetical order. ... ; a headword with an
accent or diacritic over a letter follows one consisting of the same
sequence of letters without. ...

The order of headwords which are spelled the same way but have different
parts of speech is as follows:

noun (abbreviated n.)
pronoun (abbreviated pron.)
adjective (abbreviated a.)
verb (abbreviated v.)
...
/quote

And scrutiny of the two entries of interest reveals that 'résumé' is both a
noun and a verb, whereas  'resume' is only a verb.

Perhaps the ordering of 'résumé' before 'resume' is a mistake; perhaps not.
I can't ask my aforesaid second cousin, because he's no longer with us.

Who'd be a lexicographer?

Mike.

***

J M Sykes  Email: [EMAIL PROTECTED]
97 Oakdale Drive
Heald Green
CHEADLE
Cheshire   SK8 3SN
UKTel: (44) 161 437 5413

***






Re: [OT] o-circumflex

2001-09-07 Thread $B$F$s$I$&$j$e$&$8(B

There is also no word pair separated only by the I/J distinction (in English), right?

rubyrb$B$8$e$&$$$C$A$c$s(B/rbrp(/rprtJuuitchan/rtrp)/rp/ruby
Well, I guess what you say is true,
I could never be the right kind of girl for you,
I could never be your woman
  - White Town


I know of no word pair in all of English which
is
 separated only by a diacritic.



Re: [OT] o-circumflex

2001-09-07 Thread $B$F$s$I$&$j$e$&$8(B



rubyrb$B$8$e$&$$$C$A$c$s(B/rbrp(/rprtJuuitchan/rtrp)/rp/ruby
Well, I guess what you say is true,
I could never be the right kind of girl for you,
I could never be your woman
  - White Town



Who'd be a lexicographer?


$B;d!)(B





Mike.

***

J M Sykes  Email: [EMAIL PROTECTED]
97 Oakdale Drive
Heald Green
CHEADLE
Cheshire   SK8 3SN
UKTel: (44) 161 437 5413

***






RE: [OT] o-circumflex

2001-09-07 Thread Ayers, Mike


 From: J M Sykes [mailto:[EMAIL PROTECTED]] 
 Sent: Friday, September 07, 2001 07:50 AM

 The classic example is 'resume' and 'résumé'. These are, by 
 now, two quite
 distinct words, and the fact that there is no 'established' 
 order is shown

I spell both resume and have never been corrected.  Words with the
same spelling and different pronunciation are uncommon but exist in English,
the classic example being read and its own past tense.  Since there are no
diacritics in English proper, the two resumes tend to fall into this
category.  The diacritics which often appear on one of them really only
serve to mark it as a loan word, since it is very difficult to come up with
a sentence in which the two could be confused.


/|/|ike




Re: [OT] o-circumflex

2001-09-07 Thread David Gallardo

As a practical matter, you need to take the diacritics into account when
sorting, even in English where they (may or may not) have linguistic
significance, otherwise you'll get nondeterministic behaviour. In other
words, résumé and resume should fall together, but always in the same order.

Someone in another message mentioned ñ. This is a different case in
principal, because in Spanish it's not a case of letter modified by a
diacritic--it's an entirely different letter. (It used to be written as two
side-by-side ns and then they got stacked.)  Again as practical matter, in
English, it's most common to ignore the greater distinction, (because we
have only 26 letters in our alphabet), and to treat it as a letter +
diacritic for the same considerations as above.

- Original Message -
From: Ayers, Mike [EMAIL PROTECTED]
To: 'David Starner' [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Thursday, September 06, 2001 5:12 PM
Subject: RE: [OT] o-circumflex



  From: David Starner [mailto:[EMAIL PROTECTED]]
  Sent: Thursday, September 06, 2001 01:40 PM

  On Thu, Sep 06, 2001 at 04:03:07PM +0200, Thierry Sourbier wrote:
   The only little thing to know about French and diacritical
  mark is that when
   doing a sort diacritical mark are evaluated from right to
  left.  (e.g.
   cote  côte  coté vs the English order cote 
  coté  côte ).
 
  I'm not sure there is an established English sort order. It's not a
  problem that comes up much in English.

 I believe that there is an established sort order in English, which
 is to sort without regard to diacritics, or else we'd never find the
words!
 In English (American English more than British English), diacritics are
 considered optional, and it is common to see naїve written naive, San
 José written San Jose, etc.  Especially amongst Americans, the two are
 considered equivalent, and I know of no word pair in all of English which
is
 separated only by a diacritic.


 /|/|ike






RE: [OT] o-circumflex

2001-09-07 Thread Timothy Greenwood
 There is also no word pair separated only by the I/J 
 distinction (in English), right?

iamb - as in iambic pentamater
jamb - as in a door jamb


RE: [OT] o-circumflex

2001-09-07 Thread Ayers, Mike


 From: David Gallardo [mailto:[EMAIL PROTECTED]] 
 Sent: Friday, September 07, 2001 10:07 AM

 As a practical matter, you need to take the diacritics into 
 account when
 sorting, even in English where they (may or may not) have linguistic
 significance, otherwise you'll get nondeterministic 
 behaviour. In other
 words, résumé and resume should fall together, but always in 
 the same order.

Why?  This may be of interest and benefit to programmers, but not
necessarily to end-users.  The computer should serve the human, not the
other way around, and it is not particularly challenging to come up with
search and sort algorithms which understand the concept of terminal sets
which need to be iterated over to find the final entity as opposed to
terminal entities.  Recall Mike Sykes' post concerning sort order:

MikeS
Reverting the question of order, the 'Guide to the New SOED' (a.k.a. Help)
reveals that:

quote
Entries are accessed in strict alphabetical order. ... ; a headword with an
accent or diacritic over a letter follows one consisting of the same
sequence of letters without. ...

The order of headwords which are spelled the same way but have different
parts of speech is as follows:

noun (abbreviated n.)
pronoun (abbreviated pron.)
adjective (abbreviated a.)
verb (abbreviated v.)
...
/quote
/MikeS

This explicit ordering will still be insufficient if we choose to
include verb tenses in our word list, whence we get the two reads.  If
someone has a reason why these two words need to be in the same order in
everyone's word list, I'll listen...


/|/|ike




Re: [OT] o-circumflex

2001-09-07 Thread Michael \(michka\) Kaplan

From: David Gallardo [EMAIL PROTECTED]

 As a practical matter, you need to take the diacritics into account when
 sorting, even in English where they (may or may not) have linguistic
 significance, otherwise you'll get nondeterministic behaviour. In other
 words, résumé and resume should fall together, but always in the same
order.

Well, sort of. The issue remains that if one is choosing for their
particular purpose to ignore case (for example) then there is literally no
difference between Aa and aA. Since the two are considered equivalent in
the case insensitive comparison, you cannot claim that a sorting algorithm
has errored if it arbitrarily returns one before the other because it
happens to return them in different order.

For a real-world example, this can happen with algorithms where the bottom
item and the anchor are always reordered if b  a and thus you could see
different ordering of items depending on their placement in the list.

A similar thing happens with accent-insensitive sorts -- if you literally
treat ee and éé as identical due to using an accent insensitive sort,
then the ordering is NOT deterministic, nor is it supposed to be. And there
is nothing invalid in there not being a non-deterministic behavior of
equivalent items, any more than claiming that having it put ee before ee
in one case and after another is invalid.

MichKa

Michael Kaplan
Trigeminal Software, Inc.
http://www.trigeminal.com/






RE: [OT] o-circumflex

2001-09-07 Thread Asmus Freytag

At 11:50 AM 9/7/01 -0500, Ayers, Mike wrote:
Words with the
same spelling and different pronunciation are uncommon but exist in English,
the classic example being read and its own past tense.

Actually, this is a bit more common than you think, since the pronunciation 
of vowels in English depends somewhat systematically on stress, and verb 
and noun forms of many words are stressed differently.

A./




Re: [OT] o-circumflex

2001-09-07 Thread Asmus Freytag

At 01:06 PM 9/7/01 -0400, David Gallardo wrote:
As a practical matter, you need to take the diacritics into account when
sorting, even in English where they (may or may not) have linguistic
significance, otherwise you'll get nondeterministic behaviour. In other
words, résumé and resume should fall together, but always in the same order.

Stated absolutely, this is patent, but oft-repeated nonsense. For example, 
it does not always make sense for list of names. An old friend of mine, Jon 
Proppe, who is an Icelandic art critic, spells his name with an accent 
grave on the first o and an acute accent on the e. In a campus directory of 
the US university he attended (assuming it did not strip the accents), it 
would make no sense to have his name show up after all the Proppes, or all 
the Jons without an accent (depending on whether its sorted by first or 
last name).

If I sort a list of single words which contains non-unique entries, a 
stable sort would sort the non-unique subsets in the order of their 
appearance in the input. If its not important to distinguish between naive 
and naïve (e.g. in a machine generated index that spans multiple documents 
with differences in the use of accents) its hard to see what's gained in 
splitting the list in two for this case.

On the other hand, if San Jose and San José are correctly and consistently 
distinguished in my input, they should probably sort separately.

The two cases of resume are different yet again, as noted, since one could 
be a verb form.

It all depends not on whether a distinction can be made, but whether it is 
meaningful in the context of the list being sorted.

A./








RE: [OT] o-circumflex

2001-09-07 Thread Carl W. Brown

Asmus,

You are quite correct that is why Unicode support differing collation
strengths.  Some times you only care about the actual letters without
diacritics.  But even then letters are locale sensitive.  For example the
Danish alphabet starts with an A and ends it with A ring above.  A Dane
would look for Alborg near the end of a list of towns.  It is like having
the Spanish ch follow cz.

By providing for different types of collation one can meet the user's
expectations.

Then of course you have search, display and sort differences.  If I am
looking for Istanbul it is probably OK even for Turkish locales to match it
to the Turkish spelling which uses a dotted capital I.

With languages with multiple diacritics like Vietnamese you have another set
of rules and had better have normalized data.

In Arabic do you include vowels or not?

I remember your discussions of Greek where there are other considerations.

Carl


 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
 Behalf Of Asmus Freytag
 Sent: Friday, September 07, 2001 11:51 AM
 To: David Gallardo; Ayers, Mike; 'David Starner'; [EMAIL PROTECTED]
 Subject: Re: [OT] o-circumflex


 At 01:06 PM 9/7/01 -0400, David Gallardo wrote:
 As a practical matter, you need to take the diacritics into account when
 sorting, even in English where they (may or may not) have linguistic
 significance, otherwise you'll get nondeterministic behaviour. In other
 words, résumé and resume should fall together, but always in
 the same order.

 Stated absolutely, this is patent, but oft-repeated nonsense. For
 example,
 it does not always make sense for list of names. An old friend of
 mine, Jon
 Proppe, who is an Icelandic art critic, spells his name with an accent
 grave on the first o and an acute accent on the e. In a campus
 directory of
 the US university he attended (assuming it did not strip the accents), it
 would make no sense to have his name show up after all the
 Proppes, or all
 the Jons without an accent (depending on whether its sorted by first or
 last name).

 If I sort a list of single words which contains non-unique entries, a
 stable sort would sort the non-unique subsets in the order of their
 appearance in the input. If its not important to distinguish
 between naive
 and naïve (e.g. in a machine generated index that spans multiple
 documents
 with differences in the use of accents) its hard to see what's gained in
 splitting the list in two for this case.

 On the other hand, if San Jose and San José are correctly and
 consistently
 distinguished in my input, they should probably sort separately.

 The two cases of resume are different yet again, as noted, since
 one could
 be a verb form.

 It all depends not on whether a distinction can be made, but
 whether it is
 meaningful in the context of the list being sorted.

 A./










Re: [OT] o-circumflex

2001-09-07 Thread Mark Davis

As a percentage of words in English, it is quite small, but there are still
plenty of homographs, such as:

BASS
BOW(S)
BUFFET
COAX
CLOSE
COMPOUND(S)
CONVERSE
DESERT
DIVERS
DOES
DOVE
ENTRANCE(S)
EXCISE
HARE
INTIMATE
INVALID
LAME
LEAD
LUGER(S)
MANES
MARE(S)
MINUTE
OBJECT(S)
PATENT
POLISH
PRESENT
PRIMER(S)
PROJECT(S)
PUSSY
PUTTING
RAVEN
RE
REFUSE
RESIGN(S)
RESUME(S)
ROW(S)
SEWER(S)
SHOWER(S)
SLAVER
SOW(S)
SYNDICATE(S)
TAXIS
TEAR(S)
TIER(S)
TOWER(S)
VIOLA(S)
WIND(S)
WOUND
ABSENT
ABSTRACT
ABUSE(S)
ADDRESS(ES)
ADVOCATE(S)
AGGREGATE
APPROPRIATE
APPROXIMATE
ARTICULATE
ASSOCIATE(S)
ATTRIBUTE(S)
COMBAT
COMBINE(S)
COMPACT(S)
COMPLEX
CONDUCT
CONFINES
CONFLICT(S)
CONSORT
CONSTRUCT(S)
CONTENT
CONTEST(S)
CONTRACT(S)
CONSUMMATE
CONVERT(S)
CONVICT(S)
COORDINATE(S)
DECREASE(S)
DEFECT(S)
DEGENERATE(S)
DELEGATE(S)
DELIBERATE
DISCHARGE
DOGGED
EJACULATE
ELABORATE
ESCORT(S)
EXCUSE(S)
ESTIMATE(S)
EXTRACT(S)
GRADUATE(S)
HOUSE(S)
IMPLANT(S)
IMPORT(S)
INCLINE(S)
LAMINATE(S)
LEARNED
LEGITIMATE
LIVE(S)
[-]LIVED
MEDIATE(S)
MOBILE (3)
MODERATE(S)
MOUTH
OFFENSE(S)
PERFECT
PERMIT(S)
PREDICATE(S)
PRODUCE
PROGRESS
PROTEST(S)
READ (mis-, proof-)
RECALL(S)
RECORD(S)
REDRESS
REJECT(S)
RETARD(S)
RETREAD(S)
ROUTE(S)
SEPARATE
SUBJECT(S)
SUSPECT(S)
TORMENT(S)
UPSET(S)
USE(S)



—

Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο 
πάντα — Όμήρου Μαργίτῃ
[http://www.macchiato.com]
- Original Message -
From: Asmus Freytag [EMAIL PROTECTED]
To: Ayers, Mike [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Friday, September 07, 2001 11:52
Subject: RE: [OT] o-circumflex


 At 11:50 AM 9/7/01 -0500, Ayers, Mike wrote:
 Words with the
 same spelling and different pronunciation are uncommon but exist in
English,
 the classic example being read and its own past tense.

 Actually, this is a bit more common than you think, since the
pronunciation
 of vowels in English depends somewhat systematically on stress, and verb
 and noun forms of many words are stressed differently.

 A./







Re: [OT] o-circumflex

2001-09-07 Thread Mark Davis

I disagree. What you want is a merged database field. See
http://www.macchiato.com/slides/icu_collation.ppt

Mark
—

Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο 
πάντα — Όμήρου Μαργίτῃ
[http://www.macchiato.com]
- Original Message -
From: Asmus Freytag [EMAIL PROTECTED]
To: David Gallardo [EMAIL PROTECTED]; Ayers, Mike
[EMAIL PROTECTED]; 'David Starner' [EMAIL PROTECTED];
[EMAIL PROTECTED]
Sent: Friday, September 07, 2001 11:50
Subject: Re: [OT] o-circumflex


 At 01:06 PM 9/7/01 -0400, David Gallardo wrote:
 As a practical matter, you need to take the diacritics into account when
 sorting, even in English where they (may or may not) have linguistic
 significance, otherwise you'll get nondeterministic behaviour. In other
 words, résumé and resume should fall together, but always in the same
order.

 Stated absolutely, this is patent, but oft-repeated nonsense. For example,
 it does not always make sense for list of names. An old friend of mine,
Jon
 Proppe, who is an Icelandic art critic, spells his name with an accent
 grave on the first o and an acute accent on the e. In a campus directory
of
 the US university he attended (assuming it did not strip the accents), it
 would make no sense to have his name show up after all the Proppes, or all
 the Jons without an accent (depending on whether its sorted by first or
 last name).

 If I sort a list of single words which contains non-unique entries, a
 stable sort would sort the non-unique subsets in the order of their
 appearance in the input. If its not important to distinguish between naive
 and naïve (e.g. in a machine generated index that spans multiple documents
 with differences in the use of accents) its hard to see what's gained in
 splitting the list in two for this case.

 On the other hand, if San Jose and San José are correctly and consistently
 distinguished in my input, they should probably sort separately.

 The two cases of resume are different yet again, as noted, since one could
 be a verb form.

 It all depends not on whether a distinction can be made, but whether it is
 meaningful in the context of the list being sorted.

 A./











Re: [OT] o-circumflex

2001-09-06 Thread Thierry Sourbier

 Is it a distinct grapheme, or is it considered a variant of o?

I would say it is a variant of o we just called it... o with a circumflex
accent (o avec un accent circonflex). The difference between o and ô
is normally audible (for a French speaker). The relationship is the same
than with any other letter which sometimes have accents (e.g. a and à,
e and è, etc.).

The only little thing to know about French and diacritical mark is that when
doing a sort diacritical mark are evaluated from right to left.  (e.g.
cote  côte  coté vs the English order cote   coté  côte ).

I'm just talking as a French Francophone not a linguist. May be someone on
this list knows why diacritical marks are sorted in French in such a funky
way :).

Cheers,
Thierry


www.i18ngurus.com - Open Internationalization Resources Directory

- Original Message -
From: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Thursday, September 06, 2001 3:08 PM
Subject: [OT] o-circumflex



How do Francophones view the o-circumflex ô in relation to the letter o?
Is it a distinct grapheme, or is it considered a variant of o?


- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]





Re: [OT] o-circumflex

2001-09-06 Thread David Starner

On Thu, Sep 06, 2001 at 04:03:07PM +0200, Thierry Sourbier wrote:
 The only little thing to know about French and diacritical mark is that when
 doing a sort diacritical mark are evaluated from right to left.  (e.g.
 cote  côte  coté vs the English order cote   coté  côte ).

I'm not sure there is an established English sort order. It's not a 
problem that comes up much in English. 

-- 
David Starner - [EMAIL PROTECTED]
Pointless website: http://dvdeug.dhis.org
I don't care if Bill personally has my name and reads my email and 
laughs at me. In fact, I'd be rather honored. - Joseph_Greg




Re: [OT] o-circumflex

2001-09-06 Thread Alex Bochannek

My impression is that at least in U.S. states, which are more heavily
populated by native Spanish speakers, the one diacritic, which is
frequently viewed by English speakers as non-optional to differentiate
two words (specifically proper names) is the tilde as used for the
eñe. There is a college in Redwood City, CA, which is called Cañada
College and, which is off of Cañada Road. I haven't checked
thoroughly, but I believe most road signs there use the eñe. I do know
of one highway exit in the area though which spells it Canada
College.

Alex.




Re: [OT] o-circumflex

2001-09-06 Thread James Kass


David Starner wrote:

 Yes, but I mean for cote, côte, and coté. How would you
 sort those three in English? I'd probably sort it by some
 extra-lingual information:  i.e. page number, date of birth
 or the like.

Store them as UTF-8, do a DOS sort, and call the results
the new World order?

Best regards,

James Kass.







RE: [OT] o-circumflex

2001-09-06 Thread $B$F$s$I$&$j$e$&$8(B
Sorry about the kana. My mailer is Japanese.


rubyrb$B$8$e$&$$$C$A$c$s(B/rbrp(/rprtJuuitchan/rtrp)/rp/ruby
Well, I guess what you say is true,
I could never be the right kind of girl for you,
I could never be your woman
  - White Town


--- Original Message ---
$B:9=P?M(B: "Ayers, Mike" [EMAIL PROTECTED];
$B08@h(B: 'David Starner' [EMAIL PROTECTED];[EMAIL PROTECTED];
Cc: 
$BF|;~(B: 01/09/06 21:12
$B7oL>(B: RE: [OT] o-circumflex


 From: David Starner [mailto:[EMAIL PROTECTED]] 
 Sent: Thursday, September 06, 2001 01:40 PM

 On Thu, Sep 06, 2001 at 04:03:07PM +0200, Thierry Sourbier wrote:
  The only little thing to know about French and diacritical 
 mark is that when
  doing a sort diacritical mark are evaluated from right to 
 left.  (e.g.
  "cote"  "c$B%F%((Bte"  "cot$B%F%%(B" vs the English order "cote"   
 "cot$B%F%%(B"  "c$B%F%((Bte" ).
 
 I'm not sure there is an established English sort order. It's not a 
 problem that comes up much in English. 

   I believe that there is an established sort order in English, which
is to sort without regard to diacritics, or else we'd never find the words!
In English (American English more than British English), diacritics are
considered optional, and it is common to see "na$B%`MW(Be" written "naive", "San
Jos$B%F%%(B" written "San Jose", etc.  Especially amongst Americans, the two are
considered equivalent, and I know of no word pair in all of English which is
separated only by a diacritic.

I believe that the origin of the problem is the typewriter / word-processor. The 
English typewriter / word-processor is only designed to handle 26 letters (52 if you 
count case). Diacritics are impossible on a typewriter and very difficult on a word 
processor. In handwriting, the problem is non-existent.

Think of Tendou Kasumi getting the medical scholarship she always wanted, and getting 
to study abroad. She would likely e-mail her old friends / family in romaji, but 
snail-mail them in kana / kanji.

I like the freedom of a pen, so I can write kana and even draw.

As for your word pair:

1. To continue after a pause

2. Curriculum vitae


If only technology did not change the way we write like it does.

And why should not "o with accent" be considered as different from "o" as either is, 
say, from "u"? If that is the case:
"R" is "P with stroke"
(hiragana) "Ho" is "ha with stroke"
"Ru" is "Ro with loop"
(Thai) "five" is "four with loop"
and... my favorite... Latin "G" is "C" with stroke, and history WILL back me on that 
one!




/|/|ike