Re: Collation (was RE: [OT] o-circumflex)

2001-09-15 Thread Christopher JS Vance

On Thu, Sep 13, 2001 at 12:40:30AM -0700, Edward Cherlin wrote:
: For example,
: 
: 1984 (Nineteen Eighty Four)
: 1066 and all that (Ten Sixty Six)
: 3001 (Three Thousand One)
: 2050 (Twenty Fifty)
: 2010 (Twenty Ten)
: 2001, A Space Odyssey (Two Thousand One)

You're missing the "and" from 3001 and 2001.  I know Merkins often
leave it out, but a number of us always use it and feel it's wrong
without.  :-)

Putting dialect aside, you may find that 2050 and possibly 2010 will
be said "two thousand (and) whatever".

The problem here is that there's no single way to spell out numbers in
English, so no single way to alphabetise.  It's better to sort numbers
numerically, and then you only have to decide the order for negative
numbers.

-- 
Christopher Vance




Re: Alternative sorting for digraphs (Was Re: [OT] o-circumflex)

2001-09-13 Thread Roozbeh Pournader

On Mon, 10 Sep 2001, Mark Davis wrote:

> A ZWNJ will break ligatures and cursive connections. While probably safe in
> Danish or Dutch, it is unclear to me that that is safe in all languages
> where this situation occurs. There are diagraphs in Urdu, for example. While
> I don't know their sorting order, if they do sort separately then ZWNJ can't
> be used to express the alternative sorting, since it would give the wrong
> rendering.

:'-(

I would like to ask for stopping the overuse of ZWNJ. I once loved that
character... What about *renaming* the character to "Zero Width
All-Purpose Everything Breaker"?

roozbeh





Re: Collation (was RE: [OT] o-circumflex)

2001-09-13 Thread Mark Davis

In the latest ICU, we took the work we did for Java collation and extended
it substantially (and made it many times faster). It also allows arbitrary
customization at runtime.

I happen to be giving a presentation on it in a few hours at the conference.
For more information, see the draft collation chapter in the User guide, at
http://oss.software.ibm.com/icu/. The presentation (a slightly older draft)
is on my site at www.macchiato.com

Mark
—

Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο 
πάντα — Όμήρου Μαργίτῃ
[http://www.macchiato.com]
- Original Message -
From: "David Gallardo" <[EMAIL PROTECTED]>
To: "Edward Cherlin" <[EMAIL PROTECTED]>;
<[EMAIL PROTECTED]>
Sent: Thursday, September 13, 2001 8:35 AM
Subject: Re: Collation (was RE: [OT] o-circumflex)


> Java's collation class has a rule-based  collator that is in effect
> programmable using a little language. Here is how an example from Sun's
API
> doc for Norwegian:
>
> String Norwegian = "< a,A< b,B< c,C< d,D< e,E< f,F< g,G< h,H< i,I< j,J"
>  "< k,K< l,L< m,M< n,N< o,O< p,P< q,Q< r,R< s,S< t,T"
>  "< u,U< v,V< w,W< x,X< y,Y< z,Z"
>  "< å=a?,Å=A?"
>  ";aa,AA< æ,Æ< ø,Ø";
>  RuleBasedCollator myNorwegian = new RuleBasedCollator(Norwegian);
>
> There is also syntax for things such as specifying reverse order (for
French
> accents for example), contraction and expansion.
>
> - David Gallardo
>
> - Original Message -
> From: "Edward Cherlin" <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>
> Sent: Thursday, September 13, 2001 3:40 AM
> Subject: Collation (was RE: [OT] o-circumflex)
>
>
> > English and several other languages have dozens of collations. Compare
> telephone books, library catalogs, book indexes (sic), and other sorted
> data. Knuth vol. 3 Sorting and Searching gives an example of a set of
> library sorting rules that runs to more than a page, and suggests
> programming it as an exercise. ;-) Among the rules are to spell out
numbers.
> > For example,
> >
> > 1984 (Nineteen Eighty Four)
> > 1066 and all that (Ten Sixty Six)
> > 3001 (Three Thousand One)
> > 2050 (Twenty Fifty)
> > 2010 (Twenty Ten)
> > 2001, A Space Odyssey (Two Thousand One)
> >
> > Bell Labs invented a whole programming language, Snobol, to deal with
> telephone listing conversions, matches, and sorts. Many phone books sort
Mc-
> and Mac- together, others one after the other but separate from other
names.
> >
> > Edward Cherlin
> > Generalist
> > "A knot! Oh, do let me help to undo it."
> > Alice in Wonderland
> >
> >
>
>
>
>





Re: Collation (was RE: [OT] o-circumflex)

2001-09-13 Thread David Gallardo

Java's collation class has a rule-based  collator that is in effect
programmable using a little language. Here is how an example from Sun's API
doc for Norwegian:

String Norwegian = "< a,A< b,B< c,C< d,D< e,E< f,F< g,G< h,H< i,I< j,J"
 "< k,K< l,L< m,M< n,N< o,O< p,P< q,Q< r,R< s,S< t,T"
 "< u,U< v,V< w,W< x,X< y,Y< z,Z"
 "< å=a?,Å=A?"
 ";aa,AA< æ,Æ< ø,Ø";
 RuleBasedCollator myNorwegian = new RuleBasedCollator(Norwegian);

There is also syntax for things such as specifying reverse order (for French
accents for example), contraction and expansion.

- David Gallardo

- Original Message -----
From: "Edward Cherlin" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Thursday, September 13, 2001 3:40 AM
Subject: Collation (was RE: [OT] o-circumflex)


> English and several other languages have dozens of collations. Compare
telephone books, library catalogs, book indexes (sic), and other sorted
data. Knuth vol. 3 Sorting and Searching gives an example of a set of
library sorting rules that runs to more than a page, and suggests
programming it as an exercise. ;-) Among the rules are to spell out numbers.
> For example,
>
> 1984 (Nineteen Eighty Four)
> 1066 and all that (Ten Sixty Six)
> 3001 (Three Thousand One)
> 2050 (Twenty Fifty)
> 2010 (Twenty Ten)
> 2001, A Space Odyssey (Two Thousand One)
>
> Bell Labs invented a whole programming language, Snobol, to deal with
telephone listing conversions, matches, and sorts. Many phone books sort Mc-
and Mac- together, others one after the other but separate from other names.
>
> Edward Cherlin
> Generalist
> "A knot! Oh, do let me help to undo it."
> Alice in Wonderland
>
>






Collation (was RE: [OT] o-circumflex)

2001-09-13 Thread
Whoever invented English number words, then, had a very sick sense of humour. Why 
doesn't the word for "one" start with "a", the word for "two" with "b", etc.,?


$B$8$e$&$$$C$A$c$s(B(Juuitchan)
Well, I guess what you say is true,
I could never be the right kind of girl for you,
I could never be your woman
  - White Town


--- Original Message ---
$B:9=P?M(B: Edward Cherlin <[EMAIL PROTECTED]>;
$B08@h(B: [EMAIL PROTECTED];
Cc: 
$BF|;~(B: 01/09/13 7:40
$B7oL>(B: Collation (was RE: [OT] o-circumflex)

>English and several other languages have dozens of collations. Compare telephone 
>books, library catalogs, book indexes (sic), and other sorted data. Knuth vol. 3 
>Sorting and Searching gives an example of a set of library sorting rules that runs to 
>more than a page, and suggests programming it as an exercise. ;-) Among the rules are 
>to spell out numbers. 
>For example,
>
>1984 (Nineteen Eighty Four)
>1066 and all that (Ten Sixty Six)
>3001 (Three Thousand One)
>2050 (Twenty Fifty)
>2010 (Twenty Ten)
>2001, A Space Odyssey (Two Thousand One)
>
>Bell Labs invented a whole programming language, Snobol, to deal with telephone 
>listing conversions, matches, and sorts. Many phone books sort Mc- and Mac- together, 
>others one after the other but separate from other names.
>
>Edward Cherlin
>Generalist
>"A knot! Oh, do let me help to undo it." 
>Alice in Wonderland
>
>
>> -Original Message-
>> Behalf Of Michael (michka) Kaplan
>> Sent: Mon, September 10, 2001 8:36 AM
>> From: "Mark Davis" <[EMAIL PROTECTED]>
>> 
>> > Michael, that isn't the point. There is a problem even 
>> when you stick to
>> one
>> > language.
>
>
>> By that time, many langauges may have TWO collations, since 
>> users have been
>> expecting something else for the last few decades?
>> 
>> MichKa
>> 
>> Michael Kaplan
>> Trigeminal Software, Inc.
>> http://www.trigeminal.com/
>> 
>> 
>> 
>
>
>


Collation (was RE: [OT] o-circumflex)

2001-09-13 Thread Edward Cherlin

English and several other languages have dozens of collations. Compare telephone 
books, library catalogs, book indexes (sic), and other sorted data. Knuth vol. 3 
Sorting and Searching gives an example of a set of library sorting rules that runs to 
more than a page, and suggests programming it as an exercise. ;-) Among the rules are 
to spell out numbers. 
For example,

1984 (Nineteen Eighty Four)
1066 and all that (Ten Sixty Six)
3001 (Three Thousand One)
2050 (Twenty Fifty)
2010 (Twenty Ten)
2001, A Space Odyssey (Two Thousand One)

Bell Labs invented a whole programming language, Snobol, to deal with telephone 
listing conversions, matches, and sorts. Many phone books sort Mc- and Mac- together, 
others one after the other but separate from other names.

Edward Cherlin
Generalist
"A knot! Oh, do let me help to undo it." 
Alice in Wonderland


> -Original Message-
> Behalf Of Michael (michka) Kaplan
> Sent: Mon, September 10, 2001 8:36 AM
> From: "Mark Davis" <[EMAIL PROTECTED]>
> 
> > Michael, that isn't the point. There is a problem even 
> when you stick to
> one
> > language.


> By that time, many langauges may have TWO collations, since 
> users have been
> expecting something else for the last few decades?
> 
> MichKa
> 
> Michael Kaplan
> Trigeminal Software, Inc.
> http://www.trigeminal.com/
> 
> 
> 





Re: [OT] o-circumflex

2001-09-11 Thread Lars Marius Garshol


* Lars Marius Garshol
|
| I am not sure of this, but I think 'å' is a relatively modern
| invention, and that it was originally written only as 'aa'.

* Stefan Persson
| 
| FYI, "a relatively modern invention" means that is has been used
| since the Medieval (in Swedish).

I don't think that is the case in Norwegian and Danish. The Norwegian
constitution from 1814, for example, uses 'ø' and 'æ', but never 'å'.
Possibly this was a Swedish invention only adopted later by the Danes
and Norwegians.

--Lars M.





Re: [OT] o-circumflex

2001-09-11 Thread Keld Jørn Simonsen

On Tue, Sep 11, 2001 at 06:27:20PM +0200, Stefan Persson wrote:
> - Original Message -
> From: "Keld Jørn Simonsen" <[EMAIL PROTECTED]>
> To: "Stefan Persson" <[EMAIL PROTECTED]>
> Cc: "Mark Davis" <[EMAIL PROTECTED]>; "Michael (michka) Kaplan"
> <[EMAIL PROTECTED]>; "Keld Jørn Simonsen" <[EMAIL PROTECTED]>;
> <[EMAIL PROTECTED]>
> Sent: den 10 september 2001 22:12
> Subject: Re: [OT] o-circumflex
> 
> 
> > Where is this done for swedish? I have read both the TN and the SIS
> > standard, and I dont believe these say something on sorting
> > ü according to either German or Dutch sounds. Rolf Gavare does not
> > say something along this either, as far as I can remember.
> 
> This is the sorting used in dictionnaries, encyclopædias, phone books etc.
> For example, SAOL (Svenska Akademiens ordlista över svenska språket) sorts
> "myskoxe/müsli/mysning."

Yes, I can understand that. In Danish we have the same rule.
But do you have examples of Dutch words
that are ordered in another way? That is, you need to know the
origin of the word, to sort it.

Kind regards
keld




Re: [OT] o-circumflex

2001-09-11 Thread Stefan Persson

- Original Message -
From: "Lars Marius Garshol" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: den 10 september 2001 22:45
Subject: Re: [OT] o-circumflex


> I am not sure of this, but I think 'å' is a relatively modern
> invention, and that it was originally written only as 'aa'.

FYI, "a relatively modern invention" means that is has been used since the
Medieval (in Swedish).

Stefan


_
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com





Re: [OT] o-circumflex

2001-09-11 Thread Stefan Persson

- Original Message -
From: "Keld Jørn Simonsen" <[EMAIL PROTECTED]>
To: "Stefan Persson" <[EMAIL PROTECTED]>
Cc: "Mark Davis" <[EMAIL PROTECTED]>; "Michael (michka) Kaplan"
<[EMAIL PROTECTED]>; "Keld Jørn Simonsen" <[EMAIL PROTECTED]>;
<[EMAIL PROTECTED]>
Sent: den 10 september 2001 22:12
Subject: Re: [OT] o-circumflex


> Where is this done for swedish? I have read both the TN and the SIS
> standard, and I dont believe these say something on sorting
> ü according to either German or Dutch sounds. Rolf Gavare does not
> say something along this either, as far as I can remember.

This is the sorting used in dictionnaries, encyclopædias, phone books etc.
For example, SAOL (Svenska Akademiens ordlista över svenska språket) sorts
"myskoxe/müsli/mysning."

Stefan


_
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com





Re: [OT] o-circumflex

2001-09-10 Thread Kenneth Whistler

Wy OT by now...

> AAARRRGGHHH
> 
> I give up!
> 
> I was hoping that there is SOME system that would give these cities UNIQUE names... 
>postal codes???

Ain't reality a bitch?

What you're looking for doesn't exist in the world of natural language
names -- it can only exist in artificially constructed global
geographic databases, where people may have assigned unique keys
to cities. And even there, the geographic experts are going to
argue over the exact meaning of terms. Is "Los Angeles" the
incorporated city presided over by the mayor or does it include
all the other small cities that Los Angeles surrounds and engulfs,
or does it included unincorporated parts of Los Angeles county,
or does it refer to Greater Los Angeles, the metropolitan area,
or is it related to Los Angeles county?

Not such a simple distinction, sometimes. San Francisco is
a city *and* a county, and the mayor of the city is also mayor
of the county. The mayor of New York is mayor of half a
dozen boroughs, the moral equivalent of counties.

Is Stonyford, California (population 150), a "city"? It isn't
incorporated as a city, or even a town, but it is an independent
geographic location that occurs as a "town" on maps. Where do
you draw the line between named localities and cities? Do you
depend on legally incorporated city status? But what if the
laws don't match up between different countries? How am I going
to know that "cities" in Bourkina Fasso match the same criteria
I use to designate "cities" in the United States or Japan?

Some cities have multiple postal codes, and some postal codes
cover multiple cities. And while postal codes are subject to
international treaty, how countries divide their territories
up and use the codes is still up to them.

--Ken





Re: [OT] o-circumflex

2001-09-10 Thread
AAARRRGGHHH

I give up!

I was hoping that there is SOME system that would give these cities UNIQUE names... 
postal codes???


$B$8$e$&$$$C$A$c$s(B(Juuitchan)
Well, I guess what you say is true,
I could never be the right kind of girl for you,
I could never be your woman
  - White Town


--- Original Message ---
$B:9=P?M(B: Thomas Chan <[EMAIL PROTECTED]>;
$B08@h(B: [EMAIL PROTECTED];
Cc: 
$BF|;~(B: 01/09/10 19:59
$B7oL>(B: Re: [OT] o-circumflex

>On Mon, 10 Sep 2001, [ISO-2022-JP] $B$F$s$I$&$j$e$&$8(B wrote:
>
>> If they can't agree on the pronunciation for these cities, can they
>> agree on the Hanzi for them? What ARE the Hanzi for these cities,
>> anyway??
>
>Are you asking for the names of cities in Chinese?  Copenhagen is
>ge1ben3ha1gen1 \u54e5\u672c\u54c8\u6839.  The Han characters used to write
>the names of cities depends on many factors, including but not
>limited to source spelling/pronunciation, language/dialect of the
>rendering party, mapping rules used by the renderer, time period, etc.
>For example, New York is rendered in Chinese as Mandarin niu3yue4
>\u7d10\u7d04, lit. 'button-appointment' (nauyeuk in Cantonese), while in
>Japanese it was at one time rendered as \u7d10\u80b2, lit.
>'button-rearing'.  Asking for the "hanzi" (from your wording, I don't
>think you are just talking about Chinese usage of Han characters) is like
>asking for a single Latin script rendering.
>
>(I think you need to get yourself an English<->Chinese dictionary or
>something, btw...)
>
>
>Thomas Chan
>[EMAIL PROTECTED]
>
>
>
>


RE: [OT] o-circumflex

2001-09-10 Thread Otmar Permentier

Marco,

When you're in Holland you may want to check some dictionaries too. You'll notice in 
dictionaries 'ij' is considered to consist of two letters 'i' and 'j', so the word 
'ijs' sorts between 'iets' and 'ik'.
You're right the PTT doesn't make the distinction between 'ij' and 'y', so in the 
phone book 'Meyer' and 'Meijer' are indeed near each other. I suspected they would at 
least first list all Meijers, then all Meyers, but when I just checked they appeared 
to be intermingled. On closer inspection it turned out the Meijers and Meyers are 
further sorted by street name! 
By the way, in crossword puzzles and the like, 'ij' always occupies one box (but isn't 
considered the same as 'y' I believe)

Regards,

Otmar Permentier

> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
> Behalf Of Marco Cimarosti
> Sent: maandag 10 september 2001 19:59
> To: 'Stefan Persson'; 'John Wilcock'; [EMAIL PROTECTED]
> Subject: RE: [OT] o-circumflex
> 
> 
> Stefan Persson wrote:
> > I thought "ij" sorted after "z?"
> 
> Not in Dutch: as far as I have seen it sorts the same as "y".  In fact, in
> the telephone directory many people who had an "y" in their surname listed
> near people who had the same surname spelled with "ij" (e.g. "Meyer" and
> "Meijer").
> 
> (Anyway, next time they send me to Holland, I'll ask for a downtown hotel.
> So, after dinner, I'll go sightseeing rather than spending the 
> whole evening
> looking at the collation of the phone directory:-)
> 
> _ Marco
> 
> 





Re: [OT] o-circumflex

2001-09-10 Thread Juliusz Chroboczek

>> It's as weird as some Italian names for German cities: Aquisgrana
>> for Aachen, Augusta for Augsburg, Magonza for Mainz, Monaco (di
>> Baviera) for München.

MK> Interesting that Polish names of these cities are more like Italian
MK> than German: Akwizgran, Augsburg, Moguncja, Monachium.

Because they're adaptations of the mediaeval Latin names.

The same is true of historically important Polish cities, by the way:
Varsovie, Cracovie in French, Varsavia, Cracovia in Italian.  English
uses the German names instead (Warsaw, Cracow).

Juliusz




Re: [OT] o-circumflex

2001-09-10 Thread Peter_Constable


On 09/10/2001 07:48:05 AM Michael \(michka\) Kaplan wrote:

>(can't believe this thread is still going on!)

I just wanted to know about how Francophones perceive certain graphemes,
and I got that answer a long time ago.



Peter





Re: [OT] o-circumflex

2001-09-10 Thread Lars Marius Garshol


* Marco Cimarosti
| 
| One of these cases could be the word "dataarkiv", which I found in a Danish
| web page
| (http://www.riksarkivet.no/nordiskarknytt/98-nr4/institusjonen.html).

Uh, no, you found it in a Norwegian web page. The word is the same in
Danish, though.
 
|   Order B:
|   1. data
|   2. dataarkiv
|   3. Datben, Dr. Keld
|   4. Datz, Mr. Marco
|   5. Datåz, Dr. Asmus
| 
| Asmus was arguing that List B would be the correct one (and this is
| certainly true on, e.g., a dictionary) but, in order to obtain it, the
| source text must be properly encoded with invisible separators inserted
| where needed.

Not necessarily. One solution I've seen automatically generated sort
keys from the headwords, but allowed users to adjust them where
necessary. I think users are likely to favour this solution if given a
choice. 

Of course, it depends on how important it is to get the sorting
right, and what importance the headwords have within the system
whether this solution is feasible or not. In a phone directory I guess
nobody would use it.
 
| And this is precisely what I was trying to say, although I was not
| necessarily talking about multilingual sort ("dataarkiv" seems a purely
| Danish word, although derived from Latin roots).

It's a simple concatenation of the words for 'computing' (data) and
'archive' (arkiv), meaning any electronic archive. 

This kind of construction is very common in Norwegian and Danish,
leading speakers to invent all kinds of strange new words when writing
English[1], and the Swedes to joke that we call bananas 'yellowbends'.
 
--Lars M.

[1] And, conversely, after learning English, to split apart words that
God meant us to write without spaces in them. It really ann oys to
see people write in that incon venient way.





Re: [OT] o-circumflex

2001-09-10 Thread Lars Marius Garshol


* Keld Jørn Simonsen
|
| Yes, foreigners call our cities many strange things:-) København is
| called Köpenhamn, Copenhagen, Kobenhagen, Copenhague, and many more.

* Michael Everson
| 
| In Iceland it is Kaupmannahöfn, I believe. In unadorned English that
| would be something like Cheapmenshaven, maybe to weaken as
| Cheapenhaven, in German Kaufenhagen

Which makes eminent sense, given that København by this logic would
translate as Cheapenhaven. (Your German translation should be
Kaufmannshagen, I guess, to become Kaufenhagen when translated from
København.)

--Lars M.





Re: [OT] o-circumflex

2001-09-10 Thread Lars Marius Garshol


* Jonathan Rosenne
|
| This is not always the right thing to do. For example, with personal
| names the person involved may decide whether he prefers the old (AA)
| spelling or the new Å. In any case they are equivalent.

This is true, but this is nothing particular to the aa/å distinction.
Many given names have a number of possible spellings, such as Astri /
Astrid, Cathrine / Katrine / Kathrine, Wenche / Venke / Venche, Espen
/ Esben, ...   In fact, given names which can be written both aa and å
are rare. I can only think of Åge offhand, and that is only rarely
written Aage in Norway (and the other way round in Denmark).

AA/Å confusion is much more common in surnames, but there there is no
choice involved.

--Lars M.





Re: [OT] o-circumflex

2001-09-10 Thread Lars Marius Garshol


* Francesco Zappa Nardelli
| 
| I was in Aalborg fifteen days ago, and I have seen its name written
| both as Ålborg and as Aalborg.  Where does Aalborg appear in a list
| of towns?

At the end.

In both Danish and Norwegian 'aa' and 'å' are considered equivalent.
I am not sure of this, but I think 'å' is a relatively modern
invention, and that it was originally written only as 'aa'. 

--Lars M.





Re: [OT] o-circumflex

2001-09-10 Thread Lars Marius Garshol


* Carl W. Brown
| 
| You are quite correct that is why Unicode support differing
| collation strengths.  Some times you only care about the actual
| letters without diacritics.  But even then letters are locale
| sensitive.  For example the Danish alphabet starts with an A and
| ends it with A ring above.  A Dane would look for Alborg near the
| end of a list of towns.

This example doesn't apply to this discussions, since Danes and
Norwegians consider Å to be a separate letter. That is, it is not A
with ring above, but Å, which is not related to A any more than E is
related to F.

What J. M. Sykes writes about the lack of established sort orders
seems right to me. I've done consulting work for Norwegian
encyclopedia publishers, which involved developing their sorting
routines. The orders for the different publishers did differ, and it
is not so surprising given that there are a number of cases to
consider, such as how to sort diacritics, what to consider as
diacritics, how to sort numbers, Roman numerals, ordinals, and
whatnot.

--Lars M.





Re: [OT] o-circumflex

2001-09-10 Thread
I hate this sort:
Club Mix 2000
Club Mix 98
Club Mix 99

Those non Y2K compliant fools!


$B$8$e$&$$$C$A$c$s(B(Juuitchan)
Well, I guess what you say is true,
I could never be the right kind of girl for you,
I could never be your woman
  - White Town


--- Original Message ---
$B:9=P?M(B: Stefan Persson <[EMAIL PROTECTED]>;
$B08@h(B: Mark Davis <[EMAIL PROTECTED]>;"Michael (michka) Kaplan" 
<[EMAIL PROTECTED]>;Keld J?n Simonsen <[EMAIL PROTECTED]>;[EMAIL PROTECTED];
Cc: 
$BF|;~(B: 01/09/10 17:09
$B7oL>(B: Re: [OT] o-circumflex

>There is a similar problem with Swedish:
>
>Our alphabet goes:
>
>a
>...
>u
>v & w (no difference made)
>x
>y
>z
>$B%F!&(B
>$B%F!"(B (the Danish/Norwegian "$B%F%r(B" is also sorted as "$B%F!"(B")
>$B%F%+(B (the Danish/Norwegian "$B%F%/(B" is also sorted as "$B%F%+(B")
>
>The German character "$B%F%7(B" is pronunciated as a Swedish "y," so when any
>German name or loan word containing that character occurs in Swedish it
>should be sorted as "y." However, if any "$B%F%7(B" occurs in a Dutch loan word it
>is considered as an "u" with umlaut and is sorted as "u."
>
>The same goes for "$B%F!"(B" and "$B%F%+(B": If they are the 
>Swedish/Finnish/German
>letters "$B%F!"(B" and "$B%F%+(B" they are sorted after "$B%F!&(B," if they are 
>the Dutch letters
>"a" with umlaut and "o" with umlaut, they're sorted as "a" and "o" in a
>Swedish encyclop$B%F%r(Bdia.
>
>In Swedish the Danish/Norwegian letter "$B%F%r(B" is sorted as "$B%F!"(B," while 
>the
>Latin/Icelandic letter "$B%F%r(B" is sorted as "ae."
>
>Stefan
>
>- Original Message -
>From: "Mark Davis" <[EMAIL PROTECTED]>
>To: "Michael (michka) Kaplan" <[EMAIL PROTECTED]>; "Keld J$B%F%/(Brn Simonsen"
><[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
>Sent: den 10 september 2001 17:27
>Subject: Re: [OT] o-circumflex
>
>
>> Michael, that isn't the point. There is a problem even when you stick to
>one
>> language.
>>
>> That is, there are situations where two letters in a language, e.g. "ch"
>in
>> Slovak, are normally sorted as one. However, in some exceptional
>> circumstances those letters should be sorted separated. It could be
>because
>> they come originally from another language, or it could be because they
>> happen to arise when two other words are conjoined. There is no
>algorithmic
>> distinction. So without some special character, it would require a
>> dictionary look-up to produce the right sort
>>
>> For example, suppose that "th" were sorted separately in English, after Z.
>> Yet people would expect the following order:
>>
>> cast
>> cathouse
>> caul
>> cathode
>>
>> because the "t" and "h" are logically separate in "cathouse".
>>
>> Mark
>> $Bc`Hd?Hd?Hd?Hd?!&(B>>
>> $B%[?%^8P%5%[%5c`!&b>?%^?%[%C%^&Q!&%"%^!&%=(B $Bb>HQ"P%&%[%"(B, 
>$B%[%3%[%"%[%3bA%+%^!&%[%(c`!&b>?%^?%[%C%^&Q!&%"%^!&%=(B $B%^?%[%c%[%9%^!&%"(B 
>$Bc`!&bA%1%[%7%[%g%^"P%=%^!&%[XP%"%^"P%&%[%C%^!&%=!&(B>> [http://www.macchiato.com]
>> - Original Message -
>> From: "Michael (michka) Kaplan" <[EMAIL PROTECTED]>
>> To: "Keld J$B%F%/(Brn Simonsen" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
>> Sent: Monday, September 10, 2001 5:48 AM
>> Subject: Re: [OT] o-circumflex
>>
>>
>> > From: "Keld J$B%F%/(Brn Simonsen" <[EMAIL PROTECTED]>
>> >
>> > > Real-life sorts, like MS Windows sorting or Linux sorting, actually
>> > adheres
>> > > to these Danish rules, once you have set up your machine for Danish.
>> >
>> > And this is the *true* answer to the whole mess of attempting
>> *multilingual*
>> > sorts -- once the user chooses the sort they WANT, the system might
>handle
>> > other language strings in a way that might be obscure to those who know
>> the
>> > other language but the person who expected Danish or whatever will see
>> what
>> > they want.
>> >
>> > Since various sorts openly conflict with each other there is no other
>> > general case solution which would be appropriate, anyway?
>> >
>> > (can't believe this thread is still going on!)
>> >
>> >
>> > MichKa
>> >
>> > Michael Kaplan
>> > Trigeminal Software, Inc.
>> > http://www.trigeminal.com/
>> >
>> >
>> >
>> >
>>
>
>
>_
>Do You Yahoo!?
>Get your free @yahoo.com address at http://mail.yahoo.com
>
>
>


Re: [OT] o-circumflex

2001-09-10 Thread Keld Jørn Simonsen

Where is this done for swedish? I have read both the TN and the SIS
standard, and I dont believe these say something on sorting 
ü according to either German or Dutch sounds. Rolf Gavare does not
say something along this either, as far as I can remember.

Kind regards
keld

On Mon, Sep 10, 2001 at 07:09:34PM +0200, Stefan Persson wrote:
> There is a similar problem with Swedish:
> 
> Our alphabet goes:
> 
> a
> ...
> u
> v & w (no difference made)
> x
> y
> z
> å
> ä (the Danish/Norwegian "æ" is also sorted as "ä")
> ö (the Danish/Norwegian "ø" is also sorted as "ö")
> 
> The German character "ü" is pronunciated as a Swedish "y," so when any
> German name or loan word containing that character occurs in Swedish it
> should be sorted as "y." However, if any "ü" occurs in a Dutch loan word it
> is considered as an "u" with umlaut and is sorted as "u."
> 
> The same goes for "ä" and "ö": If they are the Swedish/Finnish/German
> letters "ä" and "ö" they are sorted after "å," if they are the Dutch letters
> "a" with umlaut and "o" with umlaut, they're sorted as "a" and "o" in a
> Swedish encyclopædia.
> 
> In Swedish the Danish/Norwegian letter "æ" is sorted as "ä," while the
> Latin/Icelandic letter "æ" is sorted as "ae."
> 
> Stefan




Re: [OT] o-circumflex

2001-09-10 Thread Thomas Chan

On Mon, 10 Sep 2001, [ISO-2022-JP] $B$F$s$I$&$j$e$&$8(B wrote:

> If they can't agree on the pronunciation for these cities, can they
> agree on the Hanzi for them? What ARE the Hanzi for these cities,
> anyway??

Are you asking for the names of cities in Chinese?  Copenhagen is
ge1ben3ha1gen1 \u54e5\u672c\u54c8\u6839.  The Han characters used to write
the names of cities depends on many factors, including but not
limited to source spelling/pronunciation, language/dialect of the
rendering party, mapping rules used by the renderer, time period, etc.
For example, New York is rendered in Chinese as Mandarin niu3yue4
\u7d10\u7d04, lit. 'button-appointment' (nauyeuk in Cantonese), while in
Japanese it was at one time rendered as \u7d10\u80b2, lit.
'button-rearing'.  Asking for the "hanzi" (from your wording, I don't
think you are just talking about Chinese usage of Han characters) is like
asking for a single Latin script rendering.

(I think you need to get yourself an English<->Chinese dictionary or
something, btw...)


Thomas Chan
[EMAIL PROTECTED]






Re: [OT] o-circumflex

2001-09-10 Thread Stefan Persson

There is a similar problem with Swedish:

Our alphabet goes:

a
...
u
v & w (no difference made)
x
y
z
Ã¥
ä (the Danish/Norwegian "æ" is also sorted as "ä")
ö (the Danish/Norwegian "ø" is also sorted as "ö")

The German character "ü" is pronunciated as a Swedish "y," so when any
German name or loan word containing that character occurs in Swedish it
should be sorted as "y." However, if any "ü" occurs in a Dutch loan word it
is considered as an "u" with umlaut and is sorted as "u."

The same goes for "ä" and "ö": If they are the Swedish/Finnish/German
letters "ä" and "ö" they are sorted after "å," if they are the Dutch letters
"a" with umlaut and "o" with umlaut, they're sorted as "a" and "o" in a
Swedish encyclopædia.

In Swedish the Danish/Norwegian letter "æ" is sorted as "ä," while the
Latin/Icelandic letter "æ" is sorted as "ae."

Stefan

- Original Message -
From: "Mark Davis" <[EMAIL PROTECTED]>
To: "Michael (michka) Kaplan" <[EMAIL PROTECTED]>; "Keld Jørn Simonsen"
<[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: den 10 september 2001 17:27
Subject: Re: [OT] o-circumflex


> Michael, that isn't the point. There is a problem even when you stick to
one
> language.
>
> That is, there are situations where two letters in a language, e.g. "ch"
in
> Slovak, are normally sorted as one. However, in some exceptional
> circumstances those letters should be sorted separated. It could be
because
> they come originally from another language, or it could be because they
> happen to arise when two other words are conjoined. There is no
algorithmic
> distinction. So without some special character, it would require a
> dictionary look-up to produce the right sort
>
> For example, suppose that "th" were sorted separately in English, after Z.
> Yet people would expect the following order:
>
> cast
> cathouse
> caul
> cathode
>
> because the "t" and "h" are logically separate in "cathouse".
>
> Mark
> —————
>
> Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο 
>πάντα — Όμήρου Μαργίτῃ
> [http://www.macchiato.com]
> - Original Message -
> From: "Michael (michka) Kaplan" <[EMAIL PROTECTED]>
> To: "Keld Jørn Simonsen" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
> Sent: Monday, September 10, 2001 5:48 AM
> Subject: Re: [OT] o-circumflex
>
>
> > From: "Keld Jørn Simonsen" <[EMAIL PROTECTED]>
> >
> > > Real-life sorts, like MS Windows sorting or Linux sorting, actually
> > adheres
> > > to these Danish rules, once you have set up your machine for Danish.
> >
> > And this is the *true* answer to the whole mess of attempting
> *multilingual*
> > sorts -- once the user chooses the sort they WANT, the system might
handle
> > other language strings in a way that might be obscure to those who know
> the
> > other language but the person who expected Danish or whatever will see
> what
> > they want.
> >
> > Since various sorts openly conflict with each other there is no other
> > general case solution which would be appropriate, anyway?
> >
> > (can't believe this thread is still going on!)
> >
> >
> > MichKa
> >
> > Michael Kaplan
> > Trigeminal Software, Inc.
> > http://www.trigeminal.com/
> >
> >
> >
> >
>


_
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com





RE: [OT] o-circumflex

2001-09-10 Thread Marco Cimarosti

Stefan Persson wrote:
> I thought "ij" sorted after "z?"

Not in Dutch: as far as I have seen it sorts the same as "y".  In fact, in
the telephone directory many people who had an "y" in their surname listed
near people who had the same surname spelled with "ij" (e.g. "Meyer" and
"Meijer").

(Anyway, next time they send me to Holland, I'll ask for a downtown hotel.
So, after dinner, I'll go sightseeing rather than spending the whole evening
looking at the collation of the phone directory:-)

_ Marco




Re: [OT] o-circumflex

2001-09-10 Thread Stefan Persson

- Original Message -
From: "Marco Cimarosti" <[EMAIL PROTECTED]>
To: "'John Wilcock'" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: den 10 september 2001 18:35
Subject: RE: [OT] o-circumflex


> John Wilcock wrote:
> > I haven't followed this discussion from the beginning, so apologies if
> > I'm missing the point, but it seems to me that the Beijing case in
> > Dutch is no different from the ekstraarbejde case in Danish - a SHY or
> > ZWNJ is all that is needed to stop Beijing sorting with Bey.
>
> Yes, it is exactly the same thing.
>
> But my point is that a Dutch reader probably *does* expect Beijing to sort
> like Bey, not like Bei.  So, in some cases, a "correct" (i.e., expected)
> behavior could rather be to *remove* all SHY/ZWNJ's before sorting.

I thought "ij" sorted after "z?"


_
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com





Re: [OT] o-circumflex

2001-09-10 Thread

If they can't agree on the pronunciation for these cities, can they agree on the Hanzi 
for them?
What ARE the Hanzi for these cities, anyway??

$B$8$e$&$$$C$A$c$s(B(Juuitchan)
Well, I guess what you say is true,
I could never be the right kind of girl for you,
I could never be your woman
  - White Town


--- Original Message ---
$B:9=P?M(B: Marcin 'Qrczak' Kowalczyk <[EMAIL PROTECTED]>;
$B08@h(B: [EMAIL PROTECTED];
Cc: 
$BF|;~(B: 01/09/10 14:02
$B7oL>(B: Re: [OT] o-circumflex

>Mon, 10 Sep 2001 10:47:48 +0200, Marco Cimarosti <[EMAIL PROTECTED]> pisze:
>
>> It's as weird as some Italian names for German cities: Aquisgrana
>> for Aachen, Augusta for Augsburg, Magonza for Mainz, Monaco (di
>> Baviera) for M$B!&(Bchen.
>
>Interesting that Polish names of these cities are more like Italian
>than German: Akwizgran, Augsburg, Moguncja, Monachium.
>
>Ko/benhavn is Kopenhaga, again more like other foreign forms than
>Danish.
>
>-- 
> __("<  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
> \__/
>  ^^  SYGNATURA ZAST$B%O(BPCZA
>QRCZAK
>
>
>


RE: [OT] o-circumflex

2001-09-10 Thread Marco Cimarosti

John Wilcock wrote:
> I haven't followed this discussion from the beginning, so apologies if
> I'm missing the point, but it seems to me that the Beijing case in
> Dutch is no different from the ekstraarbejde case in Danish - a SHY or
> ZWNJ is all that is needed to stop Beijing sorting with Bey. 

Yes, it is exactly the same thing.

But my point is that a Dutch reader probably *does* expect Beijing to sort
like Bey, not like Bei.  So, in some cases, a "correct" (i.e., expected)
behavior could rather be to *remove* all SHY/ZWNJ's before sorting.

_ Marco




Alternative sorting for digraphs (Was Re: [OT] o-circumflex)

2001-09-10 Thread Mark Davis

A SHY will mean that the word can break at "Bei-
jing". It is not clear to me at least that that is safe in all cases for all
languages with digraphs that sort separately, although it may be a solution
for some.

A ZWNJ will break ligatures and cursive connections. While probably safe in
Danish or Dutch, it is unclear to me that that is safe in all languages
where this situation occurs. There are diagraphs in Urdu, for example. While
I don't know their sorting order, if they do sort separately then ZWNJ can't
be used to express the alternative sorting, since it would give the wrong
rendering.

Mark
—

Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο 
πάντα — Όμήρου Μαργίτῃ
[http://www.macchiato.com]
- Original Message -
From: "John Wilcock" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Monday, September 10, 2001 8:39 AM
Subject: Re: [OT] o-circumflex


> On Mon, 10 Sep 2001 16:42:45 +0200, Keld Jørn Simonsen wrote:
> > But maybe you are driving for a yet more complex sorting, one that can
sort
> > according to multiple rules? Beijing should then not be sorted as
Beÿing?
>
> I haven't followed this discussion from the beginning, so apologies if
> I'm missing the point, but it seems to me that the Beijing case in
> Dutch is no different from the ekstraarbejde case in Danish - a SHY or
> ZWNJ is all that is needed to stop Beijing sorting with Bey.
>
>
> John.
>
> --
> -- Over 1500 webcams from ski resorts around the world -
http://www.snoweye.com/
> -- Translate your technical documents and web pages-
http://www.tradoc.fr/
>
>





Re: [OT] o-circumflex

2001-09-10 Thread John Wilcock

On Mon, 10 Sep 2001 16:42:45 +0200, Keld Jørn Simonsen wrote:
> But maybe you are driving for a yet more complex sorting, one that can sort
> according to multiple rules? Beijing should then not be sorted as Beÿing?

I haven't followed this discussion from the beginning, so apologies if
I'm missing the point, but it seems to me that the Beijing case in
Dutch is no different from the ekstraarbejde case in Danish - a SHY or
ZWNJ is all that is needed to stop Beijing sorting with Bey. 


John.

-- 
-- Over 1500 webcams from ski resorts around the world - http://www.snoweye.com/
-- Translate your technical documents and web pages- http://www.tradoc.fr/




Re: [OT] o-circumflex

2001-09-10 Thread Mark Davis

Michael, that isn't the point. There is a problem even when you stick to one
language.

That is, there are situations where two letters in a language, e.g. "ch" in
Slovak, are normally sorted as one. However, in some exceptional
circumstances those letters should be sorted separated. It could be because
they come originally from another language, or it could be because they
happen to arise when two other words are conjoined. There is no algorithmic
distinction. So without some special character, it would require a
dictionary look-up to produce the right sort

For example, suppose that "th" were sorted separately in English, after Z.
Yet people would expect the following order:

cast
cathouse
caul
cathode

because the "t" and "h" are logically separate in "cathouse".

Mark
—————

Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο 
πάντα — Όμήρου Μαργίτῃ
[http://www.macchiato.com]
- Original Message -
From: "Michael (michka) Kaplan" <[EMAIL PROTECTED]>
To: "Keld Jørn Simonsen" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Monday, September 10, 2001 5:48 AM
Subject: Re: [OT] o-circumflex


> From: "Keld Jørn Simonsen" <[EMAIL PROTECTED]>
>
> > Real-life sorts, like MS Windows sorting or Linux sorting, actually
> adheres
> > to these Danish rules, once you have set up your machine for Danish.
>
> And this is the *true* answer to the whole mess of attempting
*multilingual*
> sorts -- once the user chooses the sort they WANT, the system might handle
> other language strings in a way that might be obscure to those who know
the
> other language but the person who expected Danish or whatever will see
what
> they want.
>
> Since various sorts openly conflict with each other there is no other
> general case solution which would be appropriate, anyway?
>
> (can't believe this thread is still going on!)
>
>
> MichKa
>
> Michael Kaplan
> Trigeminal Software, Inc.
> http://www.trigeminal.com/
>
>
>
>





Re: [OT] o-circumflex

2001-09-10 Thread Michael \(michka\) Kaplan

From: "Mark Davis" <[EMAIL PROTECTED]>

> Michael, that isn't the point. There is a problem even when you stick to
one
> language.
>
> That is, there are situations where two letters in a language, e.g. "ch"
in
> Slovak, are normally sorted as one. However, in some exceptional
> circumstances those letters should be sorted separated. It could be
because
> they come originally from another language, or it could be because they
> happen to arise when two other words are conjoined. There is no
algorithmic
> distinction. So without some special character, it would require a
> dictionary look-up to produce the right sort

I would argue that most users of the language are not expecting this type of
thing, and that when they are looking for a word that this might be the
SECOND place they look, not the first.

There are exceptions, but they are not outnumbered by the general case, by
any means.

> For example, suppose that "th" were sorted separately in English, after Z.
> Yet people would expect the following order:
>
> cast
> cathouse
> caul
> cathode
>
> because the "t" and "h" are logically separate in "cathouse".

Again, I think most people would look first in the place that does not
assume the exception -- the computer's original limitations havse trained
them. The notion of a natural language processing engine that would have all
of the specific differences (with appropriate dictionaries for exceptions to
even the NLP results) is a fascinating notion, but one that no one is even
close to, yet.

We do not even have available UCA tailorings for most of the world's
languages. Though I have high hopes for the future (if not in the UCA then
in other mechanisms).

By that time, many langauges may have TWO collations, since users have been
expecting something else for the last few decades?

MichKa

Michael Kaplan
Trigeminal Software, Inc.
http://www.trigeminal.com/






Re: [OT] o-circumflex

2001-09-10 Thread Keld Jørn Simonsen

On Mon, Sep 10, 2001 at 03:58:05PM +0200, Marco Cimarosti wrote:
> > On Mon, Sep 10, 2001 at 11:09:28AM +0200, Marco Cimarosti wrote:
> > > Asmus Freytag wrote:
> > > > But if you do this, all compound words starting with "data" 
> > > > and continuing 
> > > > with another word starting with "a" will be sorted incorrectly!
> > > > 
> > > > To achieve this effect, you would have to mark which AAs are 
> > > > A-Rings and which ones are accidental adjacencies. In Danish
> > > > one can use the SHY (soft hyphen) [...]
> > > 
> > > Real-life sort orders often ignore these subtleties and are 
> > often based on a
> > > small set of rules which is applied blindly, regardless of 
> > the origin,
> > > meaning, or pronunciation of headwords.
> > > 
> > 
> > Real-life sorts, like MS Windows sorting or Linux sorting, 
> > actually adheres
> > to these Danish rules, once you have set up your machine for Danish.
> 
> If I understand what you mean, perhaps my point was not clear.

My point was that real-life sorts nowadays are quite sophisticated,
and the major systems have adequate sorting for Danish and other
languages with that kind of complexity.

> I know that "aa" sorts like "å", and that it should go after "z".  But there
> are also cases when the sequence "aa" is just two a's, adjacent to each
> other by pure chance.
> 
> One of these cases could be the word "dataarkiv", which I found in a Danish
> web page
> (http://www.riksarkivet.no/nordiskarknytt/98-nr4/institusjonen.html).

Yes, and ekstraarbejde - extra work. I know.

> Now: if your Windows or Linux collations states (correctly!) that "aa"
> should go after "z", you may have a list ordered like this:
> 
>   Order A:
>   1. data
>   2. Datben, Dr. Keld
>   3. Datz, Mr. Marco
>   4. dataarkiv
>   5. Datåz, Dr. Asmus
> 
> But if "dataarkiv" was written using an invisible separator between the two
> a's (e.g. a soft hyphen, or a zero width non joiner), the your list would be
> like this:
> 
>   Order B:
>   1. data
>   2. dataarkiv
>   3. Datben, Dr. Keld
>   4. Datz, Mr. Marco
>   5. Datåz, Dr. Asmus
> 
> Asmus was arguing that List B would be the correct one (and this is
> certainly true on, e.g., a dictionary) but, in order to obtain it, the
> source text must be properly encoded with invisible separators inserted
> where needed.

Yes, that is also my advice.

> What I was saying is that the "automatic" Order A is also often used, and I
> brought the example of the Dutch phone directories (where "Beijing" is
> sorted as if it was "Beying"), and of the Italian encyclopedia (where
> "Jefferson" is sorted as if it was "Iefferson").

You have to sort it according to the expectations of the user.
A Dutch book would use Dutch rules, an Italian book would use
the italian order. You cannot mix ordering, such that some words follow
one set of rules, and other words follow other rules. It all needs
to be comprehended by one human, the reader, and there only one ruleset
applies.

> 
> Michael (michka) Kaplan wrote:
> > And this is the *true* answer to the whole mess of attempting 
> > *multilingual* sorts -- once the user chooses the sort they
> > WANT, the system might handle other language strings in a
> > way that might be obscure to those who know the other
> > language but the person who expected Danish or whatever 
> > will see what they want.
> 
> And this is precisely what I was trying to say, although I was not
> necessarily talking about multilingual sort ("dataarkiv" seems a purely
> Danish word, although derived from Latin roots).
> 
> For some users and some usages, the "incorrect" Order B may be much more
> useful than the "correct" Order A.  If the rules says that "ij" goes between
> "x" and "z", a Dutchman should find the "Beijing Restaurant" between "bex-"
> and "bez-".
> 
> If someone wants Order A (as may be the case for the author of a
> dictionary), then they should apply Asmus' suggestion in order to drive the
> collation algorithm.

I think we agree, but what you call "simple set of rules" I call "quite complex".
I also think that the Danish rules are quite simple as they can be formulated
in say 4 lines of Danish prose. But compared to ascii sorting they are to some
people unbelievable complex, and I think many Danish believes that you cannot get
programs that adhere, although the major systems do that out of the box.

Your incorrect and correct examples use the very same sorting algoritm, the only
thing is that the data is coded differently.

But maybe you are driving for a yet more complex sorting, one that can sort
according to multiple rules? Beijing should then not be sorted as Beÿing?
As stated above I think - and other sorting experts too - that sorting
with multiple rules is a conceptual misunderstanding.

Kind regards
Keld




Re: [OT] o-circumflex

2001-09-10 Thread Marcin 'Qrczak' Kowalczyk

Mon, 10 Sep 2001 10:47:48 +0200, Marco Cimarosti <[EMAIL PROTECTED]> pisze:

> It's as weird as some Italian names for German cities: Aquisgrana
> for Aachen, Augusta for Augsburg, Magonza for Mainz, Monaco (di
> Baviera) for München.

Interesting that Polish names of these cities are more like Italian
than German: Akwizgran, Augsburg, Moguncja, Monachium.

Ko/benhavn is Kopenhaga, again more like other foreign forms than
Danish.

-- 
 __("<  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTĘPCZA
QRCZAK





RE: [OT] o-circumflex

2001-09-10 Thread Marco Cimarosti

> On Mon, Sep 10, 2001 at 11:09:28AM +0200, Marco Cimarosti wrote:
> > Asmus Freytag wrote:
> > > But if you do this, all compound words starting with "data" 
> > > and continuing 
> > > with another word starting with "a" will be sorted incorrectly!
> > > 
> > > To achieve this effect, you would have to mark which AAs are 
> > > A-Rings and which ones are accidental adjacencies. In Danish
> > > one can use the SHY (soft hyphen) [...]
> > 
> > Real-life sort orders often ignore these subtleties and are 
> often based on a
> > small set of rules which is applied blindly, regardless of 
> the origin,
> > meaning, or pronunciation of headwords.
> > 
> 
> Real-life sorts, like MS Windows sorting or Linux sorting, 
> actually adheres
> to these Danish rules, once you have set up your machine for Danish.

If I understand what you mean, perhaps my point was not clear.

I know that "aa" sorts like "å", and that it should go after "z".  But there
are also cases when the sequence "aa" is just two a's, adjacent to each
other by pure chance.

One of these cases could be the word "dataarkiv", which I found in a Danish
web page
(http://www.riksarkivet.no/nordiskarknytt/98-nr4/institusjonen.html).

Now: if your Windows or Linux collations states (correctly!) that "aa"
should go after "z", you may have a list ordered like this:

Order A:
1. data
2. Datben, Dr. Keld
3. Datz, Mr. Marco
4. dataarkiv
5. Datåz, Dr. Asmus

But if "dataarkiv" was written using an invisible separator between the two
a's (e.g. a soft hyphen, or a zero width non joiner), the your list would be
like this:

Order B:
1. data
2. dataarkiv
3. Datben, Dr. Keld
4. Datz, Mr. Marco
5. Datåz, Dr. Asmus

Asmus was arguing that List B would be the correct one (and this is
certainly true on, e.g., a dictionary) but, in order to obtain it, the
source text must be properly encoded with invisible separators inserted
where needed.

What I was saying is that the "automatic" Order A is also often used, and I
brought the example of the Dutch phone directories (where "Beijing" is
sorted as if it was "Beying"), and of the Italian encyclopedia (where
"Jefferson" is sorted as if it was "Iefferson").

Michael (michka) Kaplan wrote:
> And this is the *true* answer to the whole mess of attempting 
> *multilingual* sorts -- once the user chooses the sort they
> WANT, the system might handle other language strings in a
> way that might be obscure to those who know the other
> language but the person who expected Danish or whatever 
> will see what they want.

And this is precisely what I was trying to say, although I was not
necessarily talking about multilingual sort ("dataarkiv" seems a purely
Danish word, although derived from Latin roots).

For some users and some usages, the "incorrect" Order B may be much more
useful than the "correct" Order A.  If the rules says that "ij" goes between
"x" and "z", a Dutchman should find the "Beijing Restaurant" between "bex-"
and "bez-".

If someone wants Order A (as may be the case for the author of a
dictionary), then they should apply Asmus' suggestion in order to drive the
collation algorithm.

_ Marco




Re: [OT] o-circumflex

2001-09-10 Thread Michael \(michka\) Kaplan

From: "Keld Jørn Simonsen" <[EMAIL PROTECTED]>

> Real-life sorts, like MS Windows sorting or Linux sorting, actually
adheres
> to these Danish rules, once you have set up your machine for Danish.

And this is the *true* answer to the whole mess of attempting *multilingual*
sorts -- once the user chooses the sort they WANT, the system might handle
other language strings in a way that might be obscure to those who know the
other language but the person who expected Danish or whatever will see what
they want.

Since various sorts openly conflict with each other there is no other
general case solution which would be appropriate, anyway?

(can't believe this thread is still going on!)


MichKa

Michael Kaplan
Trigeminal Software, Inc.
http://www.trigeminal.com/






Re: [OT] o-circumflex

2001-09-10 Thread Keld Jørn Simonsen

On Mon, Sep 10, 2001 at 11:09:28AM +0200, Marco Cimarosti wrote:
> Asmus Freytag wrote:
> > But if you do this, all compound words starting with "data" 
> > and continuing 
> > with another word starting with "a" will be sorted incorrectly!
> > 
> > To achieve this effect, you would have to mark which AAs are 
> > A-Rings and which ones are accidental adjacencies. In Danish
> > one can use the SHY (soft hyphen) [...]
> 
> Real-life sort orders often ignore these subtleties and are often based on a
> small set of rules which is applied blindly, regardless of the origin,
> meaning, or pronunciation of headwords.
> 

Real-life sorts, like MS Windows sorting or Linux sorting, actually adheres
to these Danish rules, once you have set up your machine for Danish.

Kind regards
Keld




Re: [OT] o-circumflex

2001-09-10 Thread Michael Everson

At 18:10 -0400 2001-09-09, John Cowan wrote:
>Keld Jørn Simonsen scripsit:
>
>>  Yes, foreigners call our cities many strange things:-)
>>  København is called Köpenhamn, Copenhagen, Kobenhagen, Copenhague,
>  > and many more.

In Iceland it is Kaupmannahöfn, I believe. In unadorned English that 
would be something like Cheapmenshaven, maybe to weaken as 
Cheapenhaven, in German Kaufenhagen
-- 
Michael Everson




Re: [OT] o-circumflex

2001-09-10 Thread Michael Everson

At 18:04 +0200 2001-09-09, Stefan Persson wrote:

>  > well, the official spelling of the town is Aalborg.
>
>In Sweden it has always been written "Ålborg."

At one stage, in both countries, it was written Álaborg, I suspect, 
as it is in Iceland today.
-- 
Michael Everson




RE: [OT] o-circumflex

2001-09-10 Thread Marco Cimarosti

Asmus Freytag wrote:
> But if you do this, all compound words starting with "data" 
> and continuing 
> with another word starting with "a" will be sorted incorrectly!
> 
> To achieve this effect, you would have to mark which AAs are 
> A-Rings and which ones are accidental adjacencies. In Danish
> one can use the SHY (soft hyphen) [...]

Real-life sort orders often ignore these subtleties and are often based on a
small set of rules which is applied blindly, regardless of the origin,
meaning, or pronunciation of headwords.

For instance, I have noticed that Dutch telephone directories always sort
the sequence "ij" as if it was "y", regardless that it actually occurs in a
Dutch word.  E.g., Beijing Chinese Restaurant would be listed after Mr. Bex.

Similarly, old Italian encyclopedias (e.g. Dizionario Enciclopedico Teccani)
equated "j" to "i" because, in Italian, the former is just a graphic variant
of the latter.  But this also applied to foreign name such as "Jefferson"
(which was listed between "iee-" and "ieg-"), regardless that, of course, it
would not be allowed to spell "Iefferson".

_ Marco




RE: [OT] o-circumflex

2001-09-10 Thread Marco Cimarosti

Carl W. Brown wrote:
> In Arabic do you include vowels or not?

Yes, and also consonants sometimes...

Traditional Arabic dictionary sorting uses the three-letter root ("radical")
of a word as the primary key.  So, "madrasa" (school) would be under "d"
(because its radical is "d-r-s" = to learn), ignoring the "ma-" prefix.

I doubt, however, that this system is used with automatic sort orders
generated by computers.

_ Marco




RE: [OT] o-circumflex

2001-09-10 Thread Marco Cimarosti

John Cowan wrote:
> None of which is as weird as Leghorn for Livorno (Italy).

It's as weird as some Italian names for German cities: Aquisgrana for
Aachen, Augusta for Augsburg, Magonza for Mainz, Monaco (di Baviera) for
München.

_ Marco




Re: [OT] o-circumflex

2001-09-09 Thread
What would these cities be called in Hanzi?


$B$8$e$&$$$C$A$c$s(B(Juuitchan)
Well, I guess what you say is true,
I could never be the right kind of girl for you,
I could never be your woman
  - White Town


--- Original Message ---
$B:9=P?M(B: Keld J?n Simonsen <[EMAIL PROTECTED]>;
$B08@h(B: Stefan Persson <[EMAIL PROTECTED]>;
Cc: Keld J?n Simonsen <[EMAIL PROTECTED]>;"Carl W. Brown" 
<[EMAIL PROTECTED]>;[EMAIL PROTECTED];
$BF|;~(B: 01/09/09 19:31
$B7oL>(B: Re: [OT] o-circumflex

>On Sun, Sep 09, 2001 at 06:04:30PM +0200, Stefan Persson wrote:
>> - Original Message -
>> From: "Keld J?n Simonsen" <[EMAIL PROTECTED]>
>> To: "Carl W. Brown" <[EMAIL PROTECTED]>
>> Cc: <[EMAIL PROTECTED]>
>> Sent: den 9 september 2001 14:21
>> Subject: Re: [OT] o-circumflex
>> 
>> 
>> > On Sat, Sep 08, 2001 at 06:38:57PM -0700, Carl W. Brown wrote:
>> > > Asmus,
>> > >
>> > > If you are entering Danish city names then enter it as $B%J(Blborg.  You
>> should
>> > > only use Aalborg where the font does not support $B%J(B.  For matching logic
>> you
>> > > can equate $B%J(B to Aa then the issue of compound words goes away.
>> >
>> > well, the official spelling of the town is Aalborg.
>> 
>> In Sweden it has always been written "$B%J(Blborg."
>
>Yes, foreigners call our cities many strange things:-)
>K?enhavn is called K?enhamn, Copenhagen, Kobenhagen, Copenhague,
>and many more. Helsing? is called Elsinore. 
>Well, $B%J(Blborg is sometimes spelled $B%J(Blborg, but the official spelling, as
>defined by zip and postal addresses is 9100 Aalborg, and the kommune is called
>Aalborg kommune, viz www.aalborg.dk . 
>
>$B%J(Brhus is however almost always spelled $B%J(Brhus in Danish.
>
>Kind regards
>Keld
>
>


Re: [OT] o-circumflex

2001-09-09 Thread John Cowan

Keld Jørn Simonsen scripsit:

> Yes, foreigners call our cities many strange things:-)
> København is called Köpenhamn, Copenhagen, Kobenhagen, Copenhague,
> and many more. Helsingør is called Elsinore. 

None of which is as weird as Leghorn for Livorno (Italy).

-- 
John Cowan   http://www.ccil.org/~cowan  [EMAIL PROTECTED]
Please leave your values|   Check your assumptions.  In fact,
   at the front desk.   |  check your assumptions at the door.
 --sign in Paris hotel  |--Miles Vorkosigan




Re: [OT] o-circumflex

2001-09-09 Thread Keld Jørn Simonsen

On Sun, Sep 09, 2001 at 06:04:30PM +0200, Stefan Persson wrote:
> - Original Message -
> From: "Keld Jørn Simonsen" <[EMAIL PROTECTED]>
> To: "Carl W. Brown" <[EMAIL PROTECTED]>
> Cc: <[EMAIL PROTECTED]>
> Sent: den 9 september 2001 14:21
> Subject: Re: [OT] o-circumflex
> 
> 
> > On Sat, Sep 08, 2001 at 06:38:57PM -0700, Carl W. Brown wrote:
> > > Asmus,
> > >
> > > If you are entering Danish city names then enter it as Ålborg.  You
> should
> > > only use Aalborg where the font does not support Å.  For matching logic
> you
> > > can equate Å to Aa then the issue of compound words goes away.
> >
> > well, the official spelling of the town is Aalborg.
> 
> In Sweden it has always been written "Ålborg."

Yes, foreigners call our cities many strange things:-)
København is called Köpenhamn, Copenhagen, Kobenhagen, Copenhague,
and many more. Helsingør is called Elsinore. 
Well, Ålborg is sometimes spelled Ålborg, but the official spelling, as
defined by zip and postal addresses is 9100 Aalborg, and the kommune is called
Aalborg kommune, viz www.aalborg.dk . 

Århus is however almost always spelled Århus in Danish.

Kind regards
Keld




Re: [OT] o-circumflex/Spanish sorting

2001-09-09 Thread David Gallardo

I received a private email stating that that "ch" and "ll" were abolished by
the 10th Congress of  the 12 academies of the various Spanish speaking
countries in 1994, not just the RAE.  (There are, in addition to the
obvious, also academies for Puerto Rico, North America and the Phillipines.)

However, it was also my understanding that the modern sort wasn't accepted
outside of Spain, but it's never been clear to me if this is just a matter
of popular or academic opinion, or if there has been formal resistance as
well.

Now I wonder if the various academies have the same authority in their
country that the Royal Academy has in Spain, or if there are other national
standards bodies with which they compete or cooperate.

- David Gallardo

- Original Message -
From: "Tex Texin" <[EMAIL PROTECTED]>
To: "David Gallardo" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Sunday, September 09, 2001 2:15 AM
Subject: Re: [OT] o-circumflex/Spanish sorting


> David,
> I also don't know if the other countries have academies, but my
> understanding is Latin American countries haven't accepted the modern
> sort. Having said that, there is a lot of software that does not
> implement the traditional sort, so "acceptance" is moot.
> (The reason the Real Academia Española did away with the sorting of ch
> and ll is that a majority of software wasn't implementing sorts that
> way.)
>
> tex
>
> David Gallardo wrote:
> >
> > Hi -
> >
> > I know the Real Academia Española decided to do away with "ch" and "ll"
in
> > 1994, but do you know if the other Spanish speaking countries'
corresponding
> > academies done the same?
> >
> > - David Gallardo
>
> --
> -
> Tex TexinDirector, International Business
> mailto:[EMAIL PROTECTED]Tel: +1-781-280-4271
> the Progress Company Fax: +1-781-280-4655
> -
>





Re: [OT] o-circumflex

2001-09-09 Thread Stefan Persson

- Original Message -
From: "Keld Jørn Simonsen" <[EMAIL PROTECTED]>
To: "Carl W. Brown" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: den 9 september 2001 14:21
Subject: Re: [OT] o-circumflex


> On Sat, Sep 08, 2001 at 06:38:57PM -0700, Carl W. Brown wrote:
> > Asmus,
> >
> > If you are entering Danish city names then enter it as Ålborg.  You
should
> > only use Aalborg where the font does not support Å.  For matching logic
you
> > can equate Å to Aa then the issue of compound words goes away.
>
> well, the official spelling of the town is Aalborg.

In Sweden it has always been written "Ålborg."


_
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com





Re: [OT] o-circumflex

2001-09-09 Thread Keld Jørn Simonsen

On Sat, Sep 08, 2001 at 06:38:57PM -0700, Carl W. Brown wrote:
> Asmus,
> 
> If you are entering Danish city names then enter it as Ålborg.  You should
> only use Aalborg where the font does not support Å.  For matching logic you
> can equate Å to Aa then the issue of compound words goes away.

well, the official spelling of the town is Aalborg.

Keld




Re: [OT] o-circumflex/Spanish sorting

2001-09-08 Thread Tex Texin

David,
I also don't know if the other countries have academies, but my
understanding is Latin American countries haven't accepted the modern
sort. Having said that, there is a lot of software that does not
implement the traditional sort, so "acceptance" is moot.
(The reason the Real Academia Española did away with the sorting of ch
and ll is that a majority of software wasn't implementing sorts that
way.)

tex

David Gallardo wrote:
> 
> Hi -
> 
> I know the Real Academia Española decided to do away with "ch" and "ll" in
> 1994, but do you know if the other Spanish speaking countries' corresponding
> academies done the same?
> 
> - David Gallardo

-- 
-
Tex TexinDirector, International Business
mailto:[EMAIL PROTECTED]Tel: +1-781-280-4271
the Progress Company Fax: +1-781-280-4655
-




RE: [OT] o-circumflex

2001-09-08 Thread Jonathan Rosenne

This is not always the right thing to do. For example, with personal names the
person involved may decide whether he prefers the old (AA) spelling or the new
Å. In any case they are equivalent.

Jony

> -Original Message-
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED]]On Behalf Of Carl W. Brown
> Sent: Sunday, September 09, 2001 4:39 AM
> To: [EMAIL PROTECTED]
> Subject: RE: [OT] o-circumflex
>
>
> Asmus,
>
> This discussion reminds me of my ill fated efforts to produce a manageable
> set of rules to do automatic title casing starting with French text.  It
> would have required either special dictionaries or entering the text in a
> special way.  If special text was used, one could enter it in the proper
> title case to begin with.
>
> If you are entering Danish city names then enter it as Ålborg.  You should
> only use Aalborg where the font does not support Å.  For matching logic you
> can equate Å to Aa then the issue of compound words goes away.
>
> Carl
>
> > -Original Message-
> > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
> > Behalf Of Asmus Freytag
> > Sent: Saturday, September 08, 2001 5:56 PM
> > To: Mark Davis; [EMAIL PROTECTED]; Francesco Zappa Nardelli
> > Subject: Re: [OT] o-circumflex
> >
> >
> > At 02:45 PM 9/8/01 -0700, Mark Davis wrote:
> > >If you use a Danish tailoring of the UCA that equates Å and AA
> > (at least at
> > >a primary and secondary level), then they will sort the same
> > way. A string
> > >search that uses the same tailoring will also find "Ålborg" when given
> > >"Aalborg" (and vice versa).
> >
> > But if you do this, all compound words starting with "data" and
> > continuing
> > with another word starting with "a" will be sorted incorrectly!
> >
> > To achieve this effect, you would have to mark which AAs are A-Rings and
> > which ones are accidental adjacencies. In Danish one can use the
> > SHY (soft
> > hyphen) to break the latter, as these accidental pairs occur at
> > legal word
> > break points. In fact, that's the recommended solution, but it requires
> > that the input data are in a sepecific form.
> >
> > A./
> >
>
>
>





RE: [OT] o-circumflex

2001-09-08 Thread Carl W. Brown

Asmus,

This discussion reminds me of my ill fated efforts to produce a manageable
set of rules to do automatic title casing starting with French text.  It
would have required either special dictionaries or entering the text in a
special way.  If special text was used, one could enter it in the proper
title case to begin with.

If you are entering Danish city names then enter it as Ålborg.  You should
only use Aalborg where the font does not support Å.  For matching logic you
can equate Å to Aa then the issue of compound words goes away.

Carl

> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
> Behalf Of Asmus Freytag
> Sent: Saturday, September 08, 2001 5:56 PM
> To: Mark Davis; [EMAIL PROTECTED]; Francesco Zappa Nardelli
> Subject: Re: [OT] o-circumflex
>
>
> At 02:45 PM 9/8/01 -0700, Mark Davis wrote:
> >If you use a Danish tailoring of the UCA that equates Å and AA
> (at least at
> >a primary and secondary level), then they will sort the same
> way. A string
> >search that uses the same tailoring will also find "Ålborg" when given
> >"Aalborg" (and vice versa).
>
> But if you do this, all compound words starting with "data" and
> continuing
> with another word starting with "a" will be sorted incorrectly!
>
> To achieve this effect, you would have to mark which AAs are A-Rings and
> which ones are accidental adjacencies. In Danish one can use the
> SHY (soft
> hyphen) to break the latter, as these accidental pairs occur at
> legal word
> break points. In fact, that's the recommended solution, but it requires
> that the input data are in a sepecific form.
>
> A./
>





Re: [OT] o-circumflex

2001-09-08 Thread Asmus Freytag

At 02:45 PM 9/8/01 -0700, Mark Davis wrote:
>If you use a Danish tailoring of the UCA that equates Å and AA (at least at
>a primary and secondary level), then they will sort the same way. A string
>search that uses the same tailoring will also find "Ålborg" when given
>"Aalborg" (and vice versa).

But if you do this, all compound words starting with "data" and continuing 
with another word starting with "a" will be sorted incorrectly!

To achieve this effect, you would have to mark which AAs are A-Rings and 
which ones are accidental adjacencies. In Danish one can use the SHY (soft 
hyphen) to break the latter, as these accidental pairs occur at legal word 
break points. In fact, that's the recommended solution, but it requires 
that the input data are in a sepecific form.

A./




Re: [OT] o-circumflex

2001-09-08 Thread DougEwell2

In a message dated 2001-09-08 12:00:43 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

>  I know the Real Academia Española decided to do away with "ch" and "ll" in
>  1994, but do you know if the other Spanish speaking countries' 
corresponding
>  academies done the same?

I have no idea.  I don't know which, if any, even have a language academy.

-Doug Ewell
 Fullerton, California




Re: [OT] o-circumflex

2001-09-08 Thread Mark Davis

If you use a Danish tailoring of the UCA that equates Å and AA (at least at
a primary and secondary level), then they will sort the same way. A string
search that uses the same tailoring will also find "Ålborg" when given
"Aalborg" (and vice versa).

Mark

BTW, internationalized string search is one of the features of ICU 2.0 (see
http://www-124.ibm.com/icu/develop/tasks.html). There are a number of
exceptional cases that have to be handled, due to issues with ignorable
characters, Thai & Lao boundaries, canonical equivalence and contractions
(see
http://oss.software.ibm.com/cvs/icu/~checkout~/icuhtml/design/searchproposal
.html).

—————

Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο 
πάντα — Όμήρου Μαργίτῃ
[http://www.macchiato.com]
- Original Message -
From: "Francesco Zappa Nardelli" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Saturday, September 08, 2001 10:51 AM
Subject: Re: [OT] o-circumflex


> Hello.
>
> >> For example the Danish alphabet starts with an A and ends it with A
> >> ring above.  A Dane would look for Alborg near the end of a list of
> >> towns.
>
> I was in Aalborg fifteen days ago, and I have seen its name written
> both as Ålborg and as Aalborg.  Where does Aalborg appear in a list of
> towns?
>
> -francesco
>
>





Re: [OT] o-circumflex

2001-09-08 Thread David Gallardo

Hi -

I know the Real Academia Española decided to do away with "ch" and "ll" in
1994, but do you know if the other Spanish speaking countries' corresponding
academies done the same?

- David Gallardo

- Original Message -
From: <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Saturday, September 08, 2001 1:51 AM
Subject: Re: [OT] o-circumflex


> In a message dated 2001-09-07 17:19:49 Pacific Daylight Time,
> [EMAIL PROTECTED] writes:
>
> >  You are quite correct that is why Unicode support differing collation
> >  strengths.  Some times you only care about the actual letters without
> >  diacritics.  But even then letters are locale sensitive.  For example
the
> >  Danish alphabet starts with an A and ends it with A ring above.  A Dane
> >  would look for Alborg near the end of a list of towns.  It is like
having
> >  the Spanish ch follow cz.
>
> That would be Ålborg, right?
>
> I hasten to add that Carl's Spanish example is for the so-called
"traditional
> sort," in contrast to the "modern sort" in which "ch" sorts simply as "c"
> followed by "h".  In many Spanish-speaking communities, particularly here
in
> Alta California, the simplified "modern" sort is by far the more common of
> the two.
>
> -Doug Ewell
>  Fullerton, California
>





Re: [OT] o-circumflex

2001-09-08 Thread Asmus Freytag

At 09:04 PM 9/7/01 -0700, Mark Davis wrote:
>I disagree. What you want is a merged database field. See
>http://www.macchiato.com/slides/icu_collation.ppt
>
>Mark

Mark,

David took the remainder of our discussion off the alias. I won't repeat it 
here, just to note that we've agreed that merged database fields are the 
answer to (some) of the scenarios that we've discussed, but that there are 
cases (like indexing a mixed corpus where both naive and naïve occur) where 
it might indeed make sense to ignore accent differences altogether - 
although, as is often the case, dictionary-based pre- or post processing or 
manual adjustments might give better results yet.

Thanks for your pointer to the presentation.

A./








Re: [OT] o-circumflex

2001-09-08 Thread Francesco Zappa Nardelli

Hello.

>> For example the Danish alphabet starts with an A and ends it with A
>> ring above.  A Dane would look for Alborg near the end of a list of
>> towns.  

I was in Aalborg fifteen days ago, and I have seen its name written
both as Ålborg and as Aalborg.  Where does Aalborg appear in a list of
towns?

-francesco




RE: [OT] o-circumflex

2001-09-08 Thread Carl W. Brown

Doug,

> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
> Behalf Of [EMAIL PROTECTED]
> Sent: Friday, September 07, 2001 10:52 PM
> To: [EMAIL PROTECTED]
> Cc: [EMAIL PROTECTED]
> Subject: Re: [OT] o-circumflex
>
>
> In a message dated 2001-09-07 17:19:49 Pacific Daylight Time,
> [EMAIL PROTECTED] writes:
>
> >  You are quite correct that is why Unicode support differing collation
> >  strengths.  Some times you only care about the actual letters without
> >  diacritics.  But even then letters are locale sensitive.  For
> example the
> >  Danish alphabet starts with an A and ends it with A ring above.  A Dane
> >  would look for Alborg near the end of a list of towns.  It is
> like having
> >  the Spanish ch follow cz.
>
> That would be Ålborg, right?

That is right.  I am concerned that not everyone can view special
characters.  I think that having an alphabet that goes for A to Å must be
due to the Danish sense of humor.

I also did not use the ? in ?stanbul.

>
> I hasten to add that Carl's Spanish example is for the so-called
> "traditional
> sort," in contrast to the "modern sort" in which "ch" sorts simply as "c"
> followed by "h".  In many Spanish-speaking communities,
> particularly here in
> Alta California, the simplified "modern" sort is by far the more
> common of
> the two.
>
Again correct they also use the modern sort here in Muy Alta California as
well as most of the Spanish speaking world.

There also is the differences between ASCII and EBCDIC sorting.  Talk about
people who are worlds apart.  ;-}

Carl W. Brown
Lafayette, CA






Re: [OT] o-circumflex

2001-09-07 Thread DougEwell2

In a message dated 2001-09-07 17:19:49 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

>  You are quite correct that is why Unicode support differing collation
>  strengths.  Some times you only care about the actual letters without
>  diacritics.  But even then letters are locale sensitive.  For example the
>  Danish alphabet starts with an A and ends it with A ring above.  A Dane
>  would look for Alborg near the end of a list of towns.  It is like having
>  the Spanish ch follow cz.

That would be Ålborg, right?

I hasten to add that Carl's Spanish example is for the so-called "traditional 
sort," in contrast to the "modern sort" in which "ch" sorts simply as "c" 
followed by "h".  In many Spanish-speaking communities, particularly here in 
Alta California, the simplified "modern" sort is by far the more common of 
the two.

-Doug Ewell
 Fullerton, California




Re: [OT] o-circumflex

2001-09-07 Thread Mark Davis

As a percentage of words in English, it is quite small, but there are still
plenty of homographs, such as:

BASS
BOW(S)
BUFFET
COAX
CLOSE
COMPOUND(S)
CONVERSE
DESERT
DIVERS
DOES
DOVE
ENTRANCE(S)
EXCISE
HARE
INTIMATE
INVALID
LAME
LEAD
LUGER(S)
MANES
MARE(S)
MINUTE
OBJECT(S)
PATENT
POLISH
PRESENT
PRIMER(S)
PROJECT(S)
PUSSY
PUTTING
RAVEN
RE
REFUSE
RESIGN(S)
RESUME(S)
ROW(S)
SEWER(S)
SHOWER(S)
SLAVER
SOW(S)
SYNDICATE(S)
TAXIS
TEAR(S)
TIER(S)
TOWER(S)
VIOLA(S)
WIND(S)
WOUND
ABSENT
ABSTRACT
ABUSE(S)
ADDRESS(ES)
ADVOCATE(S)
AGGREGATE
APPROPRIATE
APPROXIMATE
ARTICULATE
ASSOCIATE(S)
ATTRIBUTE(S)
COMBAT
COMBINE(S)
COMPACT(S)
COMPLEX
CONDUCT
CONFINES
CONFLICT(S)
CONSORT
CONSTRUCT(S)
CONTENT
CONTEST(S)
CONTRACT(S)
CONSUMMATE
CONVERT(S)
CONVICT(S)
COORDINATE(S)
DECREASE(S)
DEFECT(S)
DEGENERATE(S)
DELEGATE(S)
DELIBERATE
DISCHARGE
DOGGED
EJACULATE
ELABORATE
ESCORT(S)
EXCUSE(S)
ESTIMATE(S)
EXTRACT(S)
GRADUATE(S)
HOUSE(S)
IMPLANT(S)
IMPORT(S)
INCLINE(S)
LAMINATE(S)
LEARNED
LEGITIMATE
LIVE(S)
[-]LIVED
MEDIATE(S)
MOBILE (3)
MODERATE(S)
MOUTH
OFFENSE(S)
PERFECT
PERMIT(S)
PREDICATE(S)
PRODUCE
PROGRESS
PROTEST(S)
READ (mis-, proof-)
RECALL(S)
RECORD(S)
REDRESS
REJECT(S)
RETARD(S)
RETREAD(S)
ROUTE(S)
SEPARATE
SUBJECT(S)
SUSPECT(S)
TORMENT(S)
UPSET(S)
USE(S)



—

Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο 
πάντα — Όμήρου Μαργίτῃ
[http://www.macchiato.com]
- Original Message -
From: "Asmus Freytag" <[EMAIL PROTECTED]>
To: "Ayers, Mike" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Friday, September 07, 2001 11:52
Subject: RE: [OT] o-circumflex


> At 11:50 AM 9/7/01 -0500, Ayers, Mike wrote:
> >Words with the
> >same spelling and different pronunciation are uncommon but exist in
English,
> >the classic example being "read" and its own past tense.
>
> Actually, this is a bit more common than you think, since the
pronunciation
> of vowels in English depends somewhat systematically on stress, and verb
> and noun forms of many words are stressed differently.
>
> A./
>
>





Re: [OT] o-circumflex

2001-09-07 Thread Mark Davis

I disagree. What you want is a merged database field. See
http://www.macchiato.com/slides/icu_collation.ppt

Mark
—

Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο 
πάντα — Όμήρου Μαργίτῃ
[http://www.macchiato.com]
- Original Message -
From: "Asmus Freytag" <[EMAIL PROTECTED]>
To: "David Gallardo" <[EMAIL PROTECTED]>; "Ayers, Mike"
<[EMAIL PROTECTED]>; "'David Starner'" <[EMAIL PROTECTED]>;
<[EMAIL PROTECTED]>
Sent: Friday, September 07, 2001 11:50
Subject: Re: [OT] o-circumflex


> At 01:06 PM 9/7/01 -0400, David Gallardo wrote:
> >As a practical matter, you need to take the diacritics into account when
> >sorting, even in English where they (may or may not) have linguistic
> >significance, otherwise you'll get nondeterministic behaviour. In other
> >words, résumé and resume should fall together, but always in the same
order.
>
> Stated absolutely, this is patent, but oft-repeated nonsense. For example,
> it does not always make sense for list of names. An old friend of mine,
Jon
> Proppe, who is an Icelandic art critic, spells his name with an accent
> grave on the first o and an acute accent on the e. In a campus directory
of
> the US university he attended (assuming it did not strip the accents), it
> would make no sense to have his name show up after all the Proppes, or all
> the Jons without an accent (depending on whether its sorted by first or
> last name).
>
> If I sort a list of single words which contains non-unique entries, a
> stable sort would sort the non-unique subsets in the order of their
> appearance in the input. If its not important to distinguish between naive
> and naïve (e.g. in a machine generated index that spans multiple documents
> with differences in the use of accents) its hard to see what's gained in
> splitting the list in two for this case.
>
> On the other hand, if San Jose and San José are correctly and consistently
> distinguished in my input, they should probably sort separately.
>
> The two cases of resume are different yet again, as noted, since one could
> be a verb form.
>
> It all depends not on whether a distinction can be made, but whether it is
> meaningful in the context of the list being sorted.
>
> A./
>
>
>
>
>
>





RE: [OT] o-circumflex

2001-09-07 Thread Carl W. Brown

Asmus,

You are quite correct that is why Unicode support differing collation
strengths.  Some times you only care about the actual letters without
diacritics.  But even then letters are locale sensitive.  For example the
Danish alphabet starts with an A and ends it with A ring above.  A Dane
would look for Alborg near the end of a list of towns.  It is like having
the Spanish ch follow cz.

By providing for different types of collation one can meet the user's
expectations.

Then of course you have search, display and sort differences.  If I am
looking for Istanbul it is probably OK even for Turkish locales to match it
to the Turkish spelling which uses a dotted capital I.

With languages with multiple diacritics like Vietnamese you have another set
of rules and had better have normalized data.

In Arabic do you include vowels or not?

I remember your discussions of Greek where there are other considerations.

Carl


> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
> Behalf Of Asmus Freytag
> Sent: Friday, September 07, 2001 11:51 AM
> To: David Gallardo; Ayers, Mike; 'David Starner'; [EMAIL PROTECTED]
> Subject: Re: [OT] o-circumflex
>
>
> At 01:06 PM 9/7/01 -0400, David Gallardo wrote:
> >As a practical matter, you need to take the diacritics into account when
> >sorting, even in English where they (may or may not) have linguistic
> >significance, otherwise you'll get nondeterministic behaviour. In other
> >words, résumé and resume should fall together, but always in
> the same order.
>
> Stated absolutely, this is patent, but oft-repeated nonsense. For
> example,
> it does not always make sense for list of names. An old friend of
> mine, Jon
> Proppe, who is an Icelandic art critic, spells his name with an accent
> grave on the first o and an acute accent on the e. In a campus
> directory of
> the US university he attended (assuming it did not strip the accents), it
> would make no sense to have his name show up after all the
> Proppes, or all
> the Jons without an accent (depending on whether its sorted by first or
> last name).
>
> If I sort a list of single words which contains non-unique entries, a
> stable sort would sort the non-unique subsets in the order of their
> appearance in the input. If its not important to distinguish
> between naive
> and naïve (e.g. in a machine generated index that spans multiple
> documents
> with differences in the use of accents) its hard to see what's gained in
> splitting the list in two for this case.
>
> On the other hand, if San Jose and San José are correctly and
> consistently
> distinguished in my input, they should probably sort separately.
>
> The two cases of resume are different yet again, as noted, since
> one could
> be a verb form.
>
> It all depends not on whether a distinction can be made, but
> whether it is
> meaningful in the context of the list being sorted.
>
> A./
>
>
>
>
>





RE: [OT] o-circumflex

2001-09-07 Thread Asmus Freytag

At 11:50 AM 9/7/01 -0500, Ayers, Mike wrote:
>Words with the
>same spelling and different pronunciation are uncommon but exist in English,
>the classic example being "read" and its own past tense.

Actually, this is a bit more common than you think, since the pronunciation 
of vowels in English depends somewhat systematically on stress, and verb 
and noun forms of many words are stressed differently.

A./




Re: [OT] o-circumflex

2001-09-07 Thread Asmus Freytag

At 01:06 PM 9/7/01 -0400, David Gallardo wrote:
>As a practical matter, you need to take the diacritics into account when
>sorting, even in English where they (may or may not) have linguistic
>significance, otherwise you'll get nondeterministic behaviour. In other
>words, résumé and resume should fall together, but always in the same order.

Stated absolutely, this is patent, but oft-repeated nonsense. For example, 
it does not always make sense for list of names. An old friend of mine, Jon 
Proppe, who is an Icelandic art critic, spells his name with an accent 
grave on the first o and an acute accent on the e. In a campus directory of 
the US university he attended (assuming it did not strip the accents), it 
would make no sense to have his name show up after all the Proppes, or all 
the Jons without an accent (depending on whether its sorted by first or 
last name).

If I sort a list of single words which contains non-unique entries, a 
stable sort would sort the non-unique subsets in the order of their 
appearance in the input. If its not important to distinguish between naive 
and naïve (e.g. in a machine generated index that spans multiple documents 
with differences in the use of accents) its hard to see what's gained in 
splitting the list in two for this case.

On the other hand, if San Jose and San José are correctly and consistently 
distinguished in my input, they should probably sort separately.

The two cases of resume are different yet again, as noted, since one could 
be a verb form.

It all depends not on whether a distinction can be made, but whether it is 
meaningful in the context of the list being sorted.

A./








Re: [OT] o-circumflex

2001-09-07 Thread Michael \(michka\) Kaplan

From: "David Gallardo" <[EMAIL PROTECTED]>

> As a practical matter, you need to take the diacritics into account when
> sorting, even in English where they (may or may not) have linguistic
> significance, otherwise you'll get nondeterministic behaviour. In other
> words, résumé and resume should fall together, but always in the same
order.

Well, sort of. The issue remains that if one is choosing for their
particular purpose to ignore case (for example) then there is literally no
difference between "Aa" and "aA". Since the two are considered equivalent in
the "case insensitive" comparison, you cannot claim that a sorting algorithm
has errored if it arbitrarily returns one before the other because it
happens to return them in different order.

For a real-world example, this can happen with algorithms where the bottom
item and the anchor are always reordered if b < a and thus you could see
different ordering of items depending on their placement in the list.

A similar thing happens with accent-insensitive sorts -- if you literally
treat "ee" and "éé" as identical due to using an accent insensitive sort,
then the ordering is NOT deterministic, nor is it supposed to be. And there
is nothing invalid in there not being a non-deterministic behavior of
equivalent items, any more than claiming that having it put "ee" before "ee"
in one case and after another is invalid.

MichKa

Michael Kaplan
Trigeminal Software, Inc.
http://www.trigeminal.com/






RE: [OT] o-circumflex

2001-09-07 Thread Ayers, Mike


> From: David Gallardo [mailto:[EMAIL PROTECTED]] 
> Sent: Friday, September 07, 2001 10:07 AM

> As a practical matter, you need to take the diacritics into 
> account when
> sorting, even in English where they (may or may not) have linguistic
> significance, otherwise you'll get nondeterministic 
> behaviour. In other
> words, résumé and resume should fall together, but always in 
> the same order.

Why?  This may be of interest and benefit to programmers, but not
necessarily to end-users.  The computer should serve the human, not the
other way around, and it is not particularly challenging to come up with
search and sort algorithms which understand the concept of terminal sets
which need to be iterated over to find the final entity as opposed to
terminal entities.  Recall Mike Sykes' post concerning sort order:


Reverting the question of order, the 'Guide to the New SOED' (a.k.a. Help)
reveals that:


Entries are accessed in strict alphabetical order. ... ; a headword with an
accent or diacritic over a letter follows one consisting of the same
sequence of letters without. ...

The order of headwords which are spelled the same way but have different
parts of speech is as follows:

noun (abbreviated n.)
pronoun (abbreviated pron.)
adjective (abbreviated a.)
verb (abbreviated v.)
...



This explicit ordering will still be insufficient if we choose to
include verb tenses in our word list, whence we get the two "read"s.  If
someone has a reason why these two words need to be in the same order in
everyone's word list, I'll listen...


/|/|ike




RE: [OT] o-circumflex

2001-09-07 Thread Timothy Greenwood
> There is also no word pair separated only by the I/J 
> distinction (in English), right?

iamb - as in iambic pentamater
jamb - as in a door jamb


Re: [OT] o-circumflex

2001-09-07 Thread David Gallardo

As a practical matter, you need to take the diacritics into account when
sorting, even in English where they (may or may not) have linguistic
significance, otherwise you'll get nondeterministic behaviour. In other
words, résumé and resume should fall together, but always in the same order.

Someone in another message mentioned "ñ". This is a different case in
principal, because in Spanish it's not a case of letter modified by a
diacritic--it's an entirely different letter. (It used to be written as two
side-by-side "n"s and then they got stacked.)  Again as practical matter, in
English, it's most common to ignore the greater distinction, (because we
have only 26 letters in our alphabet), and to treat it as a letter +
diacritic for the same considerations as above.

- Original Message -
From: "Ayers, Mike" <[EMAIL PROTECTED]>
To: "'David Starner'" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Thursday, September 06, 2001 5:12 PM
Subject: RE: [OT] o-circumflex


>
> > From: David Starner [mailto:[EMAIL PROTECTED]]
> > Sent: Thursday, September 06, 2001 01:40 PM
>
> > On Thu, Sep 06, 2001 at 04:03:07PM +0200, Thierry Sourbier wrote:
> > > The only little thing to know about French and diacritical
> > mark is that when
> > > doing a sort diacritical mark are evaluated from right to
> > left.  (e.g.
> > > "cote" < "côte" < "coté" vs the English order "cote" <
> > "coté" < "côte" ).
> >
> > I'm not sure there is an established English sort order. It's not a
> > problem that comes up much in English.
>
> I believe that there is an established sort order in English, which
> is to sort without regard to diacritics, or else we'd never find the
words!
> In English (American English more than British English), diacritics are
> considered optional, and it is common to see "naїve" written "naive", "San
> José" written "San Jose", etc.  Especially amongst Americans, the two are
> considered equivalent, and I know of no word pair in all of English which
is
> separated only by a diacritic.
>
>
> /|/|ike
>





RE: [OT] o-circumflex

2001-09-07 Thread Ayers, Mike


> From: J M Sykes [mailto:[EMAIL PROTECTED]] 
> Sent: Friday, September 07, 2001 07:50 AM

> The classic example is 'resume' and 'résumé'. These are, by 
> now, two quite
> distinct words, and the fact that there is no 'established' 
> order is shown

I spell both "resume" and have never been corrected.  Words with the
same spelling and different pronunciation are uncommon but exist in English,
the classic example being "read" and its own past tense.  Since there are no
diacritics in English proper, the two "resume"s tend to fall into this
category.  The diacritics which often appear on one of them really only
serve to mark it as a loan word, since it is very difficult to come up with
a sentence in which the two could be confused.


/|/|ike




Re: [OT] o-circumflex

2001-09-07 Thread



$B$8$e$&$$$C$A$c$s(B(Juuitchan)
Well, I guess what you say is true,
I could never be the right kind of girl for you,
I could never be your woman
  - White Town


>
>Who'd be a lexicographer?


$B;d!)(B




>
>Mike.
>
>***
>
>J M Sykes  Email: [EMAIL PROTECTED]
>97 Oakdale Drive
>Heald Green
>CHEADLE
>Cheshire   SK8 3SN
>UKTel: (44) 161 437 5413
>
>***
>
>
>
>


Re: [OT] o-circumflex

2001-09-07 Thread

There is also no word pair separated only by the I/J distinction (in English), right?

$B$8$e$&$$$C$A$c$s(B(Juuitchan)
Well, I guess what you say is true,
I could never be the right kind of girl for you,
I could never be your woman
  - White Town


I know of no word pair in all of English which
>is
>> separated only by a diacritic.
>


Re: [OT] o-circumflex

2001-09-07 Thread J M Sykes

>
> I believe that there is an established sort order in English, which
> is to sort without regard to diacritics, or else we'd never find the
words!
> In English (American English more than British English), diacritics are
> considered optional, and it is common to see "naїve" written "naive", "San
> José" written "San Jose", etc.  Especially amongst Americans, the two are
> considered equivalent, and I know of no word pair in all of English which
is
> separated only by a diacritic.

That depends what you mean by 'established' ;-)

The classic example is 'resume' and 'résumé'. These are, by now, two quite
distinct words, and the fact that there is no 'established' order is shown
by the fact that the New Shorter Oxford English Dictionary (Version: 1.0.4,
Data version: 02.10.96s, January 1997, on disk) has them in the order:
'résumé', 'resume' while the New Oxford Dictionary of English (Clarendon
Press, 1998) has 'resume', 'resumé'. The Concise Oxford Dictionary (of
Current English, Clarendon Press, 1982, edited, as it happens, by a second
cousin of mine) also has 'resume', 'résumé'.

Evidently, we see here evidence that the diacritic on the first 'e' has
become optional since 1982, though not that on the second, presumably
because that 'e' might otherwise be supposed to be silent.

Reverting the question of order, the 'Guide to the New SOED' (a.k.a. Help)
reveals that:


Entries are accessed in strict alphabetical order. ... ; a headword with an
accent or diacritic over a letter follows one consisting of the same
sequence of letters without. ...

The order of headwords which are spelled the same way but have different
parts of speech is as follows:

noun (abbreviated n.)
pronoun (abbreviated pron.)
adjective (abbreviated a.)
verb (abbreviated v.)
...


And scrutiny of the two entries of interest reveals that 'résumé' is both a
noun and a verb, whereas  'resume' is only a verb.

Perhaps the ordering of 'résumé' before 'resume' is a mistake; perhaps not.
I can't ask my aforesaid second cousin, because he's no longer with us.

Who'd be a lexicographer?

Mike.

***

J M Sykes  Email: [EMAIL PROTECTED]
97 Oakdale Drive
Heald Green
CHEADLE
Cheshire   SK8 3SN
UKTel: (44) 161 437 5413

***






RE: [OT] o-circumflex

2001-09-07 Thread James E. Agenbroad

On Thu, 6 Sep 2001, Ayers, Mike wrote:

> 
> > From: David Starner [mailto:[EMAIL PROTECTED]] 
> > Sent: Thursday, September 06, 2001 01:40 PM
> 
> > On Thu, Sep 06, 2001 at 04:03:07PM +0200, Thierry Sourbier wrote:
> > > The only little thing to know about French and diacritical 
> > mark is that when
> > > doing a sort diacritical mark are evaluated from right to 
> > left.  (e.g.
> > > "cote" < "côte" < "coté" vs the English order "cote" <  
> > "coté" < "côte" ).
> > 
> > I'm not sure there is an established English sort order. It's not a 
> > problem that comes up much in English. 
> 
>   I believe that there is an established sort order in English, which
> is to sort without regard to diacritics, or else we'd never find the words!
> In English (American English more than British English), diacritics are
> considered optional, and it is common to see "naїve" written "naive", "San
> José" written "San Jose", etc.  Especially amongst Americans, the two are
> considered equivalent, and I know of no word pair in all of English which is
> separated only by a diacritic.
> 
 Friday, September 7, 2001
Librarians have *filing* rules--the American Library Association (ALA) and
the Library of Congress (LC) each issued some in, I think, 1980.  I
believe they both say to ignore diacritics because Americans do not
recognize that they have an order.  These days filing in vendor software
for libraries tends to follow neither one very closely--the phrase
"more honored in the breach than the observance" comes to mind.  I may be
wrong but I do not believe there is an established U.S. standard for
sorting/filing.  A few years ago a National Information Standards
Organization (NISO) committee drafted one but it didn't get the
votes needed to become an accepted standard.  

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  





Re: [OT] o-circumflex

2001-09-07 Thread Bertrand Laidain

>I would say it is a variant of "o" we just called it... "o with a circumflex
>accent" ("o avec un accent circonflex"). The difference between "o" and "ô"
>is normally audible (for a French speaker). The relationship is the same
>than with any other letter which sometimes have accents (e.g. "a" and "à",
>"e" and "è", etc.).

"o" avec un accent circonflexe, with an "e" at the end. From "Petit
Robert" (french dictionnary) the circumflexe is a mark for long vowel
(eg. île for isle (ancient french)) or to avoid confusion between two
words (eg. du and dû). The prononciation of the "ô" is closed (o fermé)
opposed to "o" without accent. But Thierry is right it's a letter with an
accent like à and è not a distinct grapheme.

Bertrand

>The only little thing to know about French and diacritical mark is that when
>doing a sort diacritical mark are evaluated from right to left.  (e.g.
>"cote" < "côte" < "coté" vs the English order "cote" <  "coté" < "côte" ).
>Cheers,
>Thierry


>How do Francophones view the o-circumflex "ô" in relation to the letter "o"?
>Is it a distinct grapheme, or is it considered a variant of "o"?
>- Peter





RE: [OT] o-circumflex

2001-09-06 Thread
Sorry about the kana. My mailer is Japanese.


$B$8$e$&$$$C$A$c$s(B(Juuitchan)
Well, I guess what you say is true,
I could never be the right kind of girl for you,
I could never be your woman
  - White Town


--- Original Message ---
$B:9=P?M(B: "Ayers, Mike" <[EMAIL PROTECTED]>;
$B08@h(B: 'David Starner' <[EMAIL PROTECTED]>;[EMAIL PROTECTED];
Cc: 
$BF|;~(B: 01/09/06 21:12
$B7oL>(B: RE: [OT] o-circumflex

>
>> From: David Starner [mailto:[EMAIL PROTECTED]] 
>> Sent: Thursday, September 06, 2001 01:40 PM
>
>> On Thu, Sep 06, 2001 at 04:03:07PM +0200, Thierry Sourbier wrote:
>> > The only little thing to know about French and diacritical 
>> mark is that when
>> > doing a sort diacritical mark are evaluated from right to 
>> left.  (e.g.
>> > "cote" < "c$B%F%((Bte" < "cot$B%F%%(B" vs the English order "cote" <  
>> "cot$B%F%%(B" < "c$B%F%((Bte" ).
>> 
>> I'm not sure there is an established English sort order. It's not a 
>> problem that comes up much in English. 
>
>   I believe that there is an established sort order in English, which
>is to sort without regard to diacritics, or else we'd never find the words!
>In English (American English more than British English), diacritics are
>considered optional, and it is common to see "na$B%`MW(Be" written "naive", "San
>Jos$B%F%%(B" written "San Jose", etc.  Especially amongst Americans, the two are
>considered equivalent, and I know of no word pair in all of English which is
>separated only by a diacritic.

I believe that the origin of the problem is the typewriter / word-processor. The 
English typewriter / word-processor is only designed to handle 26 letters (52 if you 
count case). Diacritics are impossible on a typewriter and very difficult on a word 
processor. In handwriting, the problem is non-existent.

Think of Tendou Kasumi getting the medical scholarship she always wanted, and getting 
to study abroad. She would likely e-mail her old friends / family in romaji, but 
snail-mail them in kana / kanji.

I like the freedom of a pen, so I can write kana and even draw.

As for your word pair:

1. To continue after a pause

2. Curriculum vitae


If only technology did not change the way we write like it does.

And why should not "o with accent" be considered as different from "o" as either is, 
say, from "u"? If that is the case:
"R" is "P with stroke"
(hiragana) "Ho" is "ha with stroke"
"Ru" is "Ro with loop"
(Thai) "five" is "four with loop"
and... my favorite... Latin "G" is "C" with stroke, and history WILL back me on that 
one!


>
>
>/|/|ike
>
>


Re: [OT] o-circumflex

2001-09-06 Thread James Kass


David Starner wrote:

> Yes, but I mean for "cote", "côte, and "coté. How would you
> sort those three in English? I'd probably sort it by some
> extra-lingual information:  i.e. page number, date of birth
> or the like.

Store them as UTF-8, do a DOS sort, and call the results
"the new World order"?

Best regards,

James Kass.







Re: [OT] o-circumflex

2001-09-06 Thread Alex Bochannek

My impression is that at least in U.S. states, which are more heavily
populated by native Spanish speakers, the one diacritic, which is
frequently viewed by English speakers as non-optional to differentiate
two words (specifically proper names) is the tilde as used for the
eñe. There is a college in Redwood City, CA, which is called Cañada
College and, which is off of Cañada Road. I haven't checked
thoroughly, but I believe most road signs there use the eñe. I do know
of one highway exit in the area though which spells it "Canada
College".

Alex.




RE: [OT] o-circumflex

2001-09-06 Thread Ayers, Mike


> From: David Starner [mailto:[EMAIL PROTECTED]] 
> Sent: Thursday, September 06, 2001 01:40 PM

> On Thu, Sep 06, 2001 at 04:03:07PM +0200, Thierry Sourbier wrote:
> > The only little thing to know about French and diacritical 
> mark is that when
> > doing a sort diacritical mark are evaluated from right to 
> left.  (e.g.
> > "cote" < "côte" < "coté" vs the English order "cote" <  
> "coté" < "côte" ).
> 
> I'm not sure there is an established English sort order. It's not a 
> problem that comes up much in English. 

I believe that there is an established sort order in English, which
is to sort without regard to diacritics, or else we'd never find the words!
In English (American English more than British English), diacritics are
considered optional, and it is common to see "naїve" written "naive", "San
José" written "San Jose", etc.  Especially amongst Americans, the two are
considered equivalent, and I know of no word pair in all of English which is
separated only by a diacritic.


/|/|ike




Re: [OT] o-circumflex

2001-09-06 Thread David Starner

On Thu, Sep 06, 2001 at 04:12:28PM -0500, Ayers, Mike wrote:
> 
> > From: David Starner [mailto:[EMAIL PROTECTED]] 
> > Sent: Thursday, September 06, 2001 01:40 PM
> 
> > On Thu, Sep 06, 2001 at 04:03:07PM +0200, Thierry Sourbier wrote:
> > > The only little thing to know about French and diacritical 
> > mark is that when
> > > doing a sort diacritical mark are evaluated from right to 
> > left.  (e.g.
> > > "cote" < "côte" < "coté" vs the English order "cote" <  
> > "coté" < "côte" ).
> > 
> > I'm not sure there is an established English sort order. It's not a 
> > problem that comes up much in English. 
> 
>   I believe that there is an established sort order in English, which
> is to sort without regard to diacritics, or else we'd never find the words!

Yes, but I mean for "cote", "côte, and "coté. How would you sort those
three in English? I'd probably sort it by some extra-lingual information:
i.e. page number, date of birth or the like.

-- 
David Starner - [EMAIL PROTECTED]
Pointless website: http://dvdeug.dhis.org
"I don't care if Bill personally has my name and reads my email and 
laughs at me. In fact, I'd be rather honored." - Joseph_Greg




Re: [OT] o-circumflex

2001-09-06 Thread David Starner

On Thu, Sep 06, 2001 at 04:03:07PM +0200, Thierry Sourbier wrote:
> The only little thing to know about French and diacritical mark is that when
> doing a sort diacritical mark are evaluated from right to left.  (e.g.
> "cote" < "côte" < "coté" vs the English order "cote" <  "coté" < "côte" ).

I'm not sure there is an established English sort order. It's not a 
problem that comes up much in English. 

-- 
David Starner - [EMAIL PROTECTED]
Pointless website: http://dvdeug.dhis.org
"I don't care if Bill personally has my name and reads my email and 
laughs at me. In fact, I'd be rather honored." - Joseph_Greg




Re: [OT] o-circumflex

2001-09-06 Thread Thierry Sourbier

> Is it a distinct grapheme, or is it considered a variant of "o"?

I would say it is a variant of "o" we just called it... "o with a circumflex
accent" ("o avec un accent circonflex"). The difference between "o" and "ô"
is normally audible (for a French speaker). The relationship is the same
than with any other letter which sometimes have accents (e.g. "a" and "à",
"e" and "è", etc.).

The only little thing to know about French and diacritical mark is that when
doing a sort diacritical mark are evaluated from right to left.  (e.g.
"cote" < "côte" < "coté" vs the English order "cote" <  "coté" < "côte" ).

I'm just talking as a French Francophone not a linguist. May be someone on
this list knows why diacritical marks are sorted in French in such a funky
way :).

Cheers,
Thierry

<><><><><><><><><><><><><><><><><><><><><><>
www.i18ngurus.com - Open Internationalization Resources Directory

- Original Message -
From: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Thursday, September 06, 2001 3:08 PM
Subject: [OT] o-circumflex



How do Francophones view the o-circumflex "ô" in relation to the letter "o"?
Is it a distinct grapheme, or is it considered a variant of "o"?


- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>