Re: Collation (was RE: [OT] o-circumflex)
On Thu, Sep 13, 2001 at 12:40:30AM -0700, Edward Cherlin wrote: : For example, : : 1984 (Nineteen Eighty Four) : 1066 and all that (Ten Sixty Six) : 3001 (Three Thousand One) : 2050 (Twenty Fifty) : 2010 (Twenty Ten) : 2001, A Space Odyssey (Two Thousand One) You're missing the "and" from 3001 and 2001. I know Merkins often leave it out, but a number of us always use it and feel it's wrong without. :-) Putting dialect aside, you may find that 2050 and possibly 2010 will be said "two thousand (and) whatever". The problem here is that there's no single way to spell out numbers in English, so no single way to alphabetise. It's better to sort numbers numerically, and then you only have to decide the order for negative numbers. -- Christopher Vance
Re: Alternative sorting for digraphs (Was Re: [OT] o-circumflex)
On Mon, 10 Sep 2001, Mark Davis wrote: > A ZWNJ will break ligatures and cursive connections. While probably safe in > Danish or Dutch, it is unclear to me that that is safe in all languages > where this situation occurs. There are diagraphs in Urdu, for example. While > I don't know their sorting order, if they do sort separately then ZWNJ can't > be used to express the alternative sorting, since it would give the wrong > rendering. :'-( I would like to ask for stopping the overuse of ZWNJ. I once loved that character... What about *renaming* the character to "Zero Width All-Purpose Everything Breaker"? roozbeh
Re: Collation (was RE: [OT] o-circumflex)
In the latest ICU, we took the work we did for Java collation and extended it substantially (and made it many times faster). It also allows arbitrary customization at runtime. I happen to be giving a presentation on it in a few hours at the conference. For more information, see the draft collation chapter in the User guide, at http://oss.software.ibm.com/icu/. The presentation (a slightly older draft) is on my site at www.macchiato.com Mark — Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο πάντα — Όμήρου Μαργίτῃ [http://www.macchiato.com] - Original Message - From: "David Gallardo" <[EMAIL PROTECTED]> To: "Edward Cherlin" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Thursday, September 13, 2001 8:35 AM Subject: Re: Collation (was RE: [OT] o-circumflex) > Java's collation class has a rule-based collator that is in effect > programmable using a little language. Here is how an example from Sun's API > doc for Norwegian: > > String Norwegian = "< a,A< b,B< c,C< d,D< e,E< f,F< g,G< h,H< i,I< j,J" > "< k,K< l,L< m,M< n,N< o,O< p,P< q,Q< r,R< s,S< t,T" > "< u,U< v,V< w,W< x,X< y,Y< z,Z" > "< å=a?,Å=A?" > ";aa,AA< æ,Æ< ø,Ø"; > RuleBasedCollator myNorwegian = new RuleBasedCollator(Norwegian); > > There is also syntax for things such as specifying reverse order (for French > accents for example), contraction and expansion. > > - David Gallardo > > - Original Message - > From: "Edward Cherlin" <[EMAIL PROTECTED]> > To: <[EMAIL PROTECTED]> > Sent: Thursday, September 13, 2001 3:40 AM > Subject: Collation (was RE: [OT] o-circumflex) > > > > English and several other languages have dozens of collations. Compare > telephone books, library catalogs, book indexes (sic), and other sorted > data. Knuth vol. 3 Sorting and Searching gives an example of a set of > library sorting rules that runs to more than a page, and suggests > programming it as an exercise. ;-) Among the rules are to spell out numbers. > > For example, > > > > 1984 (Nineteen Eighty Four) > > 1066 and all that (Ten Sixty Six) > > 3001 (Three Thousand One) > > 2050 (Twenty Fifty) > > 2010 (Twenty Ten) > > 2001, A Space Odyssey (Two Thousand One) > > > > Bell Labs invented a whole programming language, Snobol, to deal with > telephone listing conversions, matches, and sorts. Many phone books sort Mc- > and Mac- together, others one after the other but separate from other names. > > > > Edward Cherlin > > Generalist > > "A knot! Oh, do let me help to undo it." > > Alice in Wonderland > > > > > > > >
Re: Collation (was RE: [OT] o-circumflex)
Java's collation class has a rule-based collator that is in effect programmable using a little language. Here is how an example from Sun's API doc for Norwegian: String Norwegian = "< a,A< b,B< c,C< d,D< e,E< f,F< g,G< h,H< i,I< j,J" "< k,K< l,L< m,M< n,N< o,O< p,P< q,Q< r,R< s,S< t,T" "< u,U< v,V< w,W< x,X< y,Y< z,Z" "< å=a?,Å=A?" ";aa,AA< æ,Æ< ø,Ø"; RuleBasedCollator myNorwegian = new RuleBasedCollator(Norwegian); There is also syntax for things such as specifying reverse order (for French accents for example), contraction and expansion. - David Gallardo - Original Message ----- From: "Edward Cherlin" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Thursday, September 13, 2001 3:40 AM Subject: Collation (was RE: [OT] o-circumflex) > English and several other languages have dozens of collations. Compare telephone books, library catalogs, book indexes (sic), and other sorted data. Knuth vol. 3 Sorting and Searching gives an example of a set of library sorting rules that runs to more than a page, and suggests programming it as an exercise. ;-) Among the rules are to spell out numbers. > For example, > > 1984 (Nineteen Eighty Four) > 1066 and all that (Ten Sixty Six) > 3001 (Three Thousand One) > 2050 (Twenty Fifty) > 2010 (Twenty Ten) > 2001, A Space Odyssey (Two Thousand One) > > Bell Labs invented a whole programming language, Snobol, to deal with telephone listing conversions, matches, and sorts. Many phone books sort Mc- and Mac- together, others one after the other but separate from other names. > > Edward Cherlin > Generalist > "A knot! Oh, do let me help to undo it." > Alice in Wonderland > >
Collation (was RE: [OT] o-circumflex)
Whoever invented English number words, then, had a very sick sense of humour. Why doesn't the word for "one" start with "a", the word for "two" with "b", etc.,? $B$8$e$&$$$C$A$c$s(B(Juuitchan) Well, I guess what you say is true, I could never be the right kind of girl for you, I could never be your woman - White Town --- Original Message --- $B:9=P?M(B: Edward Cherlin <[EMAIL PROTECTED]>; $B08@h(B: [EMAIL PROTECTED]; Cc: $BF|;~(B: 01/09/13 7:40 $B7oL>(B: Collation (was RE: [OT] o-circumflex) >English and several other languages have dozens of collations. Compare telephone >books, library catalogs, book indexes (sic), and other sorted data. Knuth vol. 3 >Sorting and Searching gives an example of a set of library sorting rules that runs to >more than a page, and suggests programming it as an exercise. ;-) Among the rules are >to spell out numbers. >For example, > >1984 (Nineteen Eighty Four) >1066 and all that (Ten Sixty Six) >3001 (Three Thousand One) >2050 (Twenty Fifty) >2010 (Twenty Ten) >2001, A Space Odyssey (Two Thousand One) > >Bell Labs invented a whole programming language, Snobol, to deal with telephone >listing conversions, matches, and sorts. Many phone books sort Mc- and Mac- together, >others one after the other but separate from other names. > >Edward Cherlin >Generalist >"A knot! Oh, do let me help to undo it." >Alice in Wonderland > > >> -Original Message- >> Behalf Of Michael (michka) Kaplan >> Sent: Mon, September 10, 2001 8:36 AM >> From: "Mark Davis" <[EMAIL PROTECTED]> >> >> > Michael, that isn't the point. There is a problem even >> when you stick to >> one >> > language. > > >> By that time, many langauges may have TWO collations, since >> users have been >> expecting something else for the last few decades? >> >> MichKa >> >> Michael Kaplan >> Trigeminal Software, Inc. >> http://www.trigeminal.com/ >> >> >> > > >
Collation (was RE: [OT] o-circumflex)
English and several other languages have dozens of collations. Compare telephone books, library catalogs, book indexes (sic), and other sorted data. Knuth vol. 3 Sorting and Searching gives an example of a set of library sorting rules that runs to more than a page, and suggests programming it as an exercise. ;-) Among the rules are to spell out numbers. For example, 1984 (Nineteen Eighty Four) 1066 and all that (Ten Sixty Six) 3001 (Three Thousand One) 2050 (Twenty Fifty) 2010 (Twenty Ten) 2001, A Space Odyssey (Two Thousand One) Bell Labs invented a whole programming language, Snobol, to deal with telephone listing conversions, matches, and sorts. Many phone books sort Mc- and Mac- together, others one after the other but separate from other names. Edward Cherlin Generalist "A knot! Oh, do let me help to undo it." Alice in Wonderland > -Original Message- > Behalf Of Michael (michka) Kaplan > Sent: Mon, September 10, 2001 8:36 AM > From: "Mark Davis" <[EMAIL PROTECTED]> > > > Michael, that isn't the point. There is a problem even > when you stick to > one > > language. > By that time, many langauges may have TWO collations, since > users have been > expecting something else for the last few decades? > > MichKa > > Michael Kaplan > Trigeminal Software, Inc. > http://www.trigeminal.com/ > > >
Re: [OT] o-circumflex
* Lars Marius Garshol | | I am not sure of this, but I think 'å' is a relatively modern | invention, and that it was originally written only as 'aa'. * Stefan Persson | | FYI, "a relatively modern invention" means that is has been used | since the Medieval (in Swedish). I don't think that is the case in Norwegian and Danish. The Norwegian constitution from 1814, for example, uses 'ø' and 'æ', but never 'å'. Possibly this was a Swedish invention only adopted later by the Danes and Norwegians. --Lars M.
Re: [OT] o-circumflex
On Tue, Sep 11, 2001 at 06:27:20PM +0200, Stefan Persson wrote: > - Original Message - > From: "Keld Jørn Simonsen" <[EMAIL PROTECTED]> > To: "Stefan Persson" <[EMAIL PROTECTED]> > Cc: "Mark Davis" <[EMAIL PROTECTED]>; "Michael (michka) Kaplan" > <[EMAIL PROTECTED]>; "Keld Jørn Simonsen" <[EMAIL PROTECTED]>; > <[EMAIL PROTECTED]> > Sent: den 10 september 2001 22:12 > Subject: Re: [OT] o-circumflex > > > > Where is this done for swedish? I have read both the TN and the SIS > > standard, and I dont believe these say something on sorting > > ü according to either German or Dutch sounds. Rolf Gavare does not > > say something along this either, as far as I can remember. > > This is the sorting used in dictionnaries, encyclopædias, phone books etc. > For example, SAOL (Svenska Akademiens ordlista över svenska språket) sorts > "myskoxe/müsli/mysning." Yes, I can understand that. In Danish we have the same rule. But do you have examples of Dutch words that are ordered in another way? That is, you need to know the origin of the word, to sort it. Kind regards keld
Re: [OT] o-circumflex
- Original Message - From: "Lars Marius Garshol" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: den 10 september 2001 22:45 Subject: Re: [OT] o-circumflex > I am not sure of this, but I think 'å' is a relatively modern > invention, and that it was originally written only as 'aa'. FYI, "a relatively modern invention" means that is has been used since the Medieval (in Swedish). Stefan _ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com
Re: [OT] o-circumflex
- Original Message - From: "Keld Jørn Simonsen" <[EMAIL PROTECTED]> To: "Stefan Persson" <[EMAIL PROTECTED]> Cc: "Mark Davis" <[EMAIL PROTECTED]>; "Michael (michka) Kaplan" <[EMAIL PROTECTED]>; "Keld Jørn Simonsen" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: den 10 september 2001 22:12 Subject: Re: [OT] o-circumflex > Where is this done for swedish? I have read both the TN and the SIS > standard, and I dont believe these say something on sorting > ü according to either German or Dutch sounds. Rolf Gavare does not > say something along this either, as far as I can remember. This is the sorting used in dictionnaries, encyclopædias, phone books etc. For example, SAOL (Svenska Akademiens ordlista över svenska språket) sorts "myskoxe/müsli/mysning." Stefan _ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com
Re: [OT] o-circumflex
Wy OT by now... > AAARRRGGHHH > > I give up! > > I was hoping that there is SOME system that would give these cities UNIQUE names... >postal codes??? Ain't reality a bitch? What you're looking for doesn't exist in the world of natural language names -- it can only exist in artificially constructed global geographic databases, where people may have assigned unique keys to cities. And even there, the geographic experts are going to argue over the exact meaning of terms. Is "Los Angeles" the incorporated city presided over by the mayor or does it include all the other small cities that Los Angeles surrounds and engulfs, or does it included unincorporated parts of Los Angeles county, or does it refer to Greater Los Angeles, the metropolitan area, or is it related to Los Angeles county? Not such a simple distinction, sometimes. San Francisco is a city *and* a county, and the mayor of the city is also mayor of the county. The mayor of New York is mayor of half a dozen boroughs, the moral equivalent of counties. Is Stonyford, California (population 150), a "city"? It isn't incorporated as a city, or even a town, but it is an independent geographic location that occurs as a "town" on maps. Where do you draw the line between named localities and cities? Do you depend on legally incorporated city status? But what if the laws don't match up between different countries? How am I going to know that "cities" in Bourkina Fasso match the same criteria I use to designate "cities" in the United States or Japan? Some cities have multiple postal codes, and some postal codes cover multiple cities. And while postal codes are subject to international treaty, how countries divide their territories up and use the codes is still up to them. --Ken
Re: [OT] o-circumflex
AAARRRGGHHH I give up! I was hoping that there is SOME system that would give these cities UNIQUE names... postal codes??? $B$8$e$&$$$C$A$c$s(B(Juuitchan) Well, I guess what you say is true, I could never be the right kind of girl for you, I could never be your woman - White Town --- Original Message --- $B:9=P?M(B: Thomas Chan <[EMAIL PROTECTED]>; $B08@h(B: [EMAIL PROTECTED]; Cc: $BF|;~(B: 01/09/10 19:59 $B7oL>(B: Re: [OT] o-circumflex >On Mon, 10 Sep 2001, [ISO-2022-JP] $B$F$s$I$&$j$e$&$8(B wrote: > >> If they can't agree on the pronunciation for these cities, can they >> agree on the Hanzi for them? What ARE the Hanzi for these cities, >> anyway?? > >Are you asking for the names of cities in Chinese? Copenhagen is >ge1ben3ha1gen1 \u54e5\u672c\u54c8\u6839. The Han characters used to write >the names of cities depends on many factors, including but not >limited to source spelling/pronunciation, language/dialect of the >rendering party, mapping rules used by the renderer, time period, etc. >For example, New York is rendered in Chinese as Mandarin niu3yue4 >\u7d10\u7d04, lit. 'button-appointment' (nauyeuk in Cantonese), while in >Japanese it was at one time rendered as \u7d10\u80b2, lit. >'button-rearing'. Asking for the "hanzi" (from your wording, I don't >think you are just talking about Chinese usage of Han characters) is like >asking for a single Latin script rendering. > >(I think you need to get yourself an English<->Chinese dictionary or >something, btw...) > > >Thomas Chan >[EMAIL PROTECTED] > > > >
RE: [OT] o-circumflex
Marco, When you're in Holland you may want to check some dictionaries too. You'll notice in dictionaries 'ij' is considered to consist of two letters 'i' and 'j', so the word 'ijs' sorts between 'iets' and 'ik'. You're right the PTT doesn't make the distinction between 'ij' and 'y', so in the phone book 'Meyer' and 'Meijer' are indeed near each other. I suspected they would at least first list all Meijers, then all Meyers, but when I just checked they appeared to be intermingled. On closer inspection it turned out the Meijers and Meyers are further sorted by street name! By the way, in crossword puzzles and the like, 'ij' always occupies one box (but isn't considered the same as 'y' I believe) Regards, Otmar Permentier > -Original Message- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On > Behalf Of Marco Cimarosti > Sent: maandag 10 september 2001 19:59 > To: 'Stefan Persson'; 'John Wilcock'; [EMAIL PROTECTED] > Subject: RE: [OT] o-circumflex > > > Stefan Persson wrote: > > I thought "ij" sorted after "z?" > > Not in Dutch: as far as I have seen it sorts the same as "y". In fact, in > the telephone directory many people who had an "y" in their surname listed > near people who had the same surname spelled with "ij" (e.g. "Meyer" and > "Meijer"). > > (Anyway, next time they send me to Holland, I'll ask for a downtown hotel. > So, after dinner, I'll go sightseeing rather than spending the > whole evening > looking at the collation of the phone directory:-) > > _ Marco > >
Re: [OT] o-circumflex
>> It's as weird as some Italian names for German cities: Aquisgrana >> for Aachen, Augusta for Augsburg, Magonza for Mainz, Monaco (di >> Baviera) for München. MK> Interesting that Polish names of these cities are more like Italian MK> than German: Akwizgran, Augsburg, Moguncja, Monachium. Because they're adaptations of the mediaeval Latin names. The same is true of historically important Polish cities, by the way: Varsovie, Cracovie in French, Varsavia, Cracovia in Italian. English uses the German names instead (Warsaw, Cracow). Juliusz
Re: [OT] o-circumflex
On 09/10/2001 07:48:05 AM Michael \(michka\) Kaplan wrote: >(can't believe this thread is still going on!) I just wanted to know about how Francophones perceive certain graphemes, and I got that answer a long time ago. Peter
Re: [OT] o-circumflex
* Marco Cimarosti | | One of these cases could be the word "dataarkiv", which I found in a Danish | web page | (http://www.riksarkivet.no/nordiskarknytt/98-nr4/institusjonen.html). Uh, no, you found it in a Norwegian web page. The word is the same in Danish, though. | Order B: | 1. data | 2. dataarkiv | 3. Datben, Dr. Keld | 4. Datz, Mr. Marco | 5. Datåz, Dr. Asmus | | Asmus was arguing that List B would be the correct one (and this is | certainly true on, e.g., a dictionary) but, in order to obtain it, the | source text must be properly encoded with invisible separators inserted | where needed. Not necessarily. One solution I've seen automatically generated sort keys from the headwords, but allowed users to adjust them where necessary. I think users are likely to favour this solution if given a choice. Of course, it depends on how important it is to get the sorting right, and what importance the headwords have within the system whether this solution is feasible or not. In a phone directory I guess nobody would use it. | And this is precisely what I was trying to say, although I was not | necessarily talking about multilingual sort ("dataarkiv" seems a purely | Danish word, although derived from Latin roots). It's a simple concatenation of the words for 'computing' (data) and 'archive' (arkiv), meaning any electronic archive. This kind of construction is very common in Norwegian and Danish, leading speakers to invent all kinds of strange new words when writing English[1], and the Swedes to joke that we call bananas 'yellowbends'. --Lars M. [1] And, conversely, after learning English, to split apart words that God meant us to write without spaces in them. It really ann oys to see people write in that incon venient way.
Re: [OT] o-circumflex
* Keld Jørn Simonsen | | Yes, foreigners call our cities many strange things:-) København is | called Köpenhamn, Copenhagen, Kobenhagen, Copenhague, and many more. * Michael Everson | | In Iceland it is Kaupmannahöfn, I believe. In unadorned English that | would be something like Cheapmenshaven, maybe to weaken as | Cheapenhaven, in German Kaufenhagen Which makes eminent sense, given that København by this logic would translate as Cheapenhaven. (Your German translation should be Kaufmannshagen, I guess, to become Kaufenhagen when translated from København.) --Lars M.
Re: [OT] o-circumflex
* Jonathan Rosenne | | This is not always the right thing to do. For example, with personal | names the person involved may decide whether he prefers the old (AA) | spelling or the new Å. In any case they are equivalent. This is true, but this is nothing particular to the aa/å distinction. Many given names have a number of possible spellings, such as Astri / Astrid, Cathrine / Katrine / Kathrine, Wenche / Venke / Venche, Espen / Esben, ... In fact, given names which can be written both aa and å are rare. I can only think of Åge offhand, and that is only rarely written Aage in Norway (and the other way round in Denmark). AA/Å confusion is much more common in surnames, but there there is no choice involved. --Lars M.
Re: [OT] o-circumflex
* Francesco Zappa Nardelli | | I was in Aalborg fifteen days ago, and I have seen its name written | both as Ålborg and as Aalborg. Where does Aalborg appear in a list | of towns? At the end. In both Danish and Norwegian 'aa' and 'å' are considered equivalent. I am not sure of this, but I think 'å' is a relatively modern invention, and that it was originally written only as 'aa'. --Lars M.
Re: [OT] o-circumflex
* Carl W. Brown | | You are quite correct that is why Unicode support differing | collation strengths. Some times you only care about the actual | letters without diacritics. But even then letters are locale | sensitive. For example the Danish alphabet starts with an A and | ends it with A ring above. A Dane would look for Alborg near the | end of a list of towns. This example doesn't apply to this discussions, since Danes and Norwegians consider Å to be a separate letter. That is, it is not A with ring above, but Å, which is not related to A any more than E is related to F. What J. M. Sykes writes about the lack of established sort orders seems right to me. I've done consulting work for Norwegian encyclopedia publishers, which involved developing their sorting routines. The orders for the different publishers did differ, and it is not so surprising given that there are a number of cases to consider, such as how to sort diacritics, what to consider as diacritics, how to sort numbers, Roman numerals, ordinals, and whatnot. --Lars M.
Re: [OT] o-circumflex
I hate this sort: Club Mix 2000 Club Mix 98 Club Mix 99 Those non Y2K compliant fools! $B$8$e$&$$$C$A$c$s(B(Juuitchan) Well, I guess what you say is true, I could never be the right kind of girl for you, I could never be your woman - White Town --- Original Message --- $B:9=P?M(B: Stefan Persson <[EMAIL PROTECTED]>; $B08@h(B: Mark Davis <[EMAIL PROTECTED]>;"Michael (michka) Kaplan" <[EMAIL PROTECTED]>;Keld J?n Simonsen <[EMAIL PROTECTED]>;[EMAIL PROTECTED]; Cc: $BF|;~(B: 01/09/10 17:09 $B7oL>(B: Re: [OT] o-circumflex >There is a similar problem with Swedish: > >Our alphabet goes: > >a >... >u >v & w (no difference made) >x >y >z >$B%F!&(B >$B%F!"(B (the Danish/Norwegian "$B%F%r(B" is also sorted as "$B%F!"(B") >$B%F%+(B (the Danish/Norwegian "$B%F%/(B" is also sorted as "$B%F%+(B") > >The German character "$B%F%7(B" is pronunciated as a Swedish "y," so when any >German name or loan word containing that character occurs in Swedish it >should be sorted as "y." However, if any "$B%F%7(B" occurs in a Dutch loan word it >is considered as an "u" with umlaut and is sorted as "u." > >The same goes for "$B%F!"(B" and "$B%F%+(B": If they are the >Swedish/Finnish/German >letters "$B%F!"(B" and "$B%F%+(B" they are sorted after "$B%F!&(B," if they are >the Dutch letters >"a" with umlaut and "o" with umlaut, they're sorted as "a" and "o" in a >Swedish encyclop$B%F%r(Bdia. > >In Swedish the Danish/Norwegian letter "$B%F%r(B" is sorted as "$B%F!"(B," while >the >Latin/Icelandic letter "$B%F%r(B" is sorted as "ae." > >Stefan > >- Original Message - >From: "Mark Davis" <[EMAIL PROTECTED]> >To: "Michael (michka) Kaplan" <[EMAIL PROTECTED]>; "Keld J$B%F%/(Brn Simonsen" ><[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> >Sent: den 10 september 2001 17:27 >Subject: Re: [OT] o-circumflex > > >> Michael, that isn't the point. There is a problem even when you stick to >one >> language. >> >> That is, there are situations where two letters in a language, e.g. "ch" >in >> Slovak, are normally sorted as one. However, in some exceptional >> circumstances those letters should be sorted separated. It could be >because >> they come originally from another language, or it could be because they >> happen to arise when two other words are conjoined. There is no >algorithmic >> distinction. So without some special character, it would require a >> dictionary look-up to produce the right sort >> >> For example, suppose that "th" were sorted separately in English, after Z. >> Yet people would expect the following order: >> >> cast >> cathouse >> caul >> cathode >> >> because the "t" and "h" are logically separate in "cathouse". >> >> Mark >> $Bc`Hd?Hd?Hd?Hd?!&(B>> >> $B%[?%^8P%5%[%5c`!&b>?%^?%[%C%^&Q!&%"%^!&%=(B $Bb>HQ"P%&%[%"(B, >$B%[%3%[%"%[%3bA%+%^!&%[%(c`!&b>?%^?%[%C%^&Q!&%"%^!&%=(B $B%^?%[%c%[%9%^!&%"(B >$Bc`!&bA%1%[%7%[%g%^"P%=%^!&%[XP%"%^"P%&%[%C%^!&%=!&(B>> [http://www.macchiato.com] >> - Original Message - >> From: "Michael (michka) Kaplan" <[EMAIL PROTECTED]> >> To: "Keld J$B%F%/(Brn Simonsen" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> >> Sent: Monday, September 10, 2001 5:48 AM >> Subject: Re: [OT] o-circumflex >> >> >> > From: "Keld J$B%F%/(Brn Simonsen" <[EMAIL PROTECTED]> >> > >> > > Real-life sorts, like MS Windows sorting or Linux sorting, actually >> > adheres >> > > to these Danish rules, once you have set up your machine for Danish. >> > >> > And this is the *true* answer to the whole mess of attempting >> *multilingual* >> > sorts -- once the user chooses the sort they WANT, the system might >handle >> > other language strings in a way that might be obscure to those who know >> the >> > other language but the person who expected Danish or whatever will see >> what >> > they want. >> > >> > Since various sorts openly conflict with each other there is no other >> > general case solution which would be appropriate, anyway? >> > >> > (can't believe this thread is still going on!) >> > >> > >> > MichKa >> > >> > Michael Kaplan >> > Trigeminal Software, Inc. >> > http://www.trigeminal.com/ >> > >> > >> > >> > >> > > >_ >Do You Yahoo!? >Get your free @yahoo.com address at http://mail.yahoo.com > > >
Re: [OT] o-circumflex
Where is this done for swedish? I have read both the TN and the SIS standard, and I dont believe these say something on sorting ü according to either German or Dutch sounds. Rolf Gavare does not say something along this either, as far as I can remember. Kind regards keld On Mon, Sep 10, 2001 at 07:09:34PM +0200, Stefan Persson wrote: > There is a similar problem with Swedish: > > Our alphabet goes: > > a > ... > u > v & w (no difference made) > x > y > z > å > ä (the Danish/Norwegian "æ" is also sorted as "ä") > ö (the Danish/Norwegian "ø" is also sorted as "ö") > > The German character "ü" is pronunciated as a Swedish "y," so when any > German name or loan word containing that character occurs in Swedish it > should be sorted as "y." However, if any "ü" occurs in a Dutch loan word it > is considered as an "u" with umlaut and is sorted as "u." > > The same goes for "ä" and "ö": If they are the Swedish/Finnish/German > letters "ä" and "ö" they are sorted after "å," if they are the Dutch letters > "a" with umlaut and "o" with umlaut, they're sorted as "a" and "o" in a > Swedish encyclopædia. > > In Swedish the Danish/Norwegian letter "æ" is sorted as "ä," while the > Latin/Icelandic letter "æ" is sorted as "ae." > > Stefan
Re: [OT] o-circumflex
On Mon, 10 Sep 2001, [ISO-2022-JP] $B$F$s$I$&$j$e$&$8(B wrote: > If they can't agree on the pronunciation for these cities, can they > agree on the Hanzi for them? What ARE the Hanzi for these cities, > anyway?? Are you asking for the names of cities in Chinese? Copenhagen is ge1ben3ha1gen1 \u54e5\u672c\u54c8\u6839. The Han characters used to write the names of cities depends on many factors, including but not limited to source spelling/pronunciation, language/dialect of the rendering party, mapping rules used by the renderer, time period, etc. For example, New York is rendered in Chinese as Mandarin niu3yue4 \u7d10\u7d04, lit. 'button-appointment' (nauyeuk in Cantonese), while in Japanese it was at one time rendered as \u7d10\u80b2, lit. 'button-rearing'. Asking for the "hanzi" (from your wording, I don't think you are just talking about Chinese usage of Han characters) is like asking for a single Latin script rendering. (I think you need to get yourself an English<->Chinese dictionary or something, btw...) Thomas Chan [EMAIL PROTECTED]
Re: [OT] o-circumflex
There is a similar problem with Swedish: Our alphabet goes: a ... u v & w (no difference made) x y z Ã¥ ä (the Danish/Norwegian "æ" is also sorted as "ä") ö (the Danish/Norwegian "ø" is also sorted as "ö") The German character "ü" is pronunciated as a Swedish "y," so when any German name or loan word containing that character occurs in Swedish it should be sorted as "y." However, if any "ü" occurs in a Dutch loan word it is considered as an "u" with umlaut and is sorted as "u." The same goes for "ä" and "ö": If they are the Swedish/Finnish/German letters "ä" and "ö" they are sorted after "Ã¥," if they are the Dutch letters "a" with umlaut and "o" with umlaut, they're sorted as "a" and "o" in a Swedish encyclopædia. In Swedish the Danish/Norwegian letter "æ" is sorted as "ä," while the Latin/Icelandic letter "æ" is sorted as "ae." Stefan - Original Message - From: "Mark Davis" <[EMAIL PROTECTED]> To: "Michael (michka) Kaplan" <[EMAIL PROTECTED]>; "Keld Jørn Simonsen" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: den 10 september 2001 17:27 Subject: Re: [OT] o-circumflex > Michael, that isn't the point. There is a problem even when you stick to one > language. > > That is, there are situations where two letters in a language, e.g. "ch" in > Slovak, are normally sorted as one. However, in some exceptional > circumstances those letters should be sorted separated. It could be because > they come originally from another language, or it could be because they > happen to arise when two other words are conjoined. There is no algorithmic > distinction. So without some special character, it would require a > dictionary look-up to produce the right sort > > For example, suppose that "th" were sorted separately in English, after Z. > Yet people would expect the following order: > > cast > cathouse > caul > cathode > > because the "t" and "h" are logically separate in "cathouse". > > Mark > âââââ > > Î Ïλλâ á¼ ÏίÏÏαÏο á¼Ïγα, ÎºÎ±Îºá¿¶Ï Î´â á¼ ÏίÏÏαÏο >ÏάνÏα â ΌμήÏÎ¿Ï ÎαÏγίÏá¿ > [http://www.macchiato.com] > - Original Message - > From: "Michael (michka) Kaplan" <[EMAIL PROTECTED]> > To: "Keld Jørn Simonsen" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> > Sent: Monday, September 10, 2001 5:48 AM > Subject: Re: [OT] o-circumflex > > > > From: "Keld Jørn Simonsen" <[EMAIL PROTECTED]> > > > > > Real-life sorts, like MS Windows sorting or Linux sorting, actually > > adheres > > > to these Danish rules, once you have set up your machine for Danish. > > > > And this is the *true* answer to the whole mess of attempting > *multilingual* > > sorts -- once the user chooses the sort they WANT, the system might handle > > other language strings in a way that might be obscure to those who know > the > > other language but the person who expected Danish or whatever will see > what > > they want. > > > > Since various sorts openly conflict with each other there is no other > > general case solution which would be appropriate, anyway? > > > > (can't believe this thread is still going on!) > > > > > > MichKa > > > > Michael Kaplan > > Trigeminal Software, Inc. > > http://www.trigeminal.com/ > > > > > > > > > _ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com
RE: [OT] o-circumflex
Stefan Persson wrote: > I thought "ij" sorted after "z?" Not in Dutch: as far as I have seen it sorts the same as "y". In fact, in the telephone directory many people who had an "y" in their surname listed near people who had the same surname spelled with "ij" (e.g. "Meyer" and "Meijer"). (Anyway, next time they send me to Holland, I'll ask for a downtown hotel. So, after dinner, I'll go sightseeing rather than spending the whole evening looking at the collation of the phone directory:-) _ Marco
Re: [OT] o-circumflex
- Original Message - From: "Marco Cimarosti" <[EMAIL PROTECTED]> To: "'John Wilcock'" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: den 10 september 2001 18:35 Subject: RE: [OT] o-circumflex > John Wilcock wrote: > > I haven't followed this discussion from the beginning, so apologies if > > I'm missing the point, but it seems to me that the Beijing case in > > Dutch is no different from the ekstraarbejde case in Danish - a SHY or > > ZWNJ is all that is needed to stop Beijing sorting with Bey. > > Yes, it is exactly the same thing. > > But my point is that a Dutch reader probably *does* expect Beijing to sort > like Bey, not like Bei. So, in some cases, a "correct" (i.e., expected) > behavior could rather be to *remove* all SHY/ZWNJ's before sorting. I thought "ij" sorted after "z?" _ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com
Re: [OT] o-circumflex
If they can't agree on the pronunciation for these cities, can they agree on the Hanzi for them? What ARE the Hanzi for these cities, anyway?? $B$8$e$&$$$C$A$c$s(B(Juuitchan) Well, I guess what you say is true, I could never be the right kind of girl for you, I could never be your woman - White Town --- Original Message --- $B:9=P?M(B: Marcin 'Qrczak' Kowalczyk <[EMAIL PROTECTED]>; $B08@h(B: [EMAIL PROTECTED]; Cc: $BF|;~(B: 01/09/10 14:02 $B7oL>(B: Re: [OT] o-circumflex >Mon, 10 Sep 2001 10:47:48 +0200, Marco Cimarosti <[EMAIL PROTECTED]> pisze: > >> It's as weird as some Italian names for German cities: Aquisgrana >> for Aachen, Augusta for Augsburg, Magonza for Mainz, Monaco (di >> Baviera) for M$B!&(Bchen. > >Interesting that Polish names of these cities are more like Italian >than German: Akwizgran, Augsburg, Moguncja, Monachium. > >Ko/benhavn is Kopenhaga, again more like other foreign forms than >Danish. > >-- > __("< Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ > \__/ > ^^ SYGNATURA ZAST$B%O(BPCZA >QRCZAK > > >
RE: [OT] o-circumflex
John Wilcock wrote: > I haven't followed this discussion from the beginning, so apologies if > I'm missing the point, but it seems to me that the Beijing case in > Dutch is no different from the ekstraarbejde case in Danish - a SHY or > ZWNJ is all that is needed to stop Beijing sorting with Bey. Yes, it is exactly the same thing. But my point is that a Dutch reader probably *does* expect Beijing to sort like Bey, not like Bei. So, in some cases, a "correct" (i.e., expected) behavior could rather be to *remove* all SHY/ZWNJ's before sorting. _ Marco
Alternative sorting for digraphs (Was Re: [OT] o-circumflex)
A SHY will mean that the word can break at "Bei- jing". It is not clear to me at least that that is safe in all cases for all languages with digraphs that sort separately, although it may be a solution for some. A ZWNJ will break ligatures and cursive connections. While probably safe in Danish or Dutch, it is unclear to me that that is safe in all languages where this situation occurs. There are diagraphs in Urdu, for example. While I don't know their sorting order, if they do sort separately then ZWNJ can't be used to express the alternative sorting, since it would give the wrong rendering. Mark — Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο πάντα — Όμήρου Μαργίτῃ [http://www.macchiato.com] - Original Message - From: "John Wilcock" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Monday, September 10, 2001 8:39 AM Subject: Re: [OT] o-circumflex > On Mon, 10 Sep 2001 16:42:45 +0200, Keld Jørn Simonsen wrote: > > But maybe you are driving for a yet more complex sorting, one that can sort > > according to multiple rules? Beijing should then not be sorted as Beÿing? > > I haven't followed this discussion from the beginning, so apologies if > I'm missing the point, but it seems to me that the Beijing case in > Dutch is no different from the ekstraarbejde case in Danish - a SHY or > ZWNJ is all that is needed to stop Beijing sorting with Bey. > > > John. > > -- > -- Over 1500 webcams from ski resorts around the world - http://www.snoweye.com/ > -- Translate your technical documents and web pages- http://www.tradoc.fr/ > >
Re: [OT] o-circumflex
On Mon, 10 Sep 2001 16:42:45 +0200, Keld Jørn Simonsen wrote: > But maybe you are driving for a yet more complex sorting, one that can sort > according to multiple rules? Beijing should then not be sorted as Beÿing? I haven't followed this discussion from the beginning, so apologies if I'm missing the point, but it seems to me that the Beijing case in Dutch is no different from the ekstraarbejde case in Danish - a SHY or ZWNJ is all that is needed to stop Beijing sorting with Bey. John. -- -- Over 1500 webcams from ski resorts around the world - http://www.snoweye.com/ -- Translate your technical documents and web pages- http://www.tradoc.fr/
Re: [OT] o-circumflex
Michael, that isn't the point. There is a problem even when you stick to one language. That is, there are situations where two letters in a language, e.g. "ch" in Slovak, are normally sorted as one. However, in some exceptional circumstances those letters should be sorted separated. It could be because they come originally from another language, or it could be because they happen to arise when two other words are conjoined. There is no algorithmic distinction. So without some special character, it would require a dictionary look-up to produce the right sort For example, suppose that "th" were sorted separately in English, after Z. Yet people would expect the following order: cast cathouse caul cathode because the "t" and "h" are logically separate in "cathouse". Mark âââââ Î Ïλλâ á¼ ÏίÏÏαÏο á¼Ïγα, ÎºÎ±Îºá¿¶Ï Î´â á¼ ÏίÏÏαÏο ÏάνÏα â ΌμήÏÎ¿Ï ÎαÏγίÏá¿ [http://www.macchiato.com] - Original Message - From: "Michael (michka) Kaplan" <[EMAIL PROTECTED]> To: "Keld Jørn Simonsen" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Monday, September 10, 2001 5:48 AM Subject: Re: [OT] o-circumflex > From: "Keld Jørn Simonsen" <[EMAIL PROTECTED]> > > > Real-life sorts, like MS Windows sorting or Linux sorting, actually > adheres > > to these Danish rules, once you have set up your machine for Danish. > > And this is the *true* answer to the whole mess of attempting *multilingual* > sorts -- once the user chooses the sort they WANT, the system might handle > other language strings in a way that might be obscure to those who know the > other language but the person who expected Danish or whatever will see what > they want. > > Since various sorts openly conflict with each other there is no other > general case solution which would be appropriate, anyway? > > (can't believe this thread is still going on!) > > > MichKa > > Michael Kaplan > Trigeminal Software, Inc. > http://www.trigeminal.com/ > > > >
Re: [OT] o-circumflex
From: "Mark Davis" <[EMAIL PROTECTED]> > Michael, that isn't the point. There is a problem even when you stick to one > language. > > That is, there are situations where two letters in a language, e.g. "ch" in > Slovak, are normally sorted as one. However, in some exceptional > circumstances those letters should be sorted separated. It could be because > they come originally from another language, or it could be because they > happen to arise when two other words are conjoined. There is no algorithmic > distinction. So without some special character, it would require a > dictionary look-up to produce the right sort I would argue that most users of the language are not expecting this type of thing, and that when they are looking for a word that this might be the SECOND place they look, not the first. There are exceptions, but they are not outnumbered by the general case, by any means. > For example, suppose that "th" were sorted separately in English, after Z. > Yet people would expect the following order: > > cast > cathouse > caul > cathode > > because the "t" and "h" are logically separate in "cathouse". Again, I think most people would look first in the place that does not assume the exception -- the computer's original limitations havse trained them. The notion of a natural language processing engine that would have all of the specific differences (with appropriate dictionaries for exceptions to even the NLP results) is a fascinating notion, but one that no one is even close to, yet. We do not even have available UCA tailorings for most of the world's languages. Though I have high hopes for the future (if not in the UCA then in other mechanisms). By that time, many langauges may have TWO collations, since users have been expecting something else for the last few decades? MichKa Michael Kaplan Trigeminal Software, Inc. http://www.trigeminal.com/
Re: [OT] o-circumflex
On Mon, Sep 10, 2001 at 03:58:05PM +0200, Marco Cimarosti wrote: > > On Mon, Sep 10, 2001 at 11:09:28AM +0200, Marco Cimarosti wrote: > > > Asmus Freytag wrote: > > > > But if you do this, all compound words starting with "data" > > > > and continuing > > > > with another word starting with "a" will be sorted incorrectly! > > > > > > > > To achieve this effect, you would have to mark which AAs are > > > > A-Rings and which ones are accidental adjacencies. In Danish > > > > one can use the SHY (soft hyphen) [...] > > > > > > Real-life sort orders often ignore these subtleties and are > > often based on a > > > small set of rules which is applied blindly, regardless of > > the origin, > > > meaning, or pronunciation of headwords. > > > > > > > Real-life sorts, like MS Windows sorting or Linux sorting, > > actually adheres > > to these Danish rules, once you have set up your machine for Danish. > > If I understand what you mean, perhaps my point was not clear. My point was that real-life sorts nowadays are quite sophisticated, and the major systems have adequate sorting for Danish and other languages with that kind of complexity. > I know that "aa" sorts like "å", and that it should go after "z". But there > are also cases when the sequence "aa" is just two a's, adjacent to each > other by pure chance. > > One of these cases could be the word "dataarkiv", which I found in a Danish > web page > (http://www.riksarkivet.no/nordiskarknytt/98-nr4/institusjonen.html). Yes, and ekstraarbejde - extra work. I know. > Now: if your Windows or Linux collations states (correctly!) that "aa" > should go after "z", you may have a list ordered like this: > > Order A: > 1. data > 2. Datben, Dr. Keld > 3. Datz, Mr. Marco > 4. dataarkiv > 5. Datåz, Dr. Asmus > > But if "dataarkiv" was written using an invisible separator between the two > a's (e.g. a soft hyphen, or a zero width non joiner), the your list would be > like this: > > Order B: > 1. data > 2. dataarkiv > 3. Datben, Dr. Keld > 4. Datz, Mr. Marco > 5. Datåz, Dr. Asmus > > Asmus was arguing that List B would be the correct one (and this is > certainly true on, e.g., a dictionary) but, in order to obtain it, the > source text must be properly encoded with invisible separators inserted > where needed. Yes, that is also my advice. > What I was saying is that the "automatic" Order A is also often used, and I > brought the example of the Dutch phone directories (where "Beijing" is > sorted as if it was "Beying"), and of the Italian encyclopedia (where > "Jefferson" is sorted as if it was "Iefferson"). You have to sort it according to the expectations of the user. A Dutch book would use Dutch rules, an Italian book would use the italian order. You cannot mix ordering, such that some words follow one set of rules, and other words follow other rules. It all needs to be comprehended by one human, the reader, and there only one ruleset applies. > > Michael (michka) Kaplan wrote: > > And this is the *true* answer to the whole mess of attempting > > *multilingual* sorts -- once the user chooses the sort they > > WANT, the system might handle other language strings in a > > way that might be obscure to those who know the other > > language but the person who expected Danish or whatever > > will see what they want. > > And this is precisely what I was trying to say, although I was not > necessarily talking about multilingual sort ("dataarkiv" seems a purely > Danish word, although derived from Latin roots). > > For some users and some usages, the "incorrect" Order B may be much more > useful than the "correct" Order A. If the rules says that "ij" goes between > "x" and "z", a Dutchman should find the "Beijing Restaurant" between "bex-" > and "bez-". > > If someone wants Order A (as may be the case for the author of a > dictionary), then they should apply Asmus' suggestion in order to drive the > collation algorithm. I think we agree, but what you call "simple set of rules" I call "quite complex". I also think that the Danish rules are quite simple as they can be formulated in say 4 lines of Danish prose. But compared to ascii sorting they are to some people unbelievable complex, and I think many Danish believes that you cannot get programs that adhere, although the major systems do that out of the box. Your incorrect and correct examples use the very same sorting algoritm, the only thing is that the data is coded differently. But maybe you are driving for a yet more complex sorting, one that can sort according to multiple rules? Beijing should then not be sorted as Beÿing? As stated above I think - and other sorting experts too - that sorting with multiple rules is a conceptual misunderstanding. Kind regards Keld
Re: [OT] o-circumflex
Mon, 10 Sep 2001 10:47:48 +0200, Marco Cimarosti <[EMAIL PROTECTED]> pisze: > It's as weird as some Italian names for German cities: Aquisgrana > for Aachen, Augusta for Augsburg, Magonza for Mainz, Monaco (di > Baviera) for München. Interesting that Polish names of these cities are more like Italian than German: Akwizgran, Augsburg, Moguncja, Monachium. Ko/benhavn is Kopenhaga, again more like other foreign forms than Danish. -- __("< Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTĘPCZA QRCZAK
RE: [OT] o-circumflex
> On Mon, Sep 10, 2001 at 11:09:28AM +0200, Marco Cimarosti wrote: > > Asmus Freytag wrote: > > > But if you do this, all compound words starting with "data" > > > and continuing > > > with another word starting with "a" will be sorted incorrectly! > > > > > > To achieve this effect, you would have to mark which AAs are > > > A-Rings and which ones are accidental adjacencies. In Danish > > > one can use the SHY (soft hyphen) [...] > > > > Real-life sort orders often ignore these subtleties and are > often based on a > > small set of rules which is applied blindly, regardless of > the origin, > > meaning, or pronunciation of headwords. > > > > Real-life sorts, like MS Windows sorting or Linux sorting, > actually adheres > to these Danish rules, once you have set up your machine for Danish. If I understand what you mean, perhaps my point was not clear. I know that "aa" sorts like "å", and that it should go after "z". But there are also cases when the sequence "aa" is just two a's, adjacent to each other by pure chance. One of these cases could be the word "dataarkiv", which I found in a Danish web page (http://www.riksarkivet.no/nordiskarknytt/98-nr4/institusjonen.html). Now: if your Windows or Linux collations states (correctly!) that "aa" should go after "z", you may have a list ordered like this: Order A: 1. data 2. Datben, Dr. Keld 3. Datz, Mr. Marco 4. dataarkiv 5. Datåz, Dr. Asmus But if "dataarkiv" was written using an invisible separator between the two a's (e.g. a soft hyphen, or a zero width non joiner), the your list would be like this: Order B: 1. data 2. dataarkiv 3. Datben, Dr. Keld 4. Datz, Mr. Marco 5. Datåz, Dr. Asmus Asmus was arguing that List B would be the correct one (and this is certainly true on, e.g., a dictionary) but, in order to obtain it, the source text must be properly encoded with invisible separators inserted where needed. What I was saying is that the "automatic" Order A is also often used, and I brought the example of the Dutch phone directories (where "Beijing" is sorted as if it was "Beying"), and of the Italian encyclopedia (where "Jefferson" is sorted as if it was "Iefferson"). Michael (michka) Kaplan wrote: > And this is the *true* answer to the whole mess of attempting > *multilingual* sorts -- once the user chooses the sort they > WANT, the system might handle other language strings in a > way that might be obscure to those who know the other > language but the person who expected Danish or whatever > will see what they want. And this is precisely what I was trying to say, although I was not necessarily talking about multilingual sort ("dataarkiv" seems a purely Danish word, although derived from Latin roots). For some users and some usages, the "incorrect" Order B may be much more useful than the "correct" Order A. If the rules says that "ij" goes between "x" and "z", a Dutchman should find the "Beijing Restaurant" between "bex-" and "bez-". If someone wants Order A (as may be the case for the author of a dictionary), then they should apply Asmus' suggestion in order to drive the collation algorithm. _ Marco
Re: [OT] o-circumflex
From: "Keld Jørn Simonsen" <[EMAIL PROTECTED]> > Real-life sorts, like MS Windows sorting or Linux sorting, actually adheres > to these Danish rules, once you have set up your machine for Danish. And this is the *true* answer to the whole mess of attempting *multilingual* sorts -- once the user chooses the sort they WANT, the system might handle other language strings in a way that might be obscure to those who know the other language but the person who expected Danish or whatever will see what they want. Since various sorts openly conflict with each other there is no other general case solution which would be appropriate, anyway? (can't believe this thread is still going on!) MichKa Michael Kaplan Trigeminal Software, Inc. http://www.trigeminal.com/
Re: [OT] o-circumflex
On Mon, Sep 10, 2001 at 11:09:28AM +0200, Marco Cimarosti wrote: > Asmus Freytag wrote: > > But if you do this, all compound words starting with "data" > > and continuing > > with another word starting with "a" will be sorted incorrectly! > > > > To achieve this effect, you would have to mark which AAs are > > A-Rings and which ones are accidental adjacencies. In Danish > > one can use the SHY (soft hyphen) [...] > > Real-life sort orders often ignore these subtleties and are often based on a > small set of rules which is applied blindly, regardless of the origin, > meaning, or pronunciation of headwords. > Real-life sorts, like MS Windows sorting or Linux sorting, actually adheres to these Danish rules, once you have set up your machine for Danish. Kind regards Keld
Re: [OT] o-circumflex
At 18:10 -0400 2001-09-09, John Cowan wrote: >Keld Jørn Simonsen scripsit: > >> Yes, foreigners call our cities many strange things:-) >> København is called Köpenhamn, Copenhagen, Kobenhagen, Copenhague, > > and many more. In Iceland it is Kaupmannahöfn, I believe. In unadorned English that would be something like Cheapmenshaven, maybe to weaken as Cheapenhaven, in German Kaufenhagen -- Michael Everson
Re: [OT] o-circumflex
At 18:04 +0200 2001-09-09, Stefan Persson wrote: > > well, the official spelling of the town is Aalborg. > >In Sweden it has always been written "Ålborg." At one stage, in both countries, it was written Álaborg, I suspect, as it is in Iceland today. -- Michael Everson
RE: [OT] o-circumflex
Asmus Freytag wrote: > But if you do this, all compound words starting with "data" > and continuing > with another word starting with "a" will be sorted incorrectly! > > To achieve this effect, you would have to mark which AAs are > A-Rings and which ones are accidental adjacencies. In Danish > one can use the SHY (soft hyphen) [...] Real-life sort orders often ignore these subtleties and are often based on a small set of rules which is applied blindly, regardless of the origin, meaning, or pronunciation of headwords. For instance, I have noticed that Dutch telephone directories always sort the sequence "ij" as if it was "y", regardless that it actually occurs in a Dutch word. E.g., Beijing Chinese Restaurant would be listed after Mr. Bex. Similarly, old Italian encyclopedias (e.g. Dizionario Enciclopedico Teccani) equated "j" to "i" because, in Italian, the former is just a graphic variant of the latter. But this also applied to foreign name such as "Jefferson" (which was listed between "iee-" and "ieg-"), regardless that, of course, it would not be allowed to spell "Iefferson". _ Marco
RE: [OT] o-circumflex
Carl W. Brown wrote: > In Arabic do you include vowels or not? Yes, and also consonants sometimes... Traditional Arabic dictionary sorting uses the three-letter root ("radical") of a word as the primary key. So, "madrasa" (school) would be under "d" (because its radical is "d-r-s" = to learn), ignoring the "ma-" prefix. I doubt, however, that this system is used with automatic sort orders generated by computers. _ Marco
RE: [OT] o-circumflex
John Cowan wrote: > None of which is as weird as Leghorn for Livorno (Italy). It's as weird as some Italian names for German cities: Aquisgrana for Aachen, Augusta for Augsburg, Magonza for Mainz, Monaco (di Baviera) for München. _ Marco
Re: [OT] o-circumflex
What would these cities be called in Hanzi? $B$8$e$&$$$C$A$c$s(B(Juuitchan) Well, I guess what you say is true, I could never be the right kind of girl for you, I could never be your woman - White Town --- Original Message --- $B:9=P?M(B: Keld J?n Simonsen <[EMAIL PROTECTED]>; $B08@h(B: Stefan Persson <[EMAIL PROTECTED]>; Cc: Keld J?n Simonsen <[EMAIL PROTECTED]>;"Carl W. Brown" <[EMAIL PROTECTED]>;[EMAIL PROTECTED]; $BF|;~(B: 01/09/09 19:31 $B7oL>(B: Re: [OT] o-circumflex >On Sun, Sep 09, 2001 at 06:04:30PM +0200, Stefan Persson wrote: >> - Original Message - >> From: "Keld J?n Simonsen" <[EMAIL PROTECTED]> >> To: "Carl W. Brown" <[EMAIL PROTECTED]> >> Cc: <[EMAIL PROTECTED]> >> Sent: den 9 september 2001 14:21 >> Subject: Re: [OT] o-circumflex >> >> >> > On Sat, Sep 08, 2001 at 06:38:57PM -0700, Carl W. Brown wrote: >> > > Asmus, >> > > >> > > If you are entering Danish city names then enter it as $B%J(Blborg. You >> should >> > > only use Aalborg where the font does not support $B%J(B. For matching logic >> you >> > > can equate $B%J(B to Aa then the issue of compound words goes away. >> > >> > well, the official spelling of the town is Aalborg. >> >> In Sweden it has always been written "$B%J(Blborg." > >Yes, foreigners call our cities many strange things:-) >K?enhavn is called K?enhamn, Copenhagen, Kobenhagen, Copenhague, >and many more. Helsing? is called Elsinore. >Well, $B%J(Blborg is sometimes spelled $B%J(Blborg, but the official spelling, as >defined by zip and postal addresses is 9100 Aalborg, and the kommune is called >Aalborg kommune, viz www.aalborg.dk . > >$B%J(Brhus is however almost always spelled $B%J(Brhus in Danish. > >Kind regards >Keld > >
Re: [OT] o-circumflex
Keld Jørn Simonsen scripsit: > Yes, foreigners call our cities many strange things:-) > København is called Köpenhamn, Copenhagen, Kobenhagen, Copenhague, > and many more. Helsingør is called Elsinore. None of which is as weird as Leghorn for Livorno (Italy). -- John Cowan http://www.ccil.org/~cowan [EMAIL PROTECTED] Please leave your values| Check your assumptions. In fact, at the front desk. | check your assumptions at the door. --sign in Paris hotel |--Miles Vorkosigan
Re: [OT] o-circumflex
On Sun, Sep 09, 2001 at 06:04:30PM +0200, Stefan Persson wrote: > - Original Message - > From: "Keld Jørn Simonsen" <[EMAIL PROTECTED]> > To: "Carl W. Brown" <[EMAIL PROTECTED]> > Cc: <[EMAIL PROTECTED]> > Sent: den 9 september 2001 14:21 > Subject: Re: [OT] o-circumflex > > > > On Sat, Sep 08, 2001 at 06:38:57PM -0700, Carl W. Brown wrote: > > > Asmus, > > > > > > If you are entering Danish city names then enter it as Ålborg. You > should > > > only use Aalborg where the font does not support Å. For matching logic > you > > > can equate Å to Aa then the issue of compound words goes away. > > > > well, the official spelling of the town is Aalborg. > > In Sweden it has always been written "Ålborg." Yes, foreigners call our cities many strange things:-) København is called Köpenhamn, Copenhagen, Kobenhagen, Copenhague, and many more. Helsingør is called Elsinore. Well, Ålborg is sometimes spelled Ålborg, but the official spelling, as defined by zip and postal addresses is 9100 Aalborg, and the kommune is called Aalborg kommune, viz www.aalborg.dk . Århus is however almost always spelled Århus in Danish. Kind regards Keld
Re: [OT] o-circumflex/Spanish sorting
I received a private email stating that that "ch" and "ll" were abolished by the 10th Congress of the 12 academies of the various Spanish speaking countries in 1994, not just the RAE. (There are, in addition to the obvious, also academies for Puerto Rico, North America and the Phillipines.) However, it was also my understanding that the modern sort wasn't accepted outside of Spain, but it's never been clear to me if this is just a matter of popular or academic opinion, or if there has been formal resistance as well. Now I wonder if the various academies have the same authority in their country that the Royal Academy has in Spain, or if there are other national standards bodies with which they compete or cooperate. - David Gallardo - Original Message - From: "Tex Texin" <[EMAIL PROTECTED]> To: "David Gallardo" <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Sunday, September 09, 2001 2:15 AM Subject: Re: [OT] o-circumflex/Spanish sorting > David, > I also don't know if the other countries have academies, but my > understanding is Latin American countries haven't accepted the modern > sort. Having said that, there is a lot of software that does not > implement the traditional sort, so "acceptance" is moot. > (The reason the Real Academia Española did away with the sorting of ch > and ll is that a majority of software wasn't implementing sorts that > way.) > > tex > > David Gallardo wrote: > > > > Hi - > > > > I know the Real Academia Española decided to do away with "ch" and "ll" in > > 1994, but do you know if the other Spanish speaking countries' corresponding > > academies done the same? > > > > - David Gallardo > > -- > - > Tex TexinDirector, International Business > mailto:[EMAIL PROTECTED]Tel: +1-781-280-4271 > the Progress Company Fax: +1-781-280-4655 > - >
Re: [OT] o-circumflex
- Original Message - From: "Keld Jørn Simonsen" <[EMAIL PROTECTED]> To: "Carl W. Brown" <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: den 9 september 2001 14:21 Subject: Re: [OT] o-circumflex > On Sat, Sep 08, 2001 at 06:38:57PM -0700, Carl W. Brown wrote: > > Asmus, > > > > If you are entering Danish city names then enter it as Ålborg. You should > > only use Aalborg where the font does not support Å. For matching logic you > > can equate Å to Aa then the issue of compound words goes away. > > well, the official spelling of the town is Aalborg. In Sweden it has always been written "Ålborg." _ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com
Re: [OT] o-circumflex
On Sat, Sep 08, 2001 at 06:38:57PM -0700, Carl W. Brown wrote: > Asmus, > > If you are entering Danish city names then enter it as Ålborg. You should > only use Aalborg where the font does not support Å. For matching logic you > can equate Å to Aa then the issue of compound words goes away. well, the official spelling of the town is Aalborg. Keld
Re: [OT] o-circumflex/Spanish sorting
David, I also don't know if the other countries have academies, but my understanding is Latin American countries haven't accepted the modern sort. Having said that, there is a lot of software that does not implement the traditional sort, so "acceptance" is moot. (The reason the Real Academia Española did away with the sorting of ch and ll is that a majority of software wasn't implementing sorts that way.) tex David Gallardo wrote: > > Hi - > > I know the Real Academia Española decided to do away with "ch" and "ll" in > 1994, but do you know if the other Spanish speaking countries' corresponding > academies done the same? > > - David Gallardo -- - Tex TexinDirector, International Business mailto:[EMAIL PROTECTED]Tel: +1-781-280-4271 the Progress Company Fax: +1-781-280-4655 -
RE: [OT] o-circumflex
This is not always the right thing to do. For example, with personal names the person involved may decide whether he prefers the old (AA) spelling or the new Å. In any case they are equivalent. Jony > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED]]On Behalf Of Carl W. Brown > Sent: Sunday, September 09, 2001 4:39 AM > To: [EMAIL PROTECTED] > Subject: RE: [OT] o-circumflex > > > Asmus, > > This discussion reminds me of my ill fated efforts to produce a manageable > set of rules to do automatic title casing starting with French text. It > would have required either special dictionaries or entering the text in a > special way. If special text was used, one could enter it in the proper > title case to begin with. > > If you are entering Danish city names then enter it as Ålborg. You should > only use Aalborg where the font does not support Å. For matching logic you > can equate Å to Aa then the issue of compound words goes away. > > Carl > > > -Original Message- > > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On > > Behalf Of Asmus Freytag > > Sent: Saturday, September 08, 2001 5:56 PM > > To: Mark Davis; [EMAIL PROTECTED]; Francesco Zappa Nardelli > > Subject: Re: [OT] o-circumflex > > > > > > At 02:45 PM 9/8/01 -0700, Mark Davis wrote: > > >If you use a Danish tailoring of the UCA that equates à and AA > > (at least at > > >a primary and secondary level), then they will sort the same > > way. A string > > >search that uses the same tailoring will also find "à lborg" when given > > >"Aalborg" (and vice versa). > > > > But if you do this, all compound words starting with "data" and > > continuing > > with another word starting with "a" will be sorted incorrectly! > > > > To achieve this effect, you would have to mark which AAs are A-Rings and > > which ones are accidental adjacencies. In Danish one can use the > > SHY (soft > > hyphen) to break the latter, as these accidental pairs occur at > > legal word > > break points. In fact, that's the recommended solution, but it requires > > that the input data are in a sepecific form. > > > > A./ > > > > >
RE: [OT] o-circumflex
Asmus, This discussion reminds me of my ill fated efforts to produce a manageable set of rules to do automatic title casing starting with French text. It would have required either special dictionaries or entering the text in a special way. If special text was used, one could enter it in the proper title case to begin with. If you are entering Danish city names then enter it as Ålborg. You should only use Aalborg where the font does not support Å. For matching logic you can equate Å to Aa then the issue of compound words goes away. Carl > -Original Message- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On > Behalf Of Asmus Freytag > Sent: Saturday, September 08, 2001 5:56 PM > To: Mark Davis; [EMAIL PROTECTED]; Francesco Zappa Nardelli > Subject: Re: [OT] o-circumflex > > > At 02:45 PM 9/8/01 -0700, Mark Davis wrote: > >If you use a Danish tailoring of the UCA that equates à and AA > (at least at > >a primary and secondary level), then they will sort the same > way. A string > >search that uses the same tailoring will also find "à lborg" when given > >"Aalborg" (and vice versa). > > But if you do this, all compound words starting with "data" and > continuing > with another word starting with "a" will be sorted incorrectly! > > To achieve this effect, you would have to mark which AAs are A-Rings and > which ones are accidental adjacencies. In Danish one can use the > SHY (soft > hyphen) to break the latter, as these accidental pairs occur at > legal word > break points. In fact, that's the recommended solution, but it requires > that the input data are in a sepecific form. > > A./ >
Re: [OT] o-circumflex
At 02:45 PM 9/8/01 -0700, Mark Davis wrote: >If you use a Danish tailoring of the UCA that equates à and AA (at least at >a primary and secondary level), then they will sort the same way. A string >search that uses the same tailoring will also find "à lborg" when given >"Aalborg" (and vice versa). But if you do this, all compound words starting with "data" and continuing with another word starting with "a" will be sorted incorrectly! To achieve this effect, you would have to mark which AAs are A-Rings and which ones are accidental adjacencies. In Danish one can use the SHY (soft hyphen) to break the latter, as these accidental pairs occur at legal word break points. In fact, that's the recommended solution, but it requires that the input data are in a sepecific form. A./
Re: [OT] o-circumflex
In a message dated 2001-09-08 12:00:43 Pacific Daylight Time, [EMAIL PROTECTED] writes: > I know the Real Academia Española decided to do away with "ch" and "ll" in > 1994, but do you know if the other Spanish speaking countries' corresponding > academies done the same? I have no idea. I don't know which, if any, even have a language academy. -Doug Ewell Fullerton, California
Re: [OT] o-circumflex
If you use a Danish tailoring of the UCA that equates à and AA (at least at a primary and secondary level), then they will sort the same way. A string search that uses the same tailoring will also find "à lborg" when given "Aalborg" (and vice versa). Mark BTW, internationalized string search is one of the features of ICU 2.0 (see http://www-124.ibm.com/icu/develop/tasks.html). There are a number of exceptional cases that have to be handled, due to issues with ignorable characters, Thai & Lao boundaries, canonical equivalence and contractions (see http://oss.software.ibm.com/cvs/icu/~checkout~/icuhtml/design/searchproposal .html). âââââ Î Ïλλâ á¼ ÏίÏÏαÏο á¼Ïγα, ÎºÎ±Îºá¿¶Ï Î´â á¼ ÏίÏÏαÏο ÏάνÏα â ΌμήÏÎ¿Ï ÎαÏγίÏá¿ [http://www.macchiato.com] - Original Message - From: "Francesco Zappa Nardelli" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Saturday, September 08, 2001 10:51 AM Subject: Re: [OT] o-circumflex > Hello. > > >> For example the Danish alphabet starts with an A and ends it with A > >> ring above. A Dane would look for Alborg near the end of a list of > >> towns. > > I was in Aalborg fifteen days ago, and I have seen its name written > both as à lborg and as Aalborg. Where does Aalborg appear in a list of > towns? > > -francesco > >
Re: [OT] o-circumflex
Hi - I know the Real Academia Española decided to do away with "ch" and "ll" in 1994, but do you know if the other Spanish speaking countries' corresponding academies done the same? - David Gallardo - Original Message - From: <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Saturday, September 08, 2001 1:51 AM Subject: Re: [OT] o-circumflex > In a message dated 2001-09-07 17:19:49 Pacific Daylight Time, > [EMAIL PROTECTED] writes: > > > You are quite correct that is why Unicode support differing collation > > strengths. Some times you only care about the actual letters without > > diacritics. But even then letters are locale sensitive. For example the > > Danish alphabet starts with an A and ends it with A ring above. A Dane > > would look for Alborg near the end of a list of towns. It is like having > > the Spanish ch follow cz. > > That would be Ålborg, right? > > I hasten to add that Carl's Spanish example is for the so-called "traditional > sort," in contrast to the "modern sort" in which "ch" sorts simply as "c" > followed by "h". In many Spanish-speaking communities, particularly here in > Alta California, the simplified "modern" sort is by far the more common of > the two. > > -Doug Ewell > Fullerton, California >
Re: [OT] o-circumflex
At 09:04 PM 9/7/01 -0700, Mark Davis wrote: >I disagree. What you want is a merged database field. See >http://www.macchiato.com/slides/icu_collation.ppt > >Mark Mark, David took the remainder of our discussion off the alias. I won't repeat it here, just to note that we've agreed that merged database fields are the answer to (some) of the scenarios that we've discussed, but that there are cases (like indexing a mixed corpus where both naive and naïve occur) where it might indeed make sense to ignore accent differences altogether - although, as is often the case, dictionary-based pre- or post processing or manual adjustments might give better results yet. Thanks for your pointer to the presentation. A./
Re: [OT] o-circumflex
Hello. >> For example the Danish alphabet starts with an A and ends it with A >> ring above. A Dane would look for Alborg near the end of a list of >> towns. I was in Aalborg fifteen days ago, and I have seen its name written both as Ålborg and as Aalborg. Where does Aalborg appear in a list of towns? -francesco
RE: [OT] o-circumflex
Doug, > -Original Message- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On > Behalf Of [EMAIL PROTECTED] > Sent: Friday, September 07, 2001 10:52 PM > To: [EMAIL PROTECTED] > Cc: [EMAIL PROTECTED] > Subject: Re: [OT] o-circumflex > > > In a message dated 2001-09-07 17:19:49 Pacific Daylight Time, > [EMAIL PROTECTED] writes: > > > You are quite correct that is why Unicode support differing collation > > strengths. Some times you only care about the actual letters without > > diacritics. But even then letters are locale sensitive. For > example the > > Danish alphabet starts with an A and ends it with A ring above. A Dane > > would look for Alborg near the end of a list of towns. It is > like having > > the Spanish ch follow cz. > > That would be Ålborg, right? That is right. I am concerned that not everyone can view special characters. I think that having an alphabet that goes for A to Å must be due to the Danish sense of humor. I also did not use the ? in ?stanbul. > > I hasten to add that Carl's Spanish example is for the so-called > "traditional > sort," in contrast to the "modern sort" in which "ch" sorts simply as "c" > followed by "h". In many Spanish-speaking communities, > particularly here in > Alta California, the simplified "modern" sort is by far the more > common of > the two. > Again correct they also use the modern sort here in Muy Alta California as well as most of the Spanish speaking world. There also is the differences between ASCII and EBCDIC sorting. Talk about people who are worlds apart. ;-} Carl W. Brown Lafayette, CA
Re: [OT] o-circumflex
In a message dated 2001-09-07 17:19:49 Pacific Daylight Time, [EMAIL PROTECTED] writes: > You are quite correct that is why Unicode support differing collation > strengths. Some times you only care about the actual letters without > diacritics. But even then letters are locale sensitive. For example the > Danish alphabet starts with an A and ends it with A ring above. A Dane > would look for Alborg near the end of a list of towns. It is like having > the Spanish ch follow cz. That would be Ålborg, right? I hasten to add that Carl's Spanish example is for the so-called "traditional sort," in contrast to the "modern sort" in which "ch" sorts simply as "c" followed by "h". In many Spanish-speaking communities, particularly here in Alta California, the simplified "modern" sort is by far the more common of the two. -Doug Ewell Fullerton, California
Re: [OT] o-circumflex
As a percentage of words in English, it is quite small, but there are still plenty of homographs, such as: BASS BOW(S) BUFFET COAX CLOSE COMPOUND(S) CONVERSE DESERT DIVERS DOES DOVE ENTRANCE(S) EXCISE HARE INTIMATE INVALID LAME LEAD LUGER(S) MANES MARE(S) MINUTE OBJECT(S) PATENT POLISH PRESENT PRIMER(S) PROJECT(S) PUSSY PUTTING RAVEN RE REFUSE RESIGN(S) RESUME(S) ROW(S) SEWER(S) SHOWER(S) SLAVER SOW(S) SYNDICATE(S) TAXIS TEAR(S) TIER(S) TOWER(S) VIOLA(S) WIND(S) WOUND ABSENT ABSTRACT ABUSE(S) ADDRESS(ES) ADVOCATE(S) AGGREGATE APPROPRIATE APPROXIMATE ARTICULATE ASSOCIATE(S) ATTRIBUTE(S) COMBAT COMBINE(S) COMPACT(S) COMPLEX CONDUCT CONFINES CONFLICT(S) CONSORT CONSTRUCT(S) CONTENT CONTEST(S) CONTRACT(S) CONSUMMATE CONVERT(S) CONVICT(S) COORDINATE(S) DECREASE(S) DEFECT(S) DEGENERATE(S) DELEGATE(S) DELIBERATE DISCHARGE DOGGED EJACULATE ELABORATE ESCORT(S) EXCUSE(S) ESTIMATE(S) EXTRACT(S) GRADUATE(S) HOUSE(S) IMPLANT(S) IMPORT(S) INCLINE(S) LAMINATE(S) LEARNED LEGITIMATE LIVE(S) [-]LIVED MEDIATE(S) MOBILE (3) MODERATE(S) MOUTH OFFENSE(S) PERFECT PERMIT(S) PREDICATE(S) PRODUCE PROGRESS PROTEST(S) READ (mis-, proof-) RECALL(S) RECORD(S) REDRESS REJECT(S) RETARD(S) RETREAD(S) ROUTE(S) SEPARATE SUBJECT(S) SUSPECT(S) TORMENT(S) UPSET(S) USE(S) — Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο πάντα — Όμήρου Μαργίτῃ [http://www.macchiato.com] - Original Message - From: "Asmus Freytag" <[EMAIL PROTECTED]> To: "Ayers, Mike" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Friday, September 07, 2001 11:52 Subject: RE: [OT] o-circumflex > At 11:50 AM 9/7/01 -0500, Ayers, Mike wrote: > >Words with the > >same spelling and different pronunciation are uncommon but exist in English, > >the classic example being "read" and its own past tense. > > Actually, this is a bit more common than you think, since the pronunciation > of vowels in English depends somewhat systematically on stress, and verb > and noun forms of many words are stressed differently. > > A./ > >
Re: [OT] o-circumflex
I disagree. What you want is a merged database field. See http://www.macchiato.com/slides/icu_collation.ppt Mark — Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο πάντα — Όμήρου Μαργίτῃ [http://www.macchiato.com] - Original Message - From: "Asmus Freytag" <[EMAIL PROTECTED]> To: "David Gallardo" <[EMAIL PROTECTED]>; "Ayers, Mike" <[EMAIL PROTECTED]>; "'David Starner'" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Friday, September 07, 2001 11:50 Subject: Re: [OT] o-circumflex > At 01:06 PM 9/7/01 -0400, David Gallardo wrote: > >As a practical matter, you need to take the diacritics into account when > >sorting, even in English where they (may or may not) have linguistic > >significance, otherwise you'll get nondeterministic behaviour. In other > >words, résumé and resume should fall together, but always in the same order. > > Stated absolutely, this is patent, but oft-repeated nonsense. For example, > it does not always make sense for list of names. An old friend of mine, Jon > Proppe, who is an Icelandic art critic, spells his name with an accent > grave on the first o and an acute accent on the e. In a campus directory of > the US university he attended (assuming it did not strip the accents), it > would make no sense to have his name show up after all the Proppes, or all > the Jons without an accent (depending on whether its sorted by first or > last name). > > If I sort a list of single words which contains non-unique entries, a > stable sort would sort the non-unique subsets in the order of their > appearance in the input. If its not important to distinguish between naive > and naïve (e.g. in a machine generated index that spans multiple documents > with differences in the use of accents) its hard to see what's gained in > splitting the list in two for this case. > > On the other hand, if San Jose and San José are correctly and consistently > distinguished in my input, they should probably sort separately. > > The two cases of resume are different yet again, as noted, since one could > be a verb form. > > It all depends not on whether a distinction can be made, but whether it is > meaningful in the context of the list being sorted. > > A./ > > > > > >
RE: [OT] o-circumflex
Asmus, You are quite correct that is why Unicode support differing collation strengths. Some times you only care about the actual letters without diacritics. But even then letters are locale sensitive. For example the Danish alphabet starts with an A and ends it with A ring above. A Dane would look for Alborg near the end of a list of towns. It is like having the Spanish ch follow cz. By providing for different types of collation one can meet the user's expectations. Then of course you have search, display and sort differences. If I am looking for Istanbul it is probably OK even for Turkish locales to match it to the Turkish spelling which uses a dotted capital I. With languages with multiple diacritics like Vietnamese you have another set of rules and had better have normalized data. In Arabic do you include vowels or not? I remember your discussions of Greek where there are other considerations. Carl > -Original Message- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On > Behalf Of Asmus Freytag > Sent: Friday, September 07, 2001 11:51 AM > To: David Gallardo; Ayers, Mike; 'David Starner'; [EMAIL PROTECTED] > Subject: Re: [OT] o-circumflex > > > At 01:06 PM 9/7/01 -0400, David Gallardo wrote: > >As a practical matter, you need to take the diacritics into account when > >sorting, even in English where they (may or may not) have linguistic > >significance, otherwise you'll get nondeterministic behaviour. In other > >words, résumé and resume should fall together, but always in > the same order. > > Stated absolutely, this is patent, but oft-repeated nonsense. For > example, > it does not always make sense for list of names. An old friend of > mine, Jon > Proppe, who is an Icelandic art critic, spells his name with an accent > grave on the first o and an acute accent on the e. In a campus > directory of > the US university he attended (assuming it did not strip the accents), it > would make no sense to have his name show up after all the > Proppes, or all > the Jons without an accent (depending on whether its sorted by first or > last name). > > If I sort a list of single words which contains non-unique entries, a > stable sort would sort the non-unique subsets in the order of their > appearance in the input. If its not important to distinguish > between naive > and naïve (e.g. in a machine generated index that spans multiple > documents > with differences in the use of accents) its hard to see what's gained in > splitting the list in two for this case. > > On the other hand, if San Jose and San José are correctly and > consistently > distinguished in my input, they should probably sort separately. > > The two cases of resume are different yet again, as noted, since > one could > be a verb form. > > It all depends not on whether a distinction can be made, but > whether it is > meaningful in the context of the list being sorted. > > A./ > > > > >
RE: [OT] o-circumflex
At 11:50 AM 9/7/01 -0500, Ayers, Mike wrote: >Words with the >same spelling and different pronunciation are uncommon but exist in English, >the classic example being "read" and its own past tense. Actually, this is a bit more common than you think, since the pronunciation of vowels in English depends somewhat systematically on stress, and verb and noun forms of many words are stressed differently. A./
Re: [OT] o-circumflex
At 01:06 PM 9/7/01 -0400, David Gallardo wrote: >As a practical matter, you need to take the diacritics into account when >sorting, even in English where they (may or may not) have linguistic >significance, otherwise you'll get nondeterministic behaviour. In other >words, résumé and resume should fall together, but always in the same order. Stated absolutely, this is patent, but oft-repeated nonsense. For example, it does not always make sense for list of names. An old friend of mine, Jon Proppe, who is an Icelandic art critic, spells his name with an accent grave on the first o and an acute accent on the e. In a campus directory of the US university he attended (assuming it did not strip the accents), it would make no sense to have his name show up after all the Proppes, or all the Jons without an accent (depending on whether its sorted by first or last name). If I sort a list of single words which contains non-unique entries, a stable sort would sort the non-unique subsets in the order of their appearance in the input. If its not important to distinguish between naive and naïve (e.g. in a machine generated index that spans multiple documents with differences in the use of accents) its hard to see what's gained in splitting the list in two for this case. On the other hand, if San Jose and San José are correctly and consistently distinguished in my input, they should probably sort separately. The two cases of resume are different yet again, as noted, since one could be a verb form. It all depends not on whether a distinction can be made, but whether it is meaningful in the context of the list being sorted. A./
Re: [OT] o-circumflex
From: "David Gallardo" <[EMAIL PROTECTED]> > As a practical matter, you need to take the diacritics into account when > sorting, even in English where they (may or may not) have linguistic > significance, otherwise you'll get nondeterministic behaviour. In other > words, résumé and resume should fall together, but always in the same order. Well, sort of. The issue remains that if one is choosing for their particular purpose to ignore case (for example) then there is literally no difference between "Aa" and "aA". Since the two are considered equivalent in the "case insensitive" comparison, you cannot claim that a sorting algorithm has errored if it arbitrarily returns one before the other because it happens to return them in different order. For a real-world example, this can happen with algorithms where the bottom item and the anchor are always reordered if b < a and thus you could see different ordering of items depending on their placement in the list. A similar thing happens with accent-insensitive sorts -- if you literally treat "ee" and "éé" as identical due to using an accent insensitive sort, then the ordering is NOT deterministic, nor is it supposed to be. And there is nothing invalid in there not being a non-deterministic behavior of equivalent items, any more than claiming that having it put "ee" before "ee" in one case and after another is invalid. MichKa Michael Kaplan Trigeminal Software, Inc. http://www.trigeminal.com/
RE: [OT] o-circumflex
> From: David Gallardo [mailto:[EMAIL PROTECTED]] > Sent: Friday, September 07, 2001 10:07 AM > As a practical matter, you need to take the diacritics into > account when > sorting, even in English where they (may or may not) have linguistic > significance, otherwise you'll get nondeterministic > behaviour. In other > words, résumé and resume should fall together, but always in > the same order. Why? This may be of interest and benefit to programmers, but not necessarily to end-users. The computer should serve the human, not the other way around, and it is not particularly challenging to come up with search and sort algorithms which understand the concept of terminal sets which need to be iterated over to find the final entity as opposed to terminal entities. Recall Mike Sykes' post concerning sort order: Reverting the question of order, the 'Guide to the New SOED' (a.k.a. Help) reveals that: Entries are accessed in strict alphabetical order. ... ; a headword with an accent or diacritic over a letter follows one consisting of the same sequence of letters without. ... The order of headwords which are spelled the same way but have different parts of speech is as follows: noun (abbreviated n.) pronoun (abbreviated pron.) adjective (abbreviated a.) verb (abbreviated v.) ... This explicit ordering will still be insufficient if we choose to include verb tenses in our word list, whence we get the two "read"s. If someone has a reason why these two words need to be in the same order in everyone's word list, I'll listen... /|/|ike
RE: [OT] o-circumflex
> There is also no word pair separated only by the I/J > distinction (in English), right? iamb - as in iambic pentamater jamb - as in a door jamb
Re: [OT] o-circumflex
As a practical matter, you need to take the diacritics into account when sorting, even in English where they (may or may not) have linguistic significance, otherwise you'll get nondeterministic behaviour. In other words, résumé and resume should fall together, but always in the same order. Someone in another message mentioned "ñ". This is a different case in principal, because in Spanish it's not a case of letter modified by a diacritic--it's an entirely different letter. (It used to be written as two side-by-side "n"s and then they got stacked.) Again as practical matter, in English, it's most common to ignore the greater distinction, (because we have only 26 letters in our alphabet), and to treat it as a letter + diacritic for the same considerations as above. - Original Message - From: "Ayers, Mike" <[EMAIL PROTECTED]> To: "'David Starner'" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Thursday, September 06, 2001 5:12 PM Subject: RE: [OT] o-circumflex > > > From: David Starner [mailto:[EMAIL PROTECTED]] > > Sent: Thursday, September 06, 2001 01:40 PM > > > On Thu, Sep 06, 2001 at 04:03:07PM +0200, Thierry Sourbier wrote: > > > The only little thing to know about French and diacritical > > mark is that when > > > doing a sort diacritical mark are evaluated from right to > > left. (e.g. > > > "cote" < "côte" < "coté" vs the English order "cote" < > > "coté" < "côte" ). > > > > I'm not sure there is an established English sort order. It's not a > > problem that comes up much in English. > > I believe that there is an established sort order in English, which > is to sort without regard to diacritics, or else we'd never find the words! > In English (American English more than British English), diacritics are > considered optional, and it is common to see "naїve" written "naive", "San > José" written "San Jose", etc. Especially amongst Americans, the two are > considered equivalent, and I know of no word pair in all of English which is > separated only by a diacritic. > > > /|/|ike >
RE: [OT] o-circumflex
> From: J M Sykes [mailto:[EMAIL PROTECTED]] > Sent: Friday, September 07, 2001 07:50 AM > The classic example is 'resume' and 'résumé'. These are, by > now, two quite > distinct words, and the fact that there is no 'established' > order is shown I spell both "resume" and have never been corrected. Words with the same spelling and different pronunciation are uncommon but exist in English, the classic example being "read" and its own past tense. Since there are no diacritics in English proper, the two "resume"s tend to fall into this category. The diacritics which often appear on one of them really only serve to mark it as a loan word, since it is very difficult to come up with a sentence in which the two could be confused. /|/|ike
Re: [OT] o-circumflex
$B$8$e$&$$$C$A$c$s(B(Juuitchan) Well, I guess what you say is true, I could never be the right kind of girl for you, I could never be your woman - White Town > >Who'd be a lexicographer? $B;d!)(B > >Mike. > >*** > >J M Sykes Email: [EMAIL PROTECTED] >97 Oakdale Drive >Heald Green >CHEADLE >Cheshire SK8 3SN >UKTel: (44) 161 437 5413 > >*** > > > >
Re: [OT] o-circumflex
There is also no word pair separated only by the I/J distinction (in English), right? $B$8$e$&$$$C$A$c$s(B(Juuitchan) Well, I guess what you say is true, I could never be the right kind of girl for you, I could never be your woman - White Town I know of no word pair in all of English which >is >> separated only by a diacritic. >
Re: [OT] o-circumflex
> > I believe that there is an established sort order in English, which > is to sort without regard to diacritics, or else we'd never find the words! > In English (American English more than British English), diacritics are > considered optional, and it is common to see "naїve" written "naive", "San > José" written "San Jose", etc. Especially amongst Americans, the two are > considered equivalent, and I know of no word pair in all of English which is > separated only by a diacritic. That depends what you mean by 'established' ;-) The classic example is 'resume' and 'résumé'. These are, by now, two quite distinct words, and the fact that there is no 'established' order is shown by the fact that the New Shorter Oxford English Dictionary (Version: 1.0.4, Data version: 02.10.96s, January 1997, on disk) has them in the order: 'résumé', 'resume' while the New Oxford Dictionary of English (Clarendon Press, 1998) has 'resume', 'resumé'. The Concise Oxford Dictionary (of Current English, Clarendon Press, 1982, edited, as it happens, by a second cousin of mine) also has 'resume', 'résumé'. Evidently, we see here evidence that the diacritic on the first 'e' has become optional since 1982, though not that on the second, presumably because that 'e' might otherwise be supposed to be silent. Reverting the question of order, the 'Guide to the New SOED' (a.k.a. Help) reveals that: Entries are accessed in strict alphabetical order. ... ; a headword with an accent or diacritic over a letter follows one consisting of the same sequence of letters without. ... The order of headwords which are spelled the same way but have different parts of speech is as follows: noun (abbreviated n.) pronoun (abbreviated pron.) adjective (abbreviated a.) verb (abbreviated v.) ... And scrutiny of the two entries of interest reveals that 'résumé' is both a noun and a verb, whereas 'resume' is only a verb. Perhaps the ordering of 'résumé' before 'resume' is a mistake; perhaps not. I can't ask my aforesaid second cousin, because he's no longer with us. Who'd be a lexicographer? Mike. *** J M Sykes Email: [EMAIL PROTECTED] 97 Oakdale Drive Heald Green CHEADLE Cheshire SK8 3SN UKTel: (44) 161 437 5413 ***
RE: [OT] o-circumflex
On Thu, 6 Sep 2001, Ayers, Mike wrote: > > > From: David Starner [mailto:[EMAIL PROTECTED]] > > Sent: Thursday, September 06, 2001 01:40 PM > > > On Thu, Sep 06, 2001 at 04:03:07PM +0200, Thierry Sourbier wrote: > > > The only little thing to know about French and diacritical > > mark is that when > > > doing a sort diacritical mark are evaluated from right to > > left. (e.g. > > > "cote" < "côte" < "coté" vs the English order "cote" < > > "coté" < "côte" ). > > > > I'm not sure there is an established English sort order. It's not a > > problem that comes up much in English. > > I believe that there is an established sort order in English, which > is to sort without regard to diacritics, or else we'd never find the words! > In English (American English more than British English), diacritics are > considered optional, and it is common to see "naїve" written "naive", "San > José" written "San Jose", etc. Especially amongst Americans, the two are > considered equivalent, and I know of no word pair in all of English which is > separated only by a diacritic. > Friday, September 7, 2001 Librarians have *filing* rules--the American Library Association (ALA) and the Library of Congress (LC) each issued some in, I think, 1980. I believe they both say to ignore diacritics because Americans do not recognize that they have an order. These days filing in vendor software for libraries tends to follow neither one very closely--the phrase "more honored in the breach than the observance" comes to mind. I may be wrong but I do not believe there is an established U.S. standard for sorting/filing. A few years ago a National Information Standards Organization (NISO) committee drafted one but it didn't get the votes needed to become an accepted standard. Regards, Jim Agenbroad ( [EMAIL PROTECTED] ) The above are purely personal opinions, not necessarily the official views of any government or any agency of any. Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.
Re: [OT] o-circumflex
>I would say it is a variant of "o" we just called it... "o with a circumflex >accent" ("o avec un accent circonflex"). The difference between "o" and "ô" >is normally audible (for a French speaker). The relationship is the same >than with any other letter which sometimes have accents (e.g. "a" and "à", >"e" and "è", etc.). "o" avec un accent circonflexe, with an "e" at the end. From "Petit Robert" (french dictionnary) the circumflexe is a mark for long vowel (eg. île for isle (ancient french)) or to avoid confusion between two words (eg. du and dû). The prononciation of the "ô" is closed (o fermé) opposed to "o" without accent. But Thierry is right it's a letter with an accent like à and è not a distinct grapheme. Bertrand >The only little thing to know about French and diacritical mark is that when >doing a sort diacritical mark are evaluated from right to left. (e.g. >"cote" < "côte" < "coté" vs the English order "cote" < "coté" < "côte" ). >Cheers, >Thierry >How do Francophones view the o-circumflex "ô" in relation to the letter "o"? >Is it a distinct grapheme, or is it considered a variant of "o"? >- Peter
RE: [OT] o-circumflex
Sorry about the kana. My mailer is Japanese. $B$8$e$&$$$C$A$c$s(B(Juuitchan) Well, I guess what you say is true, I could never be the right kind of girl for you, I could never be your woman - White Town --- Original Message --- $B:9=P?M(B: "Ayers, Mike" <[EMAIL PROTECTED]>; $B08@h(B: 'David Starner' <[EMAIL PROTECTED]>;[EMAIL PROTECTED]; Cc: $BF|;~(B: 01/09/06 21:12 $B7oL>(B: RE: [OT] o-circumflex > >> From: David Starner [mailto:[EMAIL PROTECTED]] >> Sent: Thursday, September 06, 2001 01:40 PM > >> On Thu, Sep 06, 2001 at 04:03:07PM +0200, Thierry Sourbier wrote: >> > The only little thing to know about French and diacritical >> mark is that when >> > doing a sort diacritical mark are evaluated from right to >> left. (e.g. >> > "cote" < "c$B%F%((Bte" < "cot$B%F%%(B" vs the English order "cote" < >> "cot$B%F%%(B" < "c$B%F%((Bte" ). >> >> I'm not sure there is an established English sort order. It's not a >> problem that comes up much in English. > > I believe that there is an established sort order in English, which >is to sort without regard to diacritics, or else we'd never find the words! >In English (American English more than British English), diacritics are >considered optional, and it is common to see "na$B%`MW(Be" written "naive", "San >Jos$B%F%%(B" written "San Jose", etc. Especially amongst Americans, the two are >considered equivalent, and I know of no word pair in all of English which is >separated only by a diacritic. I believe that the origin of the problem is the typewriter / word-processor. The English typewriter / word-processor is only designed to handle 26 letters (52 if you count case). Diacritics are impossible on a typewriter and very difficult on a word processor. In handwriting, the problem is non-existent. Think of Tendou Kasumi getting the medical scholarship she always wanted, and getting to study abroad. She would likely e-mail her old friends / family in romaji, but snail-mail them in kana / kanji. I like the freedom of a pen, so I can write kana and even draw. As for your word pair: 1. To continue after a pause 2. Curriculum vitae If only technology did not change the way we write like it does. And why should not "o with accent" be considered as different from "o" as either is, say, from "u"? If that is the case: "R" is "P with stroke" (hiragana) "Ho" is "ha with stroke" "Ru" is "Ro with loop" (Thai) "five" is "four with loop" and... my favorite... Latin "G" is "C" with stroke, and history WILL back me on that one! > > >/|/|ike > >
Re: [OT] o-circumflex
David Starner wrote: > Yes, but I mean for "cote", "côte, and "coté. How would you > sort those three in English? I'd probably sort it by some > extra-lingual information: i.e. page number, date of birth > or the like. Store them as UTF-8, do a DOS sort, and call the results "the new World order"? Best regards, James Kass.
Re: [OT] o-circumflex
My impression is that at least in U.S. states, which are more heavily populated by native Spanish speakers, the one diacritic, which is frequently viewed by English speakers as non-optional to differentiate two words (specifically proper names) is the tilde as used for the eñe. There is a college in Redwood City, CA, which is called Cañada College and, which is off of Cañada Road. I haven't checked thoroughly, but I believe most road signs there use the eñe. I do know of one highway exit in the area though which spells it "Canada College". Alex.
RE: [OT] o-circumflex
> From: David Starner [mailto:[EMAIL PROTECTED]] > Sent: Thursday, September 06, 2001 01:40 PM > On Thu, Sep 06, 2001 at 04:03:07PM +0200, Thierry Sourbier wrote: > > The only little thing to know about French and diacritical > mark is that when > > doing a sort diacritical mark are evaluated from right to > left. (e.g. > > "cote" < "côte" < "coté" vs the English order "cote" < > "coté" < "côte" ). > > I'm not sure there is an established English sort order. It's not a > problem that comes up much in English. I believe that there is an established sort order in English, which is to sort without regard to diacritics, or else we'd never find the words! In English (American English more than British English), diacritics are considered optional, and it is common to see "naїve" written "naive", "San José" written "San Jose", etc. Especially amongst Americans, the two are considered equivalent, and I know of no word pair in all of English which is separated only by a diacritic. /|/|ike
Re: [OT] o-circumflex
On Thu, Sep 06, 2001 at 04:12:28PM -0500, Ayers, Mike wrote: > > > From: David Starner [mailto:[EMAIL PROTECTED]] > > Sent: Thursday, September 06, 2001 01:40 PM > > > On Thu, Sep 06, 2001 at 04:03:07PM +0200, Thierry Sourbier wrote: > > > The only little thing to know about French and diacritical > > mark is that when > > > doing a sort diacritical mark are evaluated from right to > > left. (e.g. > > > "cote" < "côte" < "coté" vs the English order "cote" < > > "coté" < "côte" ). > > > > I'm not sure there is an established English sort order. It's not a > > problem that comes up much in English. > > I believe that there is an established sort order in English, which > is to sort without regard to diacritics, or else we'd never find the words! Yes, but I mean for "cote", "côte, and "coté. How would you sort those three in English? I'd probably sort it by some extra-lingual information: i.e. page number, date of birth or the like. -- David Starner - [EMAIL PROTECTED] Pointless website: http://dvdeug.dhis.org "I don't care if Bill personally has my name and reads my email and laughs at me. In fact, I'd be rather honored." - Joseph_Greg
Re: [OT] o-circumflex
On Thu, Sep 06, 2001 at 04:03:07PM +0200, Thierry Sourbier wrote: > The only little thing to know about French and diacritical mark is that when > doing a sort diacritical mark are evaluated from right to left. (e.g. > "cote" < "côte" < "coté" vs the English order "cote" < "coté" < "côte" ). I'm not sure there is an established English sort order. It's not a problem that comes up much in English. -- David Starner - [EMAIL PROTECTED] Pointless website: http://dvdeug.dhis.org "I don't care if Bill personally has my name and reads my email and laughs at me. In fact, I'd be rather honored." - Joseph_Greg
Re: [OT] o-circumflex
> Is it a distinct grapheme, or is it considered a variant of "o"? I would say it is a variant of "o" we just called it... "o with a circumflex accent" ("o avec un accent circonflex"). The difference between "o" and "ô" is normally audible (for a French speaker). The relationship is the same than with any other letter which sometimes have accents (e.g. "a" and "à", "e" and "è", etc.). The only little thing to know about French and diacritical mark is that when doing a sort diacritical mark are evaluated from right to left. (e.g. "cote" < "côte" < "coté" vs the English order "cote" < "coté" < "côte" ). I'm just talking as a French Francophone not a linguist. May be someone on this list knows why diacritical marks are sorted in French in such a funky way :). Cheers, Thierry <><><><><><><><><><><><><><><><><><><><><><> www.i18ngurus.com - Open Internationalization Resources Directory - Original Message - From: [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, September 06, 2001 3:08 PM Subject: [OT] o-circumflex How do Francophones view the o-circumflex "ô" in relation to the letter "o"? Is it a distinct grapheme, or is it considered a variant of "o"? - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: <[EMAIL PROTECTED]>