Re: Collation (was RE: [OT] o-circumflex)
On Thu, Sep 13, 2001 at 12:40:30AM -0700, Edward Cherlin wrote: : For example, : : 1984 (Nineteen Eighty Four) : 1066 and all that (Ten Sixty Six) : 3001 (Three Thousand One) : 2050 (Twenty Fifty) : 2010 (Twenty Ten) : 2001, A Space Odyssey (Two Thousand One) You're missing the and from 3001 and 2001. I know Merkins often leave it out, but a number of us always use it and feel it's wrong without. :-) Putting dialect aside, you may find that 2050 and possibly 2010 will be said two thousand (and) whatever. The problem here is that there's no single way to spell out numbers in English, so no single way to alphabetise. It's better to sort numbers numerically, and then you only have to decide the order for negative numbers. -- Christopher Vance
Collation (was RE: [OT] o-circumflex)
English and several other languages have dozens of collations. Compare telephone books, library catalogs, book indexes (sic), and other sorted data. Knuth vol. 3 Sorting and Searching gives an example of a set of library sorting rules that runs to more than a page, and suggests programming it as an exercise. ;-) Among the rules are to spell out numbers. For example, 1984 (Nineteen Eighty Four) 1066 and all that (Ten Sixty Six) 3001 (Three Thousand One) 2050 (Twenty Fifty) 2010 (Twenty Ten) 2001, A Space Odyssey (Two Thousand One) Bell Labs invented a whole programming language, Snobol, to deal with telephone listing conversions, matches, and sorts. Many phone books sort Mc- and Mac- together, others one after the other but separate from other names. Edward Cherlin Generalist A knot! Oh, do let me help to undo it. Alice in Wonderland -Original Message- Behalf Of Michael (michka) Kaplan Sent: Mon, September 10, 2001 8:36 AM From: Mark Davis [EMAIL PROTECTED] Michael, that isn't the point. There is a problem even when you stick to one language. By that time, many langauges may have TWO collations, since users have been expecting something else for the last few decades? MichKa Michael Kaplan Trigeminal Software, Inc. http://www.trigeminal.com/
Collation (was RE: [OT] o-circumflex)
Whoever invented English number words, then, had a very sick sense of humour. Why doesn't the word for "one" start with "a", the word for "two" with "b", etc.,? rubyrb$B$8$e$&$$$C$A$c$s(B/rbrp(/rprtJuuitchan/rtrp)/rp/ruby Well, I guess what you say is true, I could never be the right kind of girl for you, I could never be your woman - White Town --- Original Message --- $B:9=P?M(B: Edward Cherlin [EMAIL PROTECTED]; $B08@h(B: [EMAIL PROTECTED]; Cc: $BF|;~(B: 01/09/13 7:40 $B7oL>(B: Collation (was RE: [OT] o-circumflex) English and several other languages have dozens of collations. Compare telephone books, library catalogs, book indexes (sic), and other sorted data. Knuth vol. 3 Sorting and Searching gives an example of a set of library sorting rules that runs to more than a page, and suggests programming it as an exercise. ;-) Among the rules are to spell out numbers. For example, 1984 (Nineteen Eighty Four) 1066 and all that (Ten Sixty Six) 3001 (Three Thousand One) 2050 (Twenty Fifty) 2010 (Twenty Ten) 2001, A Space Odyssey (Two Thousand One) Bell Labs invented a whole programming language, Snobol, to deal with telephone listing conversions, matches, and sorts. Many phone books sort Mc- and Mac- together, others one after the other but separate from other names. Edward Cherlin Generalist "A knot! Oh, do let me help to undo it." Alice in Wonderland -Original Message- Behalf Of Michael (michka) Kaplan Sent: Mon, September 10, 2001 8:36 AM From: "Mark Davis" [EMAIL PROTECTED] Michael, that isn't the point. There is a problem even when you stick to one language. By that time, many langauges may have TWO collations, since users have been expecting something else for the last few decades? MichKa Michael Kaplan Trigeminal Software, Inc. http://www.trigeminal.com/
Re: Collation (was RE: [OT] o-circumflex)
Java's collation class has a rule-based collator that is in effect programmable using a little language. Here is how an example from Sun's API doc for Norwegian: String Norwegian = a,A b,B c,C d,D e,E f,F g,G h,H i,I j,J k,K l,L m,M n,N o,O p,P q,Q r,R s,S t,T u,U v,V w,W x,X y,Y z,Z å=a?,Å=A? ;aa,AA æ,Æ ø,Ø; RuleBasedCollator myNorwegian = new RuleBasedCollator(Norwegian); There is also syntax for things such as specifying reverse order (for French accents for example), contraction and expansion. - David Gallardo - Original Message - From: Edward Cherlin [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, September 13, 2001 3:40 AM Subject: Collation (was RE: [OT] o-circumflex) English and several other languages have dozens of collations. Compare telephone books, library catalogs, book indexes (sic), and other sorted data. Knuth vol. 3 Sorting and Searching gives an example of a set of library sorting rules that runs to more than a page, and suggests programming it as an exercise. ;-) Among the rules are to spell out numbers. For example, 1984 (Nineteen Eighty Four) 1066 and all that (Ten Sixty Six) 3001 (Three Thousand One) 2050 (Twenty Fifty) 2010 (Twenty Ten) 2001, A Space Odyssey (Two Thousand One) Bell Labs invented a whole programming language, Snobol, to deal with telephone listing conversions, matches, and sorts. Many phone books sort Mc- and Mac- together, others one after the other but separate from other names. Edward Cherlin Generalist A knot! Oh, do let me help to undo it. Alice in Wonderland
Re: Collation (was RE: [OT] o-circumflex)
In the latest ICU, we took the work we did for Java collation and extended it substantially (and made it many times faster). It also allows arbitrary customization at runtime. I happen to be giving a presentation on it in a few hours at the conference. For more information, see the draft collation chapter in the User guide, at http://oss.software.ibm.com/icu/. The presentation (a slightly older draft) is on my site at www.macchiato.com Mark — Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο πάντα — Όμήρου Μαργίτῃ [http://www.macchiato.com] - Original Message - From: David Gallardo [EMAIL PROTECTED] To: Edward Cherlin [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Thursday, September 13, 2001 8:35 AM Subject: Re: Collation (was RE: [OT] o-circumflex) Java's collation class has a rule-based collator that is in effect programmable using a little language. Here is how an example from Sun's API doc for Norwegian: String Norwegian = a,A b,B c,C d,D e,E f,F g,G h,H i,I j,J k,K l,L m,M n,N o,O p,P q,Q r,R s,S t,T u,U v,V w,W x,X y,Y z,Z å=a?,Å=A? ;aa,AA æ,Æ ø,Ø; RuleBasedCollator myNorwegian = new RuleBasedCollator(Norwegian); There is also syntax for things such as specifying reverse order (for French accents for example), contraction and expansion. - David Gallardo - Original Message - From: Edward Cherlin [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, September 13, 2001 3:40 AM Subject: Collation (was RE: [OT] o-circumflex) English and several other languages have dozens of collations. Compare telephone books, library catalogs, book indexes (sic), and other sorted data. Knuth vol. 3 Sorting and Searching gives an example of a set of library sorting rules that runs to more than a page, and suggests programming it as an exercise. ;-) Among the rules are to spell out numbers. For example, 1984 (Nineteen Eighty Four) 1066 and all that (Ten Sixty Six) 3001 (Three Thousand One) 2050 (Twenty Fifty) 2010 (Twenty Ten) 2001, A Space Odyssey (Two Thousand One) Bell Labs invented a whole programming language, Snobol, to deal with telephone listing conversions, matches, and sorts. Many phone books sort Mc- and Mac- together, others one after the other but separate from other names. Edward Cherlin Generalist A knot! Oh, do let me help to undo it. Alice in Wonderland
Re: Alternative sorting for digraphs (Was Re: [OT] o-circumflex)
On Mon, 10 Sep 2001, Mark Davis wrote: A ZWNJ will break ligatures and cursive connections. While probably safe in Danish or Dutch, it is unclear to me that that is safe in all languages where this situation occurs. There are diagraphs in Urdu, for example. While I don't know their sorting order, if they do sort separately then ZWNJ can't be used to express the alternative sorting, since it would give the wrong rendering. :'-( I would like to ask for stopping the overuse of ZWNJ. I once loved that character... What about *renaming* the character to Zero Width All-Purpose Everything Breaker? roozbeh
Re: [OT] o-circumflex
- Original Message - From: Keld Jørn Simonsen [EMAIL PROTECTED] To: Stefan Persson [EMAIL PROTECTED] Cc: Mark Davis [EMAIL PROTECTED]; Michael (michka) Kaplan [EMAIL PROTECTED]; Keld Jørn Simonsen [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: den 10 september 2001 22:12 Subject: Re: [OT] o-circumflex Where is this done for swedish? I have read both the TN and the SIS standard, and I dont believe these say something on sorting ü according to either German or Dutch sounds. Rolf Gavare does not say something along this either, as far as I can remember. This is the sorting used in dictionnaries, encyclopædias, phone books etc. For example, SAOL (Svenska Akademiens ordlista över svenska språket) sorts myskoxe/müsli/mysning. Stefan _ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com
Re: [OT] o-circumflex
- Original Message - From: Lars Marius Garshol [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: den 10 september 2001 22:45 Subject: Re: [OT] o-circumflex I am not sure of this, but I think 'å' is a relatively modern invention, and that it was originally written only as 'aa'. FYI, a relatively modern invention means that is has been used since the Medieval (in Swedish). Stefan _ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com
Re: [OT] o-circumflex
On Tue, Sep 11, 2001 at 06:27:20PM +0200, Stefan Persson wrote: - Original Message - From: Keld Jørn Simonsen [EMAIL PROTECTED] To: Stefan Persson [EMAIL PROTECTED] Cc: Mark Davis [EMAIL PROTECTED]; Michael (michka) Kaplan [EMAIL PROTECTED]; Keld Jørn Simonsen [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: den 10 september 2001 22:12 Subject: Re: [OT] o-circumflex Where is this done for swedish? I have read both the TN and the SIS standard, and I dont believe these say something on sorting ü according to either German or Dutch sounds. Rolf Gavare does not say something along this either, as far as I can remember. This is the sorting used in dictionnaries, encyclopædias, phone books etc. For example, SAOL (Svenska Akademiens ordlista över svenska språket) sorts myskoxe/müsli/mysning. Yes, I can understand that. In Danish we have the same rule. But do you have examples of Dutch words that are ordered in another way? That is, you need to know the origin of the word, to sort it. Kind regards keld
Re: [OT] o-circumflex
* Lars Marius Garshol | | I am not sure of this, but I think 'å' is a relatively modern | invention, and that it was originally written only as 'aa'. * Stefan Persson | | FYI, a relatively modern invention means that is has been used | since the Medieval (in Swedish). I don't think that is the case in Norwegian and Danish. The Norwegian constitution from 1814, for example, uses 'ø' and 'æ', but never 'å'. Possibly this was a Swedish invention only adopted later by the Danes and Norwegians. --Lars M.
RE: [OT] o-circumflex
John Cowan wrote: None of which is as weird as Leghorn for Livorno (Italy). It's as weird as some Italian names for German cities: Aquisgrana for Aachen, Augusta for Augsburg, Magonza for Mainz, Monaco (di Baviera) for München. _ Marco
RE: [OT] o-circumflex
Carl W. Brown wrote: In Arabic do you include vowels or not? Yes, and also consonants sometimes... Traditional Arabic dictionary sorting uses the three-letter root (radical) of a word as the primary key. So, madrasa (school) would be under d (because its radical is d-r-s = to learn), ignoring the ma- prefix. I doubt, however, that this system is used with automatic sort orders generated by computers. _ Marco
RE: [OT] o-circumflex
Asmus Freytag wrote: But if you do this, all compound words starting with data and continuing with another word starting with a will be sorted incorrectly! To achieve this effect, you would have to mark which AAs are A-Rings and which ones are accidental adjacencies. In Danish one can use the SHY (soft hyphen) [...] Real-life sort orders often ignore these subtleties and are often based on a small set of rules which is applied blindly, regardless of the origin, meaning, or pronunciation of headwords. For instance, I have noticed that Dutch telephone directories always sort the sequence ij as if it was y, regardless that it actually occurs in a Dutch word. E.g., Beijing Chinese Restaurant would be listed after Mr. Bex. Similarly, old Italian encyclopedias (e.g. Dizionario Enciclopedico Teccani) equated j to i because, in Italian, the former is just a graphic variant of the latter. But this also applied to foreign name such as Jefferson (which was listed between iee- and ieg-), regardless that, of course, it would not be allowed to spell Iefferson. _ Marco
Re: [OT] o-circumflex
At 18:04 +0200 2001-09-09, Stefan Persson wrote: well, the official spelling of the town is Aalborg. In Sweden it has always been written Ålborg. At one stage, in both countries, it was written Álaborg, I suspect, as it is in Iceland today. -- Michael Everson
Re: [OT] o-circumflex
At 18:10 -0400 2001-09-09, John Cowan wrote: Keld Jørn Simonsen scripsit: Yes, foreigners call our cities many strange things:-) København is called Köpenhamn, Copenhagen, Kobenhagen, Copenhague, and many more. In Iceland it is Kaupmannahöfn, I believe. In unadorned English that would be something like Cheapmenshaven, maybe to weaken as Cheapenhaven, in German Kaufenhagen -- Michael Everson
Re: [OT] o-circumflex
On Mon, Sep 10, 2001 at 11:09:28AM +0200, Marco Cimarosti wrote: Asmus Freytag wrote: But if you do this, all compound words starting with data and continuing with another word starting with a will be sorted incorrectly! To achieve this effect, you would have to mark which AAs are A-Rings and which ones are accidental adjacencies. In Danish one can use the SHY (soft hyphen) [...] Real-life sort orders often ignore these subtleties and are often based on a small set of rules which is applied blindly, regardless of the origin, meaning, or pronunciation of headwords. Real-life sorts, like MS Windows sorting or Linux sorting, actually adheres to these Danish rules, once you have set up your machine for Danish. Kind regards Keld
Re: [OT] o-circumflex
From: Keld Jørn Simonsen [EMAIL PROTECTED] Real-life sorts, like MS Windows sorting or Linux sorting, actually adheres to these Danish rules, once you have set up your machine for Danish. And this is the *true* answer to the whole mess of attempting *multilingual* sorts -- once the user chooses the sort they WANT, the system might handle other language strings in a way that might be obscure to those who know the other language but the person who expected Danish or whatever will see what they want. Since various sorts openly conflict with each other there is no other general case solution which would be appropriate, anyway? (can't believe this thread is still going on!) MichKa Michael Kaplan Trigeminal Software, Inc. http://www.trigeminal.com/
RE: [OT] o-circumflex
On Mon, Sep 10, 2001 at 11:09:28AM +0200, Marco Cimarosti wrote: Asmus Freytag wrote: But if you do this, all compound words starting with data and continuing with another word starting with a will be sorted incorrectly! To achieve this effect, you would have to mark which AAs are A-Rings and which ones are accidental adjacencies. In Danish one can use the SHY (soft hyphen) [...] Real-life sort orders often ignore these subtleties and are often based on a small set of rules which is applied blindly, regardless of the origin, meaning, or pronunciation of headwords. Real-life sorts, like MS Windows sorting or Linux sorting, actually adheres to these Danish rules, once you have set up your machine for Danish. If I understand what you mean, perhaps my point was not clear. I know that aa sorts like å, and that it should go after z. But there are also cases when the sequence aa is just two a's, adjacent to each other by pure chance. One of these cases could be the word dataarkiv, which I found in a Danish web page (http://www.riksarkivet.no/nordiskarknytt/98-nr4/institusjonen.html). Now: if your Windows or Linux collations states (correctly!) that aa should go after z, you may have a list ordered like this: Order A: 1. data 2. Datben, Dr. Keld 3. Datz, Mr. Marco 4. dataarkiv 5. Datåz, Dr. Asmus But if dataarkiv was written using an invisible separator between the two a's (e.g. a soft hyphen, or a zero width non joiner), the your list would be like this: Order B: 1. data 2. dataarkiv 3. Datben, Dr. Keld 4. Datz, Mr. Marco 5. Datåz, Dr. Asmus Asmus was arguing that List B would be the correct one (and this is certainly true on, e.g., a dictionary) but, in order to obtain it, the source text must be properly encoded with invisible separators inserted where needed. What I was saying is that the automatic Order A is also often used, and I brought the example of the Dutch phone directories (where Beijing is sorted as if it was Beying), and of the Italian encyclopedia (where Jefferson is sorted as if it was Iefferson). Michael (michka) Kaplan wrote: And this is the *true* answer to the whole mess of attempting *multilingual* sorts -- once the user chooses the sort they WANT, the system might handle other language strings in a way that might be obscure to those who know the other language but the person who expected Danish or whatever will see what they want. And this is precisely what I was trying to say, although I was not necessarily talking about multilingual sort (dataarkiv seems a purely Danish word, although derived from Latin roots). For some users and some usages, the incorrect Order B may be much more useful than the correct Order A. If the rules says that ij goes between x and z, a Dutchman should find the Beijing Restaurant between bex- and bez-. If someone wants Order A (as may be the case for the author of a dictionary), then they should apply Asmus' suggestion in order to drive the collation algorithm. _ Marco
Re: [OT] o-circumflex
Mon, 10 Sep 2001 10:47:48 +0200, Marco Cimarosti [EMAIL PROTECTED] pisze: It's as weird as some Italian names for German cities: Aquisgrana for Aachen, Augusta for Augsburg, Magonza for Mainz, Monaco (di Baviera) for Mnchen. Interesting that Polish names of these cities are more like Italian than German: Akwizgran, Augsburg, Moguncja, Monachium. Ko/benhavn is Kopenhaga, again more like other foreign forms than Danish. -- __(" Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTPCZA QRCZAK
Re: [OT] o-circumflex
On Mon, Sep 10, 2001 at 03:58:05PM +0200, Marco Cimarosti wrote: On Mon, Sep 10, 2001 at 11:09:28AM +0200, Marco Cimarosti wrote: Asmus Freytag wrote: But if you do this, all compound words starting with data and continuing with another word starting with a will be sorted incorrectly! To achieve this effect, you would have to mark which AAs are A-Rings and which ones are accidental adjacencies. In Danish one can use the SHY (soft hyphen) [...] Real-life sort orders often ignore these subtleties and are often based on a small set of rules which is applied blindly, regardless of the origin, meaning, or pronunciation of headwords. Real-life sorts, like MS Windows sorting or Linux sorting, actually adheres to these Danish rules, once you have set up your machine for Danish. If I understand what you mean, perhaps my point was not clear. My point was that real-life sorts nowadays are quite sophisticated, and the major systems have adequate sorting for Danish and other languages with that kind of complexity. I know that aa sorts like å, and that it should go after z. But there are also cases when the sequence aa is just two a's, adjacent to each other by pure chance. One of these cases could be the word dataarkiv, which I found in a Danish web page (http://www.riksarkivet.no/nordiskarknytt/98-nr4/institusjonen.html). Yes, and ekstraarbejde - extra work. I know. Now: if your Windows or Linux collations states (correctly!) that aa should go after z, you may have a list ordered like this: Order A: 1. data 2. Datben, Dr. Keld 3. Datz, Mr. Marco 4. dataarkiv 5. Datåz, Dr. Asmus But if dataarkiv was written using an invisible separator between the two a's (e.g. a soft hyphen, or a zero width non joiner), the your list would be like this: Order B: 1. data 2. dataarkiv 3. Datben, Dr. Keld 4. Datz, Mr. Marco 5. Datåz, Dr. Asmus Asmus was arguing that List B would be the correct one (and this is certainly true on, e.g., a dictionary) but, in order to obtain it, the source text must be properly encoded with invisible separators inserted where needed. Yes, that is also my advice. What I was saying is that the automatic Order A is also often used, and I brought the example of the Dutch phone directories (where Beijing is sorted as if it was Beying), and of the Italian encyclopedia (where Jefferson is sorted as if it was Iefferson). You have to sort it according to the expectations of the user. A Dutch book would use Dutch rules, an Italian book would use the italian order. You cannot mix ordering, such that some words follow one set of rules, and other words follow other rules. It all needs to be comprehended by one human, the reader, and there only one ruleset applies. Michael (michka) Kaplan wrote: And this is the *true* answer to the whole mess of attempting *multilingual* sorts -- once the user chooses the sort they WANT, the system might handle other language strings in a way that might be obscure to those who know the other language but the person who expected Danish or whatever will see what they want. And this is precisely what I was trying to say, although I was not necessarily talking about multilingual sort (dataarkiv seems a purely Danish word, although derived from Latin roots). For some users and some usages, the incorrect Order B may be much more useful than the correct Order A. If the rules says that ij goes between x and z, a Dutchman should find the Beijing Restaurant between bex- and bez-. If someone wants Order A (as may be the case for the author of a dictionary), then they should apply Asmus' suggestion in order to drive the collation algorithm. I think we agree, but what you call simple set of rules I call quite complex. I also think that the Danish rules are quite simple as they can be formulated in say 4 lines of Danish prose. But compared to ascii sorting they are to some people unbelievable complex, and I think many Danish believes that you cannot get programs that adhere, although the major systems do that out of the box. Your incorrect and correct examples use the very same sorting algoritm, the only thing is that the data is coded differently. But maybe you are driving for a yet more complex sorting, one that can sort according to multiple rules? Beijing should then not be sorted as Beÿing? As stated above I think - and other sorting experts too - that sorting with multiple rules is a conceptual misunderstanding. Kind regards Keld
Re: [OT] o-circumflex
From: Mark Davis [EMAIL PROTECTED] Michael, that isn't the point. There is a problem even when you stick to one language. That is, there are situations where two letters in a language, e.g. ch in Slovak, are normally sorted as one. However, in some exceptional circumstances those letters should be sorted separated. It could be because they come originally from another language, or it could be because they happen to arise when two other words are conjoined. There is no algorithmic distinction. So without some special character, it would require a dictionary look-up to produce the right sort I would argue that most users of the language are not expecting this type of thing, and that when they are looking for a word that this might be the SECOND place they look, not the first. There are exceptions, but they are not outnumbered by the general case, by any means. For example, suppose that th were sorted separately in English, after Z. Yet people would expect the following order: cast cathouse caul cathode because the t and h are logically separate in cathouse. Again, I think most people would look first in the place that does not assume the exception -- the computer's original limitations havse trained them. The notion of a natural language processing engine that would have all of the specific differences (with appropriate dictionaries for exceptions to even the NLP results) is a fascinating notion, but one that no one is even close to, yet. We do not even have available UCA tailorings for most of the world's languages. Though I have high hopes for the future (if not in the UCA then in other mechanisms). By that time, many langauges may have TWO collations, since users have been expecting something else for the last few decades? MichKa Michael Kaplan Trigeminal Software, Inc. http://www.trigeminal.com/
Re: [OT] o-circumflex
Michael, that isn't the point. There is a problem even when you stick to one language. That is, there are situations where two letters in a language, e.g. ch in Slovak, are normally sorted as one. However, in some exceptional circumstances those letters should be sorted separated. It could be because they come originally from another language, or it could be because they happen to arise when two other words are conjoined. There is no algorithmic distinction. So without some special character, it would require a dictionary look-up to produce the right sort For example, suppose that th were sorted separately in English, after Z. Yet people would expect the following order: cast cathouse caul cathode because the t and h are logically separate in cathouse. Mark — Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο πάντα — Όμήρου Μαργίτῃ [http://www.macchiato.com] - Original Message - From: Michael (michka) Kaplan [EMAIL PROTECTED] To: Keld Jørn Simonsen [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Monday, September 10, 2001 5:48 AM Subject: Re: [OT] o-circumflex From: Keld Jørn Simonsen [EMAIL PROTECTED] Real-life sorts, like MS Windows sorting or Linux sorting, actually adheres to these Danish rules, once you have set up your machine for Danish. And this is the *true* answer to the whole mess of attempting *multilingual* sorts -- once the user chooses the sort they WANT, the system might handle other language strings in a way that might be obscure to those who know the other language but the person who expected Danish or whatever will see what they want. Since various sorts openly conflict with each other there is no other general case solution which would be appropriate, anyway? (can't believe this thread is still going on!) MichKa Michael Kaplan Trigeminal Software, Inc. http://www.trigeminal.com/
Re: [OT] o-circumflex
On Mon, 10 Sep 2001 16:42:45 +0200, Keld Jørn Simonsen wrote: But maybe you are driving for a yet more complex sorting, one that can sort according to multiple rules? Beijing should then not be sorted as Beÿing? I haven't followed this discussion from the beginning, so apologies if I'm missing the point, but it seems to me that the Beijing case in Dutch is no different from the ekstraarbejde case in Danish - a SHY or ZWNJ is all that is needed to stop Beijing sorting with Bey. John. -- -- Over 1500 webcams from ski resorts around the world - http://www.snoweye.com/ -- Translate your technical documents and web pages- http://www.tradoc.fr/
Alternative sorting for digraphs (Was Re: [OT] o-circumflex)
A SHY will mean that the word can break at Bei- jing. It is not clear to me at least that that is safe in all cases for all languages with digraphs that sort separately, although it may be a solution for some. A ZWNJ will break ligatures and cursive connections. While probably safe in Danish or Dutch, it is unclear to me that that is safe in all languages where this situation occurs. There are diagraphs in Urdu, for example. While I don't know their sorting order, if they do sort separately then ZWNJ can't be used to express the alternative sorting, since it would give the wrong rendering. Mark — Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο πάντα — Όμήρου Μαργίτῃ [http://www.macchiato.com] - Original Message - From: John Wilcock [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Monday, September 10, 2001 8:39 AM Subject: Re: [OT] o-circumflex On Mon, 10 Sep 2001 16:42:45 +0200, Keld Jørn Simonsen wrote: But maybe you are driving for a yet more complex sorting, one that can sort according to multiple rules? Beijing should then not be sorted as Beÿing? I haven't followed this discussion from the beginning, so apologies if I'm missing the point, but it seems to me that the Beijing case in Dutch is no different from the ekstraarbejde case in Danish - a SHY or ZWNJ is all that is needed to stop Beijing sorting with Bey. John. -- -- Over 1500 webcams from ski resorts around the world - http://www.snoweye.com/ -- Translate your technical documents and web pages- http://www.tradoc.fr/
RE: [OT] o-circumflex
John Wilcock wrote: I haven't followed this discussion from the beginning, so apologies if I'm missing the point, but it seems to me that the Beijing case in Dutch is no different from the ekstraarbejde case in Danish - a SHY or ZWNJ is all that is needed to stop Beijing sorting with Bey. Yes, it is exactly the same thing. But my point is that a Dutch reader probably *does* expect Beijing to sort like Bey, not like Bei. So, in some cases, a correct (i.e., expected) behavior could rather be to *remove* all SHY/ZWNJ's before sorting. _ Marco
Re: [OT] o-circumflex
If they can't agree on the pronunciation for these cities, can they agree on the Hanzi for them? What ARE the Hanzi for these cities, anyway?? rubyrb$B$8$e$&$$$C$A$c$s(B/rbrp(/rprtJuuitchan/rtrp)/rp/ruby Well, I guess what you say is true, I could never be the right kind of girl for you, I could never be your woman - White Town --- Original Message --- $B:9=P?M(B: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED]; $B08@h(B: [EMAIL PROTECTED]; Cc: $BF|;~(B: 01/09/10 14:02 $B7oL>(B: Re: [OT] o-circumflex Mon, 10 Sep 2001 10:47:48 +0200, Marco Cimarosti [EMAIL PROTECTED] pisze: It's as weird as some Italian names for German cities: Aquisgrana for Aachen, Augusta for Augsburg, Magonza for Mainz, Monaco (di Baviera) for M$B!&(Bchen. Interesting that Polish names of these cities are more like Italian than German: Akwizgran, Augsburg, Moguncja, Monachium. Ko/benhavn is Kopenhaga, again more like other foreign forms than Danish. -- __(" Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZAST$B%O(BPCZA QRCZAK
Re: [OT] o-circumflex
- Original Message - From: Marco Cimarosti [EMAIL PROTECTED] To: 'John Wilcock' [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: den 10 september 2001 18:35 Subject: RE: [OT] o-circumflex John Wilcock wrote: I haven't followed this discussion from the beginning, so apologies if I'm missing the point, but it seems to me that the Beijing case in Dutch is no different from the ekstraarbejde case in Danish - a SHY or ZWNJ is all that is needed to stop Beijing sorting with Bey. Yes, it is exactly the same thing. But my point is that a Dutch reader probably *does* expect Beijing to sort like Bey, not like Bei. So, in some cases, a correct (i.e., expected) behavior could rather be to *remove* all SHY/ZWNJ's before sorting. I thought ij sorted after z? _ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com
RE: [OT] o-circumflex
Stefan Persson wrote: I thought ij sorted after z? Not in Dutch: as far as I have seen it sorts the same as y. In fact, in the telephone directory many people who had an y in their surname listed near people who had the same surname spelled with ij (e.g. Meyer and Meijer). (Anyway, next time they send me to Holland, I'll ask for a downtown hotel. So, after dinner, I'll go sightseeing rather than spending the whole evening looking at the collation of the phone directory:-) _ Marco
Re: [OT] o-circumflex
There is a similar problem with Swedish: Our alphabet goes: a ... u v w (no difference made) x y z å ä (the Danish/Norwegian æ is also sorted as ä) ö (the Danish/Norwegian ø is also sorted as ö) The German character ü is pronunciated as a Swedish y, so when any German name or loan word containing that character occurs in Swedish it should be sorted as y. However, if any ü occurs in a Dutch loan word it is considered as an u with umlaut and is sorted as u. The same goes for ä and ö: If they are the Swedish/Finnish/German letters ä and ö they are sorted after å, if they are the Dutch letters a with umlaut and o with umlaut, they're sorted as a and o in a Swedish encyclopædia. In Swedish the Danish/Norwegian letter æ is sorted as ä, while the Latin/Icelandic letter æ is sorted as ae. Stefan - Original Message - From: Mark Davis [EMAIL PROTECTED] To: Michael (michka) Kaplan [EMAIL PROTECTED]; Keld Jørn Simonsen [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: den 10 september 2001 17:27 Subject: Re: [OT] o-circumflex Michael, that isn't the point. There is a problem even when you stick to one language. That is, there are situations where two letters in a language, e.g. ch in Slovak, are normally sorted as one. However, in some exceptional circumstances those letters should be sorted separated. It could be because they come originally from another language, or it could be because they happen to arise when two other words are conjoined. There is no algorithmic distinction. So without some special character, it would require a dictionary look-up to produce the right sort For example, suppose that th were sorted separately in English, after Z. Yet people would expect the following order: cast cathouse caul cathode because the t and h are logically separate in cathouse. Mark — Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο πάντα — Όμήρου Μαργίτῃ [http://www.macchiato.com] - Original Message - From: Michael (michka) Kaplan [EMAIL PROTECTED] To: Keld Jørn Simonsen [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Monday, September 10, 2001 5:48 AM Subject: Re: [OT] o-circumflex From: Keld Jørn Simonsen [EMAIL PROTECTED] Real-life sorts, like MS Windows sorting or Linux sorting, actually adheres to these Danish rules, once you have set up your machine for Danish. And this is the *true* answer to the whole mess of attempting *multilingual* sorts -- once the user chooses the sort they WANT, the system might handle other language strings in a way that might be obscure to those who know the other language but the person who expected Danish or whatever will see what they want. Since various sorts openly conflict with each other there is no other general case solution which would be appropriate, anyway? (can't believe this thread is still going on!) MichKa Michael Kaplan Trigeminal Software, Inc. http://www.trigeminal.com/ _ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com
Re: [OT] o-circumflex
On Mon, 10 Sep 2001, [ISO-2022-JP] $B$F$s$I$$j$e$$8(B wrote: If they can't agree on the pronunciation for these cities, can they agree on the Hanzi for them? What ARE the Hanzi for these cities, anyway?? Are you asking for the names of cities in Chinese? Copenhagen is ge1ben3ha1gen1 \u54e5\u672c\u54c8\u6839. The Han characters used to write the names of cities depends on many factors, including but not limited to source spelling/pronunciation, language/dialect of the rendering party, mapping rules used by the renderer, time period, etc. For example, New York is rendered in Chinese as Mandarin niu3yue4 \u7d10\u7d04, lit. 'button-appointment' (nauyeuk in Cantonese), while in Japanese it was at one time rendered as \u7d10\u80b2, lit. 'button-rearing'. Asking for the hanzi (from your wording, I don't think you are just talking about Chinese usage of Han characters) is like asking for a single Latin script rendering. (I think you need to get yourself an English-Chinese dictionary or something, btw...) Thomas Chan [EMAIL PROTECTED]
Re: [OT] o-circumflex
Where is this done for swedish? I have read both the TN and the SIS standard, and I dont believe these say something on sorting ü according to either German or Dutch sounds. Rolf Gavare does not say something along this either, as far as I can remember. Kind regards keld On Mon, Sep 10, 2001 at 07:09:34PM +0200, Stefan Persson wrote: There is a similar problem with Swedish: Our alphabet goes: a ... u v w (no difference made) x y z å ä (the Danish/Norwegian æ is also sorted as ä) ö (the Danish/Norwegian ø is also sorted as ö) The German character ü is pronunciated as a Swedish y, so when any German name or loan word containing that character occurs in Swedish it should be sorted as y. However, if any ü occurs in a Dutch loan word it is considered as an u with umlaut and is sorted as u. The same goes for ä and ö: If they are the Swedish/Finnish/German letters ä and ö they are sorted after å, if they are the Dutch letters a with umlaut and o with umlaut, they're sorted as a and o in a Swedish encyclopædia. In Swedish the Danish/Norwegian letter æ is sorted as ä, while the Latin/Icelandic letter æ is sorted as ae. Stefan
Re: [OT] o-circumflex
I hate this sort: Club Mix 2000 Club Mix 98 Club Mix 99 Those non Y2K compliant fools! rubyrb$B$8$e$&$$$C$A$c$s(B/rbrp(/rprtJuuitchan/rtrp)/rp/ruby Well, I guess what you say is true, I could never be the right kind of girl for you, I could never be your woman - White Town --- Original Message --- $B:9=P?M(B: Stefan Persson [EMAIL PROTECTED]; $B08@h(B: Mark Davis [EMAIL PROTECTED];"Michael (michka) Kaplan" [EMAIL PROTECTED];Keld J?n Simonsen [EMAIL PROTECTED];[EMAIL PROTECTED]; Cc: $BF|;~(B: 01/09/10 17:09 $B7oL>(B: Re: [OT] o-circumflex There is a similar problem with Swedish: Our alphabet goes: a ... u v w (no difference made) x y z $B%F!&(B $B%F!"(B (the Danish/Norwegian "$B%F%r(B" is also sorted as "$B%F!"(B") $B%F%+(B (the Danish/Norwegian "$B%F%/(B" is also sorted as "$B%F%+(B") The German character "$B%F%7(B" is pronunciated as a Swedish "y," so when any German name or loan word containing that character occurs in Swedish it should be sorted as "y." However, if any "$B%F%7(B" occurs in a Dutch loan word it is considered as an "u" with umlaut and is sorted as "u." The same goes for "$B%F!"(B" and "$B%F%+(B": If they are the Swedish/Finnish/German letters "$B%F!"(B" and "$B%F%+(B" they are sorted after "$B%F!&(B," if they are the Dutch letters "a" with umlaut and "o" with umlaut, they're sorted as "a" and "o" in a Swedish encyclop$B%F%r(Bdia. In Swedish the Danish/Norwegian letter "$B%F%r(B" is sorted as "$B%F!"(B," while the Latin/Icelandic letter "$B%F%r(B" is sorted as "ae." Stefan - Original Message - From: "Mark Davis" [EMAIL PROTECTED] To: "Michael (michka) Kaplan" [EMAIL PROTECTED]; "Keld J$B%F%/(Brn Simonsen" [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: den 10 september 2001 17:27 Subject: Re: [OT] o-circumflex Michael, that isn't the point. There is a problem even when you stick to one language. That is, there are situations where two letters in a language, e.g. "ch" in Slovak, are normally sorted as one. However, in some exceptional circumstances those letters should be sorted separated. It could be because they come originally from another language, or it could be because they happen to arise when two other words are conjoined. There is no algorithmic distinction. So without some special character, it would require a dictionary look-up to produce the right sort For example, suppose that "th" were sorted separately in English, after Z. Yet people would expect the following order: cast cathouse caul cathode because the "t" and "h" are logically separate in "cathouse". Mark $Bc`Hd?Hd?Hd?Hd?!&(B $B%[?%^8P%5%[%5c`!>?%^?%[%C%^!&%"%^!&%=(B $Bb>HQ"P%&%[%"(B, $B%[%3%[%"%[%3bA%+%^!&%[%(c`!>?%^?%[%C%^!&%"%^!&%=(B $B%^?%[%c%[%9%^!&%"(B $Bc`!%1%[%7%[%g%^"P%=%^!&%[XP%"%^"P%&%[%C%^!&%=!&(B [http://www.macchiato.com] - Original Message - From: "Michael (michka) Kaplan" [EMAIL PROTECTED] To: "Keld J$B%F%/(Brn Simonsen" [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Monday, September 10, 2001 5:48 AM Subject: Re: [OT] o-circumflex From: "Keld J$B%F%/(Brn Simonsen" [EMAIL PROTECTED] Real-life sorts, like MS Windows sorting or Linux sorting, actually adheres to these Danish rules, once you have set up your machine for Danish. And this is the *true* answer to the whole mess of attempting *multilingual* sorts -- once the user chooses the sort they WANT, the system might handle other language strings in a way that might be obscure to those who know the other language but the person who expected Danish or whatever will see what they want. Since various sorts openly conflict with each other there is no other general case solution which would be appropriate, anyway? (can't believe this thread is still going on!) MichKa Michael Kaplan Trigeminal Software, Inc. http://www.trigeminal.com/ _ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com
Re: [OT] o-circumflex
* Carl W. Brown | | You are quite correct that is why Unicode support differing | collation strengths. Some times you only care about the actual | letters without diacritics. But even then letters are locale | sensitive. For example the Danish alphabet starts with an A and | ends it with A ring above. A Dane would look for Alborg near the | end of a list of towns. This example doesn't apply to this discussions, since Danes and Norwegians consider Å to be a separate letter. That is, it is not A with ring above, but Å, which is not related to A any more than E is related to F. What J. M. Sykes writes about the lack of established sort orders seems right to me. I've done consulting work for Norwegian encyclopedia publishers, which involved developing their sorting routines. The orders for the different publishers did differ, and it is not so surprising given that there are a number of cases to consider, such as how to sort diacritics, what to consider as diacritics, how to sort numbers, Roman numerals, ordinals, and whatnot. --Lars M.
Re: [OT] o-circumflex
* Francesco Zappa Nardelli | | I was in Aalborg fifteen days ago, and I have seen its name written | both as Ålborg and as Aalborg. Where does Aalborg appear in a list | of towns? At the end. In both Danish and Norwegian 'aa' and 'å' are considered equivalent. I am not sure of this, but I think 'å' is a relatively modern invention, and that it was originally written only as 'aa'. --Lars M.
Re: [OT] o-circumflex
* Jonathan Rosenne | | This is not always the right thing to do. For example, with personal | names the person involved may decide whether he prefers the old (AA) | spelling or the new Å. In any case they are equivalent. This is true, but this is nothing particular to the aa/å distinction. Many given names have a number of possible spellings, such as Astri / Astrid, Cathrine / Katrine / Kathrine, Wenche / Venke / Venche, Espen / Esben, ... In fact, given names which can be written both aa and å are rare. I can only think of Åge offhand, and that is only rarely written Aage in Norway (and the other way round in Denmark). AA/Å confusion is much more common in surnames, but there there is no choice involved. --Lars M.
Re: [OT] o-circumflex
* Keld Jørn Simonsen | | Yes, foreigners call our cities many strange things:-) København is | called Köpenhamn, Copenhagen, Kobenhagen, Copenhague, and many more. * Michael Everson | | In Iceland it is Kaupmannahöfn, I believe. In unadorned English that | would be something like Cheapmenshaven, maybe to weaken as | Cheapenhaven, in German Kaufenhagen Which makes eminent sense, given that København by this logic would translate as Cheapenhaven. (Your German translation should be Kaufmannshagen, I guess, to become Kaufenhagen when translated from København.) --Lars M.
Re: [OT] o-circumflex
* Marco Cimarosti | | One of these cases could be the word dataarkiv, which I found in a Danish | web page | (http://www.riksarkivet.no/nordiskarknytt/98-nr4/institusjonen.html). Uh, no, you found it in a Norwegian web page. The word is the same in Danish, though. | Order B: | 1. data | 2. dataarkiv | 3. Datben, Dr. Keld | 4. Datz, Mr. Marco | 5. Datåz, Dr. Asmus | | Asmus was arguing that List B would be the correct one (and this is | certainly true on, e.g., a dictionary) but, in order to obtain it, the | source text must be properly encoded with invisible separators inserted | where needed. Not necessarily. One solution I've seen automatically generated sort keys from the headwords, but allowed users to adjust them where necessary. I think users are likely to favour this solution if given a choice. Of course, it depends on how important it is to get the sorting right, and what importance the headwords have within the system whether this solution is feasible or not. In a phone directory I guess nobody would use it. | And this is precisely what I was trying to say, although I was not | necessarily talking about multilingual sort (dataarkiv seems a purely | Danish word, although derived from Latin roots). It's a simple concatenation of the words for 'computing' (data) and 'archive' (arkiv), meaning any electronic archive. This kind of construction is very common in Norwegian and Danish, leading speakers to invent all kinds of strange new words when writing English[1], and the Swedes to joke that we call bananas 'yellowbends'. --Lars M. [1] And, conversely, after learning English, to split apart words that God meant us to write without spaces in them. It really ann oys to see people write in that incon venient way.
Re: [OT] o-circumflex
On 09/10/2001 07:48:05 AM Michael \(michka\) Kaplan wrote: (can't believe this thread is still going on!) I just wanted to know about how Francophones perceive certain graphemes, and I got that answer a long time ago. Peter
Re: [OT] o-circumflex
It's as weird as some Italian names for German cities: Aquisgrana for Aachen, Augusta for Augsburg, Magonza for Mainz, Monaco (di Baviera) for München. MK Interesting that Polish names of these cities are more like Italian MK than German: Akwizgran, Augsburg, Moguncja, Monachium. Because they're adaptations of the mediaeval Latin names. The same is true of historically important Polish cities, by the way: Varsovie, Cracovie in French, Varsavia, Cracovia in Italian. English uses the German names instead (Warsaw, Cracow). Juliusz
RE: [OT] o-circumflex
Marco, When you're in Holland you may want to check some dictionaries too. You'll notice in dictionaries 'ij' is considered to consist of two letters 'i' and 'j', so the word 'ijs' sorts between 'iets' and 'ik'. You're right the PTT doesn't make the distinction between 'ij' and 'y', so in the phone book 'Meyer' and 'Meijer' are indeed near each other. I suspected they would at least first list all Meijers, then all Meyers, but when I just checked they appeared to be intermingled. On closer inspection it turned out the Meijers and Meyers are further sorted by street name! By the way, in crossword puzzles and the like, 'ij' always occupies one box (but isn't considered the same as 'y' I believe) Regards, Otmar Permentier -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Marco Cimarosti Sent: maandag 10 september 2001 19:59 To: 'Stefan Persson'; 'John Wilcock'; [EMAIL PROTECTED] Subject: RE: [OT] o-circumflex Stefan Persson wrote: I thought ij sorted after z? Not in Dutch: as far as I have seen it sorts the same as y. In fact, in the telephone directory many people who had an y in their surname listed near people who had the same surname spelled with ij (e.g. Meyer and Meijer). (Anyway, next time they send me to Holland, I'll ask for a downtown hotel. So, after dinner, I'll go sightseeing rather than spending the whole evening looking at the collation of the phone directory:-) _ Marco
Re: [OT] o-circumflex
AAARRRGGHHH I give up! I was hoping that there is SOME system that would give these cities UNIQUE names... postal codes??? rubyrb$B$8$e$&$$$C$A$c$s(B/rbrp(/rprtJuuitchan/rtrp)/rp/ruby Well, I guess what you say is true, I could never be the right kind of girl for you, I could never be your woman - White Town --- Original Message --- $B:9=P?M(B: Thomas Chan [EMAIL PROTECTED]; $B08@h(B: [EMAIL PROTECTED]; Cc: $BF|;~(B: 01/09/10 19:59 $B7oL>(B: Re: [OT] o-circumflex On Mon, 10 Sep 2001, [ISO-2022-JP] $B$F$s$I$&$j$e$&$8(B wrote: If they can't agree on the pronunciation for these cities, can they agree on the Hanzi for them? What ARE the Hanzi for these cities, anyway?? Are you asking for the names of cities in Chinese? Copenhagen is ge1ben3ha1gen1 \u54e5\u672c\u54c8\u6839. The Han characters used to write the names of cities depends on many factors, including but not limited to source spelling/pronunciation, language/dialect of the rendering party, mapping rules used by the renderer, time period, etc. For example, New York is rendered in Chinese as Mandarin niu3yue4 \u7d10\u7d04, lit. 'button-appointment' (nauyeuk in Cantonese), while in Japanese it was at one time rendered as \u7d10\u80b2, lit. 'button-rearing'. Asking for the "hanzi" (from your wording, I don't think you are just talking about Chinese usage of Han characters) is like asking for a single Latin script rendering. (I think you need to get yourself an English-Chinese dictionary or something, btw...) Thomas Chan [EMAIL PROTECTED]
Re: [OT] o-circumflex
Wy OT by now... AAARRRGGHHH I give up! I was hoping that there is SOME system that would give these cities UNIQUE names... postal codes??? Ain't reality a bitch? What you're looking for doesn't exist in the world of natural language names -- it can only exist in artificially constructed global geographic databases, where people may have assigned unique keys to cities. And even there, the geographic experts are going to argue over the exact meaning of terms. Is Los Angeles the incorporated city presided over by the mayor or does it include all the other small cities that Los Angeles surrounds and engulfs, or does it included unincorporated parts of Los Angeles county, or does it refer to Greater Los Angeles, the metropolitan area, or is it related to Los Angeles county? Not such a simple distinction, sometimes. San Francisco is a city *and* a county, and the mayor of the city is also mayor of the county. The mayor of New York is mayor of half a dozen boroughs, the moral equivalent of counties. Is Stonyford, California (population 150), a city? It isn't incorporated as a city, or even a town, but it is an independent geographic location that occurs as a town on maps. Where do you draw the line between named localities and cities? Do you depend on legally incorporated city status? But what if the laws don't match up between different countries? How am I going to know that cities in Bourkina Fasso match the same criteria I use to designate cities in the United States or Japan? Some cities have multiple postal codes, and some postal codes cover multiple cities. And while postal codes are subject to international treaty, how countries divide their territories up and use the codes is still up to them. --Ken
Re: [OT] o-circumflex/Spanish sorting
David, I also don't know if the other countries have academies, but my understanding is Latin American countries haven't accepted the modern sort. Having said that, there is a lot of software that does not implement the traditional sort, so acceptance is moot. (The reason the Real Academia Española did away with the sorting of ch and ll is that a majority of software wasn't implementing sorts that way.) tex David Gallardo wrote: Hi - I know the Real Academia Española decided to do away with ch and ll in 1994, but do you know if the other Spanish speaking countries' corresponding academies done the same? - David Gallardo -- - Tex TexinDirector, International Business mailto:[EMAIL PROTECTED]Tel: +1-781-280-4271 the Progress Company Fax: +1-781-280-4655 -
Re: [OT] o-circumflex
On Sat, Sep 08, 2001 at 06:38:57PM -0700, Carl W. Brown wrote: Asmus, If you are entering Danish city names then enter it as Ålborg. You should only use Aalborg where the font does not support Å. For matching logic you can equate Å to Aa then the issue of compound words goes away. well, the official spelling of the town is Aalborg. Keld
Re: [OT] o-circumflex
- Original Message - From: Keld Jørn Simonsen [EMAIL PROTECTED] To: Carl W. Brown [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: den 9 september 2001 14:21 Subject: Re: [OT] o-circumflex On Sat, Sep 08, 2001 at 06:38:57PM -0700, Carl W. Brown wrote: Asmus, If you are entering Danish city names then enter it as Ålborg. You should only use Aalborg where the font does not support Å. For matching logic you can equate Å to Aa then the issue of compound words goes away. well, the official spelling of the town is Aalborg. In Sweden it has always been written Ålborg. _ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com
Re: [OT] o-circumflex/Spanish sorting
I received a private email stating that that ch and ll were abolished by the 10th Congress of the 12 academies of the various Spanish speaking countries in 1994, not just the RAE. (There are, in addition to the obvious, also academies for Puerto Rico, North America and the Phillipines.) However, it was also my understanding that the modern sort wasn't accepted outside of Spain, but it's never been clear to me if this is just a matter of popular or academic opinion, or if there has been formal resistance as well. Now I wonder if the various academies have the same authority in their country that the Royal Academy has in Spain, or if there are other national standards bodies with which they compete or cooperate. - David Gallardo - Original Message - From: Tex Texin [EMAIL PROTECTED] To: David Gallardo [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Sunday, September 09, 2001 2:15 AM Subject: Re: [OT] o-circumflex/Spanish sorting David, I also don't know if the other countries have academies, but my understanding is Latin American countries haven't accepted the modern sort. Having said that, there is a lot of software that does not implement the traditional sort, so acceptance is moot. (The reason the Real Academia Española did away with the sorting of ch and ll is that a majority of software wasn't implementing sorts that way.) tex David Gallardo wrote: Hi - I know the Real Academia Española decided to do away with ch and ll in 1994, but do you know if the other Spanish speaking countries' corresponding academies done the same? - David Gallardo -- - Tex TexinDirector, International Business mailto:[EMAIL PROTECTED]Tel: +1-781-280-4271 the Progress Company Fax: +1-781-280-4655 -
Re: [OT] o-circumflex
On Sun, Sep 09, 2001 at 06:04:30PM +0200, Stefan Persson wrote: - Original Message - From: Keld Jørn Simonsen [EMAIL PROTECTED] To: Carl W. Brown [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: den 9 september 2001 14:21 Subject: Re: [OT] o-circumflex On Sat, Sep 08, 2001 at 06:38:57PM -0700, Carl W. Brown wrote: Asmus, If you are entering Danish city names then enter it as Ålborg. You should only use Aalborg where the font does not support Å. For matching logic you can equate Å to Aa then the issue of compound words goes away. well, the official spelling of the town is Aalborg. In Sweden it has always been written Ålborg. Yes, foreigners call our cities many strange things:-) København is called Köpenhamn, Copenhagen, Kobenhagen, Copenhague, and many more. Helsingør is called Elsinore. Well, Ålborg is sometimes spelled Ålborg, but the official spelling, as defined by zip and postal addresses is 9100 Aalborg, and the kommune is called Aalborg kommune, viz www.aalborg.dk . Århus is however almost always spelled Århus in Danish. Kind regards Keld
Re: [OT] o-circumflex
Keld Jørn Simonsen scripsit: Yes, foreigners call our cities many strange things:-) København is called Köpenhamn, Copenhagen, Kobenhagen, Copenhague, and many more. Helsingør is called Elsinore. None of which is as weird as Leghorn for Livorno (Italy). -- John Cowan http://www.ccil.org/~cowan [EMAIL PROTECTED] Please leave your values| Check your assumptions. In fact, at the front desk. | check your assumptions at the door. --sign in Paris hotel |--Miles Vorkosigan
Re: [OT] o-circumflex
In a message dated 2001-09-07 17:19:49 Pacific Daylight Time, [EMAIL PROTECTED] writes: You are quite correct that is why Unicode support differing collation strengths. Some times you only care about the actual letters without diacritics. But even then letters are locale sensitive. For example the Danish alphabet starts with an A and ends it with A ring above. A Dane would look for Alborg near the end of a list of towns. It is like having the Spanish ch follow cz. That would be Ålborg, right? I hasten to add that Carl's Spanish example is for the so-called traditional sort, in contrast to the modern sort in which ch sorts simply as c followed by h. In many Spanish-speaking communities, particularly here in Alta California, the simplified modern sort is by far the more common of the two. -Doug Ewell Fullerton, California
RE: [OT] o-circumflex
Doug, -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of [EMAIL PROTECTED] Sent: Friday, September 07, 2001 10:52 PM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: Re: [OT] o-circumflex In a message dated 2001-09-07 17:19:49 Pacific Daylight Time, [EMAIL PROTECTED] writes: You are quite correct that is why Unicode support differing collation strengths. Some times you only care about the actual letters without diacritics. But even then letters are locale sensitive. For example the Danish alphabet starts with an A and ends it with A ring above. A Dane would look for Alborg near the end of a list of towns. It is like having the Spanish ch follow cz. That would be Ålborg, right? That is right. I am concerned that not everyone can view special characters. I think that having an alphabet that goes for A to Å must be due to the Danish sense of humor. I also did not use the ? in ?stanbul. I hasten to add that Carl's Spanish example is for the so-called traditional sort, in contrast to the modern sort in which ch sorts simply as c followed by h. In many Spanish-speaking communities, particularly here in Alta California, the simplified modern sort is by far the more common of the two. Again correct they also use the modern sort here in Muy Alta California as well as most of the Spanish speaking world. There also is the differences between ASCII and EBCDIC sorting. Talk about people who are worlds apart. ;-} Carl W. Brown Lafayette, CA
Re: [OT] o-circumflex
Hello. For example the Danish alphabet starts with an A and ends it with A ring above. A Dane would look for Alborg near the end of a list of towns. I was in Aalborg fifteen days ago, and I have seen its name written both as Ålborg and as Aalborg. Where does Aalborg appear in a list of towns? -francesco
Re: [OT] o-circumflex
At 09:04 PM 9/7/01 -0700, Mark Davis wrote: I disagree. What you want is a merged database field. See http://www.macchiato.com/slides/icu_collation.ppt Mark Mark, David took the remainder of our discussion off the alias. I won't repeat it here, just to note that we've agreed that merged database fields are the answer to (some) of the scenarios that we've discussed, but that there are cases (like indexing a mixed corpus where both naive and naïve occur) where it might indeed make sense to ignore accent differences altogether - although, as is often the case, dictionary-based pre- or post processing or manual adjustments might give better results yet. Thanks for your pointer to the presentation. A./
Re: [OT] o-circumflex
If you use a Danish tailoring of the UCA that equates Å and AA (at least at a primary and secondary level), then they will sort the same way. A string search that uses the same tailoring will also find Ålborg when given Aalborg (and vice versa). Mark BTW, internationalized string search is one of the features of ICU 2.0 (see http://www-124.ibm.com/icu/develop/tasks.html). There are a number of exceptional cases that have to be handled, due to issues with ignorable characters, Thai Lao boundaries, canonical equivalence and contractions (see http://oss.software.ibm.com/cvs/icu/~checkout~/icuhtml/design/searchproposal .html). — Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο πάντα — Όμήρου Μαργίτῃ [http://www.macchiato.com] - Original Message - From: Francesco Zappa Nardelli [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Saturday, September 08, 2001 10:51 AM Subject: Re: [OT] o-circumflex Hello. For example the Danish alphabet starts with an A and ends it with A ring above. A Dane would look for Alborg near the end of a list of towns. I was in Aalborg fifteen days ago, and I have seen its name written both as Ålborg and as Aalborg. Where does Aalborg appear in a list of towns? -francesco
Re: [OT] o-circumflex
In a message dated 2001-09-08 12:00:43 Pacific Daylight Time, [EMAIL PROTECTED] writes: I know the Real Academia Española decided to do away with ch and ll in 1994, but do you know if the other Spanish speaking countries' corresponding academies done the same? I have no idea. I don't know which, if any, even have a language academy. -Doug Ewell Fullerton, California
Re: [OT] o-circumflex
At 02:45 PM 9/8/01 -0700, Mark Davis wrote: If you use a Danish tailoring of the UCA that equates Å and AA (at least at a primary and secondary level), then they will sort the same way. A string search that uses the same tailoring will also find Ålborg when given Aalborg (and vice versa). But if you do this, all compound words starting with data and continuing with another word starting with a will be sorted incorrectly! To achieve this effect, you would have to mark which AAs are A-Rings and which ones are accidental adjacencies. In Danish one can use the SHY (soft hyphen) to break the latter, as these accidental pairs occur at legal word break points. In fact, that's the recommended solution, but it requires that the input data are in a sepecific form. A./
RE: [OT] o-circumflex
Asmus, This discussion reminds me of my ill fated efforts to produce a manageable set of rules to do automatic title casing starting with French text. It would have required either special dictionaries or entering the text in a special way. If special text was used, one could enter it in the proper title case to begin with. If you are entering Danish city names then enter it as Ålborg. You should only use Aalborg where the font does not support Å. For matching logic you can equate Å to Aa then the issue of compound words goes away. Carl -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Asmus Freytag Sent: Saturday, September 08, 2001 5:56 PM To: Mark Davis; [EMAIL PROTECTED]; Francesco Zappa Nardelli Subject: Re: [OT] o-circumflex At 02:45 PM 9/8/01 -0700, Mark Davis wrote: If you use a Danish tailoring of the UCA that equates à and AA (at least at a primary and secondary level), then they will sort the same way. A string search that uses the same tailoring will also find à lborg when given Aalborg (and vice versa). But if you do this, all compound words starting with data and continuing with another word starting with a will be sorted incorrectly! To achieve this effect, you would have to mark which AAs are A-Rings and which ones are accidental adjacencies. In Danish one can use the SHY (soft hyphen) to break the latter, as these accidental pairs occur at legal word break points. In fact, that's the recommended solution, but it requires that the input data are in a sepecific form. A./
RE: [OT] o-circumflex
This is not always the right thing to do. For example, with personal names the person involved may decide whether he prefers the old (AA) spelling or the new Å. In any case they are equivalent. Jony -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Carl W. Brown Sent: Sunday, September 09, 2001 4:39 AM To: [EMAIL PROTECTED] Subject: RE: [OT] o-circumflex Asmus, This discussion reminds me of my ill fated efforts to produce a manageable set of rules to do automatic title casing starting with French text. It would have required either special dictionaries or entering the text in a special way. If special text was used, one could enter it in the proper title case to begin with. If you are entering Danish city names then enter it as Ålborg. You should only use Aalborg where the font does not support Å. For matching logic you can equate Å to Aa then the issue of compound words goes away. Carl -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Asmus Freytag Sent: Saturday, September 08, 2001 5:56 PM To: Mark Davis; [EMAIL PROTECTED]; Francesco Zappa Nardelli Subject: Re: [OT] o-circumflex At 02:45 PM 9/8/01 -0700, Mark Davis wrote: If you use a Danish tailoring of the UCA that equates à and AA (at least at a primary and secondary level), then they will sort the same way. A string search that uses the same tailoring will also find à lborg when given Aalborg (and vice versa). But if you do this, all compound words starting with data and continuing with another word starting with a will be sorted incorrectly! To achieve this effect, you would have to mark which AAs are A-Rings and which ones are accidental adjacencies. In Danish one can use the SHY (soft hyphen) to break the latter, as these accidental pairs occur at legal word break points. In fact, that's the recommended solution, but it requires that the input data are in a sepecific form. A./
Re: [OT] o-circumflex
I would say it is a variant of o we just called it... o with a circumflex accent (o avec un accent circonflex). The difference between o and ô is normally audible (for a French speaker). The relationship is the same than with any other letter which sometimes have accents (e.g. a and à, e and è, etc.). o avec un accent circonflexe, with an e at the end. From Petit Robert (french dictionnary) the circumflexe is a mark for long vowel (eg. île for isle (ancient french)) or to avoid confusion between two words (eg. du and dû). The prononciation of the ô is closed (o fermé) opposed to o without accent. But Thierry is right it's a letter with an accent like à and è not a distinct grapheme. Bertrand The only little thing to know about French and diacritical mark is that when doing a sort diacritical mark are evaluated from right to left. (e.g. cote côte coté vs the English order cote coté côte ). Cheers, Thierry How do Francophones view the o-circumflex ô in relation to the letter o? Is it a distinct grapheme, or is it considered a variant of o? - Peter
RE: [OT] o-circumflex
On Thu, 6 Sep 2001, Ayers, Mike wrote: From: David Starner [mailto:[EMAIL PROTECTED]] Sent: Thursday, September 06, 2001 01:40 PM On Thu, Sep 06, 2001 at 04:03:07PM +0200, Thierry Sourbier wrote: The only little thing to know about French and diacritical mark is that when doing a sort diacritical mark are evaluated from right to left. (e.g. cote côte coté vs the English order cote coté côte ). I'm not sure there is an established English sort order. It's not a problem that comes up much in English. I believe that there is an established sort order in English, which is to sort without regard to diacritics, or else we'd never find the words! In English (American English more than British English), diacritics are considered optional, and it is common to see naїve written naive, San José written San Jose, etc. Especially amongst Americans, the two are considered equivalent, and I know of no word pair in all of English which is separated only by a diacritic. Friday, September 7, 2001 Librarians have *filing* rules--the American Library Association (ALA) and the Library of Congress (LC) each issued some in, I think, 1980. I believe they both say to ignore diacritics because Americans do not recognize that they have an order. These days filing in vendor software for libraries tends to follow neither one very closely--the phrase more honored in the breach than the observance comes to mind. I may be wrong but I do not believe there is an established U.S. standard for sorting/filing. A few years ago a National Information Standards Organization (NISO) committee drafted one but it didn't get the votes needed to become an accepted standard. Regards, Jim Agenbroad ( [EMAIL PROTECTED] ) The above are purely personal opinions, not necessarily the official views of any government or any agency of any. Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.
Re: [OT] o-circumflex
I believe that there is an established sort order in English, which is to sort without regard to diacritics, or else we'd never find the words! In English (American English more than British English), diacritics are considered optional, and it is common to see naїve written naive, San José written San Jose, etc. Especially amongst Americans, the two are considered equivalent, and I know of no word pair in all of English which is separated only by a diacritic. That depends what you mean by 'established' ;-) The classic example is 'resume' and 'résumé'. These are, by now, two quite distinct words, and the fact that there is no 'established' order is shown by the fact that the New Shorter Oxford English Dictionary (Version: 1.0.4, Data version: 02.10.96s, January 1997, on disk) has them in the order: 'résumé', 'resume' while the New Oxford Dictionary of English (Clarendon Press, 1998) has 'resume', 'resumé'. The Concise Oxford Dictionary (of Current English, Clarendon Press, 1982, edited, as it happens, by a second cousin of mine) also has 'resume', 'résumé'. Evidently, we see here evidence that the diacritic on the first 'e' has become optional since 1982, though not that on the second, presumably because that 'e' might otherwise be supposed to be silent. Reverting the question of order, the 'Guide to the New SOED' (a.k.a. Help) reveals that: quote Entries are accessed in strict alphabetical order. ... ; a headword with an accent or diacritic over a letter follows one consisting of the same sequence of letters without. ... The order of headwords which are spelled the same way but have different parts of speech is as follows: noun (abbreviated n.) pronoun (abbreviated pron.) adjective (abbreviated a.) verb (abbreviated v.) ... /quote And scrutiny of the two entries of interest reveals that 'résumé' is both a noun and a verb, whereas 'resume' is only a verb. Perhaps the ordering of 'résumé' before 'resume' is a mistake; perhaps not. I can't ask my aforesaid second cousin, because he's no longer with us. Who'd be a lexicographer? Mike. *** J M Sykes Email: [EMAIL PROTECTED] 97 Oakdale Drive Heald Green CHEADLE Cheshire SK8 3SN UKTel: (44) 161 437 5413 ***
Re: [OT] o-circumflex
There is also no word pair separated only by the I/J distinction (in English), right? rubyrb$B$8$e$&$$$C$A$c$s(B/rbrp(/rprtJuuitchan/rtrp)/rp/ruby Well, I guess what you say is true, I could never be the right kind of girl for you, I could never be your woman - White Town I know of no word pair in all of English which is separated only by a diacritic.
Re: [OT] o-circumflex
rubyrb$B$8$e$&$$$C$A$c$s(B/rbrp(/rprtJuuitchan/rtrp)/rp/ruby Well, I guess what you say is true, I could never be the right kind of girl for you, I could never be your woman - White Town Who'd be a lexicographer? $B;d!)(B Mike. *** J M Sykes Email: [EMAIL PROTECTED] 97 Oakdale Drive Heald Green CHEADLE Cheshire SK8 3SN UKTel: (44) 161 437 5413 ***
RE: [OT] o-circumflex
From: J M Sykes [mailto:[EMAIL PROTECTED]] Sent: Friday, September 07, 2001 07:50 AM The classic example is 'resume' and 'résumé'. These are, by now, two quite distinct words, and the fact that there is no 'established' order is shown I spell both resume and have never been corrected. Words with the same spelling and different pronunciation are uncommon but exist in English, the classic example being read and its own past tense. Since there are no diacritics in English proper, the two resumes tend to fall into this category. The diacritics which often appear on one of them really only serve to mark it as a loan word, since it is very difficult to come up with a sentence in which the two could be confused. /|/|ike
Re: [OT] o-circumflex
As a practical matter, you need to take the diacritics into account when sorting, even in English where they (may or may not) have linguistic significance, otherwise you'll get nondeterministic behaviour. In other words, résumé and resume should fall together, but always in the same order. Someone in another message mentioned ñ. This is a different case in principal, because in Spanish it's not a case of letter modified by a diacritic--it's an entirely different letter. (It used to be written as two side-by-side ns and then they got stacked.) Again as practical matter, in English, it's most common to ignore the greater distinction, (because we have only 26 letters in our alphabet), and to treat it as a letter + diacritic for the same considerations as above. - Original Message - From: Ayers, Mike [EMAIL PROTECTED] To: 'David Starner' [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Thursday, September 06, 2001 5:12 PM Subject: RE: [OT] o-circumflex From: David Starner [mailto:[EMAIL PROTECTED]] Sent: Thursday, September 06, 2001 01:40 PM On Thu, Sep 06, 2001 at 04:03:07PM +0200, Thierry Sourbier wrote: The only little thing to know about French and diacritical mark is that when doing a sort diacritical mark are evaluated from right to left. (e.g. cote côte coté vs the English order cote coté côte ). I'm not sure there is an established English sort order. It's not a problem that comes up much in English. I believe that there is an established sort order in English, which is to sort without regard to diacritics, or else we'd never find the words! In English (American English more than British English), diacritics are considered optional, and it is common to see naїve written naive, San José written San Jose, etc. Especially amongst Americans, the two are considered equivalent, and I know of no word pair in all of English which is separated only by a diacritic. /|/|ike
RE: [OT] o-circumflex
There is also no word pair separated only by the I/J distinction (in English), right? iamb - as in iambic pentamater jamb - as in a door jamb
RE: [OT] o-circumflex
From: David Gallardo [mailto:[EMAIL PROTECTED]] Sent: Friday, September 07, 2001 10:07 AM As a practical matter, you need to take the diacritics into account when sorting, even in English where they (may or may not) have linguistic significance, otherwise you'll get nondeterministic behaviour. In other words, résumé and resume should fall together, but always in the same order. Why? This may be of interest and benefit to programmers, but not necessarily to end-users. The computer should serve the human, not the other way around, and it is not particularly challenging to come up with search and sort algorithms which understand the concept of terminal sets which need to be iterated over to find the final entity as opposed to terminal entities. Recall Mike Sykes' post concerning sort order: MikeS Reverting the question of order, the 'Guide to the New SOED' (a.k.a. Help) reveals that: quote Entries are accessed in strict alphabetical order. ... ; a headword with an accent or diacritic over a letter follows one consisting of the same sequence of letters without. ... The order of headwords which are spelled the same way but have different parts of speech is as follows: noun (abbreviated n.) pronoun (abbreviated pron.) adjective (abbreviated a.) verb (abbreviated v.) ... /quote /MikeS This explicit ordering will still be insufficient if we choose to include verb tenses in our word list, whence we get the two reads. If someone has a reason why these two words need to be in the same order in everyone's word list, I'll listen... /|/|ike
Re: [OT] o-circumflex
From: David Gallardo [EMAIL PROTECTED] As a practical matter, you need to take the diacritics into account when sorting, even in English where they (may or may not) have linguistic significance, otherwise you'll get nondeterministic behaviour. In other words, résumé and resume should fall together, but always in the same order. Well, sort of. The issue remains that if one is choosing for their particular purpose to ignore case (for example) then there is literally no difference between Aa and aA. Since the two are considered equivalent in the case insensitive comparison, you cannot claim that a sorting algorithm has errored if it arbitrarily returns one before the other because it happens to return them in different order. For a real-world example, this can happen with algorithms where the bottom item and the anchor are always reordered if b a and thus you could see different ordering of items depending on their placement in the list. A similar thing happens with accent-insensitive sorts -- if you literally treat ee and éé as identical due to using an accent insensitive sort, then the ordering is NOT deterministic, nor is it supposed to be. And there is nothing invalid in there not being a non-deterministic behavior of equivalent items, any more than claiming that having it put ee before ee in one case and after another is invalid. MichKa Michael Kaplan Trigeminal Software, Inc. http://www.trigeminal.com/
RE: [OT] o-circumflex
At 11:50 AM 9/7/01 -0500, Ayers, Mike wrote: Words with the same spelling and different pronunciation are uncommon but exist in English, the classic example being read and its own past tense. Actually, this is a bit more common than you think, since the pronunciation of vowels in English depends somewhat systematically on stress, and verb and noun forms of many words are stressed differently. A./
Re: [OT] o-circumflex
At 01:06 PM 9/7/01 -0400, David Gallardo wrote: As a practical matter, you need to take the diacritics into account when sorting, even in English where they (may or may not) have linguistic significance, otherwise you'll get nondeterministic behaviour. In other words, résumé and resume should fall together, but always in the same order. Stated absolutely, this is patent, but oft-repeated nonsense. For example, it does not always make sense for list of names. An old friend of mine, Jon Proppe, who is an Icelandic art critic, spells his name with an accent grave on the first o and an acute accent on the e. In a campus directory of the US university he attended (assuming it did not strip the accents), it would make no sense to have his name show up after all the Proppes, or all the Jons without an accent (depending on whether its sorted by first or last name). If I sort a list of single words which contains non-unique entries, a stable sort would sort the non-unique subsets in the order of their appearance in the input. If its not important to distinguish between naive and naïve (e.g. in a machine generated index that spans multiple documents with differences in the use of accents) its hard to see what's gained in splitting the list in two for this case. On the other hand, if San Jose and San José are correctly and consistently distinguished in my input, they should probably sort separately. The two cases of resume are different yet again, as noted, since one could be a verb form. It all depends not on whether a distinction can be made, but whether it is meaningful in the context of the list being sorted. A./
RE: [OT] o-circumflex
Asmus, You are quite correct that is why Unicode support differing collation strengths. Some times you only care about the actual letters without diacritics. But even then letters are locale sensitive. For example the Danish alphabet starts with an A and ends it with A ring above. A Dane would look for Alborg near the end of a list of towns. It is like having the Spanish ch follow cz. By providing for different types of collation one can meet the user's expectations. Then of course you have search, display and sort differences. If I am looking for Istanbul it is probably OK even for Turkish locales to match it to the Turkish spelling which uses a dotted capital I. With languages with multiple diacritics like Vietnamese you have another set of rules and had better have normalized data. In Arabic do you include vowels or not? I remember your discussions of Greek where there are other considerations. Carl -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Asmus Freytag Sent: Friday, September 07, 2001 11:51 AM To: David Gallardo; Ayers, Mike; 'David Starner'; [EMAIL PROTECTED] Subject: Re: [OT] o-circumflex At 01:06 PM 9/7/01 -0400, David Gallardo wrote: As a practical matter, you need to take the diacritics into account when sorting, even in English where they (may or may not) have linguistic significance, otherwise you'll get nondeterministic behaviour. In other words, résumé and resume should fall together, but always in the same order. Stated absolutely, this is patent, but oft-repeated nonsense. For example, it does not always make sense for list of names. An old friend of mine, Jon Proppe, who is an Icelandic art critic, spells his name with an accent grave on the first o and an acute accent on the e. In a campus directory of the US university he attended (assuming it did not strip the accents), it would make no sense to have his name show up after all the Proppes, or all the Jons without an accent (depending on whether its sorted by first or last name). If I sort a list of single words which contains non-unique entries, a stable sort would sort the non-unique subsets in the order of their appearance in the input. If its not important to distinguish between naive and naïve (e.g. in a machine generated index that spans multiple documents with differences in the use of accents) its hard to see what's gained in splitting the list in two for this case. On the other hand, if San Jose and San José are correctly and consistently distinguished in my input, they should probably sort separately. The two cases of resume are different yet again, as noted, since one could be a verb form. It all depends not on whether a distinction can be made, but whether it is meaningful in the context of the list being sorted. A./
Re: [OT] o-circumflex
As a percentage of words in English, it is quite small, but there are still plenty of homographs, such as: BASS BOW(S) BUFFET COAX CLOSE COMPOUND(S) CONVERSE DESERT DIVERS DOES DOVE ENTRANCE(S) EXCISE HARE INTIMATE INVALID LAME LEAD LUGER(S) MANES MARE(S) MINUTE OBJECT(S) PATENT POLISH PRESENT PRIMER(S) PROJECT(S) PUSSY PUTTING RAVEN RE REFUSE RESIGN(S) RESUME(S) ROW(S) SEWER(S) SHOWER(S) SLAVER SOW(S) SYNDICATE(S) TAXIS TEAR(S) TIER(S) TOWER(S) VIOLA(S) WIND(S) WOUND ABSENT ABSTRACT ABUSE(S) ADDRESS(ES) ADVOCATE(S) AGGREGATE APPROPRIATE APPROXIMATE ARTICULATE ASSOCIATE(S) ATTRIBUTE(S) COMBAT COMBINE(S) COMPACT(S) COMPLEX CONDUCT CONFINES CONFLICT(S) CONSORT CONSTRUCT(S) CONTENT CONTEST(S) CONTRACT(S) CONSUMMATE CONVERT(S) CONVICT(S) COORDINATE(S) DECREASE(S) DEFECT(S) DEGENERATE(S) DELEGATE(S) DELIBERATE DISCHARGE DOGGED EJACULATE ELABORATE ESCORT(S) EXCUSE(S) ESTIMATE(S) EXTRACT(S) GRADUATE(S) HOUSE(S) IMPLANT(S) IMPORT(S) INCLINE(S) LAMINATE(S) LEARNED LEGITIMATE LIVE(S) [-]LIVED MEDIATE(S) MOBILE (3) MODERATE(S) MOUTH OFFENSE(S) PERFECT PERMIT(S) PREDICATE(S) PRODUCE PROGRESS PROTEST(S) READ (mis-, proof-) RECALL(S) RECORD(S) REDRESS REJECT(S) RETARD(S) RETREAD(S) ROUTE(S) SEPARATE SUBJECT(S) SUSPECT(S) TORMENT(S) UPSET(S) USE(S) — Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο πάντα — Όμήρου Μαργίτῃ [http://www.macchiato.com] - Original Message - From: Asmus Freytag [EMAIL PROTECTED] To: Ayers, Mike [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Friday, September 07, 2001 11:52 Subject: RE: [OT] o-circumflex At 11:50 AM 9/7/01 -0500, Ayers, Mike wrote: Words with the same spelling and different pronunciation are uncommon but exist in English, the classic example being read and its own past tense. Actually, this is a bit more common than you think, since the pronunciation of vowels in English depends somewhat systematically on stress, and verb and noun forms of many words are stressed differently. A./
Re: [OT] o-circumflex
I disagree. What you want is a merged database field. See http://www.macchiato.com/slides/icu_collation.ppt Mark — Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο πάντα — Όμήρου Μαργίτῃ [http://www.macchiato.com] - Original Message - From: Asmus Freytag [EMAIL PROTECTED] To: David Gallardo [EMAIL PROTECTED]; Ayers, Mike [EMAIL PROTECTED]; 'David Starner' [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Friday, September 07, 2001 11:50 Subject: Re: [OT] o-circumflex At 01:06 PM 9/7/01 -0400, David Gallardo wrote: As a practical matter, you need to take the diacritics into account when sorting, even in English where they (may or may not) have linguistic significance, otherwise you'll get nondeterministic behaviour. In other words, résumé and resume should fall together, but always in the same order. Stated absolutely, this is patent, but oft-repeated nonsense. For example, it does not always make sense for list of names. An old friend of mine, Jon Proppe, who is an Icelandic art critic, spells his name with an accent grave on the first o and an acute accent on the e. In a campus directory of the US university he attended (assuming it did not strip the accents), it would make no sense to have his name show up after all the Proppes, or all the Jons without an accent (depending on whether its sorted by first or last name). If I sort a list of single words which contains non-unique entries, a stable sort would sort the non-unique subsets in the order of their appearance in the input. If its not important to distinguish between naive and naïve (e.g. in a machine generated index that spans multiple documents with differences in the use of accents) its hard to see what's gained in splitting the list in two for this case. On the other hand, if San Jose and San José are correctly and consistently distinguished in my input, they should probably sort separately. The two cases of resume are different yet again, as noted, since one could be a verb form. It all depends not on whether a distinction can be made, but whether it is meaningful in the context of the list being sorted. A./
Re: [OT] o-circumflex
Is it a distinct grapheme, or is it considered a variant of o? I would say it is a variant of o we just called it... o with a circumflex accent (o avec un accent circonflex). The difference between o and ô is normally audible (for a French speaker). The relationship is the same than with any other letter which sometimes have accents (e.g. a and à, e and è, etc.). The only little thing to know about French and diacritical mark is that when doing a sort diacritical mark are evaluated from right to left. (e.g. cote côte coté vs the English order cote coté côte ). I'm just talking as a French Francophone not a linguist. May be someone on this list knows why diacritical marks are sorted in French in such a funky way :). Cheers, Thierry www.i18ngurus.com - Open Internationalization Resources Directory - Original Message - From: [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, September 06, 2001 3:08 PM Subject: [OT] o-circumflex How do Francophones view the o-circumflex ô in relation to the letter o? Is it a distinct grapheme, or is it considered a variant of o? - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: [EMAIL PROTECTED]
Re: [OT] o-circumflex
On Thu, Sep 06, 2001 at 04:03:07PM +0200, Thierry Sourbier wrote: The only little thing to know about French and diacritical mark is that when doing a sort diacritical mark are evaluated from right to left. (e.g. cote côte coté vs the English order cote coté côte ). I'm not sure there is an established English sort order. It's not a problem that comes up much in English. -- David Starner - [EMAIL PROTECTED] Pointless website: http://dvdeug.dhis.org I don't care if Bill personally has my name and reads my email and laughs at me. In fact, I'd be rather honored. - Joseph_Greg
Re: [OT] o-circumflex
My impression is that at least in U.S. states, which are more heavily populated by native Spanish speakers, the one diacritic, which is frequently viewed by English speakers as non-optional to differentiate two words (specifically proper names) is the tilde as used for the eñe. There is a college in Redwood City, CA, which is called Cañada College and, which is off of Cañada Road. I haven't checked thoroughly, but I believe most road signs there use the eñe. I do know of one highway exit in the area though which spells it Canada College. Alex.
Re: [OT] o-circumflex
David Starner wrote: Yes, but I mean for cote, côte, and coté. How would you sort those three in English? I'd probably sort it by some extra-lingual information: i.e. page number, date of birth or the like. Store them as UTF-8, do a DOS sort, and call the results the new World order? Best regards, James Kass.
RE: [OT] o-circumflex
Sorry about the kana. My mailer is Japanese. rubyrb$B$8$e$&$$$C$A$c$s(B/rbrp(/rprtJuuitchan/rtrp)/rp/ruby Well, I guess what you say is true, I could never be the right kind of girl for you, I could never be your woman - White Town --- Original Message --- $B:9=P?M(B: "Ayers, Mike" [EMAIL PROTECTED]; $B08@h(B: 'David Starner' [EMAIL PROTECTED];[EMAIL PROTECTED]; Cc: $BF|;~(B: 01/09/06 21:12 $B7oL>(B: RE: [OT] o-circumflex From: David Starner [mailto:[EMAIL PROTECTED]] Sent: Thursday, September 06, 2001 01:40 PM On Thu, Sep 06, 2001 at 04:03:07PM +0200, Thierry Sourbier wrote: The only little thing to know about French and diacritical mark is that when doing a sort diacritical mark are evaluated from right to left. (e.g. "cote" "c$B%F%((Bte" "cot$B%F%%(B" vs the English order "cote" "cot$B%F%%(B" "c$B%F%((Bte" ). I'm not sure there is an established English sort order. It's not a problem that comes up much in English. I believe that there is an established sort order in English, which is to sort without regard to diacritics, or else we'd never find the words! In English (American English more than British English), diacritics are considered optional, and it is common to see "na$B%`MW(Be" written "naive", "San Jos$B%F%%(B" written "San Jose", etc. Especially amongst Americans, the two are considered equivalent, and I know of no word pair in all of English which is separated only by a diacritic. I believe that the origin of the problem is the typewriter / word-processor. The English typewriter / word-processor is only designed to handle 26 letters (52 if you count case). Diacritics are impossible on a typewriter and very difficult on a word processor. In handwriting, the problem is non-existent. Think of Tendou Kasumi getting the medical scholarship she always wanted, and getting to study abroad. She would likely e-mail her old friends / family in romaji, but snail-mail them in kana / kanji. I like the freedom of a pen, so I can write kana and even draw. As for your word pair: 1. To continue after a pause 2. Curriculum vitae If only technology did not change the way we write like it does. And why should not "o with accent" be considered as different from "o" as either is, say, from "u"? If that is the case: "R" is "P with stroke" (hiragana) "Ho" is "ha with stroke" "Ru" is "Ro with loop" (Thai) "five" is "four with loop" and... my favorite... Latin "G" is "C" with stroke, and history WILL back me on that one! /|/|ike