At 01:06 PM 9/7/01 -0400, David Gallardo wrote:
>As a practical matter, you need to take the diacritics into account when
>sorting, even in English where they (may or may not) have linguistic
>significance, otherwise you'll get nondeterministic behaviour. In other
>words, résumé and resume should fall together, but always in the same order.

Stated absolutely, this is patent, but oft-repeated nonsense. For example, 
it does not always make sense for list of names. An old friend of mine, Jon 
Proppe, who is an Icelandic art critic, spells his name with an accent 
grave on the first o and an acute accent on the e. In a campus directory of 
the US university he attended (assuming it did not strip the accents), it 
would make no sense to have his name show up after all the Proppes, or all 
the Jons without an accent (depending on whether its sorted by first or 
last name).

If I sort a list of single words which contains non-unique entries, a 
stable sort would sort the non-unique subsets in the order of their 
appearance in the input. If its not important to distinguish between naive 
and naïve (e.g. in a machine generated index that spans multiple documents 
with differences in the use of accents) its hard to see what's gained in 
splitting the list in two for this case.

On the other hand, if San Jose and San José are correctly and consistently 
distinguished in my input, they should probably sort separately.

The two cases of resume are different yet again, as noted, since one could 
be a verb form.

It all depends not on whether a distinction can be made, but whether it is 
meaningful in the context of the list being sorted.

A./





Reply via email to