2010/1/1 Paul Gilmartin <[email protected]>:
> On Thu, 31 Dec 2009 15:28:16 -0600, McKown, John wrote:
>>
>>I guess the order is aAbBcCdD and so on.
>>
> Actually, no.  Not according to a couple dictionaries I glanced at,
> and OpenSolaris:
>
>    509 $ ls -1
>    castor
>    Castor
>    castor bean
>    510 $
>
> What does Linux do?
>
> The technique appears to be: First sort as if entirely case-insensitive;
> only then resolve any ties by considering the case of the characters.
>
> Which is why I suggested keeping all alphabetic characters in a single
> case, followed by a bitmap identifying the case of the characters.
> Case-insensitive lookup would ignore the bitmap; case sensitive would
> consider it.

You really can't do a proper text sort by ordering individual byte
values. The usual approach, pioneered about 15 years ago by some smart
people at IBM's National Language Centre in Toronto, and one smart guy
in the Quebec government, is to assign separate sort keys to a string,
based on the character value, the case, the accent, and "special"
weighting. Then you sort those keys instead of the original string. Of
course you can precalculate the sort keys or do them on the fly,
depending on the performance and storage tradeoffs.

Sorting is a cultural thing (where "culture" can include C programming
as much as French-in-France, French-in-Canada, English, German, etc.)
And each culture may have multiple sort orders appropriate for
different circumstances. For example French dictionaries have a
different order from French phonebooks; a French phonebook user may
expect to find the name duPont under P, not under D. Even in English,
where do you expect to find castor-oil in the list above? Surely the
hyphen should be given lower weighting than even the letters that
follow it, so that it comes out after castor bean. How about Caesar vs
Cæsar or Noel vs Noël? Google search knows that they are the same
thing, but Gmail flunks the latter in its spelling checker. What does
the "ls" command think?

Tony H.

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html

Reply via email to