Re: Sorting and combining diacritical marks

Markus Kuhn Wed, 08 Nov 2000 06:51:04 -0800
Keld Simonsen wrote on 2000-11-08 14:20 UTC:
> > btw, you mean that the regular expressions are locale dependent?!
> > i thought they should be handled by the POSIX locale to be portable.
> 
> Yes, they are locale dependent according to the POSIX standards.

That is a very tricky area. The underlying problem with regular
expressions (where you split up everything into substrings) is that in
contrast to the traditional naive strcmp(), ISO string comparison is not
any more a homomorphism with regard to concatenation. That is very
unfortunate at many places (esp. regular expressions), but unavoidable
if you want to be compatible with the multi-pass sorting traditions of
dictionary publishers.

In other words, if you split up (for simplicity: equally long) strings A
and B into two substrings

  A = concat(A1, A2)
  B = concat(B1, B2)   with lenth(A1) = length(B1) and length(A2) = length(B2)

then in the C locale, you are guaranteed the nice and important property

  strcmp(A, B) < 0  <=>  strcmp(A1, B1) < 0 || (strcmp(A1, B1) == 0 &&
                                                strcmp(A2, B2) < 0)
In other words:

  a) if the prefixes differ, they determine fully the outcome
     of the comparison

  b) if the prefixes are equal, the remainders fully determine
     the outcome of the comparison

This property is in essence what makes the range notation in regular
expressions useful, because by selecting a range of characters in some
prefix of a string, you are guaranteed to select a consecutive sequence
of strings in a sorted list. With multi-pass comparison algorithms like
ISO, this property does not hold any more, and that is why many people
(including myself) prefer to hold on to the old naive "C" strcmp()
sorting order until we get a nicer single-pass sorting algorithm that is
guaranteed to be a homomorphism over string concatenation.

The "cultural correctness" of the ISO is unfortunately restricted to the
community of dictionary and telephone book publishers, and does not
necessary include experienced Unix users and other regular expression
artists. The ISO does make sense in that it tries to move uncertain
knowledge of a word that I try to look up to the end of the sorting
priority list (accents, case, etc.) such that I am likely to find the
word nearby in a sorted list, but that comes at the cost of breaking the
above property.

All this is for me an important reason for not overusing the ISO (=
International Sorting Order) in POSIX. Fortunately, the LC_COLLATE
environment variable allows you to deactivate the ISO, even if you use a
UTF-8 locale.

My preferred setting:

  export LC_COLLATE=C LANG=en_GB.UTF-8

and then I get my preferred traditional culturally-correct hardcore Unix
sorting order (strict left-to-right UCS order, capitals first, funny
non-ASCII things last), while enjoying the other benefits of a modern
locale setting (UTF-8, lots of lovely redundant u's in "colour", etc.).

> > so you mean that someone should update the glibc data file with recent
> > data?
>
> Yes, possibly. But Ulrich is very reluctant in doing this as
> it will have the results as explained above.

I don't see how holding on to an outdated configuration table can fix
that problem. If you prefer [a-c]* to only select consecutive strings in
a sorted list (such as output of ls) to avoid surprises, then I think
the proper solution is LC_COLLATE=C.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
Re: Sorting and combining diacritical marks

Reply via email to