Re: bug in join: case comparisons don't work in multibyte locales

Jim Meyering Wed, 11 Mar 2009 01:09:39 -0700

Bruno Haible wrote:
> In coreutils/src/join.c, there is a FIXME mentioning that the -i option for
> case insensitive comparison of the input lines does not work in multibyte
> locales. And indeed, in an UTF-8 locale, I see this:
...
> Find attached a draft patch for the 'join' program, that fixes the bug
> mentioned above by use of the mbmemcasecmp or ulc_casecmp functions. It
> is not ready to apply, because there are three big questions:
>
> 1) Which functions to use for case comparison in coreutils?
>
>    The difference between mbmemcasecmp and ulc_casecmp (or between
>    mbmemcasecoll and ulc_casecoll) is:
>    mbmemcasecmp treats only English and a few European languages correctly,
>      - Turkish i / I is halfway correct, but not fully,
>    whereas ulc_casecmp handles all known specialities of languages:
>      - Turkish i / I is fully correct,
>      - German ß is equivalent to ss,
>      - Croatian and Bosnian: Characters with 3 forms, such as DZ dz Dz, are
>        considered equivalent,
>      - Greek final sigma (lowercase) is considered equivalent to uppercase
>        sigma, (There is no difference between final and non-final sigma in the
>        upper case.)
>      - Lithuanian soft-dot,
>      - etc.
>
>    I think ulc_casecmp is "correct", whereas mbmemcasecmp is only "half 
> correct".
>
>    The reason is that mbmemcasecmp is based on the POSIX APIs, but these APIs
>    have some assumptions built-in that are not valid in some languages:
>      - It assumes that there is only uppercase and lowercase - not true for
>        DZ dz Dz.
>      - It assumes that uppercasing of 1 character leads to 1 character - not
>        true for German ß.
>      - It assumes that there is 1:1 mapping between uppercase and lowercase
>        forms - not true for Greek sigma.
>      - It assumes that the upper/lowercase mappings are position independent -
>        not true for Greek sigma and Lithuanian i.


Hi Bruno,

Wow.  Thanks for all that work.
I prefer the "correct" approach, especially since I believe that will
eventually align with POSIX, even if it doesn't match the current intent
(I don't know).

> 2) There is a problem with the case comparison in "sort -f": POSIX specifies
>    how this option should behave, in terms of the old POSIX terms
>    ("all lowercase characters that have uppercase equivalents").
>
>    How to deal with that?
>      a) Use mbmemcasecmp for the option -f, and introduce a long option that
>         works with ulc_casecmp?
>      b) Use mbmemcasecmp if the environment variable POSIXLY_CORRECT is set,
>         and ulc_casecmp otherwise?

How about a third approach?

  Use ulc_casecmp unconditionally (assuming it's available), and resort
  to adding POSIXLY_CORRECT if enough people complain *and* if somehow
  POSIX cannot be changed to accommodate the correct behavior.

> 3) There is also a problem with the executable size: the ulc_casecmp (and
>    ulc_casecoll) functions are implemented using a couple of tables. I
>    squeezed them already, while still guaranteeing O(1) time for each
>    access. Most of the tables are about 10 KB large, the largest one ca. 45 
> KB.
>    But it sums up:
>
>             join executable              size (decimal)
>
>        coreutils-7.1 unmodified             35436
>
>        with mbmemcasecmp                    36473
>
>        with ulc_casecmp                    174336
>
>        with ulc_casecmp and mbmemcasecmp   176521
>        (switched at runtime)
>
>    When an executable grows from 35 KB to 175 KB, just for correct string
>    comparisons, some people will certainly complain. Especially embedded
>    developers, like the busybox guys, try to reduce total executable size.
>    And that's not only about 'join', it's ultimately about every coreutils
>    program that has an option to perform case-insensitive comparisons on
>    user's data.
>
>    How do deal with that?
>      a) Add a configure option --disable-extra-i18n, that will refrain from
>         using the ulc_casecmp function?
>      b) Let coreutils build and install a shared library for these large
>         modules?
>      c) Should these Unicode string functions be packaged externally to
>         coreutils, and coreutils can link to it as an external dependency
>         (like it does for libiconv, libintl, libacl, etc.)?

c) would be great.  The size issue is non-negligible, even if
it's just four programs.  Besides, I'd like to keep coreutils
out of the shared-library-creation/installation business.

BTW, your patch looked impeccable.


_______________________________________________
Bug-coreutils mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/bug-coreutils

Re: bug in join: case comparisons don't work in multibyte locales

Reply via email to