Hi Jim, In coreutils/src/join.c, there is a FIXME mentioning that the -i option for case insensitive comparison of the input lines does not work in multibyte locales. And indeed, in an UTF-8 locale, I see this:
$ cat > in1 <<EOF müsste EOF $ cat > in2 <<EOF MÜSSTE EOF $ join -i in1 in2 [empty result] The expected result is: $ join -i in1 in2 müsste Similarly, with a German word in lower and upper case: $ cat > in1 <<EOF Ruß EOF $ cat > in2 <<EOF RUSS EOF $ join -i in1 in2 [empty result] The expected result is: $ join -i in1 in2 Ruß Before going on, let me summarize the case comparison functions for strings that we have available with gnulib: | on NUL terminated | on memory areas or | strings | strings with embedded NULs ----------------------+----------------------+--------------------------- For ASCII strings | c_strcasecmp, | only | STRCASEEQ | ----------------------+----------------------+--------------------------- For unibyte locales | strcasecmp | memcasecmp only | | ----------------------+----------------------+--------------------------- Support for multibyte | mbscasecmp | mbmemcasecmp locales | | ------------------+----------------------+--------------------------- + German, Greek etc.| | ulc_casecmp ----------------------+----------------------+--------------------------- Support for multibyte | | mbmemcasecoll locales and locale | | collation order | | ------------------+----------------------+--------------------------- + German, Greek etc.| | ulc_casecoll ----------------------+----------------------+--------------------------- Find attached a draft patch for the 'join' program, that fixes the bug mentioned above by use of the mbmemcasecmp or ulc_casecmp functions. It is not ready to apply, because there are three big questions: 1) Which functions to use for case comparison in coreutils? The difference between mbmemcasecmp and ulc_casecmp (or between mbmemcasecoll and ulc_casecoll) is: mbmemcasecmp treats only English and a few European languages correctly, - Turkish i / I is halfway correct, but not fully, whereas ulc_casecmp handles all known specialities of languages: - Turkish i / I is fully correct, - German ß is equivalent to ss, - Croatian and Bosnian: Characters with 3 forms, such as DZ dz Dz, are considered equivalent, - Greek final sigma (lowercase) is considered equivalent to uppercase sigma, (There is no difference between final and non-final sigma in the upper case.) - Lithuanian soft-dot, - etc. I think ulc_casecmp is "correct", whereas mbmemcasecmp is only "half correct". The reason is that mbmemcasecmp is based on the POSIX APIs, but these APIs have some assumptions built-in that are not valid in some languages: - It assumes that there is only uppercase and lowercase - not true for DZ dz Dz. - It assumes that uppercasing of 1 character leads to 1 character - not true for German ß. - It assumes that there is 1:1 mapping between uppercase and lowercase forms - not true for Greek sigma. - It assumes that the upper/lowercase mappings are position independent - not true for Greek sigma and Lithuanian i. 2) There is a problem with the case comparison in "sort -f": POSIX specifies how this option should behave, in terms of the old POSIX terms ("all lowercase characters that have uppercase equivalents"). How to deal with that? a) Use mbmemcasecmp for the option -f, and introduce a long option that works with ulc_casecmp? b) Use mbmemcasecmp if the environment variable POSIXLY_CORRECT is set, and ulc_casecmp otherwise? 3) There is also a problem with the executable size: the ulc_casecmp (and ulc_casecoll) functions are implemented using a couple of tables. I squeezed them already, while still guaranteeing O(1) time for each access. Most of the tables are about 10 KB large, the largest one ca. 45 KB. But it sums up: join executable size (decimal) coreutils-7.1 unmodified 35436 with mbmemcasecmp 36473 with ulc_casecmp 174336 with ulc_casecmp and mbmemcasecmp 176521 (switched at runtime) When an executable grows from 35 KB to 175 KB, just for correct string comparisons, some people will certainly complain. Especially embedded developers, like the busybox guys, try to reduce total executable size. And that's not only about 'join', it's ultimately about every coreutils program that has an option to perform case-insensitive comparisons on user's data. How do deal with that? a) Add a configure option --disable-extra-i18n, that will refrain from using the ulc_casecmp function? b) Let coreutils build and install a shared library for these large modules? c) Should these Unicode string functions be packaged externally to coreutils, and coreutils can link to it as an external dependency (like it does for libiconv, libintl, libacl, etc.)? d) any other idea? Bruno
--- coreutils-7.1/src/join.c.bak 2008-11-10 14:17:52.000000000 +0100 +++ coreutils-7.1/src/join.c 2009-03-10 03:48:45.000000000 +0100 @@ -1,5 +1,5 @@ /* join - join lines of two files on a common field - Copyright (C) 91, 1995-2006, 2008 Free Software Foundation, Inc. + Copyright (C) 91, 1995-2006, 2008-2009 Free Software Foundation, Inc. This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by @@ -25,6 +25,9 @@ #include "system.h" #include "error.h" #include "linebuffer.h" +#include "unicase.h" +#include "uninorm.h" +#include "mbmemcasecmp.h" #include "memcasecmp.h" #include "quote.h" #include "stdio--.h" @@ -92,6 +95,9 @@ want to overwrite the previous buffer before we check order. */ static struct line *spareline[2] = {NULL, NULL}; +/* True if the LC_CTYPE locale is hard. */ +static bool hard_LC_CTYPE; + /* True if the LC_COLLATE locale is hard. */ static bool hard_LC_COLLATE; @@ -321,8 +327,23 @@ if (ignore_case) { - /* FIXME: ignore_case does not work with NLS (in particular, - with multibyte chars). */ + if (hard_LC_CTYPE) + { +#if EXTRA_I18N + /* The ulc_casecmp function handles not only multibyte characters + correctly, but also the German sharp s, the Greek final sigma, + the Turkish dotless i, etc. */ + if (ulc_casecmp (beg1, len1, beg2, len2, uc_locale_language (), + UNINORM_NFD, &diff) >= 0) + return diff; + if (errno == ENOMEM) + xalloc_die (); +#endif + /* IF ulc_casecmp failed due to some conversion error, fall back to + a comparison that at least handles multibyte characters and the + Turkish dotless i correctly. */ + return mbmemcasecmp (beg1, len1, beg2, len2); + } diff = memcasecmp (beg1, beg2, MIN (len1, len2)); } else @@ -942,6 +963,7 @@ setlocale (LC_ALL, ""); bindtextdomain (PACKAGE, LOCALEDIR); textdomain (PACKAGE); + hard_LC_CTYPE = hard_locale (LC_CTYPE); hard_LC_COLLATE = hard_locale (LC_COLLATE); atexit (close_stdout); --- coreutils-7.1/bootstrap.conf.bak 2009-02-16 14:35:18.000000000 +0100 +++ coreutils-7.1/bootstrap.conf 2009-03-10 03:52:46.000000000 +0100 @@ -67,6 +67,7 @@ inttostr inttypes isapipe lchmod lchown lib-ignore linebuffer link-follow long-options lstat malloc + mbmemcasecmp mbrtowc mbswidth memcasecmp mempcpy @@ -96,7 +97,9 @@ strdup strftime strpbrk strtoimax strtoumax strverscmp sys_stat timespec tzset - unicodeio unistd-safer unlink-busy unlinkdir unlocked-io + unicase/ulc-casecmp unicase/locale-language + unicodeio uninorm/nfd + unistd-safer unlink-busy unlinkdir unlocked-io uptime useless-if-before-free userspec utimecmp utimens
_______________________________________________ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils