bug in join: case comparisons don't work in multibyte locales

Bruno Haible Tue, 10 Mar 2009 17:41:05 -0700

Hi Jim,

In coreutils/src/join.c, there is a FIXME mentioning that the -i option for
case insensitive comparison of the input lines does not work in multibyte
locales. And indeed, in an UTF-8 locale, I see this:


  $ cat > in1 <<EOF
  müsste
  EOF
  $ cat > in2 <<EOF
  MÜSSTE
  EOF
  $ join -i in1 in2
  [empty result]

The expected result is:

  $ join -i in1 in2
  müsste

Similarly, with a German word in lower and upper case:

  $ cat > in1 <<EOF
  Ruß
  EOF
  $ cat > in2 <<EOF
  RUSS
  EOF
  $ join -i in1 in2
  [empty result]

The expected result is:

  $ join -i in1 in2
  Ruß

Before going on, let me summarize the case comparison functions for strings
that we have available with gnulib:


                      | on NUL terminated    | on memory areas or
                      | strings              | strings with embedded NULs
----------------------+----------------------+---------------------------
For ASCII strings     | c_strcasecmp,        |
only                  | STRCASEEQ            |
----------------------+----------------------+---------------------------
For unibyte locales   | strcasecmp           | memcasecmp
only                  |                      |
----------------------+----------------------+---------------------------
Support for multibyte | mbscasecmp           | mbmemcasecmp
locales               |                      |
    ------------------+----------------------+---------------------------
  + German, Greek etc.|                      | ulc_casecmp
----------------------+----------------------+---------------------------
Support for multibyte |                      | mbmemcasecoll
locales and locale    |                      |
collation order       |                      |
    ------------------+----------------------+---------------------------
  + German, Greek etc.|                      | ulc_casecoll
----------------------+----------------------+---------------------------


Find attached a draft patch for the 'join' program, that fixes the bug
mentioned above by use of the mbmemcasecmp or ulc_casecmp functions. It
is not ready to apply, because there are three big questions:

1) Which functions to use for case comparison in coreutils?

   The difference between mbmemcasecmp and ulc_casecmp (or between
   mbmemcasecoll and ulc_casecoll) is:
   mbmemcasecmp treats only English and a few European languages correctly,
     - Turkish i / I is halfway correct, but not fully,
   whereas ulc_casecmp handles all known specialities of languages:
     - Turkish i / I is fully correct,
     - German ß is equivalent to ss,
     - Croatian and Bosnian: Characters with 3 forms, such as DZ dz Dz, are
       considered equivalent,
     - Greek final sigma (lowercase) is considered equivalent to uppercase
       sigma, (There is no difference between final and non-final sigma in the
       upper case.)
     - Lithuanian soft-dot,
     - etc.

   I think ulc_casecmp is "correct", whereas mbmemcasecmp is only "half 
correct".

   The reason is that mbmemcasecmp is based on the POSIX APIs, but these APIs
   have some assumptions built-in that are not valid in some languages:
     - It assumes that there is only uppercase and lowercase - not true for
       DZ dz Dz.
     - It assumes that uppercasing of 1 character leads to 1 character - not
       true for German ß.
     - It assumes that there is 1:1 mapping between uppercase and lowercase
       forms - not true for Greek sigma.
     - It assumes that the upper/lowercase mappings are position independent -
       not true for Greek sigma and Lithuanian i.

2) There is a problem with the case comparison in "sort -f": POSIX specifies
   how this option should behave, in terms of the old POSIX terms
   ("all lowercase characters that have uppercase equivalents").

   How to deal with that?
     a) Use mbmemcasecmp for the option -f, and introduce a long option that
        works with ulc_casecmp?
     b) Use mbmemcasecmp if the environment variable POSIXLY_CORRECT is set,
        and ulc_casecmp otherwise?

3) There is also a problem with the executable size: the ulc_casecmp (and
   ulc_casecoll) functions are implemented using a couple of tables. I
   squeezed them already, while still guaranteeing O(1) time for each
   access. Most of the tables are about 10 KB large, the largest one ca. 45 KB.
   But it sums up:

            join executable              size (decimal)

       coreutils-7.1 unmodified             35436

       with mbmemcasecmp                    36473

       with ulc_casecmp                    174336

       with ulc_casecmp and mbmemcasecmp   176521
       (switched at runtime)

   When an executable grows from 35 KB to 175 KB, just for correct string
   comparisons, some people will certainly complain. Especially embedded
   developers, like the busybox guys, try to reduce total executable size.
   And that's not only about 'join', it's ultimately about every coreutils
   program that has an option to perform case-insensitive comparisons on
   user's data.

   How do deal with that?
     a) Add a configure option --disable-extra-i18n, that will refrain from
        using the ulc_casecmp function?
     b) Let coreutils build and install a shared library for these large
        modules?
     c) Should these Unicode string functions be packaged externally to
        coreutils, and coreutils can link to it as an external dependency
        (like it does for libiconv, libintl, libacl, etc.)?
     d) any other idea?

Bruno

--- coreutils-7.1/src/join.c.bak	2008-11-10 14:17:52.000000000 +0100
+++ coreutils-7.1/src/join.c	2009-03-10 03:48:45.000000000 +0100
@@ -1,5 +1,5 @@
 /* join - join lines of two files on a common field
-   Copyright (C) 91, 1995-2006, 2008 Free Software Foundation, Inc.
+   Copyright (C) 91, 1995-2006, 2008-2009 Free Software Foundation, Inc.
 
    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
@@ -25,6 +25,9 @@
 #include "system.h"
 #include "error.h"
 #include "linebuffer.h"
+#include "unicase.h"
+#include "uninorm.h"
+#include "mbmemcasecmp.h"
 #include "memcasecmp.h"
 #include "quote.h"
 #include "stdio--.h"
@@ -92,6 +95,9 @@
    want to overwrite the previous buffer before we check order. */
 static struct line *spareline[2] = {NULL, NULL};
 
+/* True if the LC_CTYPE locale is hard.  */
+static bool hard_LC_CTYPE;
+
 /* True if the LC_COLLATE locale is hard.  */
 static bool hard_LC_COLLATE;
 
@@ -321,8 +327,23 @@
 
   if (ignore_case)
     {
-      /* FIXME: ignore_case does not work with NLS (in particular,
-         with multibyte chars).  */
+      if (hard_LC_CTYPE)
+	{
+#if EXTRA_I18N
+	  /* The ulc_casecmp function handles not only multibyte characters
+	     correctly, but also the German sharp s, the Greek final sigma,
+	     the Turkish dotless i, etc.  */
+	  if (ulc_casecmp (beg1, len1, beg2, len2, uc_locale_language (),
+			   UNINORM_NFD, &diff) >= 0)
+	    return diff;
+	  if (errno == ENOMEM)
+	    xalloc_die ();
+#endif
+	  /* IF ulc_casecmp failed due to some conversion error, fall back to
+	     a comparison that at least handles multibyte characters and the
+	     Turkish dotless i correctly.  */
+	  return mbmemcasecmp (beg1, len1, beg2, len2);
+	}
       diff = memcasecmp (beg1, beg2, MIN (len1, len2));
     }
   else
@@ -942,6 +963,7 @@
   setlocale (LC_ALL, "");
   bindtextdomain (PACKAGE, LOCALEDIR);
   textdomain (PACKAGE);
+  hard_LC_CTYPE = hard_locale (LC_CTYPE);
   hard_LC_COLLATE = hard_locale (LC_COLLATE);
 
   atexit (close_stdout);
--- coreutils-7.1/bootstrap.conf.bak	2009-02-16 14:35:18.000000000 +0100
+++ coreutils-7.1/bootstrap.conf	2009-03-10 03:52:46.000000000 +0100
@@ -67,6 +67,7 @@
 	inttostr inttypes isapipe
 	lchmod lchown lib-ignore linebuffer link-follow
 	long-options lstat malloc
+	mbmemcasecmp
 	mbrtowc
 	mbswidth
 	memcasecmp mempcpy
@@ -96,7 +97,9 @@
 	strdup
 	strftime
 	strpbrk strtoimax strtoumax strverscmp sys_stat timespec tzset
-	unicodeio unistd-safer unlink-busy unlinkdir unlocked-io
+	unicase/ulc-casecmp unicase/locale-language
+	unicodeio uninorm/nfd
+	unistd-safer unlink-busy unlinkdir unlocked-io
 	uptime
 	useless-if-before-free
 	userspec utimecmp utimens

_______________________________________________
Bug-coreutils mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/bug-coreutils

bug in join: case comparisons don't work in multibyte locales

Reply via email to