[coreutils] Re: [PATCH] join: support multi-byte character encodings

Bruno Haible Sun, 19 Sep 2010 09:32:02 -0700

Hi Pádraig,

> This is my start at applying robust and efficient multi-byte
> processing to coreutils.


Actually, it is the continuation of the discussion and based on the patch
from March 2009
<http://lists.gnu.org/archive/html/bug-coreutils/2009-03/msg00102.html>.

What has changed since then?

Ad 1) Which functions to use for case comparison in coreutils?
I see you replaced the calls to ulc_casecmp / mbmemcasecmp with calls to
ulc_casecoll / mbmemcasecoll. This is correct because POSIX specifies that
the sorting should obey the LC_COLLATE environment variable, i.e. use locale
dependent collation rules.

Ad 2) How should the case comparison in "sort -f" behave?
I think you need to decide on this before changing 'join', because 'join'
and 'sort' need to behave the same way, otherwise 'join' errs out when
processing results from 'sort'.

Ad 3) Executable size.
This has now been resolved through the creation of libunistring as a shared
library.


About multibyte processing vs. UTF-8 processing:
> I was wondering about unconditionally converting
> input to UTF-8 before processing. I didn't do this yet as:
>   Currently we support invalid input chars by processing as in the C locale.

Correct. This is one of the major design decisions Paul, Jim, and I agreed upon
in 2001. It is this requirement which forbids converting the input to a wchar_t
stream, doing processing with wchar_t objects, and producing a stream of wchar_t
objects that are finally converted to multibyte representation again.

It is this requirement which also forbids converting the input to UTF-8, doing
processing with Unicode characters, and converting the Unicode character stream
to multibyte representation at the end. This approach is acceptable for a word
processor that can refuse to open a file, or for more general applications.
But for coreutils, where classical behaviour is to get reasonable processing in
the "C" locale of files encoded in UTF-8, EUC-JP, or ISO-8859-2, this approach
cannot be done.

For this reason, gnulib has the modules 'mbchar', 'mbiter', 'mbuiter', 'mbfile',
which provide a "multibyte character" datatype that accommodates also invalid
byte sequences.

Emacs handles this requirement by extending UTF-8. But this approach is unique
to Emacs: libunistring and other software support plain UTF-8, not extended
UTF-8.

> wanted to transform the input as little as possible.

Yes, this is a wise guideline.

>   I was unsure how to generally correlate portions of a UTF-8 string with its
>   corresponding source string

The u8_conv_from_encoding function
<http://www.gnu.org/software/libunistring/manual/html_node/uniconv_002eh.html>
stores offsets in an array that give you a mapping from the source string
pointers to the UTF-8 string pointers.

> I may revisit the idea of processing using UTF8 internally if we support
> normalization internally to the utils.

I would advise against normalization that forgets the original string. Of
course, you can associate the original string with the UTF-8 equivalent, if
that exists, and keep both in memory; that becomes merely a speed vs. memory
trade-off.

> I do try to special case UTF-8 where beneficial

I think this should be done in an encoding agnostic way. Don't test whether
the locale encoding is "UTF-8". This is bad, because
  - The GB18030 encoding is just as powerful as UTF-8; it's just a different
    encoding.
  - The same handling of the Turkish i should be done in ISO-8859-9 locales
    as in UTF-8 locales. The same handling of the Greek sigma should be done
    in ISO-8859-7 locales as in UTF-8 locales. Users would be very surprised
    if moving from unibyte encodings to UTF-8 would change the semantic of
    processing their text.

> One consequence of linking with libunistring ... printf ...
> incurs a 16% startup overhead

We discussed this in the thread at
<http://lists.gnu.org/archive/html/bug-libunistring/2010-09/msg00000.html>.

> On a related note, I noticed that bootstrap doesn't update ./configure 
> correctly

See <http://lists.gnu.org/archive/html/bug-gnulib/2010-09/msg00090.html>.


Now, going through your patch.

- Function xfields. I agree with the use of mbiter for the multibyte case and of
  'char' variables for the unibyte case. But the algorithm is duplicated, once
  for MB_CUR_MAX > 1, once for MB_CUR_MAX == 1. In 2001, Paul and Jim said that
  they wanted
    - correct handling of multibyte locales, even with invalid input,
    - no speed decrease for unibyte locales,
    - no code duplication,
    - maintainable code.
  At that time I wrote macros which expand to uses of 'mbiter' or plain 'char'
  variables, depending on macros. See attached diff. At that time, Jim did not
  like the extra file. Two months ago, he told me an extra file would be
  acceptable if we really find no better solution.

- Function xfields. The "FIXME: ... when splitting on non blanks, Surrounding
  chars might be needed for context". I would say in this case, context should
  be ignored. If someone specifies to split at Greek sigmas, he certainly does
  wants to consider all Greek sigmas, not some of them depending on their
  context. So I would turn this "FIXME" into a note.

- Function keycmp. For speed, it should be useful to determine
  uc_locale_language () once only.

- Function main, handling of 't' argument. Let's be clear about what POSIX
  says <http://www.opengroup.org/onlinepubs/9699919799/utilities/join.html>:
  The 't' option takes a single character as argument, and the LC_CTYPE category
  of the current locale determines the boundaries of characters.
  You _can_ allow "combined characters" as 't' arguments, but that is an
  extension to POSIX, so one should make sure that
    1. this extension does not cause regression to the POSIX part of the
       functionality,
    2. it is turned off when the environment variable POSIXLY_CORRECT is set.

+#if HAVE_LIBUNISTRING
+            else
+              {
+                if (!(u8tab = u8_str_from_locale (optarg)))
+                  error (EXIT_FAILURE, errno, _("error converting tab %s"),
+                         quote (optarg));
+              }
+#endif

  IMO, if the conversion cannot be performed or libunistring is not present,
  you should use mbrtowc to test whether optarg contains only one character.
  Or equivalently, test    mbslen (optarg) > 1.

+#if HAVE_LIBUNISTRING
+                    /* We support only single characters to support existing
+                       documentation, and to restrict possible future character
+                       combinations, like the current "\0".  */
+                    if (u8_base_chars (u8tab) > 1)
+#endif
+                      error (EXIT_FAILURE, 0, _("multi-character tab %s"),
+                             quote (optarg));

  Putting an if condition inside #if is confusing. I would avoid that style.
  And, as said above, better consider POSIXLY_CORRECT here.

- Comments in function u8_base_chars: The POSIX notion of "character" is a
  multibyte character. When we convert multibyte characters to Unicode
  characters, it is possible to map one POSIX character to 2 Unicode characters.
  Your function u8_base_chars returns the number of so-called
  "complex characters"
  <http://opengroup.org/onlinepubs/007908775/xcurses/intov.html>.
  It appears correct to use UC_PROPERTY_COMBINING here (as opposed to just the
  combining class).

About the unit test: I would also add some tests with the EUC-JP encoding.
It is so easy to assume that every 'const char *' is UTF-8 encoded, especially
for future generation of programmers who did not grow up with ISO-8859-1, and
especially when libunistring is in the minds of the programmers.

Bruno

*** textutils/src/join.c	2001-12-02 20:51:39.000000000 +0100
--- textutils-i18n/src/join.c	2001-12-11 00:09:40.000000000 +0100
***************
*** 29,38 ****
  #include "error.h"
  #include "hard-locale.h"
  #include "linebuffer.h"
! #include "memcasecmp.h"
  #include "memcoll.h"
  #include "xstrtol.h"
  
  /* The official name of this program (e.g., no `g' prefix).  */
  #define PROGRAM_NAME "join"
  
--- 29,42 ----
  #include "error.h"
  #include "hard-locale.h"
  #include "linebuffer.h"
! #include "memcasecoll.h"
  #include "memcoll.h"
  #include "xstrtol.h"
  
+ #if HAVE_MBRTOWC
+ # include "mbchar.h"
+ #endif
+ 
  /* The official name of this program (e.g., no `g' prefix).  */
  #define PROGRAM_NAME "join"
  
***************
*** 88,94 ****
  /* The name this program was run with.  */
  char *program_name;
  
! #ifdef ENABLE_NLS
  /* Nonzero if the LC_COLLATE locale is hard.  */
  static int hard_LC_COLLATE;
  #endif
--- 92,98 ----
  /* The name this program was run with.  */
  char *program_name;
  
! #if HAVE_SETLOCALE
  /* Nonzero if the LC_COLLATE locale is hard.  */
  static int hard_LC_COLLATE;
  #endif
***************
*** 111,120 ****
  /* Last element in `outlist', where a new element can be added.  */
  static struct outlist *outlist_end = &outlist_head;
  
! /* Tab character separating fields; if this is NUL fields are separated
!    by any nonempty string of white space, otherwise by exactly one
!    tab character.  */
! static unsigned char tab;
  
  /* When using getopt_long_only, no long option can start with
     a character that is a short option.  */
--- 115,127 ----
  /* Last element in `outlist', where a new element can be added.  */
  static struct outlist *outlist_end = &outlist_head;
  
! /* Tab character separating fields; if this is NULL fields are separated
!    by any nonempty string of white space, otherwise by exactly one copy
!    of this tab character.  */
! static char* tab;
! #if HAVE_MBRTOWC
! static mbchar_t tabmb;
! #endif
  
  /* When using getopt_long_only, no long option can start with
     a character that is a short option.  */
***************
*** 184,189 ****
--- 191,242 ----
    exit (status == 0 ? EXIT_SUCCESS : EXIT_FAILURE);
  }
  
+ /* Set tab and tabmb according to the command line argument.  */
+ #if HAVE_MBRTOWC
+ #include "mbiter_multi.h"
+ static inline void
+ set_tab_multi (const char *arg)
+ {
+   MBI_ITERATOR iter;
+ 
+   MBI_INIT (iter, arg, strlen (arg));
+   if (MBI_AVAIL (iter))
+     {
+       size_t len = MB_LEN (MBI_CUR (iter));
+ 
+       tab = (char *) xmalloc (len + 1);
+       memcpy (tab, MBI_CUR_PTR (iter), len);
+       tab[len] = '\0';
+       mb_copy (&tabmb, &MBI_CUR (iter));
+       MBI_ADVANCE (iter);
+       if (MBI_AVAIL (iter))
+ 	error (EXIT_FAILURE, 0,
+ 	       _("the separator must be a single character: %s"), arg);
+     }
+ }
+ #include "mbiter_undef.h"
+ #endif
+ static void
+ set_tab (const char *arg)
+ {
+ #if HAVE_MBRTOWC
+   if (MB_CUR_MAX > 1)
+     set_tab_multi (arg);
+   else
+ #endif
+     {
+       if (*arg != '\0')
+ 	{
+ 	  tab = (char *) xmalloc (2);
+ 	  tab[0] = *arg;
+ 	  tab[1] = '\0';
+ 	  if (arg[1] != '\0')
+ 	    error (EXIT_FAILURE, 0,
+ 		   _("the separator must be a single character: %s"), arg);
+ 	}
+     }
+ }
+ 
  static void
  ADD_FIELD (struct line *line, const unsigned char *field, size_t len)
  {
***************
*** 201,256 ****
  
  /* Fill in the `fields' structure in LINE.  */
  
! static void
! xfields (struct line *line)
! {
!   int i;
!   unsigned char *ptr0 = (unsigned char *) line->buf.buffer;
!   unsigned char *ptr;
!   unsigned char *lim;
! 
!   ptr = ptr0;
!   lim = ptr0 + line->buf.length - 1;
  
!   if (!tab)
!     {
!       /* Skip leading blanks before the first field.  */
!       while (ptr < lim && ISBLANK (*ptr))
! 	++ptr;
!     }
  
!   for (i = 0; ptr < lim; ++i)
!     {
!       if (tab)
  	{
! 	  unsigned char *beg;
! 
! 	  beg = ptr;
! 	  while (ptr < lim && *ptr != tab)
! 	    ++ptr;
! 	  ADD_FIELD (line, beg, ptr - beg);
! 	  if (ptr < lim)
! 	    ++ptr;
! 	}
        else
! 	{
! 	  unsigned char *beg;
! 
! 	  beg = ptr;
! 	  while (ptr < lim && !ISBLANK (*ptr))
! 	    ++ptr;
! 	  ADD_FIELD (line, beg, ptr - beg);
! 	  while (ptr < lim && ISBLANK (*ptr))
! 	    ++ptr;
! 	}
!     }
! 
!   if (ptr != ptr0 && ((!tab && ISBLANK (ptr[-1])) || ptr[-1] == tab))
!     {
!       /* Add one more (empty) field because the last character of the
! 	 line was a delimiter.  */
!       ADD_FIELD (line, NULL, 0);
!     }
  }
  
  /* Read a line from FP into LINE and split it into fields.
--- 254,282 ----
  
  /* Fill in the `fields' structure in LINE.  */
  
! #if HAVE_MBRTOWC
! #define FUNC xfields_multi
! #define IS_TAB(mbc) MB_EQUAL (mbc, tabmb)
! #include "mbiter_multi.h"
! #include "join_fields.h"
! #include "mbiter_undef.h"
! #endif
  
! #define FUNC xfields_8bit
! #define IS_TAB(mbc) ((mbc) == tab[0])
! #include "mbiter_8bit.h"
! #include "join_fields.h"
! #include "mbiter_undef.h"
  
! static void
! xfields (struct line *line)
  {
! #if HAVE_MBRTOWC
!   if (MB_CUR_MAX > 1)
!     return xfields_multi (line);
    else
! #endif
!     return xfields_8bit (line);
  }
  
  /* Read a line from FP into LINE and split it into fields.
***************
*** 365,377 ****
       of memcmp.  */
    if (ignore_case)
      {
!       /* FIXME: ignore_case does not work with NLS (in particular,
!          with multibyte chars).  */
!       diff = memcasecmp (beg1, beg2, min (len1, len2));
      }
    else
      {
! #ifdef ENABLE_NLS
        if (hard_LC_COLLATE)
  	return memcoll ((char *) beg1, len1, (char *) beg2, len2);
  #endif
--- 391,405 ----
       of memcmp.  */
    if (ignore_case)
      {
! #if HAVE_SETLOCALE
!       return memcasecoll (beg1, len1, beg2, len2, hard_LC_COLLATE);
! #else
!       return memcasecoll (beg1, len1, beg2, len2, 0);
! #endif
      }
    else
      {
! #if HAVE_SETLOCALE
        if (hard_LC_COLLATE)
  	return memcoll ((char *) beg1, len1, (char *) beg2, len2);
  #endif
***************
*** 444,450 ****
  	  o = o->next;
  	  if (o == NULL)
  	    break;
! 	  putchar (tab ? tab : ' ');
  	}
        putchar ('\n');
      }
--- 472,478 ----
  	  o = o->next;
  	  if (o == NULL)
  	    break;
! 	  fputs (tab ? tab : " ", stdout);
  	}
        putchar ('\n');
      }
***************
*** 462,484 ****
        prfield (join_field_1, line1);
        for (i = 0; i < join_field_1 && i < line1->nfields; ++i)
  	{
! 	  putchar (tab ? tab : ' ');
  	  prfield (i, line1);
  	}
        for (i = join_field_1 + 1; i < line1->nfields; ++i)
  	{
! 	  putchar (tab ? tab : ' ');
  	  prfield (i, line1);
  	}
  
        for (i = 0; i < join_field_2 && i < line2->nfields; ++i)
  	{
! 	  putchar (tab ? tab : ' ');
  	  prfield (i, line2);
  	}
        for (i = join_field_2 + 1; i < line2->nfields; ++i)
  	{
! 	  putchar (tab ? tab : ' ');
  	  prfield (i, line2);
  	}
        putchar ('\n');
--- 490,512 ----
        prfield (join_field_1, line1);
        for (i = 0; i < join_field_1 && i < line1->nfields; ++i)
  	{
! 	  fputs (tab ? tab : " ", stdout);
  	  prfield (i, line1);
  	}
        for (i = join_field_1 + 1; i < line1->nfields; ++i)
  	{
! 	  fputs (tab ? tab : " ", stdout);
  	  prfield (i, line1);
  	}
  
        for (i = 0; i < join_field_2 && i < line2->nfields; ++i)
  	{
! 	  fputs (tab ? tab : " ", stdout);
  	  prfield (i, line2);
  	}
        for (i = join_field_2 + 1; i < line2->nfields; ++i)
  	{
! 	  fputs (tab ? tab : " ", stdout);
  	  prfield (i, line2);
  	}
        putchar ('\n');
***************
*** 746,752 ****
  
    atexit (close_stdout);
  
! #ifdef ENABLE_NLS
    hard_LC_COLLATE = hard_locale (LC_COLLATE);
  #endif
  
--- 774,780 ----
  
    atexit (close_stdout);
  
! #if HAVE_SETLOCALE
    hard_LC_COLLATE = hard_locale (LC_COLLATE);
  #endif
  
***************
*** 820,826 ****
  	  break;
  
  	case 't':
! 	  tab = *optarg;
  	  break;
  
  	case 1:		/* Non-option argument.  */
--- 848,854 ----
  	  break;
  
  	case 't':
! 	  set_tab (optarg);
  	  break;
  
  	case 1:		/* Non-option argument.  */
*** /dev/null	2009-04-14 12:31:40.000000000 +0200
--- textutils-i18n/src/join_fields.h	2001-12-11 00:09:40.000000000 +0100
***************
*** 0 ****
--- 1,90 ----
+ /* Split a line into fields.
+    Copyright (C) 2001 Free Software Foundation, Inc.
+ 
+    This program is free software; you can redistribute it and/or modify
+    it under the terms of the GNU General Public License as published by
+    the Free Software Foundation; either version 2, or (at your option)
+    any later version.
+ 
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU General Public License for more details.
+ 
+    You should have received a copy of the GNU General Public License
+    along with this program; if not, write to the Free Software Foundation,
+    Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.  */
+ 
+ static inline void
+ FUNC (struct line *line)
+ {
+   MBI_ITERATOR iter;
+ 
+   MBI_INIT (iter, line->buf.buffer, line->buf.length - 1);
+ 
+   if (tab)
+     {
+       /* In this case, a leading tab means that the first field is empty.
+ 	 A trailing tab means that the last field is empty.  */
+       if (MBI_AVAIL (iter))
+ 	for (;;)
+ 	  {
+ 	    const char *beg = MBI_CUR_PTR (iter);
+ 
+ 	    /* Skip a field.  */
+ 	    while (MBI_AVAIL (iter) && !IS_TAB (MBI_CUR (iter)))
+ 	      MBI_ADVANCE (iter);
+ 
+ 	    ADD_FIELD (line, beg, MBI_CUR_PTR (iter) - beg);
+ 
+ 	    if (!MBI_AVAIL (iter))
+ 	      break;
+ 
+ 	    /* Skip a tab.  */
+ 	    MBI_ADVANCE (iter);
+ 
+ 	    if (!MBI_AVAIL (iter))
+ 	      {
+ 		/* Add one more (empty) field because the last character of
+ 		   the line was a delimiter.  */
+ 		ADD_FIELD (line, NULL, 0);
+ 		break;
+ 	      }
+ 	  }
+     }
+   else
+     {
+       /* In this case, leading blanks are ignored.
+ 	 But trailing blanks mean that the last field is empty.  */
+       for (;;)
+ 	{
+ 	  const char *beg;
+ 
+ 	  if (!MBI_AVAIL (iter))
+ 	    break;
+ 
+ 	  /* Skip blanks.  */
+ 	  while (MBI_AVAIL (iter) && MB_ISBLANK (MBI_CUR (iter)))
+ 	    MBI_ADVANCE (iter);
+ 
+ 	  if (!MBI_AVAIL (iter))
+ 	    {
+ 	      /* Add one more (empty) field because the last character of
+ 		 the line was a delimiter.  */
+ 	      ADD_FIELD (line, NULL, 0);
+ 	      break;
+ 	    }
+ 
+ 	  beg = MBI_CUR_PTR (iter);
+ 
+ 	  /* Skip a field.  */
+ 	  while (MBI_AVAIL (iter) && !MB_ISBLANK (MBI_CUR (iter)))
+ 	    MBI_ADVANCE (iter);
+ 
+ 	  ADD_FIELD (line, beg, MBI_CUR_PTR (iter) - beg);
+ 	}
+     }
+ }
+ 
+ #undef FUNC
+ #undef IS_TAB

[coreutils] Re: [PATCH] join: support multi-byte character encodings

Reply via email to