Hi Pádraig, > This is my start at applying robust and efficient multi-byte > processing to coreutils.
Actually, it is the continuation of the discussion and based on the patch from March 2009 <http://lists.gnu.org/archive/html/bug-coreutils/2009-03/msg00102.html>. What has changed since then? Ad 1) Which functions to use for case comparison in coreutils? I see you replaced the calls to ulc_casecmp / mbmemcasecmp with calls to ulc_casecoll / mbmemcasecoll. This is correct because POSIX specifies that the sorting should obey the LC_COLLATE environment variable, i.e. use locale dependent collation rules. Ad 2) How should the case comparison in "sort -f" behave? I think you need to decide on this before changing 'join', because 'join' and 'sort' need to behave the same way, otherwise 'join' errs out when processing results from 'sort'. Ad 3) Executable size. This has now been resolved through the creation of libunistring as a shared library. About multibyte processing vs. UTF-8 processing: > I was wondering about unconditionally converting > input to UTF-8 before processing. I didn't do this yet as: > Currently we support invalid input chars by processing as in the C locale. Correct. This is one of the major design decisions Paul, Jim, and I agreed upon in 2001. It is this requirement which forbids converting the input to a wchar_t stream, doing processing with wchar_t objects, and producing a stream of wchar_t objects that are finally converted to multibyte representation again. It is this requirement which also forbids converting the input to UTF-8, doing processing with Unicode characters, and converting the Unicode character stream to multibyte representation at the end. This approach is acceptable for a word processor that can refuse to open a file, or for more general applications. But for coreutils, where classical behaviour is to get reasonable processing in the "C" locale of files encoded in UTF-8, EUC-JP, or ISO-8859-2, this approach cannot be done. For this reason, gnulib has the modules 'mbchar', 'mbiter', 'mbuiter', 'mbfile', which provide a "multibyte character" datatype that accommodates also invalid byte sequences. Emacs handles this requirement by extending UTF-8. But this approach is unique to Emacs: libunistring and other software support plain UTF-8, not extended UTF-8. > wanted to transform the input as little as possible. Yes, this is a wise guideline. > I was unsure how to generally correlate portions of a UTF-8 string with its > corresponding source string The u8_conv_from_encoding function <http://www.gnu.org/software/libunistring/manual/html_node/uniconv_002eh.html> stores offsets in an array that give you a mapping from the source string pointers to the UTF-8 string pointers. > I may revisit the idea of processing using UTF8 internally if we support > normalization internally to the utils. I would advise against normalization that forgets the original string. Of course, you can associate the original string with the UTF-8 equivalent, if that exists, and keep both in memory; that becomes merely a speed vs. memory trade-off. > I do try to special case UTF-8 where beneficial I think this should be done in an encoding agnostic way. Don't test whether the locale encoding is "UTF-8". This is bad, because - The GB18030 encoding is just as powerful as UTF-8; it's just a different encoding. - The same handling of the Turkish i should be done in ISO-8859-9 locales as in UTF-8 locales. The same handling of the Greek sigma should be done in ISO-8859-7 locales as in UTF-8 locales. Users would be very surprised if moving from unibyte encodings to UTF-8 would change the semantic of processing their text. > One consequence of linking with libunistring ... printf ... > incurs a 16% startup overhead We discussed this in the thread at <http://lists.gnu.org/archive/html/bug-libunistring/2010-09/msg00000.html>. > On a related note, I noticed that bootstrap doesn't update ./configure > correctly See <http://lists.gnu.org/archive/html/bug-gnulib/2010-09/msg00090.html>. Now, going through your patch. - Function xfields. I agree with the use of mbiter for the multibyte case and of 'char' variables for the unibyte case. But the algorithm is duplicated, once for MB_CUR_MAX > 1, once for MB_CUR_MAX == 1. In 2001, Paul and Jim said that they wanted - correct handling of multibyte locales, even with invalid input, - no speed decrease for unibyte locales, - no code duplication, - maintainable code. At that time I wrote macros which expand to uses of 'mbiter' or plain 'char' variables, depending on macros. See attached diff. At that time, Jim did not like the extra file. Two months ago, he told me an extra file would be acceptable if we really find no better solution. - Function xfields. The "FIXME: ... when splitting on non blanks, Surrounding chars might be needed for context". I would say in this case, context should be ignored. If someone specifies to split at Greek sigmas, he certainly does wants to consider all Greek sigmas, not some of them depending on their context. So I would turn this "FIXME" into a note. - Function keycmp. For speed, it should be useful to determine uc_locale_language () once only. - Function main, handling of 't' argument. Let's be clear about what POSIX says <http://www.opengroup.org/onlinepubs/9699919799/utilities/join.html>: The 't' option takes a single character as argument, and the LC_CTYPE category of the current locale determines the boundaries of characters. You _can_ allow "combined characters" as 't' arguments, but that is an extension to POSIX, so one should make sure that 1. this extension does not cause regression to the POSIX part of the functionality, 2. it is turned off when the environment variable POSIXLY_CORRECT is set. +#if HAVE_LIBUNISTRING + else + { + if (!(u8tab = u8_str_from_locale (optarg))) + error (EXIT_FAILURE, errno, _("error converting tab %s"), + quote (optarg)); + } +#endif IMO, if the conversion cannot be performed or libunistring is not present, you should use mbrtowc to test whether optarg contains only one character. Or equivalently, test mbslen (optarg) > 1. +#if HAVE_LIBUNISTRING + /* We support only single characters to support existing + documentation, and to restrict possible future character + combinations, like the current "\0". */ + if (u8_base_chars (u8tab) > 1) +#endif + error (EXIT_FAILURE, 0, _("multi-character tab %s"), + quote (optarg)); Putting an if condition inside #if is confusing. I would avoid that style. And, as said above, better consider POSIXLY_CORRECT here. - Comments in function u8_base_chars: The POSIX notion of "character" is a multibyte character. When we convert multibyte characters to Unicode characters, it is possible to map one POSIX character to 2 Unicode characters. Your function u8_base_chars returns the number of so-called "complex characters" <http://opengroup.org/onlinepubs/007908775/xcurses/intov.html>. It appears correct to use UC_PROPERTY_COMBINING here (as opposed to just the combining class). About the unit test: I would also add some tests with the EUC-JP encoding. It is so easy to assume that every 'const char *' is UTF-8 encoded, especially for future generation of programmers who did not grow up with ISO-8859-1, and especially when libunistring is in the minds of the programmers. Bruno
*** textutils/src/join.c 2001-12-02 20:51:39.000000000 +0100 --- textutils-i18n/src/join.c 2001-12-11 00:09:40.000000000 +0100 *************** *** 29,38 **** #include "error.h" #include "hard-locale.h" #include "linebuffer.h" ! #include "memcasecmp.h" #include "memcoll.h" #include "xstrtol.h" /* The official name of this program (e.g., no `g' prefix). */ #define PROGRAM_NAME "join" --- 29,42 ---- #include "error.h" #include "hard-locale.h" #include "linebuffer.h" ! #include "memcasecoll.h" #include "memcoll.h" #include "xstrtol.h" + #if HAVE_MBRTOWC + # include "mbchar.h" + #endif + /* The official name of this program (e.g., no `g' prefix). */ #define PROGRAM_NAME "join" *************** *** 88,94 **** /* The name this program was run with. */ char *program_name; ! #ifdef ENABLE_NLS /* Nonzero if the LC_COLLATE locale is hard. */ static int hard_LC_COLLATE; #endif --- 92,98 ---- /* The name this program was run with. */ char *program_name; ! #if HAVE_SETLOCALE /* Nonzero if the LC_COLLATE locale is hard. */ static int hard_LC_COLLATE; #endif *************** *** 111,120 **** /* Last element in `outlist', where a new element can be added. */ static struct outlist *outlist_end = &outlist_head; ! /* Tab character separating fields; if this is NUL fields are separated ! by any nonempty string of white space, otherwise by exactly one ! tab character. */ ! static unsigned char tab; /* When using getopt_long_only, no long option can start with a character that is a short option. */ --- 115,127 ---- /* Last element in `outlist', where a new element can be added. */ static struct outlist *outlist_end = &outlist_head; ! /* Tab character separating fields; if this is NULL fields are separated ! by any nonempty string of white space, otherwise by exactly one copy ! of this tab character. */ ! static char* tab; ! #if HAVE_MBRTOWC ! static mbchar_t tabmb; ! #endif /* When using getopt_long_only, no long option can start with a character that is a short option. */ *************** *** 184,189 **** --- 191,242 ---- exit (status == 0 ? EXIT_SUCCESS : EXIT_FAILURE); } + /* Set tab and tabmb according to the command line argument. */ + #if HAVE_MBRTOWC + #include "mbiter_multi.h" + static inline void + set_tab_multi (const char *arg) + { + MBI_ITERATOR iter; + + MBI_INIT (iter, arg, strlen (arg)); + if (MBI_AVAIL (iter)) + { + size_t len = MB_LEN (MBI_CUR (iter)); + + tab = (char *) xmalloc (len + 1); + memcpy (tab, MBI_CUR_PTR (iter), len); + tab[len] = '\0'; + mb_copy (&tabmb, &MBI_CUR (iter)); + MBI_ADVANCE (iter); + if (MBI_AVAIL (iter)) + error (EXIT_FAILURE, 0, + _("the separator must be a single character: %s"), arg); + } + } + #include "mbiter_undef.h" + #endif + static void + set_tab (const char *arg) + { + #if HAVE_MBRTOWC + if (MB_CUR_MAX > 1) + set_tab_multi (arg); + else + #endif + { + if (*arg != '\0') + { + tab = (char *) xmalloc (2); + tab[0] = *arg; + tab[1] = '\0'; + if (arg[1] != '\0') + error (EXIT_FAILURE, 0, + _("the separator must be a single character: %s"), arg); + } + } + } + static void ADD_FIELD (struct line *line, const unsigned char *field, size_t len) { *************** *** 201,256 **** /* Fill in the `fields' structure in LINE. */ ! static void ! xfields (struct line *line) ! { ! int i; ! unsigned char *ptr0 = (unsigned char *) line->buf.buffer; ! unsigned char *ptr; ! unsigned char *lim; ! ! ptr = ptr0; ! lim = ptr0 + line->buf.length - 1; ! if (!tab) ! { ! /* Skip leading blanks before the first field. */ ! while (ptr < lim && ISBLANK (*ptr)) ! ++ptr; ! } ! for (i = 0; ptr < lim; ++i) ! { ! if (tab) { ! unsigned char *beg; ! ! beg = ptr; ! while (ptr < lim && *ptr != tab) ! ++ptr; ! ADD_FIELD (line, beg, ptr - beg); ! if (ptr < lim) ! ++ptr; ! } else ! { ! unsigned char *beg; ! ! beg = ptr; ! while (ptr < lim && !ISBLANK (*ptr)) ! ++ptr; ! ADD_FIELD (line, beg, ptr - beg); ! while (ptr < lim && ISBLANK (*ptr)) ! ++ptr; ! } ! } ! ! if (ptr != ptr0 && ((!tab && ISBLANK (ptr[-1])) || ptr[-1] == tab)) ! { ! /* Add one more (empty) field because the last character of the ! line was a delimiter. */ ! ADD_FIELD (line, NULL, 0); ! } } /* Read a line from FP into LINE and split it into fields. --- 254,282 ---- /* Fill in the `fields' structure in LINE. */ ! #if HAVE_MBRTOWC ! #define FUNC xfields_multi ! #define IS_TAB(mbc) MB_EQUAL (mbc, tabmb) ! #include "mbiter_multi.h" ! #include "join_fields.h" ! #include "mbiter_undef.h" ! #endif ! #define FUNC xfields_8bit ! #define IS_TAB(mbc) ((mbc) == tab[0]) ! #include "mbiter_8bit.h" ! #include "join_fields.h" ! #include "mbiter_undef.h" ! static void ! xfields (struct line *line) { ! #if HAVE_MBRTOWC ! if (MB_CUR_MAX > 1) ! return xfields_multi (line); else ! #endif ! return xfields_8bit (line); } /* Read a line from FP into LINE and split it into fields. *************** *** 365,377 **** of memcmp. */ if (ignore_case) { ! /* FIXME: ignore_case does not work with NLS (in particular, ! with multibyte chars). */ ! diff = memcasecmp (beg1, beg2, min (len1, len2)); } else { ! #ifdef ENABLE_NLS if (hard_LC_COLLATE) return memcoll ((char *) beg1, len1, (char *) beg2, len2); #endif --- 391,405 ---- of memcmp. */ if (ignore_case) { ! #if HAVE_SETLOCALE ! return memcasecoll (beg1, len1, beg2, len2, hard_LC_COLLATE); ! #else ! return memcasecoll (beg1, len1, beg2, len2, 0); ! #endif } else { ! #if HAVE_SETLOCALE if (hard_LC_COLLATE) return memcoll ((char *) beg1, len1, (char *) beg2, len2); #endif *************** *** 444,450 **** o = o->next; if (o == NULL) break; ! putchar (tab ? tab : ' '); } putchar ('\n'); } --- 472,478 ---- o = o->next; if (o == NULL) break; ! fputs (tab ? tab : " ", stdout); } putchar ('\n'); } *************** *** 462,484 **** prfield (join_field_1, line1); for (i = 0; i < join_field_1 && i < line1->nfields; ++i) { ! putchar (tab ? tab : ' '); prfield (i, line1); } for (i = join_field_1 + 1; i < line1->nfields; ++i) { ! putchar (tab ? tab : ' '); prfield (i, line1); } for (i = 0; i < join_field_2 && i < line2->nfields; ++i) { ! putchar (tab ? tab : ' '); prfield (i, line2); } for (i = join_field_2 + 1; i < line2->nfields; ++i) { ! putchar (tab ? tab : ' '); prfield (i, line2); } putchar ('\n'); --- 490,512 ---- prfield (join_field_1, line1); for (i = 0; i < join_field_1 && i < line1->nfields; ++i) { ! fputs (tab ? tab : " ", stdout); prfield (i, line1); } for (i = join_field_1 + 1; i < line1->nfields; ++i) { ! fputs (tab ? tab : " ", stdout); prfield (i, line1); } for (i = 0; i < join_field_2 && i < line2->nfields; ++i) { ! fputs (tab ? tab : " ", stdout); prfield (i, line2); } for (i = join_field_2 + 1; i < line2->nfields; ++i) { ! fputs (tab ? tab : " ", stdout); prfield (i, line2); } putchar ('\n'); *************** *** 746,752 **** atexit (close_stdout); ! #ifdef ENABLE_NLS hard_LC_COLLATE = hard_locale (LC_COLLATE); #endif --- 774,780 ---- atexit (close_stdout); ! #if HAVE_SETLOCALE hard_LC_COLLATE = hard_locale (LC_COLLATE); #endif *************** *** 820,826 **** break; case 't': ! tab = *optarg; break; case 1: /* Non-option argument. */ --- 848,854 ---- break; case 't': ! set_tab (optarg); break; case 1: /* Non-option argument. */ *** /dev/null 2009-04-14 12:31:40.000000000 +0200 --- textutils-i18n/src/join_fields.h 2001-12-11 00:09:40.000000000 +0100 *************** *** 0 **** --- 1,90 ---- + /* Split a line into fields. + Copyright (C) 2001 Free Software Foundation, Inc. + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2, or (at your option) + any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software Foundation, + Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. */ + + static inline void + FUNC (struct line *line) + { + MBI_ITERATOR iter; + + MBI_INIT (iter, line->buf.buffer, line->buf.length - 1); + + if (tab) + { + /* In this case, a leading tab means that the first field is empty. + A trailing tab means that the last field is empty. */ + if (MBI_AVAIL (iter)) + for (;;) + { + const char *beg = MBI_CUR_PTR (iter); + + /* Skip a field. */ + while (MBI_AVAIL (iter) && !IS_TAB (MBI_CUR (iter))) + MBI_ADVANCE (iter); + + ADD_FIELD (line, beg, MBI_CUR_PTR (iter) - beg); + + if (!MBI_AVAIL (iter)) + break; + + /* Skip a tab. */ + MBI_ADVANCE (iter); + + if (!MBI_AVAIL (iter)) + { + /* Add one more (empty) field because the last character of + the line was a delimiter. */ + ADD_FIELD (line, NULL, 0); + break; + } + } + } + else + { + /* In this case, leading blanks are ignored. + But trailing blanks mean that the last field is empty. */ + for (;;) + { + const char *beg; + + if (!MBI_AVAIL (iter)) + break; + + /* Skip blanks. */ + while (MBI_AVAIL (iter) && MB_ISBLANK (MBI_CUR (iter))) + MBI_ADVANCE (iter); + + if (!MBI_AVAIL (iter)) + { + /* Add one more (empty) field because the last character of + the line was a delimiter. */ + ADD_FIELD (line, NULL, 0); + break; + } + + beg = MBI_CUR_PTR (iter); + + /* Skip a field. */ + while (MBI_AVAIL (iter) && !MB_ISBLANK (MBI_CUR (iter))) + MBI_ADVANCE (iter); + + ADD_FIELD (line, beg, MBI_CUR_PTR (iter) - beg); + } + } + } + + #undef FUNC + #undef IS_TAB