Bruno Haible wrote: > In coreutils/src/join.c, there is a FIXME mentioning that the -i option for > case insensitive comparison of the input lines does not work in multibyte > locales. And indeed, in an UTF-8 locale, I see this: ... > Find attached a draft patch for the 'join' program, that fixes the bug > mentioned above by use of the mbmemcasecmp or ulc_casecmp functions. It > is not ready to apply, because there are three big questions: > > 1) Which functions to use for case comparison in coreutils? > > The difference between mbmemcasecmp and ulc_casecmp (or between > mbmemcasecoll and ulc_casecoll) is: > mbmemcasecmp treats only English and a few European languages correctly, > - Turkish i / I is halfway correct, but not fully, > whereas ulc_casecmp handles all known specialities of languages: > - Turkish i / I is fully correct, > - German ß is equivalent to ss, > - Croatian and Bosnian: Characters with 3 forms, such as DZ dz Dz, are > considered equivalent, > - Greek final sigma (lowercase) is considered equivalent to uppercase > sigma, (There is no difference between final and non-final sigma in the > upper case.) > - Lithuanian soft-dot, > - etc. > > I think ulc_casecmp is "correct", whereas mbmemcasecmp is only "half > correct". > > The reason is that mbmemcasecmp is based on the POSIX APIs, but these APIs > have some assumptions built-in that are not valid in some languages: > - It assumes that there is only uppercase and lowercase - not true for > DZ dz Dz. > - It assumes that uppercasing of 1 character leads to 1 character - not > true for German ß. > - It assumes that there is 1:1 mapping between uppercase and lowercase > forms - not true for Greek sigma. > - It assumes that the upper/lowercase mappings are position independent - > not true for Greek sigma and Lithuanian i.
Hi Bruno, Wow. Thanks for all that work. I prefer the "correct" approach, especially since I believe that will eventually align with POSIX, even if it doesn't match the current intent (I don't know). > 2) There is a problem with the case comparison in "sort -f": POSIX specifies > how this option should behave, in terms of the old POSIX terms > ("all lowercase characters that have uppercase equivalents"). > > How to deal with that? > a) Use mbmemcasecmp for the option -f, and introduce a long option that > works with ulc_casecmp? > b) Use mbmemcasecmp if the environment variable POSIXLY_CORRECT is set, > and ulc_casecmp otherwise? How about a third approach? Use ulc_casecmp unconditionally (assuming it's available), and resort to adding POSIXLY_CORRECT if enough people complain *and* if somehow POSIX cannot be changed to accommodate the correct behavior. > 3) There is also a problem with the executable size: the ulc_casecmp (and > ulc_casecoll) functions are implemented using a couple of tables. I > squeezed them already, while still guaranteeing O(1) time for each > access. Most of the tables are about 10 KB large, the largest one ca. 45 > KB. > But it sums up: > > join executable size (decimal) > > coreutils-7.1 unmodified 35436 > > with mbmemcasecmp 36473 > > with ulc_casecmp 174336 > > with ulc_casecmp and mbmemcasecmp 176521 > (switched at runtime) > > When an executable grows from 35 KB to 175 KB, just for correct string > comparisons, some people will certainly complain. Especially embedded > developers, like the busybox guys, try to reduce total executable size. > And that's not only about 'join', it's ultimately about every coreutils > program that has an option to perform case-insensitive comparisons on > user's data. > > How do deal with that? > a) Add a configure option --disable-extra-i18n, that will refrain from > using the ulc_casecmp function? > b) Let coreutils build and install a shared library for these large > modules? > c) Should these Unicode string functions be packaged externally to > coreutils, and coreutils can link to it as an external dependency > (like it does for libiconv, libintl, libacl, etc.)? c) would be great. The size issue is non-negligible, even if it's just four programs. Besides, I'd like to keep coreutils out of the shared-library-creation/installation business. BTW, your patch looked impeccable. _______________________________________________ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils