Hi Pádraig,
This is my start at applying robust and efficient multi-byte
processing to coreutils.
Actually, it is the continuation of the discussion and based on the patch
from March 2009
http://lists.gnu.org/archive/html/bug-coreutils/2009-03/msg00102.html.
What has changed since then?
Ad 1) Which functions to use for case comparison in coreutils?
I see you replaced the calls to ulc_casecmp / mbmemcasecmp with calls to
ulc_casecoll / mbmemcasecoll. This is correct because POSIX specifies that
the sorting should obey the LC_COLLATE environment variable, i.e. use locale
dependent collation rules.
Ad 2) How should the case comparison in sort -f behave?
I think you need to decide on this before changing 'join', because 'join'
and 'sort' need to behave the same way, otherwise 'join' errs out when
processing results from 'sort'.
Ad 3) Executable size.
This has now been resolved through the creation of libunistring as a shared
library.
About multibyte processing vs. UTF-8 processing:
I was wondering about unconditionally converting
input to UTF-8 before processing. I didn't do this yet as:
Currently we support invalid input chars by processing as in the C locale.
Correct. This is one of the major design decisions Paul, Jim, and I agreed upon
in 2001. It is this requirement which forbids converting the input to a wchar_t
stream, doing processing with wchar_t objects, and producing a stream of wchar_t
objects that are finally converted to multibyte representation again.
It is this requirement which also forbids converting the input to UTF-8, doing
processing with Unicode characters, and converting the Unicode character stream
to multibyte representation at the end. This approach is acceptable for a word
processor that can refuse to open a file, or for more general applications.
But for coreutils, where classical behaviour is to get reasonable processing in
the C locale of files encoded in UTF-8, EUC-JP, or ISO-8859-2, this approach
cannot be done.
For this reason, gnulib has the modules 'mbchar', 'mbiter', 'mbuiter', 'mbfile',
which provide a multibyte character datatype that accommodates also invalid
byte sequences.
Emacs handles this requirement by extending UTF-8. But this approach is unique
to Emacs: libunistring and other software support plain UTF-8, not extended
UTF-8.
wanted to transform the input as little as possible.
Yes, this is a wise guideline.
I was unsure how to generally correlate portions of a UTF-8 string with its
corresponding source string
The u8_conv_from_encoding function
http://www.gnu.org/software/libunistring/manual/html_node/uniconv_002eh.html
stores offsets in an array that give you a mapping from the source string
pointers to the UTF-8 string pointers.
I may revisit the idea of processing using UTF8 internally if we support
normalization internally to the utils.
I would advise against normalization that forgets the original string. Of
course, you can associate the original string with the UTF-8 equivalent, if
that exists, and keep both in memory; that becomes merely a speed vs. memory
trade-off.
I do try to special case UTF-8 where beneficial
I think this should be done in an encoding agnostic way. Don't test whether
the locale encoding is UTF-8. This is bad, because
- The GB18030 encoding is just as powerful as UTF-8; it's just a different
encoding.
- The same handling of the Turkish i should be done in ISO-8859-9 locales
as in UTF-8 locales. The same handling of the Greek sigma should be done
in ISO-8859-7 locales as in UTF-8 locales. Users would be very surprised
if moving from unibyte encodings to UTF-8 would change the semantic of
processing their text.
One consequence of linking with libunistring ... printf ...
incurs a 16% startup overhead
We discussed this in the thread at
http://lists.gnu.org/archive/html/bug-libunistring/2010-09/msg0.html.
On a related note, I noticed that bootstrap doesn't update ./configure
correctly
See http://lists.gnu.org/archive/html/bug-gnulib/2010-09/msg00090.html.
Now, going through your patch.
- Function xfields. I agree with the use of mbiter for the multibyte case and of
'char' variables for the unibyte case. But the algorithm is duplicated, once
for MB_CUR_MAX 1, once for MB_CUR_MAX == 1. In 2001, Paul and Jim said that
they wanted
- correct handling of multibyte locales, even with invalid input,
- no speed decrease for unibyte locales,
- no code duplication,
- maintainable code.
At that time I wrote macros which expand to uses of 'mbiter' or plain 'char'
variables, depending on macros. See attached diff. At that time, Jim did not
like the extra file. Two months ago, he told me an extra file would be
acceptable if we really find no better solution.
- Function xfields. The FIXME: ... when splitting on non blanks, Surrounding
chars might be needed for context. I would say in this case, context should
be ignored. If someone specifies to