Re: [coreutils] [PATCH] tests: fix a printf portability issue

2010-09-19 Thread Jim Meyering
Pádraig Brady wrote:
 not all printf commands support \xhh

 diff --git a/tests/misc/sort-debug-keys b/tests/misc/sort-debug-keys
 index 57a52a6..4e8beff 100755
 --- a/tests/misc/sort-debug-keys
 +++ b/tests/misc/sort-debug-keys
 @@ -275,7 +275,7 @@ printf 2.,,3\n2.4\n | sort -s -k1n --debug
  printf 2,,3\n2.4\n | sort -s -k1n --debug

  # -z means we convert \0 to \n
 -printf 1a\x002b\x00 | sort -s -n -z --debug
 +printf 1a\0002b\000 | sort -s -n -z --debug

That would accommodate an inferior printf builtin function,
with the implication that all printf uses should do the same.
It may be hard or at least tedious to audit existing uses --
then it'd be good (but more work) to enforce the no-\xHH policy.

Since this is the coreutils, we have the option of a more sustainable
policy: use env to ensure the test runs the printf binary that we've
just built in this package:

# printf '\xHH' is not portable with some built-ins. FIXME: list an offender
# Use env to force use of the one from coreutils.
env printf '1a\x002b\x00' | sort -s -n -z --debug

That also gives printf more test coverage.

Also, please use single quotes in cases like this,
where nothing is intended to be shell-expanded.



Re: [coreutils] testsuite failures on Fedora

2010-09-19 Thread Jim Meyering
Pádraig Brady wrote:

 On 17/09/10 11:06, Pádraig Brady wrote:
 On 17/09/10 10:34, Jim Meyering wrote:
 FAIL: misc/stat-mount (exit: 1)
 ===
 ...
 + stat_mnt=/shared/home
 + test /shared = /shared/home
 + fail=1
 ...
 Here, the failure is that I have /home bind-mounted on /shared/home,
 and since I ran the test under /home, the mount location reported by
 df is different than the one reported by stat -c%m.  This is a new
 test, and I'm not sure what to do about the issue (whether stat needs
 fixing to deal with bind mounts, or whether the test is at fault).

 If there's a reliable mechanism to determine
 that there is a potentially interfering bind mount,
 I'd like to skip this test.

 Oops, invalid test.
 I had forgotten I resolved the bind mount in that case also.
 I'll cook up a fix this evening to skip the test if the inodes match
 or something like that.

 Note `stat -c%m` was outputting the bind mount (/shared/home) in this case,
 while df was just outputting the original mount point for the device.

 One could assume the difference is due to bind mounts if
 the dev,inode were the same and pass in that case.
 Alternatively, one could run `stat -c%m` in a loop to get
 the base device mount point. However one could contrive a
 situation (using bind mounts) where df is outputting
 the original mount point for the device, while stat is
 outputting the current different alias.
 So in summary it's usually but not always valid
 to compare the mount points from `df` and `stat`.
 So I'll adjust the test as follows, to just
 check that stat outputs something.

Thanks.
This certainly won't provoke a false positive ;-)



Re: [coreutils] [PATCH] tests: fix a printf portability issue

2010-09-19 Thread Pádraig Brady
On 19/09/10 07:50, Jim Meyering wrote:
 Pádraig Brady wrote:
 not all printf commands support \xhh

 diff --git a/tests/misc/sort-debug-keys b/tests/misc/sort-debug-keys
 index 57a52a6..4e8beff 100755
 --- a/tests/misc/sort-debug-keys
 +++ b/tests/misc/sort-debug-keys
 @@ -275,7 +275,7 @@ printf 2.,,3\n2.4\n | sort -s -k1n --debug
  printf 2,,3\n2.4\n | sort -s -k1n --debug

  # -z means we convert \0 to \n
 -printf 1a\x002b\x00 | sort -s -n -z --debug
 +printf 1a\0002b\000 | sort -s -n -z --debug
 
 That would accommodate an inferior printf builtin function,
 with the implication that all printf uses should do the same.
 It may be hard or at least tedious to audit existing uses --
 then it'd be good (but more work) to enforce the no-\xHH policy.
 
 Since this is the coreutils, we have the option of a more sustainable
 policy: use env to ensure the test runs the printf binary that we've
 just built in this package:
 
 # printf '\xHH' is not portable with some built-ins. FIXME: list an 
 offender
 # Use env to force use of the one from coreutils.
 env printf '1a\x002b\x00' | sort -s -n -z --debug
 
 That also gives printf more test coverage.
 
 Also, please use single quotes in cases like this,
 where nothing is intended to be shell-expanded.

I'll change as you suggest (and also in the join-i18n test).

Note I left the double quotes (and octal escapes)
because that's what vim highlights.
vim's shell highlighting is woeful anyway,
especially if you don't change the default like
this in your ~/.vimrc
  let g:is_posix = 1
I should fix up vim rather than tweaking
scripts to accommodate it.

cheers,
Pádraig.



[coreutils] Re: [PATCH] join: support multi-byte character encodings

2010-09-19 Thread Bruno Haible
Hi Pádraig,

 This is my start at applying robust and efficient multi-byte
 processing to coreutils.

Actually, it is the continuation of the discussion and based on the patch
from March 2009
http://lists.gnu.org/archive/html/bug-coreutils/2009-03/msg00102.html.

What has changed since then?

Ad 1) Which functions to use for case comparison in coreutils?
I see you replaced the calls to ulc_casecmp / mbmemcasecmp with calls to
ulc_casecoll / mbmemcasecoll. This is correct because POSIX specifies that
the sorting should obey the LC_COLLATE environment variable, i.e. use locale
dependent collation rules.

Ad 2) How should the case comparison in sort -f behave?
I think you need to decide on this before changing 'join', because 'join'
and 'sort' need to behave the same way, otherwise 'join' errs out when
processing results from 'sort'.

Ad 3) Executable size.
This has now been resolved through the creation of libunistring as a shared
library.


About multibyte processing vs. UTF-8 processing:
 I was wondering about unconditionally converting
 input to UTF-8 before processing. I didn't do this yet as:
   Currently we support invalid input chars by processing as in the C locale.

Correct. This is one of the major design decisions Paul, Jim, and I agreed upon
in 2001. It is this requirement which forbids converting the input to a wchar_t
stream, doing processing with wchar_t objects, and producing a stream of wchar_t
objects that are finally converted to multibyte representation again.

It is this requirement which also forbids converting the input to UTF-8, doing
processing with Unicode characters, and converting the Unicode character stream
to multibyte representation at the end. This approach is acceptable for a word
processor that can refuse to open a file, or for more general applications.
But for coreutils, where classical behaviour is to get reasonable processing in
the C locale of files encoded in UTF-8, EUC-JP, or ISO-8859-2, this approach
cannot be done.

For this reason, gnulib has the modules 'mbchar', 'mbiter', 'mbuiter', 'mbfile',
which provide a multibyte character datatype that accommodates also invalid
byte sequences.

Emacs handles this requirement by extending UTF-8. But this approach is unique
to Emacs: libunistring and other software support plain UTF-8, not extended
UTF-8.

 wanted to transform the input as little as possible.

Yes, this is a wise guideline.

   I was unsure how to generally correlate portions of a UTF-8 string with its
   corresponding source string

The u8_conv_from_encoding function
http://www.gnu.org/software/libunistring/manual/html_node/uniconv_002eh.html
stores offsets in an array that give you a mapping from the source string
pointers to the UTF-8 string pointers.

 I may revisit the idea of processing using UTF8 internally if we support
 normalization internally to the utils.

I would advise against normalization that forgets the original string. Of
course, you can associate the original string with the UTF-8 equivalent, if
that exists, and keep both in memory; that becomes merely a speed vs. memory
trade-off.

 I do try to special case UTF-8 where beneficial

I think this should be done in an encoding agnostic way. Don't test whether
the locale encoding is UTF-8. This is bad, because
  - The GB18030 encoding is just as powerful as UTF-8; it's just a different
encoding.
  - The same handling of the Turkish i should be done in ISO-8859-9 locales
as in UTF-8 locales. The same handling of the Greek sigma should be done
in ISO-8859-7 locales as in UTF-8 locales. Users would be very surprised
if moving from unibyte encodings to UTF-8 would change the semantic of
processing their text.

 One consequence of linking with libunistring ... printf ...
 incurs a 16% startup overhead

We discussed this in the thread at
http://lists.gnu.org/archive/html/bug-libunistring/2010-09/msg0.html.

 On a related note, I noticed that bootstrap doesn't update ./configure 
 correctly

See http://lists.gnu.org/archive/html/bug-gnulib/2010-09/msg00090.html.


Now, going through your patch.

- Function xfields. I agree with the use of mbiter for the multibyte case and of
  'char' variables for the unibyte case. But the algorithm is duplicated, once
  for MB_CUR_MAX  1, once for MB_CUR_MAX == 1. In 2001, Paul and Jim said that
  they wanted
- correct handling of multibyte locales, even with invalid input,
- no speed decrease for unibyte locales,
- no code duplication,
- maintainable code.
  At that time I wrote macros which expand to uses of 'mbiter' or plain 'char'
  variables, depending on macros. See attached diff. At that time, Jim did not
  like the extra file. Two months ago, he told me an extra file would be
  acceptable if we really find no better solution.

- Function xfields. The FIXME: ... when splitting on non blanks, Surrounding
  chars might be needed for context. I would say in this case, context should
  be ignored. If someone specifies to