Re: [coreutils] [PATCH] sort: tune and refactor --debug code, and fix minor underlining bug

2010-08-11 Thread Pádraig Brady
On 12/08/10 00:57, Paul Eggert wrote:
> On 08/12/10 01:06, Pádraig Brady wrote:
>> The disadvantage of separating the debugging code from the
>> actual sorting code is that one now has to maintain the
>> extent matching in 2 places, which means we're less sure that
>> the debug output matches what's actually being done.
> 
> True, but there is a performance advantage of hoisting the debugging
> code out of the inner core of comparison.  And if my next few ideas
> work out this advantage will become much greater.

Fair enough.

> Furthermore, separating the debugging code from the actual sorting
> code made the code simpler (at least by the trivial measure of lines
> of code).  This was a bit counterintuitive for me, but it appears that
> the fiddly bits of code to support debugging in the integrated
> version were in some sense more complicated than simply cloning the
> relevant parts of the code and doing it twice, once for debugging
> and once for comparing.

I agree that the separate --debug code isn't too complicated.
Hopefully we'll be able to keep everything in sync.

>> Also doing within compare(), would have allowed to
>> show extents actually used in the comparison of arbitrary lines,
>> which could have been enabled with --debug=verbose in future,
>> or temporarily to aid development.
> 
> That would be a better justification for putting it into the
> comparison code.  (The previously-existing code, which enabled
> debugging only to compare a line against itself, was pretty weird.)
> 
> Perhaps this could be done without adversely affecting performance,
> by using inlining to specialize the comparison code for the case
> where debugging is disabled.

Let's cross that bridge when we come to it.

>> I also thought about outputting the total number of comparisons
>> it debug mode (though that should be easy to add in in any case I think).
> 
> Yes, I think so.
> 
>>> -  /* Disable this combination so that users are less likely
>>> - to inadvertantly update a file with debugging enabled.
>>> - Also it simplifies the code for handling temp files.  */
>>> -  if (debug && outfile)
>>> -error (SORT_FAILURE, 0, _("options -o and --debug are incompatible"));
>>> -
>>>if (debug)
>>>  {
>>> +  if (checkonly || outfile)
>>> +{
>>> +  static char opts[] = "X --debug";
>>> +  opts[0] = (checkonly ? checkonly : 'o');
>>> +  incompatible_options (opts);
>>> +}
>>> +
>>
>> I wouldn't have removed the comment, as Eric was wondering
>> why -o and --debug were incompatible.
> 
> As was I.  I think the comment was wrong in both counts.  I don't see
> how it simplified the code for handling temp files.

I think I thought it would be hard
to distinguish temp fds from the -o fd,
but didn't look too hard TBH.

> And I don't see
> why it's helpful to nanny the user.  I preserved the incompatibility
> betwee -o and --debug only because I didn't want to change too many
> things in one edit.  I'd favor removing that incompatibility.

Well I can't see a need for mixing the 2 options,
and one could seriously nobble a file in doing so.
I nearly did when testing.

>> Also I'm wondering now why --check and --debug are incompatible?
> 
> Because --check outputs to stderr, and --debug outputs to stdout.
> Before the change, --debug silently overrode --check in this respect,
> which was undocumented and confusing (POSIX says that the check option
> "shall not" output to stdout).  I thought it better to be honest and
> say that the two options don't work together.  I expect that the very
> few people who use --check, and who need to debug, can debug without
> --check and then use --check once they understand the problem and do
> not need to debug any more.

A summary of that would be a good comment :)

cheers,
Pádraig.



Re: [coreutils] [PATCH] sort: tune and refactor --debug code, and fix minor underlining bug

2010-08-11 Thread Paul Eggert
On 08/12/10 01:06, Pádraig Brady wrote:
> The disadvantage of separating the debugging code from the
> actual sorting code is that one now has to maintain the
> extent matching in 2 places, which means we're less sure that
> the debug output matches what's actually being done.

True, but there is a performance advantage of hoisting the debugging
code out of the inner core of comparison.  And if my next few ideas
work out this advantage will become much greater.

Furthermore, separating the debugging code from the actual sorting
code made the code simpler (at least by the trivial measure of lines
of code).  This was a bit counterintuitive for me, but it appears that
the fiddly bits of code to support debugging in the integrated
version were in some sense more complicated than simply cloning the
relevant parts of the code and doing it twice, once for debugging
and once for comparing.

> Also doing within compare(), would have allowed to
> show extents actually used in the comparison of arbitrary lines,
> which could have been enabled with --debug=verbose in future,
> or temporarily to aid development.

That would be a better justification for putting it into the
comparison code.  (The previously-existing code, which enabled
debugging only to compare a line against itself, was pretty weird.)

Perhaps this could be done without adversely affecting performance,
by using inlining to specialize the comparison code for the case
where debugging is disabled.

> I also thought about outputting the total number of comparisons
> it debug mode (though that should be easy to add in in any case I think).

Yes, I think so.

>> -  /* Disable this combination so that users are less likely
>> - to inadvertantly update a file with debugging enabled.
>> - Also it simplifies the code for handling temp files.  */
>> -  if (debug && outfile)
>> -error (SORT_FAILURE, 0, _("options -o and --debug are incompatible"));
>> -
>>if (debug)
>>  {
>> +  if (checkonly || outfile)
>> +{
>> +  static char opts[] = "X --debug";
>> +  opts[0] = (checkonly ? checkonly : 'o');
>> +  incompatible_options (opts);
>> +}
>> +
> 
> I wouldn't have removed the comment, as Eric was wondering
> why -o and --debug were incompatible.

As was I.  I think the comment was wrong in both counts.  I don't see
how it simplified the code for handling temp files.  And I don't see
why it's helpful to nanny the user.  I preserved the incompatibility
betwee -o and --debug only because I didn't want to change too many
things in one edit.  I'd favor removing that incompatibility.

> Also I'm wondering now why --check and --debug are incompatible?

Because --check outputs to stderr, and --debug outputs to stdout.
Before the change, --debug silently overrode --check in this respect,
which was undocumented and confusing (POSIX says that the check option
"shall not" output to stdout).  I thought it better to be honest and
say that the two options don't work together.  I expect that the very
few people who use --check, and who need to debug, can debug without
--check and then use --check once they understand the problem and do
not need to debug any more.



Re: [coreutils] [PATCH] sort: -R now uses less memory on long lines with internal NULs

2010-08-11 Thread Paul Eggert
On 08/12/10 00:49, Pádraig Brady wrote:

> Is it uint32_t for alignment (speed)?

>From sort's point of view, it's uint32_t because that's what
the md5 library specifies.  I haven't looked into md5 and don't
know if it could be sped up by assuming 64-bit integers.

> Would it be worth doing a memcmp() first and only doing
> the rest if the bytes differ. Depends on the data I know,
> but given caching the up front memcmp may be worth it?

I had the same idea, but have not bothered to investigate it.
You're right that it depends on the data.  For actual uses
of -R I suspect that the memcmp won't help performance much,
as the items typically won't contain duplicates.

> Also I wonder would caching the xfrm() data in the line buffers
> be worth it, as the number of comparisons increases?

Yes, absolutely, and that's been on my list of things to try
for a couple of months now.  I want to try something else first,
though, that will avoid the need for strxfrm entirely during
almost all calls to compare().  If this works, the benefit
of caching will be so little that I expect it's not worth
the bother.



Re: [coreutils] [PATCH] sort: tune and refactor --debug code, and fix minor underlining bug

2010-08-11 Thread Pádraig Brady
On 05/08/10 21:29, Paul Eggert wrote:
> Formerly, the 'compare' function and some of its subroutines had a
> debugging flag, which caused them to output underlines.  This
> change refactors the code so that debugging output is
> more-separated from the actual sorting.  In the process, the
> change fixes a minor error in the debugging output.  The change
> shortens the source code and executable size a tad, and improves
> CPU performance by 2.4% on my platform with a simple benchmark (C
> locale, line sorting, no debug).

The disadvantage of separating the debugging code from the
actual sorting code is that one now has to maintain the
extent matching in 2 places, which means we're less sure that
the debug output matches what's actually being done.

Also doing within compare(), would have allowed to
show extents actually used in the comparison of arbitrary lines,
which could have been enabled with --debug=verbose in future,
or temporarily to aid development.

I also thought about outputting the total number of comparisons
it debug mode (though that should be easy to add in in any case I think).

> -  /* Disable this combination so that users are less likely
> - to inadvertantly update a file with debugging enabled.
> - Also it simplifies the code for handling temp files.  */
> -  if (debug && outfile)
> -error (SORT_FAILURE, 0, _("options -o and --debug are incompatible"));
> -
>if (debug)
>  {
> +  if (checkonly || outfile)
> +{
> +  static char opts[] = "X --debug";
> +  opts[0] = (checkonly ? checkonly : 'o');
> +  incompatible_options (opts);
> +}
> +

I wouldn't have removed the comment, as Eric was wondering
why -o and --debug were incompatible.
Also I'm wondering now why --check and --debug are incompatible?

cheers,
Pádraig.



Re: [coreutils] [PATCH] sort: -R now uses less memory on long lines with internal NULs

2010-08-11 Thread Pádraig Brady
On 05/08/10 00:12, Paul Eggert wrote:

> @@ -2038,41 +2027,117 @@ static int
>  compare_random (char *restrict texta, size_t lena,
>  char *restrict textb, size_t lenb)
>  {
> -  int diff;
> +  uint32_t dig[2][MD5_DIGEST_SIZE / sizeof (uint32_t)];

I know this is just moved code but it confused me.
Is it uint32_t for alignment (speed)?
Would the following be better on 64 bit while
being immune to hash size?

diff --git a/src/sort.c b/src/sort.c
index a8d0c14..ac913b6 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -2026,7 +2026,8 @@ compare_random (char *restrict texta, size_t lena,
   char *buf = stackbuf;
   size_t bufsize = sizeof stackbuf;
   void *allocated = NULL;
-  uint32_t dig[2][MD5_DIGEST_SIZE / sizeof (uint32_t)];
+  /* align the digests on a size_t boundary for speed.  */
+  size_t dig[2][(MD5_DIGEST_SIZE + sizeof (size_t) - 1) / sizeof (size_t)];
   struct md5_ctx s[2];
   s[0] = s[1] = random_md5_state;

@@ -2121,7 +2122,7 @@ compare_random (char *restrict texta, size_t lena,
   /* Compute and compare the checksums.  */
   md5_process_bytes (texta, lena, &s[0]); md5_finish_ctx (&s[0], dig[0]);
   md5_process_bytes (textb, lenb, &s[1]); md5_finish_ctx (&s[1], dig[1]);
-  int diff = memcmp (dig[0], dig[1], sizeof dig[0]);
+  int diff = memcmp (dig[0], dig[1], MD5_DIGEST_SIZE);

   /* Fall back on the tiebreaker if the checksums collide.  */
   if (! diff)

> +  struct md5_ctx s[2];
> +  s[0] = s[1] = random_md5_state;
>  
> -  if (! hard_LC_COLLATE)
> -diff = cmp_hashes (texta, lena, textb, lenb);
> -  else
> +  if (hard_LC_COLLATE)
>  {
> -  /* Transform the text into the basis of comparison, so that byte
> - strings that would otherwise considered to be equal are
> - considered equal here even if their bytes differ.  */
> -
> -  char *buf = NULL;
> -  char stackbuf[4000];
> -  size_t tlena = xmemxfrm (stackbuf, sizeof stackbuf, texta, lena);

Just off hand ideas...
There are a lot of expensive operations done here.
Would it be worth doing a memcmp() first and only doing
the rest if the bytes differ. Depends on the data I know,
but given caching the up front memcmp may be worth it?
Also I wonder would caching the xfrm() data in the line buffers
be worth it, as the number of comparisons increases?

cheers,
Pádraig.



bug#6844: AIX 6.1: compile errors

2010-08-11 Thread Gottfried Mueller
Hello,

I tried to compile the coreutils the first time under the AIX operating 
system and I've go the the following errors:


  CC   printf-frexp.o
  CC   printf-frexpl.o
  CC   priv-set.o
"priv-set.h", line 22.11: 1506-296 (S) #include file  not found.
"priv-set.h", line 30.27: 1506-045 (S) Undeclared identifier 
PRIV_SYS_LINKDIR.
"priv-set.h", line 35.28: 1506-045 (S) Undeclared identifier 
PRIV_SYS_LINKDIR.
"priv-set.c", line 27.11: 1506-296 (S) #include file  not found.
"priv-set.c", line 30.8: 1506-166 (S) Definition of function priv_set_t 
requires
 parentheses.
"priv-set.c", line 30.19: 1506-276 (S) Syntax error: possible missing '{'?
"priv-set.c", line 40.9: 1506-045 (S) Undeclared identifier initialized.
"priv-set.c", line 42.7: 1506-045 (S) Undeclared identifier eff_set.
"priv-set.c", line 47.7: 1506-045 (S) Undeclared identifier rem_set.
"priv-set.c", line 53.21: 1506-045 (S) Undeclared identifier 
PRIV_EFFECTIVE.
"priv-set.c", line 73.9: 1506-045 (S) Undeclared identifier initialized.
"priv-set.c", line 76.25: 1506-045 (S) Undeclared identifier eff_set.
"priv-set.c", line 86.9: 1506-045 (S) Undeclared identifier initialized.
"priv-set.c", line 89.22: 1506-045 (S) Undeclared identifier eff_set.
"priv-set.c", line 94.21: 1506-045 (S) Undeclared identifier PRIV_SET.
"priv-set.c", line 94.31: 1506-045 (S) Undeclared identifier 
PRIV_EFFECTIVE.
"priv-set.c", line 99.20: 1506-045 (S) Undeclared identifier rem_set.
"priv-set.c", line 119.9: 1506-045 (S) Undeclared identifier initialized.
"priv-set.c", line 122.22: 1506-045 (S) Undeclared identifier rem_set.
"priv-set.c", line 126.20: 1506-045 (S) Undeclared identifier eff_set.
"priv-set.c", line 127.21: 1506-045 (S) Undeclared identifier PRIV_SET.
"priv-set.c", line 127.31: 1506-045 (S) Undeclared identifier 
PRIV_EFFECTIVE.
make: The error code from the last command is 1.


It seems this problem is comparably with the error compiling the gnu tar 
(1.23) command using AIX.

http://lists.gnu.org/archive/html/bug-tar/2010-06/msg00019.html



Gottfried Müller
Magirus International GmbH


Re: [coreutils] A more forgiving rm.

2010-08-11 Thread Jason
I just rsync everything once an hour to another machine with a bigger hard
drive, and then do an rsync --delete when the drive gets full. Version
control can be helpful too. What happens if you blacklist a file thru a
symlink, or rm a file thru a symlink? I didn't see anything to deal with
that. The chances of having added the file that you didn't want to delete to
this list is quite small.

Jason

On Mon, Aug 9, 2010 at 11:35 AM, Dan Hipschman  wrote:

> On Mon, Aug 09, 2010 at 02:04:38PM +0100, P?draig Brady wrote:
> > I'm not sure an option like this would help much.
> > It might give a false sense of security more than anything.
>
> On the contrary, I have been in precisely the situation where it would
> help, which is why I wrote this patch. ;-)
>
> I'm not sure what false sense of security this would give.  The option
> clearly does not prevent removal of files system-wide.  It prevents
> removal of those files in the blacklist during a particular invocation
> of `rm'.  What's confusing about that?  It's not a replacement for
> backups; it's purely there, in my mind, as a convenient safeguard
> against accidental removal of files.  After all, do you really want to
> hunt down that file and restore it?  An option that saves you anywhere
> from 10 minutes to hours figuring out what you deleted and restoring
> it seems like a big win.
>
> > Between rm -I, immutable flags, ACLS and backups
> > one is suitably covered.
>
> These all have problems that I explained in the original email and
> above.
>
> I'd like to mention with respect to the usefulness of the patch that
> the perl wrapper, safe-rm, seems popular enough to prove its
> usefulness.  It's been in Ubuntu's universe repository as far back as
> Intrepid.  Integrating the functionality with coreutils allows us to
> improve on it in a number of ways.
>
> Thanks for your consideration,
> Dan
>
>


Re: [coreutils] ignore-value.h considered harmful

2010-08-11 Thread Eric Blake
On 08/10/2010 09:56 PM, Paul Eggert wrote:
> Re that recent change to sort.c to insert a call to ignore_value.
> 
> This ignore_value business is ugly, and
> runs against the spirit of the GNU coding standards:
> 
>   "Don't make the program ugly to placate lint.
>Please don't insert any casts to void."
> 
>   
> 
> How about if we suppress those GCC warnings instead, using
> -Wno-unused-result or whatever-it-is?  The cost of this
> particular warning seems to be greater than its benefits.

I've seen several benefits, such as catching missed error checking on
system calls.  I'd rather keep -Wunused-result active, for this reason.
 But since it is debatable which system library headers should be thus
marked, I see no option but to have a way to conveniently ignore the
warning on the few places where ignoring the result is intentional.
Ultimately, ignore_value(x) looks a lot better than (void)x, so we're
still complying with the letter of the "no casts to void" if not the spirit.

-- 
Eric Blake   ebl...@redhat.com+1-801-349-2682
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [coreutils] [PATCH] tests: silence 'make syntax-check'

2010-08-11 Thread Eric Blake
On 08/11/2010 03:55 AM, Pádraig Brady wrote:
>> Committing as obvious.  The set_program_name hack is borrowed
>> liberally from test-ino-map.c.
>>
> 
> It would be more consistent to update .x-sc_program_name
> In fact we can do better and exclude all programs in this directory:
> 
> 
> commit 41fe1906319d008af5c5f1d8fbeeadf45cde5333
> Author: Pádraig Brady 
> Date:   Wed Aug 11 10:49:22 2010 +0100
> 
> maint: exclude all tests from the set_program_name syntax-check
> 
> * .x-sc_program_name: Exclude all current and future
> c files in gl/tests from this check
> * gl/tests/test-di-set.c: Remove the hack to work around
> the set_program_name syntax-check
> * gl/tests/test-ino-map.c: Likewise
> * gl/tests/test-rand-isaac.c: Likewise

Looks good to me ;)

-- 
Eric Blake   ebl...@redhat.com+1-801-349-2682
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [coreutils] [PATCH] tests: silence 'make syntax-check'

2010-08-11 Thread Pádraig Brady
On 09/08/10 21:59, Eric Blake wrote:
> * gl/tests/test-rand-isaac.c (main): Avoid warnings from
> syntax-check.
> ---
> 
> Committing as obvious.  The set_program_name hack is borrowed
> liberally from test-ino-map.c.
> 
>  gl/tests/test-rand-isaac.c |3 ++-
>  1 files changed, 2 insertions(+), 1 deletions(-)
> 
> diff --git a/gl/tests/test-rand-isaac.c b/gl/tests/test-rand-isaac.c
> index 330b18f..c1ad01a 100644
> --- a/gl/tests/test-rand-isaac.c
> +++ b/gl/tests/test-rand-isaac.c
> @@ -576,6 +576,7 @@ static isaac_word const expected[2][ISAAC_WORDS] =
>  int
>  main (int argc, char **argv)
>  {
> +  /* set_program_name (argv[0]); placate overzealous "syntax-check" test.  */
>unsigned int i;
>isaac_word r[ISAAC_WORDS];
>int iterations;

It would be more consistent to update .x-sc_program_name
In fact we can do better and exclude all programs in this directory:


commit 41fe1906319d008af5c5f1d8fbeeadf45cde5333
Author: Pádraig Brady 
Date:   Wed Aug 11 10:49:22 2010 +0100

maint: exclude all tests from the set_program_name syntax-check

* .x-sc_program_name: Exclude all current and future
c files in gl/tests from this check
* gl/tests/test-di-set.c: Remove the hack to work around
the set_program_name syntax-check
* gl/tests/test-ino-map.c: Likewise
* gl/tests/test-rand-isaac.c: Likewise

diff --git a/.x-sc_program_name b/.x-sc_program_name
index 0667044..86cc5c1 100644
--- a/.x-sc_program_name
+++ b/.x-sc_program_name
@@ -1,4 +1,3 @@
 gl/lib/randint.c
 lib/euidaccess-stat.c
-gl/tests/test-mbsalign.c
-gl/tests/test-fadvise.c
+gl/tests/.*\.c
diff --git a/gl/tests/test-di-set.c b/gl/tests/test-di-set.c
index e5fb6cb..7f02e66 100644
--- a/gl/tests/test-di-set.c
+++ b/gl/tests/test-di-set.c
@@ -39,7 +39,6 @@
 int
 main (void)
 {
-  /* set_program_name (argv[0]); placate overzealous "syntax-check" test.  */
   struct di_set *dis = di_set_alloc ();
   ASSERT (dis);

diff --git a/gl/tests/test-ino-map.c b/gl/tests/test-ino-map.c
index 2b44602..aa7334a 100644
--- a/gl/tests/test-ino-map.c
+++ b/gl/tests/test-ino-map.c
@@ -39,7 +39,6 @@
 int
 main ()
 {
-  /* set_program_name (argv[0]); placate overzealous "syntax-check" test.  */
   enum { INO_MAP_INIT = 123 };
   struct ino_map *ino_map = ino_map_alloc (INO_MAP_INIT);
   ASSERT (ino_map != NULL);
diff --git a/gl/tests/test-rand-isaac.c b/gl/tests/test-rand-isaac.c
index c1ad01a..03b004c 100644
--- a/gl/tests/test-rand-isaac.c
+++ b/gl/tests/test-rand-isaac.c
@@ -576,7 +576,6 @@ static isaac_word const expected[2][ISAAC_WORDS] =
 int
 main (int argc, char **argv)
 {
-  /* set_program_name (argv[0]); placate overzealous "syntax-check" test.  */
   unsigned int i;
   isaac_word r[ISAAC_WORDS];
   int iterations;

cheers,
Pádraig.