Re: feature request: gzip/bzip support for sort

2007-01-21 Thread Bauke Jan Douma

Dan Hipschman wrote on 22-01-07 05:55:


sort can now compresses


Small typo here.


bjd





___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: feature request: gzip/bzip support for sort

2007-01-21 Thread Dan Hipschman

I have a good feeling about this one :-)  I think I've addressed
everything except adding libz.  I just don't think I have the time to do
that right now, and I think it can wait.  I can look into writing some
test cases in a bit.

2007-01-21  Jim Meyering  <[EMAIL PROTECTED]>

* src/sort.c (MAX_FORK_RETRIES_COMPRESS, MAX_FORK_RETRIES_DECOMPRESS):
In pipe_fork callers, use these named constants, not "2" and "8".
(proctab, nprocs): Declare to be "static".
(pipe_fork) [lint]: Initialize local, pid,
to avoid unwarranted may-be-used-uninitialized warning.
(create_temp): Use the active voice.  Describe parameters, too.

2007-01-21  James Youngman  <[EMAIL PROTECTED]>

Centralize all the uses of sigprocmask().  Don't restore an invalid
saved mask.
* src/sort.c (enter_cs, leave_cs): New functions for protecting
code sequences against signal delivery.
* (exit_cleanup): Use enter_cs and leave_cs instead of
calling sigprocmask directly.
(create_temp_file, pipe_fork, zaptemp): Likewise

2007-01-21  Dan Hipschman  <[EMAIL PROTECTED]>

Add compression of temp files to sort.
* NEWS: Mention this.
* bootstrap.conf: Import findprog.
* configure.ac: Add AC_FUNC_FORK.
* doc/coreutils.texi: Document GNUSORT_COMPRESSOR environment
variable.
* src/sort.c (compress_program): New global, holds the name of the
external compression program.
(struct sortfile): New type used by mergepfs and friends instead
of filenames to hold PIDs of compressor processes.
(proctab): New global, holds compressor PIDs on which to wait.
(enum procstate, struct procnode): New types used by proctab.
(proctab_hasher, proctab_comparator): New functions for proctab.
(nprocs): New global, number of forked but unreaped children.
(reap, reap_some): New function, wait for/cleanup forked processes.
(register_proc, update_proc, wait_proc): New functions for adding,
modifying and removing proctab entries.
(create_temp_file): Change parameter type to pointer to file
descriptor, and return type to pointer to struct tempnode.
(dup2_or_die): New function used in create_temp and open_temp.
(pipe_fork): New function, creates a pipe and child process.
(create_temp): Creates a temp file and possibly a compression
program to which we filter output.
(open_temp): Opens a compressed temp file and creates a
decompression process through which to filter the input.
(mergefps): Change FILES parameter type to struct sortfile array
and update access accordingly.  Use open_temp and reap_some.
(avoid_trashing_input, merge): Change FILES parameter like
mergefps and call create_temp instead of create_temp_file.
(sort): Call create_temp instead of create_temp_file.
Use reap_some.
(avoid_trashing_input, merge, sort, main): Adapt to mergefps.

Index: NEWS
===
RCS file: /sources/coreutils/coreutils/NEWS,v
retrieving revision 1.467
diff -p -u -r1.467 NEWS
--- NEWS18 Jan 2007 09:18:21 -  1.467
+++ NEWS22 Jan 2007 03:44:29 -
@@ -29,6 +29,11 @@ GNU coreutils NEWS  
 
   "rm --interactive=never F" no longer prompts for an unwritable F
 
+** New features
+
+  sort can now compresses temporary files to improve performance of
+  very large sorts.
+
 
 * Noteworthy changes in release 6.7 (2006-12-08) [stable]
 
Index: bootstrap.conf
===
RCS file: /sources/coreutils/coreutils/bootstrap.conf,v
retrieving revision 1.15
diff -p -u -r1.15 bootstrap.conf
--- bootstrap.conf  15 Jan 2007 10:33:52 -  1.15
+++ bootstrap.conf  22 Jan 2007 03:44:29 -
@@ -43,7 +43,8 @@ gnulib_modules="
config-h configmake
closeout cycle-check d-ino d-type diacrit dirfd dirname dup2
error euidaccess exclude exitfail fchdir fcntl fcntl-safer fdl
-   file-type fileblocks filemode filenamecat fnmatch-gnu fopen-safer
+   file-type fileblocks filemode filenamecat findprog fnmatch-gnu
+   fopen-safer
fprintftime fsusage ftruncate fts getdate getgroups gethrxtime
getline getloadavg getndelim2 getopt getpagesize getpass-gnu
gettext gettime gettimeofday getugroups getusershell gnupload
Index: configure.ac
===
RCS file: /sources/coreutils/coreutils/configure.ac,v
retrieving revision 1.102
diff -p -u -r1.102 configure.ac
--- configure.ac28 Dec 2006 08:18:17 -  1.102
+++ configure.ac22 Jan 2007 03:44:29 -
@@ -39,6 +39,8 @@ gl_EARLY
 gl_INIT
 coreutils_MACROS
 
+AC_FUNC_FORK
+
 AC_CHECK_FUNCS(uname,
OPTIONAL_BIN_PROGS="$OPTIONAL_BIN_PROGS

POSIX most likely will require a new -C option for 'sort'

2007-01-21 Thread Paul Eggert

suggests that the next edition of POSIX will require 'sort' to support
a new -C option.  There's no guarantee of this new requirement, but at
this point I think we should probably just put in -C.  Here is a
proposed patch.

2007-01-21  Paul Eggert  <[EMAIL PROTECTED]>

* NEWS: New option sort -C, proposed by XCU ERN 127, which looks
like it will be approved.  Also add --check=quiet, --check=silent
as long aliases, and --check=diagnose-first as an alias for -c.
* doc/coreutils.texi (sort invocation): Document this.
Also, mention that sort -c can take at most one file.
* src/sort.c: Implement this.
Include argmatch.h.
(usage): Document the change.
(CHECK_OPTION): New constant.
(long_options): --check now takes an optional argument, and is now
treated differently from 'c'.
(check_args, check_types): New constant arrays.
(check): New arg CHECKONLY, which suppresses diagnostic if -C.
(main): Parse the new options.
* tests/sort/Test.pm (02d, 02d, incompat5, incompat6):
New tests for -C.

diff --git a/NEWS b/NEWS
index a30b17e..99f6b93 100644
--- a/NEWS
+++ b/NEWS
@@ -29,6 +29,13 @@ GNU coreutils NEWS-*- 
outline -*-

   "rm --interactive=never F" no longer prompts for an unwritable F

+** New features
+
+  sort accepts the new option -C, which acts like -c except no diagnostic
+  is printed.  Its --check option now accepts an optional argument, and
+  --check=quiet and --check=silent are now aliases for -C, while
+  --check=diagnose-first is an alias for -c or plain --check.
+

 * Noteworthy changes in release 6.7 (2006-12-08) [stable]

diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index 89e97d8..c534223 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -3342,12 +3342,26 @@ mode:

 @item -c
 @itemx --check
[EMAIL PROTECTED] --check=diagnose-first
 @opindex -c
 @opindex --check
 @cindex checking for sortedness
-Check whether the given files are already sorted: if they are not all
-sorted, print an error message and exit with a status of 1.
+Check whether the given file is already sorted: if it is not all
+sorted, print a diagnostic containing the first out-of-order line and
+exit with a status of 1.
 Otherwise, exit successfully.
+At most one input file can be given.
+
[EMAIL PROTECTED] -C
[EMAIL PROTECTED] --check=quiet
[EMAIL PROTECTED] --check=silent
[EMAIL PROTECTED] -c
[EMAIL PROTECTED] --check
[EMAIL PROTECTED] checking for sortedness
+Exit successfully if the given file is already sorted, and
+exit with status 1 otherwise.
+At most one input file can be given.
+This is like @option{-c}, except it does not print a diagnostic.

 @item -m
 @itemx --merge
@@ -3401,7 +3415,7 @@ Exit status:

 @display
 0 if no error occurred
-1 if invoked with @option{-c} and the input is not properly sorted
+1 if invoked with @option{-c} or @option{-C} and the input is not sorted
 2 if an error occurred
 @end display

@@ -3703,7 +3717,7 @@ disks and controllers.
 @cindex uniquifying output

 Normally, output only the first of a sequence of lines that compare
-equal.  For the @option{--check} (@option{-c}) option,
+equal.  For the @option{--check} (@option{-c} or @option{-C}) option,
 check that no pair of consecutive lines compares equal.

 This option also disables the default last-resort comparison.
diff --git a/src/sort.c b/src/sort.c
index 8a22796..e54db00 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include "system.h"
+#include "argmatch.h"
 #include "error.h"
 #include "hard-locale.h"
 #include "inttostr.h"
@@ -315,7 +316,8 @@ Ordering options:\n\
   fputs (_("\
 Other options:\n\
 \n\
-  -c, --check   check whether input is sorted; do not sort\n\
+  -c, --check, --check=diagnose-first  check for sorted input; do not sort\n\
+  -C, --check=quiet, --check=silent  like -c, but do not report first bad 
line\n\
   -k, --key=POS1[,POS2] start a key at POS1, end it at POS2 (origin 1)\n\
   -m, --merge   merge already sorted files; do not sort\n\
   -o, --output=FILE write result to FILE instead of standard output\n\
@@ -364,15 +366,16 @@ native byte values.\n\
non-character as a pseudo short option, starting with CHAR_MAX + 1.  */
 enum
 {
-  RANDOM_SOURCE_OPTION = CHAR_MAX + 1
+  CHECK_OPTION = CHAR_MAX + 1,
+  RANDOM_SOURCE_OPTION
 };

-static char const short_options[] = "-bcdfgik:mMno:rRsS:t:T:uy:z";
+static char const short_options[] = "-bcCdfgik:mMno:rRsS:t:T:uy:z";

 static struct option const long_options[] =
 {
   {"ignore-leading-blanks", no_argument, NULL, 'b'},
-  {"check", no_argument, NULL, 'c'},
+  {"check", optional_argument, NULL, CHECK_OPTION},
   {"dictionary-order", no_argument, NULL, 'd'},
   {"ignore-case", no_argument, NULL, 'f'},
   {"general-numeric-sort", no_argument, NULL, 'g'},

Re: RFC: removing old-style "signal" handling code

2007-01-21 Thread Paul Eggert
Jim Meyering <[EMAIL PROTECTED]> writes:

> Maybe we can finally excise the crufty code in those #else blocks.
> Do any of you know of a "reasonable porting target" :-) that does
> not define SA_NOCLDSTOP?

Neither Debian stable nor Solaris 10 defines SA_NOCLDSTOP if you
compile with gcc -ansi.  This is because the C Standard doesn't allow
the implementation to declare SA_NOCLDSTOP in , due to name
space considerations.  One could easily argue that this is simply a
botch in the standard, but it's a common enough botch that I expect
we'll run into it.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: feature request: gzip/bzip support for sort

2007-01-21 Thread Dan Hipschman
On Sun, Jan 21, 2007 at 10:41:11PM +0100, Jim Meyering wrote:
> This is a good argument for using libz by default, not a separate
> gzip program.  Why incur the overhead of an exec when we don't need to?
> Now, I'm convinced that sort should provide built-in support for both
> gzip and bzip2.  How to select built-in vs. actually exec the program
> is something to think about...
> Of course it should still be possible to specify some other program.

Well, was there anything wrong with my original LZO patch?  It shouldn't
be too hard to replace LZO with libz.  I don't disagree with you about
using libz in addition to an external program, but if the current patch
gives performance gains, why not check it in (once the wrinkles are
ironed out) and add more optimizations later?  At least that way I can
take a break (aha, the real motivation comes out), and pick up where I
left off later without worrying about the code changing under my feet.



___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: feature request: gzip/bzip support for sort

2007-01-21 Thread Jim Meyering
Dan Hipschman <[EMAIL PROTECTED]> wrote:

> On Sun, Jan 21, 2007 at 07:14:03PM +0100, Jim Meyering wrote:
>> Not to look the gift horse in the mouth, but it'd be nice
>> if you wrote ChangeLog entries, too.  And even (gasp! :-)
>> a test case or two.  Of course, we'd expect such a test case
>> (probably named tests/misc/sort-compress, and based on
>> tests/sample-test) to have this line in it:
>>
>>   . $srcdir/../very-expensive
>>
>> If you don't have time for that, I'll take care of it, eventually.
>
> I'm not going to stop you :-)  I just haven't had the time to look into
> it yet, so I've just been running the coreutils tests and then my own
> tests.  I was planning on adding the tests, as you say, "eventually".
>
>> Default to just "gzip", not /bin/gzip.  The latter may not exist;
>> your patch already handles that, but /bin/gzip may not be the first
>> gzip in PATH.  Also, don't bother with the access-XOK check.
>> There's no point in incurring even such a small overhead in the
>> general case, when no temporary is used.
>
> The reason I put the access check in there is that if we default to gzip
> and it doesn't exist, then of course the exec will fail, the child will

This is a good argument for using libz by default, not a separate
gzip program.  Why incur the overhead of an exec when we don't need to?
Now, I'm convinced that sort should provide built-in support for both
gzip and bzip2.  How to select built-in vs. actually exec the program
is something to think about...
Of course it should still be possible to specify some other program.

> fail, and this will cause sort to fail when it really should just not do
> the compression, or try another default if something suitable exists
> (what about compress?).  How about we just delay the determination of

compress?  no thank you! :-)

> the compress program until it's actually needed (e.g., in create_temp
> right before "if (compress_program)" we have "if (compress_program_known)"
> and inside the body we check the environment variable and/or do access
> checks on possible defaults)?
>
>> But please address the FIXME I've added.
>
> If we can't fork a compression process, it's not the end of the world.
> We just don't compress that file and sort will still work.  If we can't
> fork a decompression process, we can't continue the sort.  So I figure
> we'll just try twice to fork compression processes, and if we can't do
> it after 1 sec, we're probably wasting more time waiting to fork than we
> would doing disk access.  However, we really need to be able to fork
> decompression processes, so we can afford to wait a really long time for
> it.  I was considering making the number of tries for decompression
> processes even larger (now, it'll wait about 2 min before giving up).
>
>> Have you considered using the gnulib hash module rather than
>> rolling your own?  There are examples in many of coreutils/src/*.c.
>
> I'm not familiar with gnulib, so I didn't know a hash module existed or
> think to look for one.  Looking at it now, though, it seems it will be
> slower because of its abstraction, unless the table fills up to the

Performance isn't the issue here, but code-reuse.
I would be very surprised if changing hash table implementations
has any measurable effect on sort's performance.

> point where it would be faster to access if it grew.  I'd prefer (since
> it seems to be my time we're talking about), to leave it the way it is
> because it's simple, and see if the gnulib module is faster later.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: feature request: gzip/bzip support for sort

2007-01-21 Thread Dan Hipschman
On Sun, Jan 21, 2007 at 07:14:03PM +0100, Jim Meyering wrote:
> Not to look the gift horse in the mouth, but it'd be nice
> if you wrote ChangeLog entries, too.  And even (gasp! :-)
> a test case or two.  Of course, we'd expect such a test case
> (probably named tests/misc/sort-compress, and based on
> tests/sample-test) to have this line in it:
> 
>   . $srcdir/../very-expensive
> 
> If you don't have time for that, I'll take care of it, eventually.

I'm not going to stop you :-)  I just haven't had the time to look into
it yet, so I've just been running the coreutils tests and then my own
tests.  I was planning on adding the tests, as you say, "eventually".

> Default to just "gzip", not /bin/gzip.  The latter may not exist;
> your patch already handles that, but /bin/gzip may not be the first
> gzip in PATH.  Also, don't bother with the access-XOK check.
> There's no point in incurring even such a small overhead in the
> general case, when no temporary is used.

The reason I put the access check in there is that if we default to gzip
and it doesn't exist, then of course the exec will fail, the child will
fail, and this will cause sort to fail when it really should just not do
the compression, or try another default if something suitable exists
(what about compress?).  How about we just delay the determination of
the compress program until it's actually needed (e.g., in create_temp
right before "if (compress_program)" we have "if (compress_program_known)"
and inside the body we check the environment variable and/or do access
checks on possible defaults)?

> But please address the FIXME I've added.

If we can't fork a compression process, it's not the end of the world.
We just don't compress that file and sort will still work.  If we can't
fork a decompression process, we can't continue the sort.  So I figure
we'll just try twice to fork compression processes, and if we can't do
it after 1 sec, we're probably wasting more time waiting to fork than we
would doing disk access.  However, we really need to be able to fork
decompression processes, so we can afford to wait a really long time for
it.  I was considering making the number of tries for decompression
processes even larger (now, it'll wait about 2 min before giving up).

> Have you considered using the gnulib hash module rather than
> rolling your own?  There are examples in many of coreutils/src/*.c.

I'm not familiar with gnulib, so I didn't know a hash module existed or
think to look for one.  Looking at it now, though, it seems it will be
slower because of its abstraction, unless the table fills up to the
point where it would be faster to access if it grew.  I'd prefer (since
it seems to be my time we're talking about), to leave it the way it is
because it's simple, and see if the gnulib module is faster later.
We've already seen tests (yours :-) that show the patch improves large
sorts (granted, they used the last patch, not this one), so it's still a
win to put it in, and optimizations later are just gravy.

Thanks for the advice.  I've got some other work I've got to get caught
up with first (trust me, I'd rather do this), but hopefully I can get to
work on it tomorrow.



___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: feature request: gzip/bzip support for sort

2007-01-21 Thread Jim Meyering
"James Youngman" <[EMAIL PROTECTED]> wrote:
> It might be worth ensuring that we don't pass an invalid signal mask
> to sigprocmask(SET_MASK,...) if the previous call to
> sigprocmask(SIG_BLOCK,...) had failed.  Offhand I can't think of a way
> for sigprocmask() to fail unless the first argument is invalid, but
> looking that the unchecked return code makes me mildly nervous.  How
> about something resembling this (untested since my version of Automake
> is still only 1.9.6)?
>
> 2007-01-21  James Youngman  <[EMAIL PROTECTED]>
>
>   Centralize all the uses of sigprocmask().  Don't restore an invalid
>   saved mask.
>   * src/sort.c (enter_cs, leave_cs): New functions for protecting
>code sequences against signal delivery.
>   * (exit_cleanup): Use enter_cs and leave_cs instead of
>calling sigprocmask directly.
>   (create_temp_file, pipe_fork, zaptemp): Likewise

Hi James,

Good idea.
I've applied it on a branch with some minor changes:
- leave_cs should be void
- remove now-unused declarations of oldset
- rename new struct and functions to start with cs_

BTW, your patch was mangled: as if something filtered it through sed 's/^ //'.

I've applied this on top of a modified version of Dan's patch.
I'm fiddling with git now, with an eye to publishing this working branch.
In the mean time, I've included my working version of sort.c below.

Here's the adjusted patch:

diff --git a/ChangeLog b/ChangeLog
index fe8cd7a..764aef8 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,13 @@
+2007-01-21  James Youngman  <[EMAIL PROTECTED]>
+
+   Centralize all the uses of sigprocmask.  Don't restore an invalid
+   saved mask.
+   * src/sort.c (cs_enter, cs_leave): New functions for protecting
+   code sequences against signal delivery.
+   (exit_cleanup): Use cs_enter and cs_leave instead of calling
+   sigprocmask directly.
+   (create_temp_file, pipe_fork, zaptemp): Likewise
+
 2007-01-21  Jim Meyering  <[EMAIL PROTECTED]>

* src/sort.c (MAX_FORK_RETRIES_COMPRESS, MAX_FORK_RETRIES_DECOMPRESS):
diff --git a/src/sort.c b/src/sort.c
index da10759..57003ea 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -64,7 +64,8 @@ struct rlimit { size_t rlim_cur; };
present.  */
 #ifndef SA_NOCLDSTOP
 # define SA_NOCLDSTOP 0
-# define sigprocmask(How, Set, Oset) /* empty */
+/* No sigprocmask.  Always 'return' zero. */
+# define sigprocmask(How, Set, Oset) (0)
 # define sigset_t int
 # if ! HAVE_SIGINTERRUPT
 #  define siginterrupt(sig, flag) /* empty */
@@ -412,6 +413,33 @@ static struct option const long_options[] =
 /* The set of signals that are caught.  */
 static sigset_t caught_signals;

+/* Critical section status.  */
+struct cs_status
+{
+  bool valid;
+  sigset_t sigs;
+};
+
+/* Enter a critical section.  */
+static struct cs_status
+cs_enter (void)
+{
+  struct cs_status status;
+  status.valid = (sigprocmask (SIG_BLOCK, &caught_signals, &status.sigs) == 0);
+  return status;
+}
+
+/* Leave a critical section.  */
+static void
+cs_leave (struct cs_status status)
+{
+  if (status.valid)
+{
+  /* Ignore failure when restoring the signal mask. */
+  sigprocmask (SIG_SETMASK, &status.sigs, NULL);
+}
+}
+
 /* The list of temporary files. */
 struct tempnode
 {
@@ -577,10 +605,9 @@ exit_cleanup (void)
 {
   /* Clean up any remaining temporary files in a critical section so
 that a signal handler does not try to clean them too.  */
-  sigset_t oldset;
-  sigprocmask (SIG_BLOCK, &caught_signals, &oldset);
+  struct cs_status cs = cs_enter ();
   cleanup ();
-  sigprocmask (SIG_SETMASK, &oldset, NULL);
+  cs_leave (cs);
 }

   close_stdout ();
@@ -594,7 +621,6 @@ create_temp_file (int *pfd)
 {
   static char const slashbase[] = "/sortXX";
   static size_t temp_dir_index;
-  sigset_t oldset;
   int fd;
   int saved_errno;
   char const *temp_dir = temp_dirs[temp_dir_index];
@@ -602,6 +628,7 @@ create_temp_file (int *pfd)
   struct tempnode *node =
 xmalloc (offsetof (struct tempnode, name) + len + sizeof slashbase);
   char *file = node->name;
+  struct cs_status cs;

   memcpy (file, temp_dir, len);
   memcpy (file + len, slashbase, sizeof slashbase);
@@ -611,7 +638,7 @@ create_temp_file (int *pfd)
 temp_dir_index = 0;

   /* Create the temporary file in a critical section, to avoid races.  */
-  sigprocmask (SIG_BLOCK, &caught_signals, &oldset);
+  cs = cs_enter ();
   fd = mkstemp (file);
   if (0 <= fd)
 {
@@ -619,7 +646,7 @@ create_temp_file (int *pfd)
   temptail = &node->next;
 }
   saved_errno = errno;
-  sigprocmask (SIG_SETMASK, &oldset, NULL);
+  cs_leave (cs);
   errno = saved_errno;

   if (fd < 0)
@@ -699,10 +726,10 @@ pipe_fork (int pipefds[2], size_t tries)
 {
 #if HAVE_WORKING_FORK
   struct tempnode *saved_temphead;
-  sigset_t oldset;
   int saved_errno;
   unsigned int wait_retry = 1;
   pid_t pid IF_LINT (= 0);
+  struct cs_status cs;

   if (pipe (pipefds) < 0

Re: feature request: gzip/bzip support for sort

2007-01-21 Thread Jim Meyering
Dan Hipschman <[EMAIL PROTECTED]> wrote:
> I think this patch addresses everything Paul mentioned in his critique
> of my last attempt.  I did look at gnulib pipe module, but there were
> some problems with using it "out of the box".  First, it takes a

Hi Dan,

Thanks for doing all that.
Not to look the gift horse in the mouth, but it'd be nice
if you wrote ChangeLog entries, too.  And even (gasp! :-)
a test case or two.  Of course, we'd expect such a test case
(probably named tests/misc/sort-compress, and based on
tests/sample-test) to have this line in it:

  . $srcdir/../very-expensive

If you don't have time for that, I'll take care of it, eventually.

Here are a few more comments.

Default to just "gzip", not /bin/gzip.  The latter may not exist;
your patch already handles that, but /bin/gzip may not be the first
gzip in PATH.  Also, don't bother with the access-XOK check.
There's no point in incurring even such a small overhead in the
general case, when no temporary is used.

may-be-used-uninit:
  sort.c: In function 'pipe_fork':
  sort.c:696: warning: 'pid' may be used uninitialized in this function

Coreutils policy -- via "make distcheck" -- dictates that file-global
variables be declared static.

In pipe_fork callers, use named constants, not "2" and "8".
i.e.
enum {
  /* Explain why this number is smaller than... */
  MAX_FORK_RETRIES_COMPRESS = 2,
  /* ...this one.  */
  MAX_FORK_RETRIES_DECOMPRESS = 8
};
--
I've done the above in the patch below.
But please address the FIXME I've added.
The rest I've left for you.

Have you considered using the gnulib hash module rather than
rolling your own?  There are examples in many of coreutils/src/*.c.

For this,
> +#define PROC_THRESH 2
Use an "enum", as above.  It's more debugger-friendly.
Use a more descriptive name.

reap and register_proc need brief comments describing what
they do and the meaning/function of their parameters.

===
Here's a start.  The patch is relative to your latest posted patch:

* src/sort.c (MAX_FORK_RETRIES_COMPRESS, MAX_FORK_RETRIES_DECOMPRESS):
In pipe_fork callers, use these named constants, not "2" and "8".
(proctab, nprocs): Declare to be "static".
(pipe_fork) [lint]: Initialize local, pid,
to avoid unwarranted may-be-used-uninitialized warning.
(create_temp): Use the active voice.  Describe parameters, too.
(main): Default to "gzip", not "/bin/gzip".
Don't bother with the access-X_OK check.

diff --git a/src/sort.c b/src/sort.c
index 9604faa..da10759 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -93,6 +93,15 @@ enum
 SORT_FAILURE = 2
   };

+enum
+  {
+/* FIXME: explain why this number is smaller than... */
+MAX_FORK_RETRIES_COMPRESS = 2,
+
+/* ...this one.  */
+MAX_FORK_RETRIES_DECOMPRESS = 8
+  };
+
 /* The representation of the decimal point in the current locale.  */
 static int decimal_point;

@@ -440,11 +449,11 @@ struct procnode
processes in a timely manner so as not to exhaust system resources,
so we store the info on whether the process is still running, or has
been reaped here.  */
-struct procnode *proctab[PROCTABSZ];
+static struct procnode *proctab[PROCTABSZ];

 /* The total number of forked processes (compressors and decompressors)
that have not been reaped yet. */
-size_t nprocs;
+static size_t nprocs;

 /* The number of child processes we'll allow before we try to reap them. */
 #define PROC_THRESH 2
@@ -693,7 +702,7 @@ pipe_fork (int pipefds[2], size_t tries)
   sigset_t oldset;
   int saved_errno;
   unsigned int wait_retry = 1;
-  pid_t pid;
+  pid_t pid IF_LINT (= 0);

   if (pipe (pipefds) < 0)
 return -1;
@@ -744,7 +753,9 @@ pipe_fork (int pipefds[2], size_t tries)
 #endif
 }

-/* Creates a temp file and compression program to filter output to it.  */
+/* Create a temporary file and start a compression program to filter output
+   to that file.  Set *PFP to the file handle and if *PPID is non-NULL,
+   set it to the PID of the newly-created process.  */

 static char *
 create_temp (FILE **pfp, pid_t *ppid)
@@ -757,7 +768,7 @@ create_temp (FILE **pfp, pid_t *ppid)
 {
   int pipefds[2];

-  node->pid = pipe_fork (pipefds, 2);
+  node->pid = pipe_fork (pipefds, MAX_FORK_RETRIES_COMPRESS);
   if (0 < node->pid)
{
  close (tempfd);
@@ -810,7 +821,7 @@ open_temp (const char *name, pid_t pid)
   if (compress_program)
 {
   int pipefds[2];
-  pid_t child_pid = pipe_fork (pipefds, 8);
+  pid_t child_pid = pipe_fork (pipefds, MAX_FORK_RETRIES_DECOMPRESS);

   if (0 < child_pid)
{
@@ -3030,20 +3041,9 @@ main (int argc, char **argv)
 }

   compress_program = getenv ("GNUSORT_COMPRESSOR");
-  if (! compress_program)
-{
-  static const char *compressors[] = { "/bin/gzip" };
-  enum { ncompressors = sizeof compressors / sizeof compressors[0] };
-  size_t i;
-
-  for (i = 0; i < ncompressors; ++i)
-

Re: feature request: gzip/bzip support for sort

2007-01-21 Thread James Youngman

It might be worth ensuring that we don't pass an invalid signal mask
to sigprocmask(SET_MASK,...) if the previous call to
sigprocmask(SIG_BLOCK,...) had failed.  Offhand I can't think of a way
for sigprocmask() to fail unless the first argument is invalid, but
looking that the unchecked return code makes me mildly nervous.  How
about something resembling this (untested since my version of Automake
is still only 1.9.6)?

2007-01-21  James Youngman  <[EMAIL PROTECTED]>

Centralize all the uses of sigprocmask().  Don't restore an invalid
saved mask.
* src/sort.c (enter_cs, leave_cs): New functions for protecting
   code sequences against signal delivery.
* (exit_cleanup): Use enter_cs and leave_cs instead of
   calling sigprocmask directly.
(create_temp_file, pipe_fork, zaptemp): Likewise

--- sort.c  2007-01-21 14:37:20.0 +
+++ sort_prime.c2007-01-21 14:37:44.0 +
@@ -64,7 +64,8 @@ struct rlimit { size_t rlim_cur; };
   present.  */
#ifndef SA_NOCLDSTOP
# define SA_NOCLDSTOP 0
-# define sigprocmask(How, Set, Oset) /* empty */
+/* No sigprocmask.  Always 'return' zero. */
+# define sigprocmask(How, Set, Oset) (0)
# define sigset_t int
# if ! HAVE_SIGINTERRUPT
#  define siginterrupt(sig, flag) /* empty */
@@ -403,6 +404,35 @@ static struct option const long_options[
/* The set of signals that are caught.  */
static sigset_t caught_signals;

+
+struct crit_sec_status
+{
+  int  valid;
+  sigset_t sigs;
+};
+
+
+/* Enter a critical section */
+static struct crit_sec_status
+enter_cs(void)
+{
+  struct crit_sec_status status;
+  status.valid = (0 == sigprocmask (SIG_BLOCK, &caught_signals, &status.sigs));
+  return status;
+}
+
+/* Leave a critical section */
+static int
+leave_cs(struct crit_sec_status status)
+{
+  if (status.valid)
+{
+  /* Ignore failure when restoring the signal mask. */
+  (void) sigprocmask (SIG_SETMASK, &status.sigs, NULL);
+}
+}
+
+
/* The list of temporary files. */
struct tempnode
{
@@ -568,10 +598,9 @@ exit_cleanup (void)
{
  /* Clean up any remaining temporary files in a critical section so
 that a signal handler does not try to clean them too.  */
-  sigset_t oldset;
-  sigprocmask (SIG_BLOCK, &caught_signals, &oldset);
+  struct crit_sec_status cs = enter_cs();
  cleanup ();
-  sigprocmask (SIG_SETMASK, &oldset, NULL);
+  leave_cs(cs);
}

  close_stdout ();
@@ -593,6 +622,7 @@ create_temp_file (int *pfd)
  struct tempnode *node =
xmalloc (offsetof (struct tempnode, name) + len + sizeof slashbase);
  char *file = node->name;
+  struct crit_sec_status cs;

  memcpy (file, temp_dir, len);
  memcpy (file + len, slashbase, sizeof slashbase);
@@ -602,7 +632,7 @@ create_temp_file (int *pfd)
temp_dir_index = 0;

  /* Create the temporary file in a critical section, to avoid races.  */
-  sigprocmask (SIG_BLOCK, &caught_signals, &oldset);
+  cs = enter_cs();
  fd = mkstemp (file);
  if (0 <= fd)
{
@@ -610,7 +640,7 @@ create_temp_file (int *pfd)
  temptail = &node->next;
}
  saved_errno = errno;
-  sigprocmask (SIG_SETMASK, &oldset, NULL);
+  leave_cs(cs);
  errno = saved_errno;

  if (fd < 0)
@@ -694,7 +724,8 @@ pipe_fork (int pipefds[2], size_t tries)
  int saved_errno;
  unsigned int wait_retry = 1;
  pid_t pid;
-
+  struct crit_sec_status cs;
+
  if (pipe (pipefds) < 0)
return -1;

@@ -702,7 +733,7 @@ pipe_fork (int pipefds[2], size_t tries)
{
  /* This is so the child process won't delete our temp files
 if it receives a signal before exec-ing.  */
-  sigprocmask (SIG_BLOCK, &caught_signals, &oldset);
+  cs = enter_cs();
  saved_temphead = temphead;
  temphead = NULL;

@@ -711,7 +742,7 @@ pipe_fork (int pipefds[2], size_t tries)
  if (pid)
temphead = saved_temphead;

-  sigprocmask (SIG_SETMASK, &oldset, NULL);
+  leave_cs(cs);
  errno = saved_errno;

  if (0 <= pid || errno != EAGAIN)
@@ -870,17 +901,18 @@ zaptemp (const char *name)
  sigset_t oldset;
  int unlink_status;
  int unlink_errno = 0;
+  struct crit_sec_status cs;

  for (pnode = &temphead; (node = *pnode)->name != name; pnode = &node->next)
continue;

  /* Unlink the temporary file in a critical section to avoid races.  */
  next = node->next;
-  sigprocmask (SIG_BLOCK, &caught_signals, &oldset);
+  cs = enter_cs();
  unlink_status = unlink (name);
  unlink_errno = errno;
  *pnode = next;
-  sigprocmask (SIG_SETMASK, &oldset, NULL);
+  leave_cs(cs);

  if (unlink_status != 0)
error (0, unlink_errno, _("warning: cannot remove: %s"), name);


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils