Re: feature request: gzip/bzip support for sort

2007-01-25 Thread Jim Meyering
Paul Eggert [EMAIL PROTECTED] wrote:
 Jim Meyering [EMAIL PROTECTED] writes:
 I'm probably going to change the documentation so that
 people will be less likely to depend on being able to run
 a separate program.  To be precise, I'd like to document
 that the only valid values of GNUSORT_COMPRESSOR are the
 empty string, gzip and bzip2[*].

 This sounds extreme, particularly since gzip and bzip2 are
 not the best algorithms for 'sort' compression, where you
 want a fast compressor.  Better choices right now would
 include include lzop http://www.lzop.org/ and maybe
 QuickLZ http://www.quicklz.com/.

I see that streaming support in QuickLZ is available only in the very
latest beta.  That makes me think it's not quite ready for use in
GNU sort.  Maybe lzop is more mature.  I haven't looked yet.

 The fast-compressor field is moving fairly rapidly.
 (I've heard some rumors from some of my commercial friends.)
 QuickLZ, a new algorithm, is at the top of the
 maximumcompression list right now for fast compressors; see
 http://www.maximumcompression.com/data/summary_mf3.php.
 I would not be surprised to see a new champ next year.

 Then we will have the liberty to remove the exec calls and use library
 code instead, thus making the code a little more efficient -- but mainly,
 more robust.

 It's not clear to me that it'll be more efficient for the
 soon-to-be common case of multicore chips, since 'sort' and
 the compressor can run in parallel.

No.  I proposed to remove only the *exec*, not the _fork_.  The idea
is to have each child run library code to (de)compress, rather than to
exec an external program.  Well, I do want it to revert to sequential
whenever fork fails, but otherwise it would still take advantage of
_some_ parallelism: 2-way when compressing, and up to 17-way (NMERGE + 1)
when merging.  Of course, this parallelism currently kicks in only
with a data set large enough to require temporary files.  Now that
dual-core systems are common, a nice project would be to make GNU sort
take advantage of that when performing day-to-day (in-memory) sorts.

 We'll have to measure.
 I agree about the robustness but that should be up to the user.

I want to keep the default code paths as simple and reliable as possible.
That means no exec, and a fall-back to running library code sequentially
upon fork failure -- both for compression and decompression.  While this
new feature is a very nice addition, we have to be careful to ensure
that it does not compromise sort's reliability, even under duress.

Here's my reasoning: when compressing with an arbitrary program, sort
may do a *lot* of work up front (there is no decompression at all in the
beginning), yet fail only near the end when it discovers that prog
doesn't accept -d or can't fork.  The result: the time spent sorting
that large data set is wasted.  With judicious support for at least one
built-in compression method, this error path is completely eliminated.
This is yet another way in which sort differs from tar.  If an invalid
compression program is specified to tar, tar fails right away -- it
doesn't first waste hours of compute time.  In fact, given an invalid
compressor program, sort may well work repeatedly, potentially over a span
of days or longer, before encountering an input large enough to trigger
the need to use temporary files, where it could hit a decompression
failure.  Avoiding this sort of delayed failure is my main motivation
for defining away the problem.  And as I'm sure you agree, I'd rather
not have sort test (fork/exec) the compression program to ensure that
it works with -d.

Another reason to avoid the exec is memory.  When merging, there are
usually 16 separate gzip processes running in parallel.  Sure, each is
pretty small, with a RSS of under 500K on my system, but with 16 of
them, it does add up.

So we need support for at least one built-in compression
library.  Here are my selection criteria, in decreasing order
of importance:

robust
portable
reasonable run-time efficiency
reasonable compression performance

If some other compression library is a better fit, then we
should add support for it, and consider making it the default.

Similarly, if there are people interested in sorting huge data sets
for which general purpose compression isn't good enough, they can
provide profiling data showing how their domain-specific compressor
is essential.  If this ever happens, it would be a good reason to
allow sort to exec an arbitrary program again.

 Perhaps we could put in something that says, If the
 compressor is named 'gzip' we may optimize that. and
 similarly for 'lzop' and/or a few other compressor names.
 Or, more generally, we could have the convention that if the
 compressor name starts with - we will strip the - and
 then try to optimize the result if we can.  Something like
 that, anyway.

 [*] If gzip and bzip2 are good enough for tar, why should sort make any
 compromise (exec'ing some other program) in order 

Re: feature request: gzip/bzip support for sort

2007-01-25 Thread Jim Meyering
Dan Hipschman [EMAIL PROTECTED] wrote:
 On Wed, Jan 24, 2007 at 08:08:18AM +0100, Jim Meyering wrote:
 I've checked in your changes, then changed NEWS a little:

 Great!  Thanks :-)

 Additionally, I'm probably going to change the documentation so that
 people will be less likely to depend on being able to run a separate
 program.  To be precise, I'd like to document that the only valid values
 of GNUSORT_COMPRESSOR are the empty string, gzip and bzip2[*].
 Then we will have the liberty to remove the exec calls and use library
 code instead, thus making the code a little more efficient -- but mainly,
 more robust.

 Why not add a special value 'libz' and document it as follows:

We'll see.  I'm really inclined to disallow the exec option,
unless someone provides a good use case to justify it.  It's fun to add
features, but so far that one is not justified, given its downsides.

 By the way, I've got a little amendment to the patch.  I took a look at
 gnulib's findprog module, and it turns out find_in_path does an access-
 X_OK itself, so sort doesn't need to do it again.


 2007-01-24  Dan Hipschman  [EMAIL PROTECTED]

   * src/sort.c (create_temp): Remove superfluous access-X_OK
   check.  find_in_path does this for us.

Thanks.  Applied.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: feature request: gzip/bzip support for sort

2007-01-24 Thread Eric Blake
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

According to Jim Meyering on 1/24/2007 12:08 AM:
 Additionally, I'm probably going to change the documentation so that
 people will be less likely to depend on being able to run a separate
 program.  To be precise, I'd like to document that the only valid values
 of GNUSORT_COMPRESSOR are the empty string, gzip and bzip2[*].
 Then we will have the liberty to remove the exec calls and use library
 code instead, thus making the code a little more efficient -- but mainly,
 more robust.

Fair enough for now, as long as we leave ourselves an opening for future
expansion.  For example, the 7zip algorithm tends to produce smaller
compressed files than even bzip2 on typical input, and is patent
unencumbered, but my impression of 7zip is that it still does not behaves
as a filter compressor like gzip or bzip2, so it is not ready for
prime-time support yet.

 
 If someone makes a good case for allowing an arbitrary compressor, we can
 allow that later.  But if we were to add (and document) this feature now,
 we might well be stuck with it for a long time.
 
 [*] If gzip and bzip2 are good enough for tar, why should sort make any
 compromise (exec'ing some other program) in order to be more flexible?

New enough tar (1.16.1, for example), supports:
  --use-compress-program=PROG
 filter through PROG (must accept -d)
along with the builtin recognition of gzip and bzip2.  However, rather
than linking in libraries, it always exec's, even for the known two formats.

- --
Don't work too hard, make some time for fun as well!

Eric Blake [EMAIL PROTECTED]
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (Cygwin)
Comment: Public key at home.comcast.net/~ericblake/eblake.gpg
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFt1tn84KuGfSFAYARAqeoAJ9Zwstntak+XtKCRMgHwBaRWt7evgCgoypy
I8ymsCcFDib8l8wdzwpRROw=
=jnfm
-END PGP SIGNATURE-


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: feature request: gzip/bzip support for sort

2007-01-24 Thread Paul Eggert
Jim Meyering [EMAIL PROTECTED] writes:

 I'm probably going to change the documentation so that
 people will be less likely to depend on being able to run
 a separate program.  To be precise, I'd like to document
 that the only valid values of GNUSORT_COMPRESSOR are the
 empty string, gzip and bzip2[*].

This sounds extreme, particularly since gzip and bzip2 are
not the best algorithms for 'sort' compression, where you
want a fast compressor.  Better choices right now would
include include lzop http://www.lzop.org/ and maybe
QuickLZ http://www.quicklz.com/.

The fast-compressor field is moving fairly rapidly.
(I've heard some rumors from some of my commercial friends.)
QuickLZ, a new algorithm, is at the top of the
maximumcompression list right now for fast compressors; see
http://www.maximumcompression.com/data/summary_mf3.php.
I would not be surprised to see a new champ next year.

 Then we will have the liberty to remove the exec calls and use library
 code instead, thus making the code a little more efficient -- but mainly,
 more robust.

It's not clear to me that it'll be more efficient for the
soon-to-be common case of multicore chips, since 'sort' and
the compressor can run in parallel.  We'll have to measure.
I agree about the robustness but that should be up to the user.

Perhaps we could put in something that says, If the
compressor is named 'gzip' we may optimize that. and
similarly for 'lzop' and/or a few other compressor names.
Or, more generally, we could have the convention that if the
compressor name starts with - we will strip the - and
then try to optimize the result if we can.  Something like
that, anyway.

 [*] If gzip and bzip2 are good enough for tar, why should sort make any
 compromise (exec'ing some other program) in order to be more flexible?

For 'sort' the tradeoff is different than for 'tar'.  We
don't particularly care if the format is stable, since it's
throwaway.  And we want fast compression, whereas people
generating tarballs often are willing to have way slower
compression for a slightly higher compression ratio.  (Plus,
new versions of 'tar' allow arbitrary compressors anyway.)


I do have a suggestion: we shouldn't use an environment
variable to select a compressor.  It should just be an
option.  Environment variables are funny beasts and it's
better to avoid them if we can.  I'll construct a patch
along those lines if you like.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: feature request: gzip/bzip support for sort

2007-01-24 Thread Craig Macdonald
Firstly, I wanted to say that I am exited by the extremely fast progress 
that has been made in sort for compressed temporary files.

Many thanks to Dan and others for the implementation.
(I've failed to accomplish the bootstrap of the CVS sources - are there 
bootstrapped snapshots

available anywhere?)

For 'sort' the tradeoff is different than for 'tar'.  We
don't particularly care if the format is stable, since it's
throwaway.  And we want fast compression, whereas people
generating tarballs often are willing to have way slower
compression for a slightly higher compression ratio.  (Plus,
new versions of 'tar' allow arbitrary compressors anyway.)

  

Now that we have the ability to fork decompression processes, are
we likely to see sort have the ability to open gzipped(or bzip2ed) files?
For sorting a stream of compressed, this is obviously not required, but 
for merging,

this would reduce a substantial mess with zcats to fifos etc etc.
However, I'd understand if it was decided not to, because unlike the 
temporary

files, there is an existing workable solution.

Many thanks

Craig Macdonald


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: feature request: gzip/bzip support for sort

2007-01-23 Thread Jim Meyering
Dan Hipschman [EMAIL PROTECTED] wrote:
 On Sun, Jan 21, 2007 at 07:14:03PM +0100, Jim Meyering wrote:
 Not to look the gift horse in the mouth, but it'd be nice
 if you wrote ChangeLog entries, too.  And even (gasp! :-)
 a test case or two.  Of course, we'd expect such a test case
 (probably named tests/misc/sort-compress, and based on
 tests/sample-test) to have this line in it:

   . $srcdir/../very-expensive

 If you don't have time for that, I'll take care of it, eventually.

 Here's some tests.  They're actually not very expensive.  Of course,
 you need to chmod +x sort-compress.


 2007-01-22  Dan Hipschman  [EMAIL PROTECTED]

   Test sort compression.
   * tests/misc/Makefile.am: Add the test.
   * tests/misc/sort-compress: New file containing the tests.

Thanks for all the work!
I've checked in your changes, then changed NEWS a little:

** New features

  By default, sort now compresses any temporary file it writes.
  When sorting very large inputs, this usually results in sort using
  far less temporary disk space and in improved performance.

Additionally, I'm probably going to change the documentation so that
people will be less likely to depend on being able to run a separate
program.  To be precise, I'd like to document that the only valid values
of GNUSORT_COMPRESSOR are the empty string, gzip and bzip2[*].
Then we will have the liberty to remove the exec calls and use library
code instead, thus making the code a little more efficient -- but mainly,
more robust.

If someone makes a good case for allowing an arbitrary compressor, we can
allow that later.  But if we were to add (and document) this feature now,
we might well be stuck with it for a long time.

[*] If gzip and bzip2 are good enough for tar, why should sort make any
compromise (exec'ing some other program) in order to be more flexible?


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: feature request: gzip/bzip support for sort

2007-01-22 Thread Dan Hipschman
On Sun, Jan 21, 2007 at 07:14:03PM +0100, Jim Meyering wrote:
 Not to look the gift horse in the mouth, but it'd be nice
 if you wrote ChangeLog entries, too.  And even (gasp! :-)
 a test case or two.  Of course, we'd expect such a test case
 (probably named tests/misc/sort-compress, and based on
 tests/sample-test) to have this line in it:
 
   . $srcdir/../very-expensive
 
 If you don't have time for that, I'll take care of it, eventually.

Here's some tests.  They're actually not very expensive.  Of course,
you need to chmod +x sort-compress.


2007-01-22  Dan Hipschman  [EMAIL PROTECTED]

Test sort compression.
* tests/misc/Makefile.am: Add the test.
* tests/misc/sort-compress: New file containing the tests.

Index: Makefile.am
===
RCS file: /sources/coreutils/coreutils/tests/misc/Makefile.am,v
retrieving revision 1.46
diff -p -u -r1.46 Makefile.am
--- tests/misc/Makefile.am  15 Jan 2007 10:33:49 -  1.46
+++ tests/misc/Makefile.am  23 Jan 2007 05:12:58 -
@@ -69,6 +69,7 @@ TESTS = \
   sha384sum \
   sha512sum \
   shuf \
+  sort-compress \
   sort-merge \
   sort-rand \
   split-a \
Index: sort-compress
===
diff -Nru a/sort-compress b/sort-compress
--- tests/misc/sort-compress1969-12-31 16:00:00.0 -0800
+++ tests/misc/sort-compress2007-01-22 21:00:19.0 -0800
@@ -0,0 +1,92 @@
+#!/bin/sh
+# Test use of compression by sort
+
+# Copyright (C) 2007 Free Software Foundation, Inc.
+
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
+# 02110-1301, USA.
+
+if test $VERBOSE = yes; then
+  set -x
+  sort --version
+fi
+
+pwd=`pwd`
+t0=`echo $0|sed 's,.*/,,'`.tmp; tmp=$t0/$$
+trap 'status=$?; cd $pwd  chmod -R u+rwx $t0  rm -rf $t0  exit 
$status' 0
+trap '(exit $?); exit $?' 1 2 13 15
+
+framework_failure=0
+mkdir -p $tmp || framework_failure=1
+cd $tmp || framework_failure=1
+seq -w 2000  exp || framework_failure=1
+tac exp  in || framework_failure=1
+SORT=`which sort` || framework_failure=1
+
+if test $framework_failure = 1; then
+  echo $0: failure in testing framework 12
+  (exit 1); exit 1
+fi
+
+fail=0
+
+# This should force the use of temp files compressed with the default gzip
+sort -S 1k in  out || fail=1
+cmp exp out || fail=1
+test $fail = 1  diff out exp 2 /dev/null
+
+# Create our own gzip program that will be used as the default
+cat \EOF  gzip || fail=1
+#!/bin/sh
+tr 41 14
+touch ok
+EOF
+
+chmod +x gzip
+
+# This will find our new gzip in PATH
+PATH=.:$PATH sort -S 1k in  out || fail=1
+cmp exp out || fail=1
+test $fail = 1  diff out exp 2 /dev/null
+test -f ok || fail=1
+rm -f ok
+
+# This is to make sure we can disable compression
+PATH=.:$PATH GNUSORT_COMPRESSOR= sort -S 1k in  out || fail=1
+cmp exp out || fail=1
+test $fail = 1  diff out exp 2 /dev/null
+test -f ok  fail=1
+
+# This is to make sure we can use something other than gzip
+mv gzip dzip || fail=1
+GNUSORT_COMPRESSOR=./dzip sort -S 1k in  out || fail=1
+cmp exp out || fail=1
+test $fail = 1  diff out exp 2 /dev/null
+test -f ok || fail=1
+rm -f ok
+
+# Make sure it can find other programs in PATH correctly
+PATH=.:$PATH GNUSORT_COMPRESSOR=dzip sort -S 1k in  out || fail=1
+cmp exp out || fail=1
+test $fail = 1  diff out exp 2 /dev/null
+test -f ok || fail=1
+rm -f dzip ok
+
+# This is to make sure sort functions if it can't find the default gzip
+PATH=. $SORT -S 1k in  out || fail=1
+cmp exp out || fail=1
+test $fail = 1  diff out exp 2 /dev/null
+
+(exit $fail); exit $fail


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: feature request: gzip/bzip support for sort

2007-01-21 Thread Jim Meyering
Dan Hipschman [EMAIL PROTECTED] wrote:
 I think this patch addresses everything Paul mentioned in his critique
 of my last attempt.  I did look at gnulib pipe module, but there were
 some problems with using it out of the box.  First, it takes a

Hi Dan,

Thanks for doing all that.
Not to look the gift horse in the mouth, but it'd be nice
if you wrote ChangeLog entries, too.  And even (gasp! :-)
a test case or two.  Of course, we'd expect such a test case
(probably named tests/misc/sort-compress, and based on
tests/sample-test) to have this line in it:

  . $srcdir/../very-expensive

If you don't have time for that, I'll take care of it, eventually.

Here are a few more comments.

Default to just gzip, not /bin/gzip.  The latter may not exist;
your patch already handles that, but /bin/gzip may not be the first
gzip in PATH.  Also, don't bother with the access-XOK check.
There's no point in incurring even such a small overhead in the
general case, when no temporary is used.

may-be-used-uninit:
  sort.c: In function 'pipe_fork':
  sort.c:696: warning: 'pid' may be used uninitialized in this function

Coreutils policy -- via make distcheck -- dictates that file-global
variables be declared static.

In pipe_fork callers, use named constants, not 2 and 8.
i.e.
enum {
  /* Explain why this number is smaller than... */
  MAX_FORK_RETRIES_COMPRESS = 2,
  /* ...this one.  */
  MAX_FORK_RETRIES_DECOMPRESS = 8
};
--
I've done the above in the patch below.
But please address the FIXME I've added.
The rest I've left for you.

Have you considered using the gnulib hash module rather than
rolling your own?  There are examples in many of coreutils/src/*.c.

For this,
 +#define PROC_THRESH 2
Use an enum, as above.  It's more debugger-friendly.
Use a more descriptive name.

reap and register_proc need brief comments describing what
they do and the meaning/function of their parameters.

===
Here's a start.  The patch is relative to your latest posted patch:

* src/sort.c (MAX_FORK_RETRIES_COMPRESS, MAX_FORK_RETRIES_DECOMPRESS):
In pipe_fork callers, use these named constants, not 2 and 8.
(proctab, nprocs): Declare to be static.
(pipe_fork) [lint]: Initialize local, pid,
to avoid unwarranted may-be-used-uninitialized warning.
(create_temp): Use the active voice.  Describe parameters, too.
(main): Default to gzip, not /bin/gzip.
Don't bother with the access-X_OK check.

diff --git a/src/sort.c b/src/sort.c
index 9604faa..da10759 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -93,6 +93,15 @@ enum
 SORT_FAILURE = 2
   };

+enum
+  {
+/* FIXME: explain why this number is smaller than... */
+MAX_FORK_RETRIES_COMPRESS = 2,
+
+/* ...this one.  */
+MAX_FORK_RETRIES_DECOMPRESS = 8
+  };
+
 /* The representation of the decimal point in the current locale.  */
 static int decimal_point;

@@ -440,11 +449,11 @@ struct procnode
processes in a timely manner so as not to exhaust system resources,
so we store the info on whether the process is still running, or has
been reaped here.  */
-struct procnode *proctab[PROCTABSZ];
+static struct procnode *proctab[PROCTABSZ];

 /* The total number of forked processes (compressors and decompressors)
that have not been reaped yet. */
-size_t nprocs;
+static size_t nprocs;

 /* The number of child processes we'll allow before we try to reap them. */
 #define PROC_THRESH 2
@@ -693,7 +702,7 @@ pipe_fork (int pipefds[2], size_t tries)
   sigset_t oldset;
   int saved_errno;
   unsigned int wait_retry = 1;
-  pid_t pid;
+  pid_t pid IF_LINT (= 0);

   if (pipe (pipefds)  0)
 return -1;
@@ -744,7 +753,9 @@ pipe_fork (int pipefds[2], size_t tries)
 #endif
 }

-/* Creates a temp file and compression program to filter output to it.  */
+/* Create a temporary file and start a compression program to filter output
+   to that file.  Set *PFP to the file handle and if *PPID is non-NULL,
+   set it to the PID of the newly-created process.  */

 static char *
 create_temp (FILE **pfp, pid_t *ppid)
@@ -757,7 +768,7 @@ create_temp (FILE **pfp, pid_t *ppid)
 {
   int pipefds[2];

-  node-pid = pipe_fork (pipefds, 2);
+  node-pid = pipe_fork (pipefds, MAX_FORK_RETRIES_COMPRESS);
   if (0  node-pid)
{
  close (tempfd);
@@ -810,7 +821,7 @@ open_temp (const char *name, pid_t pid)
   if (compress_program)
 {
   int pipefds[2];
-  pid_t child_pid = pipe_fork (pipefds, 8);
+  pid_t child_pid = pipe_fork (pipefds, MAX_FORK_RETRIES_DECOMPRESS);

   if (0  child_pid)
{
@@ -3030,20 +3041,9 @@ main (int argc, char **argv)
 }

   compress_program = getenv (GNUSORT_COMPRESSOR);
-  if (! compress_program)
-{
-  static const char *compressors[] = { /bin/gzip };
-  enum { ncompressors = sizeof compressors / sizeof compressors[0] };
-  size_t i;
-
-  for (i = 0; i  ncompressors; ++i)
-   if (access (compressors[i], 

Re: feature request: gzip/bzip support for sort

2007-01-21 Thread Jim Meyering
James Youngman [EMAIL PROTECTED] wrote:
 It might be worth ensuring that we don't pass an invalid signal mask
 to sigprocmask(SET_MASK,...) if the previous call to
 sigprocmask(SIG_BLOCK,...) had failed.  Offhand I can't think of a way
 for sigprocmask() to fail unless the first argument is invalid, but
 looking that the unchecked return code makes me mildly nervous.  How
 about something resembling this (untested since my version of Automake
 is still only 1.9.6)?

 2007-01-21  James Youngman  [EMAIL PROTECTED]

   Centralize all the uses of sigprocmask().  Don't restore an invalid
   saved mask.
   * src/sort.c (enter_cs, leave_cs): New functions for protecting
code sequences against signal delivery.
   * (exit_cleanup): Use enter_cs and leave_cs instead of
calling sigprocmask directly.
   (create_temp_file, pipe_fork, zaptemp): Likewise

Hi James,

Good idea.
I've applied it on a branch with some minor changes:
- leave_cs should be void
- remove now-unused declarations of oldset
- rename new struct and functions to start with cs_

BTW, your patch was mangled: as if something filtered it through sed 's/^ //'.

I've applied this on top of a modified version of Dan's patch.
I'm fiddling with git now, with an eye to publishing this working branch.
In the mean time, I've included my working version of sort.c below.

Here's the adjusted patch:

diff --git a/ChangeLog b/ChangeLog
index fe8cd7a..764aef8 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,13 @@
+2007-01-21  James Youngman  [EMAIL PROTECTED]
+
+   Centralize all the uses of sigprocmask.  Don't restore an invalid
+   saved mask.
+   * src/sort.c (cs_enter, cs_leave): New functions for protecting
+   code sequences against signal delivery.
+   (exit_cleanup): Use cs_enter and cs_leave instead of calling
+   sigprocmask directly.
+   (create_temp_file, pipe_fork, zaptemp): Likewise
+
 2007-01-21  Jim Meyering  [EMAIL PROTECTED]

* src/sort.c (MAX_FORK_RETRIES_COMPRESS, MAX_FORK_RETRIES_DECOMPRESS):
diff --git a/src/sort.c b/src/sort.c
index da10759..57003ea 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -64,7 +64,8 @@ struct rlimit { size_t rlim_cur; };
present.  */
 #ifndef SA_NOCLDSTOP
 # define SA_NOCLDSTOP 0
-# define sigprocmask(How, Set, Oset) /* empty */
+/* No sigprocmask.  Always 'return' zero. */
+# define sigprocmask(How, Set, Oset) (0)
 # define sigset_t int
 # if ! HAVE_SIGINTERRUPT
 #  define siginterrupt(sig, flag) /* empty */
@@ -412,6 +413,33 @@ static struct option const long_options[] =
 /* The set of signals that are caught.  */
 static sigset_t caught_signals;

+/* Critical section status.  */
+struct cs_status
+{
+  bool valid;
+  sigset_t sigs;
+};
+
+/* Enter a critical section.  */
+static struct cs_status
+cs_enter (void)
+{
+  struct cs_status status;
+  status.valid = (sigprocmask (SIG_BLOCK, caught_signals, status.sigs) == 0);
+  return status;
+}
+
+/* Leave a critical section.  */
+static void
+cs_leave (struct cs_status status)
+{
+  if (status.valid)
+{
+  /* Ignore failure when restoring the signal mask. */
+  sigprocmask (SIG_SETMASK, status.sigs, NULL);
+}
+}
+
 /* The list of temporary files. */
 struct tempnode
 {
@@ -577,10 +605,9 @@ exit_cleanup (void)
 {
   /* Clean up any remaining temporary files in a critical section so
 that a signal handler does not try to clean them too.  */
-  sigset_t oldset;
-  sigprocmask (SIG_BLOCK, caught_signals, oldset);
+  struct cs_status cs = cs_enter ();
   cleanup ();
-  sigprocmask (SIG_SETMASK, oldset, NULL);
+  cs_leave (cs);
 }

   close_stdout ();
@@ -594,7 +621,6 @@ create_temp_file (int *pfd)
 {
   static char const slashbase[] = /sortXX;
   static size_t temp_dir_index;
-  sigset_t oldset;
   int fd;
   int saved_errno;
   char const *temp_dir = temp_dirs[temp_dir_index];
@@ -602,6 +628,7 @@ create_temp_file (int *pfd)
   struct tempnode *node =
 xmalloc (offsetof (struct tempnode, name) + len + sizeof slashbase);
   char *file = node-name;
+  struct cs_status cs;

   memcpy (file, temp_dir, len);
   memcpy (file + len, slashbase, sizeof slashbase);
@@ -611,7 +638,7 @@ create_temp_file (int *pfd)
 temp_dir_index = 0;

   /* Create the temporary file in a critical section, to avoid races.  */
-  sigprocmask (SIG_BLOCK, caught_signals, oldset);
+  cs = cs_enter ();
   fd = mkstemp (file);
   if (0 = fd)
 {
@@ -619,7 +646,7 @@ create_temp_file (int *pfd)
   temptail = node-next;
 }
   saved_errno = errno;
-  sigprocmask (SIG_SETMASK, oldset, NULL);
+  cs_leave (cs);
   errno = saved_errno;

   if (fd  0)
@@ -699,10 +726,10 @@ pipe_fork (int pipefds[2], size_t tries)
 {
 #if HAVE_WORKING_FORK
   struct tempnode *saved_temphead;
-  sigset_t oldset;
   int saved_errno;
   unsigned int wait_retry = 1;
   pid_t pid IF_LINT (= 0);
+  struct cs_status cs;

   if (pipe (pipefds)  0)
 return -1;
@@ -711,7 +738,7 @@ 

Re: feature request: gzip/bzip support for sort

2007-01-21 Thread Dan Hipschman
On Sun, Jan 21, 2007 at 07:14:03PM +0100, Jim Meyering wrote:
 Not to look the gift horse in the mouth, but it'd be nice
 if you wrote ChangeLog entries, too.  And even (gasp! :-)
 a test case or two.  Of course, we'd expect such a test case
 (probably named tests/misc/sort-compress, and based on
 tests/sample-test) to have this line in it:
 
   . $srcdir/../very-expensive
 
 If you don't have time for that, I'll take care of it, eventually.

I'm not going to stop you :-)  I just haven't had the time to look into
it yet, so I've just been running the coreutils tests and then my own
tests.  I was planning on adding the tests, as you say, eventually.

 Default to just gzip, not /bin/gzip.  The latter may not exist;
 your patch already handles that, but /bin/gzip may not be the first
 gzip in PATH.  Also, don't bother with the access-XOK check.
 There's no point in incurring even such a small overhead in the
 general case, when no temporary is used.

The reason I put the access check in there is that if we default to gzip
and it doesn't exist, then of course the exec will fail, the child will
fail, and this will cause sort to fail when it really should just not do
the compression, or try another default if something suitable exists
(what about compress?).  How about we just delay the determination of
the compress program until it's actually needed (e.g., in create_temp
right before if (compress_program) we have if (compress_program_known)
and inside the body we check the environment variable and/or do access
checks on possible defaults)?

 But please address the FIXME I've added.

If we can't fork a compression process, it's not the end of the world.
We just don't compress that file and sort will still work.  If we can't
fork a decompression process, we can't continue the sort.  So I figure
we'll just try twice to fork compression processes, and if we can't do
it after 1 sec, we're probably wasting more time waiting to fork than we
would doing disk access.  However, we really need to be able to fork
decompression processes, so we can afford to wait a really long time for
it.  I was considering making the number of tries for decompression
processes even larger (now, it'll wait about 2 min before giving up).

 Have you considered using the gnulib hash module rather than
 rolling your own?  There are examples in many of coreutils/src/*.c.

I'm not familiar with gnulib, so I didn't know a hash module existed or
think to look for one.  Looking at it now, though, it seems it will be
slower because of its abstraction, unless the table fills up to the
point where it would be faster to access if it grew.  I'd prefer (since
it seems to be my time we're talking about), to leave it the way it is
because it's simple, and see if the gnulib module is faster later.
We've already seen tests (yours :-) that show the patch improves large
sorts (granted, they used the last patch, not this one), so it's still a
win to put it in, and optimizations later are just gravy.

Thanks for the advice.  I've got some other work I've got to get caught
up with first (trust me, I'd rather do this), but hopefully I can get to
work on it tomorrow.



___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: feature request: gzip/bzip support for sort

2007-01-21 Thread Jim Meyering
Dan Hipschman [EMAIL PROTECTED] wrote:

 On Sun, Jan 21, 2007 at 07:14:03PM +0100, Jim Meyering wrote:
 Not to look the gift horse in the mouth, but it'd be nice
 if you wrote ChangeLog entries, too.  And even (gasp! :-)
 a test case or two.  Of course, we'd expect such a test case
 (probably named tests/misc/sort-compress, and based on
 tests/sample-test) to have this line in it:

   . $srcdir/../very-expensive

 If you don't have time for that, I'll take care of it, eventually.

 I'm not going to stop you :-)  I just haven't had the time to look into
 it yet, so I've just been running the coreutils tests and then my own
 tests.  I was planning on adding the tests, as you say, eventually.

 Default to just gzip, not /bin/gzip.  The latter may not exist;
 your patch already handles that, but /bin/gzip may not be the first
 gzip in PATH.  Also, don't bother with the access-XOK check.
 There's no point in incurring even such a small overhead in the
 general case, when no temporary is used.

 The reason I put the access check in there is that if we default to gzip
 and it doesn't exist, then of course the exec will fail, the child will

This is a good argument for using libz by default, not a separate
gzip program.  Why incur the overhead of an exec when we don't need to?
Now, I'm convinced that sort should provide built-in support for both
gzip and bzip2.  How to select built-in vs. actually exec the program
is something to think about...
Of course it should still be possible to specify some other program.

 fail, and this will cause sort to fail when it really should just not do
 the compression, or try another default if something suitable exists
 (what about compress?).  How about we just delay the determination of

compress?  no thank you! :-)

 the compress program until it's actually needed (e.g., in create_temp
 right before if (compress_program) we have if (compress_program_known)
 and inside the body we check the environment variable and/or do access
 checks on possible defaults)?

 But please address the FIXME I've added.

 If we can't fork a compression process, it's not the end of the world.
 We just don't compress that file and sort will still work.  If we can't
 fork a decompression process, we can't continue the sort.  So I figure
 we'll just try twice to fork compression processes, and if we can't do
 it after 1 sec, we're probably wasting more time waiting to fork than we
 would doing disk access.  However, we really need to be able to fork
 decompression processes, so we can afford to wait a really long time for
 it.  I was considering making the number of tries for decompression
 processes even larger (now, it'll wait about 2 min before giving up).

 Have you considered using the gnulib hash module rather than
 rolling your own?  There are examples in many of coreutils/src/*.c.

 I'm not familiar with gnulib, so I didn't know a hash module existed or
 think to look for one.  Looking at it now, though, it seems it will be
 slower because of its abstraction, unless the table fills up to the

Performance isn't the issue here, but code-reuse.
I would be very surprised if changing hash table implementations
has any measurable effect on sort's performance.

 point where it would be faster to access if it grew.  I'd prefer (since
 it seems to be my time we're talking about), to leave it the way it is
 because it's simple, and see if the gnulib module is faster later.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: feature request: gzip/bzip support for sort

2007-01-21 Thread Dan Hipschman
On Sun, Jan 21, 2007 at 10:41:11PM +0100, Jim Meyering wrote:
 This is a good argument for using libz by default, not a separate
 gzip program.  Why incur the overhead of an exec when we don't need to?
 Now, I'm convinced that sort should provide built-in support for both
 gzip and bzip2.  How to select built-in vs. actually exec the program
 is something to think about...
 Of course it should still be possible to specify some other program.

Well, was there anything wrong with my original LZO patch?  It shouldn't
be too hard to replace LZO with libz.  I don't disagree with you about
using libz in addition to an external program, but if the current patch
gives performance gains, why not check it in (once the wrinkles are
ironed out) and add more optimizations later?  At least that way I can
take a break (aha, the real motivation comes out), and pick up where I
left off later without worrying about the code changing under my feet.



___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: feature request: gzip/bzip support for sort

2007-01-21 Thread Dan Hipschman

I have a good feeling about this one :-)  I think I've addressed
everything except adding libz.  I just don't think I have the time to do
that right now, and I think it can wait.  I can look into writing some
test cases in a bit.

2007-01-21  Jim Meyering  [EMAIL PROTECTED]

* src/sort.c (MAX_FORK_RETRIES_COMPRESS, MAX_FORK_RETRIES_DECOMPRESS):
In pipe_fork callers, use these named constants, not 2 and 8.
(proctab, nprocs): Declare to be static.
(pipe_fork) [lint]: Initialize local, pid,
to avoid unwarranted may-be-used-uninitialized warning.
(create_temp): Use the active voice.  Describe parameters, too.

2007-01-21  James Youngman  [EMAIL PROTECTED]

Centralize all the uses of sigprocmask().  Don't restore an invalid
saved mask.
* src/sort.c (enter_cs, leave_cs): New functions for protecting
code sequences against signal delivery.
* (exit_cleanup): Use enter_cs and leave_cs instead of
calling sigprocmask directly.
(create_temp_file, pipe_fork, zaptemp): Likewise

2007-01-21  Dan Hipschman  [EMAIL PROTECTED]

Add compression of temp files to sort.
* NEWS: Mention this.
* bootstrap.conf: Import findprog.
* configure.ac: Add AC_FUNC_FORK.
* doc/coreutils.texi: Document GNUSORT_COMPRESSOR environment
variable.
* src/sort.c (compress_program): New global, holds the name of the
external compression program.
(struct sortfile): New type used by mergepfs and friends instead
of filenames to hold PIDs of compressor processes.
(proctab): New global, holds compressor PIDs on which to wait.
(enum procstate, struct procnode): New types used by proctab.
(proctab_hasher, proctab_comparator): New functions for proctab.
(nprocs): New global, number of forked but unreaped children.
(reap, reap_some): New function, wait for/cleanup forked processes.
(register_proc, update_proc, wait_proc): New functions for adding,
modifying and removing proctab entries.
(create_temp_file): Change parameter type to pointer to file
descriptor, and return type to pointer to struct tempnode.
(dup2_or_die): New function used in create_temp and open_temp.
(pipe_fork): New function, creates a pipe and child process.
(create_temp): Creates a temp file and possibly a compression
program to which we filter output.
(open_temp): Opens a compressed temp file and creates a
decompression process through which to filter the input.
(mergefps): Change FILES parameter type to struct sortfile array
and update access accordingly.  Use open_temp and reap_some.
(avoid_trashing_input, merge): Change FILES parameter like
mergefps and call create_temp instead of create_temp_file.
(sort): Call create_temp instead of create_temp_file.
Use reap_some.
(avoid_trashing_input, merge, sort, main): Adapt to mergefps.

Index: NEWS
===
RCS file: /sources/coreutils/coreutils/NEWS,v
retrieving revision 1.467
diff -p -u -r1.467 NEWS
--- NEWS18 Jan 2007 09:18:21 -  1.467
+++ NEWS22 Jan 2007 03:44:29 -
@@ -29,6 +29,11 @@ GNU coreutils NEWS  
 
   rm --interactive=never F no longer prompts for an unwritable F
 
+** New features
+
+  sort can now compresses temporary files to improve performance of
+  very large sorts.
+
 
 * Noteworthy changes in release 6.7 (2006-12-08) [stable]
 
Index: bootstrap.conf
===
RCS file: /sources/coreutils/coreutils/bootstrap.conf,v
retrieving revision 1.15
diff -p -u -r1.15 bootstrap.conf
--- bootstrap.conf  15 Jan 2007 10:33:52 -  1.15
+++ bootstrap.conf  22 Jan 2007 03:44:29 -
@@ -43,7 +43,8 @@ gnulib_modules=
config-h configmake
closeout cycle-check d-ino d-type diacrit dirfd dirname dup2
error euidaccess exclude exitfail fchdir fcntl fcntl-safer fdl
-   file-type fileblocks filemode filenamecat fnmatch-gnu fopen-safer
+   file-type fileblocks filemode filenamecat findprog fnmatch-gnu
+   fopen-safer
fprintftime fsusage ftruncate fts getdate getgroups gethrxtime
getline getloadavg getndelim2 getopt getpagesize getpass-gnu
gettext gettime gettimeofday getugroups getusershell gnupload
Index: configure.ac
===
RCS file: /sources/coreutils/coreutils/configure.ac,v
retrieving revision 1.102
diff -p -u -r1.102 configure.ac
--- configure.ac28 Dec 2006 08:18:17 -  1.102
+++ configure.ac22 Jan 2007 03:44:29 -
@@ -39,6 +39,8 @@ gl_EARLY
 gl_INIT
 coreutils_MACROS
 
+AC_FUNC_FORK
+
 AC_CHECK_FUNCS(uname,
OPTIONAL_BIN_PROGS=$OPTIONAL_BIN_PROGS 

Re: feature request: gzip/bzip support for sort

2007-01-21 Thread Bauke Jan Douma

Dan Hipschman wrote on 22-01-07 05:55:


sort can now compresses


Small typo here.


bjd





___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: feature request: gzip/bzip support for sort

2007-01-19 Thread Dan Hipschman

Hi,

I think this patch addresses everything Paul mentioned in his critique
of my last attempt.  I did look at gnulib pipe module, but there were
some problems with using it out of the box.  First, it takes a
filename as the stdout of the child process, when for temp files it's
better to pass a file descriptor.  That's a rather small issue, though,
and would be pretty easy to add an additional function that fixed that.
Another problem is that since it does the fork/exec in one call, the
race condition where the child and parent both receive signals between
the fork and child's exec can't be solved the way I solve it in this
patch.  Rather than change the way sort handles signals, I pleaded with
Paul (well, I asked him nicely :-) to be able to put off the portability
issues of fork until I had more time.  Hence, I've just added a check
for fork() in configure.ac and any system that doesn't have it will have
to wait a while for this feature.

I also tried to address Jim's concern about fork() failing.  I clean up
zombie children every so often in the main loops of sort() and
mergefps() so as not to bleed the system dry.  This seems to keep the
number of defunct and running child processes close to a minimum.  If we
can't fork, I retry some number of times, sleeping a bit in between and
killing off zombies (sounds like a good time :-).  I was able to test
the case where fork() fails with a macro like this:

unsigned int foo;
#define fork() (foo++ % 3 ? fork() : (errno = EAGAIN, -1))

I'm not sure how to test the case where we fork a compression process,
it exits, we reap it, and we create another process with the same PID
before we run the decompression process on the file it created.  I did
address this issue, and of course I *think* my solution is correct, but
like I said, I'm not sure how to test it.

Anyway, here's the patch.


Index: NEWS
===
RCS file: /sources/coreutils/coreutils/NEWS,v
retrieving revision 1.467
diff -p -u -r1.467 NEWS
--- NEWS18 Jan 2007 09:18:21 -  1.467
+++ NEWS20 Jan 2007 05:49:27 -
@@ -29,6 +29,11 @@ GNU coreutils NEWS  
 
   rm --interactive=never F no longer prompts for an unwritable F
 
+** New features
+
+  sort can now compresses temporary files to improve performance of
+  very large sorts.
+
 
 * Noteworthy changes in release 6.7 (2006-12-08) [stable]
 
Index: configure.ac
===
RCS file: /sources/coreutils/coreutils/configure.ac,v
retrieving revision 1.102
diff -p -u -r1.102 configure.ac
--- configure.ac28 Dec 2006 08:18:17 -  1.102
+++ configure.ac20 Jan 2007 05:49:27 -
@@ -39,6 +39,8 @@ gl_EARLY
 gl_INIT
 coreutils_MACROS
 
+AC_FUNC_FORK
+
 AC_CHECK_FUNCS(uname,
OPTIONAL_BIN_PROGS=$OPTIONAL_BIN_PROGS uname\$(EXEEXT)
MAN=$MAN uname.1)
Index: doc/coreutils.texi
===
RCS file: /sources/coreutils/coreutils/doc/coreutils.texi,v
retrieving revision 1.363
diff -p -u -r1.363 coreutils.texi
--- doc/coreutils.texi  4 Jan 2007 11:04:46 -   1.363
+++ doc/coreutils.texi  20 Jan 2007 05:49:27 -
@@ -3411,6 +3411,19 @@ value as the directory for temporary fil
 @option{--temporary-directory} (@option{-T}) option in turn overrides
 the environment variable.
 
[EMAIL PROTECTED] GNUSORT_COMPRESSOR
+To improve performance when sorting very large files, GNU sort will,
+by default, try to compress temporary files with the program
[EMAIL PROTECTED]/bin/gzip}.  The environment variable @env{GNUSORT_COMPRESSOR}
+can be set to the name of another program to be used.  The program
+specified must compress standard input to standard output when no
+arguments are given to it, and it must decompress standard input to
+standard output when the @option{-d} argument is given to it.  To
+disable compression of temporary files, set the variable to the empty
+string.  Whitespace characters, and the backslash character are
+reserved for future use in the program name.  If the program exits
+with nonzero status, sort will terminate with an error.
+
 
 The following options affect the ordering of output lines.  They may be
 specified globally or as part of a specific key field.  If no key
Index: src/sort.c
===
RCS file: /sources/coreutils/coreutils/src/sort.c,v
retrieving revision 1.345
diff -p -u -r1.345 sort.c
--- src/sort.c  19 Jan 2007 22:28:31 -  1.345
+++ src/sort.c  20 Jan 2007 05:49:29 -
@@ -1,5 +1,5 @@
 /* sort - sort lines of text (with all kinds of options).
-   Copyright (C) 1988, 1991-2006 Free Software Foundation, Inc.
+   Copyright (C) 1988, 1991-2007 Free Software Foundation, Inc.
 
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published 

Re: feature request: gzip/bzip support for sort

2007-01-18 Thread Jim Meyering
Jim Meyering [EMAIL PROTECTED] wrote:

 Dan Hipschman [EMAIL PROTECTED] wrote:
 On Sat, Jan 13, 2007 at 10:07:59PM -0800, Paul Eggert wrote:
 3.  I can see where the user might be able to specify a better
 algorithm, for a particular data set.  For that, how about if we have
 a --compress-program=PROGRAM option, which lets the user plug in any
 program that works as a pipeline?  E.g., --compress-program=gzip would
 use gzip.  The default would be to use PROGRAM -d to decompress; we
 could have another option if that doesn't suffice.

 This is pretty close.  There is just a few more problems that I see:

 One more thing to consider:

 In your proposed patch, a failed fork evokes an immediate exit.
 In that case, it may be worthwhile to retry, or even to revert
 to using gzip directly.

Upon rereading, I see the latter part is not clear.  I mean resorting
to the use of a direct (non-forking) implementation, like libz.
This, of course, assumes we're using gzip.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: feature request: gzip/bzip support for sort

2007-01-18 Thread Philip Rowlands

On Thu, 18 Jan 2007, Jim Meyering wrote:


I've done some more timings, but with two more sizes of input.
Here's the summary, comparing straight sort with sort --comp=gzip:

 2.7GB:   6.6% speed-up
 10.0GB: 17.8% speed-up


It would be interesting to see the individual stats returned by wait4(2) 
from the child, to separate CPU seconds spent in sort itself, and in the 
compression/decompression forks.


I think allowing an environment variable to define the compressor is a 
good idea, so long as there's a corresponding --nocompress override 
available from the command line.



 $ seq 999  k
 $ cat k k k k k k k k k  j
 $ cat j j j j  sort-in
 $ wc -c sort-in
 283968 sort-in


I had to use seq -f %.0f to get this filesize.


With --compress=gzip:
 $ /usr/bin/time ./sort -T. --compress=gzip  sort-in  out
 814.07user 29.97system 14:50.16elapsed 94%CPU (0avgtext+0avgdata 
0maxresident)k  0inputs+0outputs (4major+2821589minor)pagefaults 0swaps


There's a big difference in the time spent on gzip compression depending 
on the -1/-9 option (default -6). For a similar seq-generated data set 
above, I get

gzip -1: User time (seconds): 48.63, output size is 6% of input
gzip -9: User time (seconds): 952.97, output size is 3% of input

Decompression time for both tests shows less variation (25s vs 21s).

This would suggest the elapsed time to sort can be improved by trading 
compression ratio for less CPU time. Obviously a critical factor is the 
disk latency.



Cheers,
Phil


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: feature request: gzip/bzip support for sort

2007-01-18 Thread Jim Meyering
Philip Rowlands [EMAIL PROTECTED] wrote:

 On Thu, 18 Jan 2007, Jim Meyering wrote:

 I've done some more timings, but with two more sizes of input.
 Here's the summary, comparing straight sort with sort --comp=gzip:

  2.7GB:   6.6% speed-up
  10.0GB: 17.8% speed-up

 It would be interesting to see the individual stats returned by wait4(2)
 from the child, to separate CPU seconds spent in sort itself, and in the
 compression/decompression forks.

 I think allowing an environment variable to define the compressor is a
 good idea, so long as there's a corresponding --nocompress override
 available from the command line.

  $ seq 999  k
  $ cat k k k k k k k k k  j
  $ cat j j j j  sort-in
  $ wc -c sort-in
  283968 sort-in

 I had to use seq -f %.0f to get this filesize.

Odd.
Here's what those generate for me:

  $ seq 999  k
  $ wc -c  k
  7888

  $ tail -1 k
  999

The remaining cat commands merely write 36 copies of that data to sort-in:

  $ (wc -c  k|tr -d '\n'; echo '* 36')|bc
  283968

What happens differently for you?


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: feature request: gzip/bzip support for sort

2007-01-18 Thread Philip Rowlands

On Thu, 18 Jan 2007, Jim Meyering wrote:


I had to use seq -f %.0f to get this filesize.


Odd.
Here's what those generate for me:

 $ seq 999  k
 $ wc -c  k
 7888

 $ tail -1 k
 999

What happens differently for you?


$ seq 990 999
9.9e+06
9.9e+06
9.9e+06
9.9e+06
9.9e+06
1e+07
1e+07
1e+07
1e+07
1e+07
[EMAIL PROTECTED]:~$ seq --version
seq (GNU coreutils) 5.93

I'm using an old version of seq with a different default format (I 
didn't realised it had changed recently.)



Cheers,
Phil


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: feature request: gzip/bzip support for sort

2007-01-18 Thread Dan Hipschman
On Thu, Jan 18, 2007 at 05:47:53PM -0800, Dan Hipschman wrote:
 That's a thought, although libz only works with gzip (as you said), and
 it will add more complexity (like my original patch using LZO and this
 new one combined).  I don't think we'll have 40 instances of gzip -d
 running.  We should only need at most one compression process, and
 NMERGE (16) decompression processes running at one time.  I think
 retrying fork if it fails is a good idea, and I've already added that
 since I read your mail.

Apologies, I spoke to soon.  We shouldn't have 40 *running* processes,
but we generate a lot of defunct ones.  I'm working on cleaning them up
sooner and then hopefully we can bound the number of child processes at
NMERGE + 1.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: feature request: gzip/bzip support for sort

2007-01-16 Thread Jim Meyering
Dan Hipschman [EMAIL PROTECTED] wrote:
 Here's the patch for comments.  Thanks,

I tried it and did some timings.
Bottom line: with a 4+GB file, dual-processor, I see a 19% speed-up,
but I think most of the savings is in reduced I/O.


virtually no difference (~5%) for a file of size 324M, created like this:
running on a uniprocessor amd-64 3400:

  $ seq 9  k
  $ cat k k k k k k k k k k k k k k k k k k k k k k k k  j
  $ mv j k
  $ cat k k k k k k k k k k k k k k k k k k k k k k k k  j
  $ shuf  j  sort-in

  $ /usr/bin/time ./sort --compress=gzip  sort-in  out
  100.11user 4.69system 1:48.67elapsed 96%CPU (0avgtext+0avgdata 0maxresident)k
  0inputs+0outputs (1major+761524minor)pagefaults 0swaps
  $ /usr/bin/time ./sort  sort-in  out
  93.16user 3.35system 1:40.35elapsed 96%CPU (0avgtext+0avgdata 0maxresident)k
  0inputs+0outputs (0major+137435minor)pagefaults 0swaps

-
Trying similar, but with a 4.2GB file created like this,
running on a dual-processor with 2GB of RAM.

  $ seq 999  k10M lines / 7888 bytes
  $ cat k k k k k k k k k  j  90M lines
  $ cat j j j j j j j  k 630M lines / 4.62 GB
  $ mv k sort-in

What does /tmp look like, after a few minutes?

  $ du -sh /tmp/sort*
  11M /tmp/sort0Sdnkk
  11M /tmp/sort6cTaAE
  11M /tmp/sortABogDY
  ...
--- contrast with sizes during the run w/no compression:
  216M/tmp/sort5NTqQu
  216M/tmp/sortAjx50R
  216M/tmp/sortKvyGIT
  ...

$ /usr/bin/time ./sort -T /tmp --compress=gzip  sort-in  out
1535.33user 71.76system 27:15.72elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (1major+4985207minor)pagefaults 0swaps
$ /usr/bin/time ./sort -T /tmp  sort-in  out

$ /usr/bin/time ./sort -T /tmp  sort-in  out
./sort: write failed: /tmp/sortieA1nv: No space left on device
Command exited with non-zero status 2
588.79user 17.76system 17:20.70elapsed 58%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (2major+225191minor)pagefaults 0swaps
[Exit 2]
$ df -hT /tmp
FilesystemTypeSize  Used Avail Use% Mounted on
/dev/sda3 reiserfs 12G  6.4G  5.4G  55% /

$ /usr/bin/time ./sort -T .  sort-in  out
754.03user 38.99system 33:42.35elapsed 39%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (2major+210437minor)pagefaults 0swaps

So, with just one trial each, I see a 19% speed-up.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: feature request: gzip/bzip support for sort

2007-01-16 Thread Paul Eggert
Jim Meyering [EMAIL PROTECTED] writes:

 So, with just one trial each, I see a 19% speed-up.

Yaayyy!  That's good news.  Thanks for timing it.  I read your email
just after talking with Dan (in person) about how we'd time it.  I
just bought 1 TB worth of disk for my home computer and hadn't hooked
it up yet, so was going to volunteer that, but you beat me to it.

At this point to my mind the only question is how to put this change
in, and whether to make it the default (with gzip, say).  Clearly it
can lead to a big performance improvement with large sorts on modern
machines.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: feature request: gzip/bzip support for sort

2007-01-16 Thread Bauke Jan Douma

Paul Eggert wrote on 16-01-07 18:35:

Jim Meyering [EMAIL PROTECTED] writes:


So, with just one trial each, I see a 19% speed-up.


Yaayyy!  That's good news.  Thanks for timing it.  I read your email
just after talking with Dan (in person) about how we'd time it.  I
just bought 1 TB worth of disk for my home computer and hadn't hooked
it up yet, so was going to volunteer that, but you beat me to it.

At this point to my mind the only question is how to put this change
in, and whether to make it the default (with gzip, say).  Clearly it
can lead to a big performance improvement with large sorts on modern
machines.


How about files of 'average' size?

Someone said:

  sort: Compress temporary files when doing large external sort/merges.
 This improves performance when you can compress/uncompress faster than
 you can read/write, which is common in these days of fast CPUs.

If this statement is true, you get the best of both worlds.
In that case I'd just make it the default (which is just that,
the default, and implies an option to choose the 'old' behavior).

bjd




___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: feature request: gzip/bzip support for sort

2007-01-16 Thread Dan Hipschman
On Tue, Jan 16, 2007 at 01:20:16PM +0100, Jim Meyering wrote:
 I tried it and did some timings.
 Bottom line: with a 4+GB file, dual-processor, I see a 19% speed-up,
 but I think most of the savings is in reduced I/O.

Thanks very much for doing this.  The performance gain is good news
indeed.



___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: feature request: gzip/bzip support for sort

2007-01-16 Thread James Youngman

On 1/16/07, Dan Hipschman [EMAIL PROTECTED] wrote:

On Tue, Jan 16, 2007 at 01:20:16PM +0100, Jim Meyering wrote:
 I tried it and did some timings.
 Bottom line: with a 4+GB file, dual-processor, I see a 19% speed-up,
 but I think most of the savings is in reduced I/O.

Thanks very much for doing this.  The performance gain is good news
indeed.


How does this compare with using 'cat' as the compressor?

James.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: feature request: gzip/bzip support for sort

2007-01-16 Thread Dan Hipschman
On Tue, Jan 16, 2007 at 09:35:52AM -0800, Paul Eggert wrote:
 At this point to my mind the only question is how to put this change
 in, and whether to make it the default (with gzip, say).  Clearly it
 can lead to a big performance improvement with large sorts on modern
 machines.

I think it should be triggered with an environment variable, say
GNUSORT_COMPRESS=gzip (or whatever).  That way, it's got the convenience
of a default option, but it's easily configurable.  Setting the variable
to the empty string (or perhaps just not setting the variable) could
disable the behaviour.



___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: feature request: gzip/bzip support for sort

2007-01-15 Thread Dan Hipschman
On Sat, Jan 13, 2007 at 10:07:59PM -0800, Paul Eggert wrote:
 3.  I can see where the user might be able to specify a better
 algorithm, for a particular data set.  For that, how about if we have
 a --compress-program=PROGRAM option, which lets the user plug in any
 program that works as a pipeline?  E.g., --compress-program=gzip would
 use gzip.  The default would be to use PROGRAM -d to decompress; we
 could have another option if that doesn't suffice.
 
 An advantage of (3) is that it should work well on two-processor
 hosts, since compression can be done in one CPU while sorting is done
 on another.  (Hmm, perhaps we should consider forking even if we use a
 built-in default compressor, for the same reason.)

I've started working on this, and have made good progress so far.  There
are a lot of subtleties, though, like making sure the forked child
doesn't receive SIGINT and unlink all our temp files before it execs
(I've solved that problem), and making sure the compress process
finishes compressing the temp file before the corresponding decompress
process starts processing it (I've got a plan for that).  Anyway, my
point is, I've gotten off to a good start, but it's going to take a lot
of testing to make sure I've done it right due to all these race
conditions.

The actual compression is obviously a lot better (using gzip / bzip2),
and it shouldn't be hard to extend the code so sort can read and write
externally compressed files, which is what the OP wanted.  It's not
faster (not even close) on my machine, though.  Of course, I've only got
one CPU, and a slow one at that :-)

Dan



___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: feature request: gzip/bzip support for sort

2007-01-15 Thread Dan Hipschman
On Sat, Jan 13, 2007 at 10:07:59PM -0800, Paul Eggert wrote:
 3.  I can see where the user might be able to specify a better
 algorithm, for a particular data set.  For that, how about if we have
 a --compress-program=PROGRAM option, which lets the user plug in any
 program that works as a pipeline?  E.g., --compress-program=gzip would
 use gzip.  The default would be to use PROGRAM -d to decompress; we
 could have another option if that doesn't suffice.

This is pretty close.  There is just a few more problems that I see:

* It doesn't respond well to SIGPIPE.  It terminates without
success, so at least the user knows something went wrong, if
they check the status, but it doesn't print an error message.

* It checks the exit status of decompression processes at the
very end, so the user is notified (with an error message and
non-successful exit code) if a decompression process fails, but
after the output has been generated.  I don't think there's
anything we can do about this in general, though.

* If we can't fork/exec a decompression process, we're hosed,
but if we can't fork/exec a compression process, a possible
improvement might be to just proceed without compression (and
maybe print a warning).  On the other hand, the user might want
us to terminate (e.g., if they just spelled the name of the
program wrong).

* Like Paul said, we might want a separate decompress program,
and the user might want to pass in options (e.g.,
--compress-program=gzip -1).

I'm still testing, but so far it's looking pretty good.  If you think
this patch has a good chance of making it into coreutils, then I'll keep
working on it.  It does a good job of compression, so I think it will
solve (most of) the OP's problem, but it's not a speed booster on my
machine.  If someone with a multi-processor computer wants to test it
maybe we can get some good news on that front.

Here's the patch for comments.  Thanks,
Dan


Index: sort.c
===
RCS file: /sources/coreutils/coreutils/src/sort.c,v
retrieving revision 1.344
diff -p -u -r1.344 sort.c
--- sort.c  13 Dec 2006 21:27:05 -  1.344
+++ sort.c  16 Jan 2007 01:34:50 -
@@ -1,5 +1,5 @@
 /* sort - sort lines of text (with all kinds of options).
-   Copyright (C) 1988, 1991-2006 Free Software Foundation, Inc.
+   Copyright (C) 1988, 1991-2007 Free Software Foundation, Inc.
 
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
@@ -25,6 +25,7 @@
 
 #include getopt.h
 #include sys/types.h
+#include sys/wait.h
 #include signal.h
 #include system.h
 #include error.h
@@ -261,6 +262,9 @@ static bool have_read_stdin;
 /* List of key field comparisons to be tried.  */
 static struct keyfield *keylist;
 
+/* Program used to (de)compress temp files.  Must accept -d.  */
+static const char *compress_program;
+
 static void sortlines_temp (struct line *, size_t, struct line *);
 
 /* Report MESSAGE for FILE, then clean up and exit.
@@ -328,6 +332,9 @@ Other options:\n\
   multiple options specify multiple directories\n\
   -u, --unique  with -c, check for strict ordering;\n\
   without -c, output only the first of an equal 
run\n\
+  --compress-progam=PROG  (de)compress temporaries with PROG, which must\n\
+  (de)compress stdin to stdout, and accept the 
-d\n\
+  command line option for decompressing\n\
 ), DEFAULT_TMPDIR);
   fputs (_(\
   -z, --zero-terminated end lines with 0 byte, not newline\n\
@@ -364,7 +371,8 @@ native byte values.\n\
non-character as a pseudo short option, starting with CHAR_MAX + 1.  */
 enum
 {
-  RANDOM_SOURCE_OPTION = CHAR_MAX + 1
+  RANDOM_SOURCE_OPTION = CHAR_MAX + 1,
+  COMPRESS_PROGRAM
 };
 
 static char const short_options[] = -bcdfgik:mMno:rRsS:t:T:uy:z;
@@ -373,6 +381,7 @@ static struct option const long_options[
 {
   {ignore-leading-blanks, no_argument, NULL, 'b'},
   {check, no_argument, NULL, 'c'},
+  {compress-program, required_argument, NULL, COMPRESS_PROGRAM},
   {dictionary-order, no_argument, NULL, 'd'},
   {ignore-case, no_argument, NULL, 'f'},
   {general-numeric-sort, no_argument, NULL, 'g'},
@@ -403,6 +412,7 @@ static sigset_t caught_signals;
 struct tempnode
 {
   struct tempnode *volatile next;
+  pid_t pid; /* If compressed, the pid of compressor, else zero */
   char name[1];  /* Actual size is 1 + file name length.  */
 };
 static struct tempnode *volatile temphead;
@@ -422,8 +432,8 @@ cleanup (void)
 /* Create a new temporary file, returning its newly allocated name.
Store into *PFP a stream open for writing.  */
 
-static char *
-create_temp_file (FILE **pfp)
+static struct tempnode *

Re: feature request: gzip/bzip support for sort

2007-01-15 Thread Paul Eggert
Thanks very much for looking into that.  Some comments:

Dan Hipschman [EMAIL PROTECTED] writes:

 +  /* This is so the child process won't delete our temp files
 + if it receives a signal before exec-ing.  */
 +  sigprocmask (SIG_BLOCK, caught_signals, oldset);
 +  saved_temphead = temphead;
 +  temphead = NULL;
 +
 +  if ((pid = fork ())  0)
 +error (SORT_FAILURE, errno, _(couldn't fork));
 +  else if (0  pid)
 +{
 +  temphead = saved_temphead;
 +  sigprocmask (SIG_SETMASK, oldset, NULL);
 +}
 +  else
 +{
 +  sigprocmask (SIG_SETMASK, oldset, NULL);

There is a path out of code that does not restore the signal
mask, which sounds dangerous.  The usual pattern looks
more like this:

  sigprocmask (SIG_BLOCK, caught_signals, oldset);
  unlink_status = unlink (name);
  unlink_errno = errno;
  *pnode = next;
  sigprocmask (SIG_SETMASK, oldset, NULL);

The critical section is as small as possible, and it always restores
the signal mask.  Any tests for whether the critical section succeeded
or not are deferred until after the critical section is over.


 +  close_or_die (STDIN_FILENO, _(standard input));
 +  close_or_die (STDOUT_FILENO, _(standard output));

I don't see why we care whether these 'close' calls fail.
The file descriptors are junk at this point.  In general, we
shouldn't test for 'close' failures unless we actually care
about whether the file's I/O went well.  There are other
places where this is an issue.

 +  if ((node-pid = pipe_fork (pipefds)))

These days the usual style for 'if' is:

 node-pid = pipe_fork (pipefds);
 if (node-pid)

 +  pid_t *pids = xnmalloc (nfiles, sizeof *pids);
 +  memset (pids, 0, nfiles * sizeof *pids);

This looks like a case for xcalloc.

 +  while (reap (-1))
 +continue;

This one has me worried a bit, but perhaps I'm worrying too much.

More thoughts:

I'm not that worried about signal handling being propagated to the
child compress/decompress processes.  Let them worry about their own
signal handling and cleanup.  They shouldn't run for long if we exit.
However, we shouldn't block or mask out signals for them; they should
inherit the same signal mask etc. that 'sort' does.

coreutils doesn't invoke pipe() anywhere else, and it rarely invokes
fork() so we have to be prepared for the usual portability gotchas in
this area.  Could you please look at the gnulib pipe module and see
whether it solves any problems here?  Perhaps coreutils can use it
directly, or add options to it to make it usable.  It does reflect a
lot of implementation wisdom.  Even if we don't use the pipe module we
should at least be aware of the problems it fixes.

vfork is more efficient than fork on many platforms, particularly when
the parent process is large (which can easily be the case here).  It
has some porting issues, though (as you can see if you look at diff3
and/or sdiff).

We'll need documentation, of course (coreutils.texi and NEWS).  The
documentation should reserve space and backslash in the compress
program name, for future use.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: feature request: gzip/bzip support for sort

2007-01-14 Thread Dan Hipschman

OK, here's my current patch and theory on why it's still not fast
enough.  Here's profiling info on the cvs version of sort, up to
write_bytes:

  %   cumulative   self  self total
 time   seconds   secondscalls   s/call   s/call  name
 38.09  7.85 7.85 memcoll
 22.42 12.47 4.62  3827913 0.00 0.00  compare
 13.73 15.30 2.83   133249 0.00 0.00  mergelines
  6.94 16.73 1.43 xmemcoll
  5.92 17.95 1.221 1.22 2.84  mergefps
  4.44 18.87 0.92   88 0.01 0.01  fillbuf
  1.89 19.26 0.391 0.3910.50  sort
  1.70 19.61 0.35 xalloc_die
  1.31 19.88 0.27 md5_process_block
  1.02 20.09 0.21 keycompare
  0.85 20.26 0.1858368 0.00 0.00  sortlines_temp
  0.66 20.40 0.1458368 0.00 0.00  sortlines
  0.63 20.53 0.13   50 0.00 0.00  write_bytes

Now here's the version compressing temp files, with a few irrelevant
things removed:

  %   cumulative   self  self total
 time   seconds   secondscalls   s/call   s/call  name
 42.09  8.70 8.70 memcoll
 20.20 12.88 4.18  3827913 0.00 0.00  compare
 12.19 15.40 2.52   133249 0.00 0.00  mergelines
  6.39 16.72 1.32 xmemcoll
  4.57 17.66 0.951 0.95 2.63  mergefps
  4.40 18.57 0.91   88 0.01 0.01  fillbuf
  1.79 18.94 0.37   50 0.00 0.00  compress_and_write_bytes
  1.74 19.30 0.36 md5_process_block
  1.60 19.63 0.331 0.33 9.87  sort
  1.40 19.92 0.2958368 0.00 0.00  sortlines_temp
  1.40 20.21 0.29 xalloc_die
  1.04 20.43 0.2258368 0.00 0.00  sortlines
  0.65 20.56 0.14 keycompare
  0.15 20.59 0.03   251662 0.00 0.00  write_bytes
  0.15 20.62 0.03  569 0.00 0.00  load_compression_buffer
  0.05 20.66 0.01 6688 0.00 0.00  read_and_decompress_bytes
  0.05 20.67 0.01   88 0.00 0.00  get_compression_buffer
  0.00 20.67 0.00 3954 0.00 0.00  read_bytes
  0.00 20.67 0.00 1123 0.00 0.00  compress_out
  0.00 20.67 0.00  554 0.00 0.00  compress_wrkmem
  0.00 20.67 0.00  554 0.00 0.00  flush_compression
  0.00 20.67 0.00   16 0.00 0.00  
flush_compression_and_xfclose

compress_and_write_bytes takes the place of write_bytes, and as you can
see, it's slower.  In the second version, we spend less time in
write_bytes, so the time sending data to the kernel to write to disk has
decreased.  The problem is, we're more than losing that time in
compress_and_write_bytes.  The part that actaully does compression is in
flush_compression, and that's fast.  The time seems to be spent just
copying bytes into the buffer where it will later be compressed.

It may be possible to eliminate this time if we had a stream compression
algorithm instead of a block compression algorithm.  My patch still just
slows sort down (although, less than before).  If anybody can find any
obvious improvements to it that I've missed, then maybe there's still
hope for it, but I don't think so at this point.  I'll look at some
different compression algorithms later, or if someone else wants to take
up the cause, that's fine by me, because I'm a little busy with school
right now.  Still, I'll see what I can come up with in my spare time.

Thanks,
Dan


Index: sort.c
===
RCS file: /sources/coreutils/coreutils/src/sort.c,v
retrieving revision 1.344
diff -p -u -r1.344 sort.c
--- sort.c  13 Dec 2006 21:27:05 -  1.344
+++ sort.c  14 Jan 2007 23:59:50 -
@@ -50,6 +50,10 @@ struct rlimit { size_t rlim_cur; };
 # define getrlimit(Resource, Rlp) (-1)
 #endif
 
+#if HAVE_LZO1X_H
+# include lzo1x.h
+#endif
+
 /* The official name of this program (e.g., no `g' prefix).  */
 #define PROGRAM_NAME sort
 
@@ -261,6 +265,12 @@ static bool have_read_stdin;
 /* List of key field comparisons to be tried.  */
 static struct keyfield *keylist;
 
+/* Flag to compress temp files. */
+static bool compress_temps = HAVE_LZO1X_H;
+
+/* The number of chars to allocate for compression buffers. */
+static size_t compress_buf_size = 65536;
+
 static void sortlines_temp (struct line *, size_t, struct line *);
 
 /* Report MESSAGE for FILE, then clean up and exit.
@@ -399,15 +409,87 @@ static struct option const long_options[
 /* The set of signals that are caught.  */
 static sigset_t caught_signals;
 
+/* 

Re: feature request: gzip/bzip support for sort

2007-01-14 Thread Paul Eggert
Dan Hipschman [EMAIL PROTECTED] writes:

   %   cumulative   self  self total
  time   seconds   secondscalls   s/call   s/call  name
  47.91  4.92 4.92 9269 0.00 0.00  find_temp

Yes, thanks, that explains the results all right

 without the -S flag, even the largest file on my hard drive (~80M)
 didn't use any temp files so I had to resort to this artificial
 approach.

For a more realistic test, we need to sort something larger, so that
'sort' becomes disk-bound.  Maybe a terabyte of data or so, for a
typically small machine nowadays that has a GB or two of RAM?

It will be a bit of a pain to arrange for a decent set of test cases
here, but without it I'm afraid we're just guessing about what will
make things run faster oversll.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


feature request: gzip/bzip support for sort

2007-01-13 Thread Craig Macdonald
On some occasions, I have the need to sort extremely large files, but 
which compress well using programs such as gzip or bzip.


I can emulate the sorting of a gzipped files while keeping input 
compressed using shell pipes, eg

   zcat in.gz | sort | gzip  out.gz
However, if there is not enough temporary space available for sort to 
store to (ie, usually /tmp), then sort will fail.


In a similar vein, I can sort multiple large gzipped files, where each 
file is small enough to sort in available temporary space, but

the complete file is not, by making use of fifos.
for i in *.gz;
do
   zcat $i | sort | gzip  $i.out
   mkfifo $i.out.fifo
   zcat $i.out  $i.out.fifo 
done
sort -m *.out.fifo | gzip  out.gz

However, things would be made easier if sort, like tar, could support 
the use of compressors and decompressors, (a) for input and (b) for the 
temporary files. Are there any difficulties in adding such features to 
sort that I haven't envisaged?


Many thanks

Craig


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: feature request: gzip/bzip support for sort

2007-01-13 Thread Jim Meyering
Craig Macdonald [EMAIL PROTECTED] wrote:
 On some occasions, I have the need to sort extremely large files, but
 which compress well using programs such as gzip or bzip.
...

This task has been on the TODO list for some time:

  sort: Compress temporary files when doing large external sort/merges.
This improves performance when you can compress/uncompress faster than
you can read/write, which is common in these days of fast CPUs.
suggestion from Charles Randall on 2001-08-10

Just waiting for someone to work on it.

If you're willing to do the job (and can assign copyright) that'd be great.
Let me know, and I'll send you the papers to get that process started.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: feature request: gzip/bzip support for sort

2007-01-13 Thread Dan Hipschman
On Sat, Jan 13, 2007 at 10:36:05PM +0100, Jim Meyering wrote:
 Craig Macdonald [EMAIL PROTECTED] wrote:
  On some occasions, I have the need to sort extremely large files, but
  which compress well using programs such as gzip or bzip.
 ...
 
 This task has been on the TODO list for some time:
 
   sort: Compress temporary files when doing large external sort/merges.
 This improves performance when you can compress/uncompress faster than
 you can read/write, which is common in these days of fast CPUs.
 suggestion from Charles Randall on 2001-08-10
 
 Just waiting for someone to work on it.

I already wrote a patch to sort quite a while ago that added compression
of the temporary files.  Unfortunately, with my code, I saw no
performance increase, so I discarded the patch.  However, if it can fix
the problem of running out of space, there may still be some use for it
after all.  The patch is old and needs some conflicts to be worked out,
but I'll get on it and submit ASAP.  In the meantime, I've posted the
original C code http://linux.ucla.edu/~dsh/patches/compressed-sort.c
in case you want to check it out and send comments.  It uses LZO
compression because it was supposed to be fast.  I also had patches to
the autoconf related stuff, but they seem to be hiding at the moment.
That part shouldn't be too hard to rewrite.



___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: feature request: gzip/bzip support for sort

2007-01-13 Thread Paul Eggert
Thanks.  I like the idea of compression, but before we get into the
details of your patch, what do you mean by there not being a
performance improvement with this patch?  What's the holdup on
performance?  It seems to me that compression ought to be a real win.

If it's not a win, we shouldn't bother with LZO; instead, we should
use an algorithm that will typically be a clear win -- or, if there
isn't any such algorithm, we shouldn't hardware any algorithm at all.

Some other thoughts.

1.  Have you looked into other compression algorithms?  QuickLZ
http://www.quicklz.com/ should compress fairly quickly, if
compression speed is the bottleneck.  I mention QuickLZ because it's
currently the compression-speed champ at
http://www.maximumcompression.com/data/summary_mf3.php.

2.  For the default, let's not bother with a user-visible option.
Let's just use compression, if a good algorithm is available.

3.  I can see where the user might be able to specify a better
algorithm, for a particular data set.  For that, how about if we have
a --compress-program=PROGRAM option, which lets the user plug in any
program that works as a pipeline?  E.g., --compress-program=gzip would
use gzip.  The default would be to use PROGRAM -d to decompress; we
could have another option if that doesn't suffice.

An advantage of (3) is that it should work well on two-processor
hosts, since compression can be done in one CPU while sorting is done
on another.  (Hmm, perhaps we should consider forking even if we use a
built-in default compressor, for the same reason.)


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils