Re: feature request: gzip/bzip support for sort
Paul Eggert [EMAIL PROTECTED] wrote: Jim Meyering [EMAIL PROTECTED] writes: I'm probably going to change the documentation so that people will be less likely to depend on being able to run a separate program. To be precise, I'd like to document that the only valid values of GNUSORT_COMPRESSOR are the empty string, gzip and bzip2[*]. This sounds extreme, particularly since gzip and bzip2 are not the best algorithms for 'sort' compression, where you want a fast compressor. Better choices right now would include include lzop http://www.lzop.org/ and maybe QuickLZ http://www.quicklz.com/. I see that streaming support in QuickLZ is available only in the very latest beta. That makes me think it's not quite ready for use in GNU sort. Maybe lzop is more mature. I haven't looked yet. The fast-compressor field is moving fairly rapidly. (I've heard some rumors from some of my commercial friends.) QuickLZ, a new algorithm, is at the top of the maximumcompression list right now for fast compressors; see http://www.maximumcompression.com/data/summary_mf3.php. I would not be surprised to see a new champ next year. Then we will have the liberty to remove the exec calls and use library code instead, thus making the code a little more efficient -- but mainly, more robust. It's not clear to me that it'll be more efficient for the soon-to-be common case of multicore chips, since 'sort' and the compressor can run in parallel. No. I proposed to remove only the *exec*, not the _fork_. The idea is to have each child run library code to (de)compress, rather than to exec an external program. Well, I do want it to revert to sequential whenever fork fails, but otherwise it would still take advantage of _some_ parallelism: 2-way when compressing, and up to 17-way (NMERGE + 1) when merging. Of course, this parallelism currently kicks in only with a data set large enough to require temporary files. Now that dual-core systems are common, a nice project would be to make GNU sort take advantage of that when performing day-to-day (in-memory) sorts. We'll have to measure. I agree about the robustness but that should be up to the user. I want to keep the default code paths as simple and reliable as possible. That means no exec, and a fall-back to running library code sequentially upon fork failure -- both for compression and decompression. While this new feature is a very nice addition, we have to be careful to ensure that it does not compromise sort's reliability, even under duress. Here's my reasoning: when compressing with an arbitrary program, sort may do a *lot* of work up front (there is no decompression at all in the beginning), yet fail only near the end when it discovers that prog doesn't accept -d or can't fork. The result: the time spent sorting that large data set is wasted. With judicious support for at least one built-in compression method, this error path is completely eliminated. This is yet another way in which sort differs from tar. If an invalid compression program is specified to tar, tar fails right away -- it doesn't first waste hours of compute time. In fact, given an invalid compressor program, sort may well work repeatedly, potentially over a span of days or longer, before encountering an input large enough to trigger the need to use temporary files, where it could hit a decompression failure. Avoiding this sort of delayed failure is my main motivation for defining away the problem. And as I'm sure you agree, I'd rather not have sort test (fork/exec) the compression program to ensure that it works with -d. Another reason to avoid the exec is memory. When merging, there are usually 16 separate gzip processes running in parallel. Sure, each is pretty small, with a RSS of under 500K on my system, but with 16 of them, it does add up. So we need support for at least one built-in compression library. Here are my selection criteria, in decreasing order of importance: robust portable reasonable run-time efficiency reasonable compression performance If some other compression library is a better fit, then we should add support for it, and consider making it the default. Similarly, if there are people interested in sorting huge data sets for which general purpose compression isn't good enough, they can provide profiling data showing how their domain-specific compressor is essential. If this ever happens, it would be a good reason to allow sort to exec an arbitrary program again. Perhaps we could put in something that says, If the compressor is named 'gzip' we may optimize that. and similarly for 'lzop' and/or a few other compressor names. Or, more generally, we could have the convention that if the compressor name starts with - we will strip the - and then try to optimize the result if we can. Something like that, anyway. [*] If gzip and bzip2 are good enough for tar, why should sort make any compromise (exec'ing some other program) in order
Re: feature request: gzip/bzip support for sort
Dan Hipschman [EMAIL PROTECTED] wrote: On Wed, Jan 24, 2007 at 08:08:18AM +0100, Jim Meyering wrote: I've checked in your changes, then changed NEWS a little: Great! Thanks :-) Additionally, I'm probably going to change the documentation so that people will be less likely to depend on being able to run a separate program. To be precise, I'd like to document that the only valid values of GNUSORT_COMPRESSOR are the empty string, gzip and bzip2[*]. Then we will have the liberty to remove the exec calls and use library code instead, thus making the code a little more efficient -- but mainly, more robust. Why not add a special value 'libz' and document it as follows: We'll see. I'm really inclined to disallow the exec option, unless someone provides a good use case to justify it. It's fun to add features, but so far that one is not justified, given its downsides. By the way, I've got a little amendment to the patch. I took a look at gnulib's findprog module, and it turns out find_in_path does an access- X_OK itself, so sort doesn't need to do it again. 2007-01-24 Dan Hipschman [EMAIL PROTECTED] * src/sort.c (create_temp): Remove superfluous access-X_OK check. find_in_path does this for us. Thanks. Applied. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: feature request: gzip/bzip support for sort
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 According to Jim Meyering on 1/24/2007 12:08 AM: Additionally, I'm probably going to change the documentation so that people will be less likely to depend on being able to run a separate program. To be precise, I'd like to document that the only valid values of GNUSORT_COMPRESSOR are the empty string, gzip and bzip2[*]. Then we will have the liberty to remove the exec calls and use library code instead, thus making the code a little more efficient -- but mainly, more robust. Fair enough for now, as long as we leave ourselves an opening for future expansion. For example, the 7zip algorithm tends to produce smaller compressed files than even bzip2 on typical input, and is patent unencumbered, but my impression of 7zip is that it still does not behaves as a filter compressor like gzip or bzip2, so it is not ready for prime-time support yet. If someone makes a good case for allowing an arbitrary compressor, we can allow that later. But if we were to add (and document) this feature now, we might well be stuck with it for a long time. [*] If gzip and bzip2 are good enough for tar, why should sort make any compromise (exec'ing some other program) in order to be more flexible? New enough tar (1.16.1, for example), supports: --use-compress-program=PROG filter through PROG (must accept -d) along with the builtin recognition of gzip and bzip2. However, rather than linking in libraries, it always exec's, even for the known two formats. - -- Don't work too hard, make some time for fun as well! Eric Blake [EMAIL PROTECTED] -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (Cygwin) Comment: Public key at home.comcast.net/~ericblake/eblake.gpg Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFt1tn84KuGfSFAYARAqeoAJ9Zwstntak+XtKCRMgHwBaRWt7evgCgoypy I8ymsCcFDib8l8wdzwpRROw= =jnfm -END PGP SIGNATURE- ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: feature request: gzip/bzip support for sort
Jim Meyering [EMAIL PROTECTED] writes: I'm probably going to change the documentation so that people will be less likely to depend on being able to run a separate program. To be precise, I'd like to document that the only valid values of GNUSORT_COMPRESSOR are the empty string, gzip and bzip2[*]. This sounds extreme, particularly since gzip and bzip2 are not the best algorithms for 'sort' compression, where you want a fast compressor. Better choices right now would include include lzop http://www.lzop.org/ and maybe QuickLZ http://www.quicklz.com/. The fast-compressor field is moving fairly rapidly. (I've heard some rumors from some of my commercial friends.) QuickLZ, a new algorithm, is at the top of the maximumcompression list right now for fast compressors; see http://www.maximumcompression.com/data/summary_mf3.php. I would not be surprised to see a new champ next year. Then we will have the liberty to remove the exec calls and use library code instead, thus making the code a little more efficient -- but mainly, more robust. It's not clear to me that it'll be more efficient for the soon-to-be common case of multicore chips, since 'sort' and the compressor can run in parallel. We'll have to measure. I agree about the robustness but that should be up to the user. Perhaps we could put in something that says, If the compressor is named 'gzip' we may optimize that. and similarly for 'lzop' and/or a few other compressor names. Or, more generally, we could have the convention that if the compressor name starts with - we will strip the - and then try to optimize the result if we can. Something like that, anyway. [*] If gzip and bzip2 are good enough for tar, why should sort make any compromise (exec'ing some other program) in order to be more flexible? For 'sort' the tradeoff is different than for 'tar'. We don't particularly care if the format is stable, since it's throwaway. And we want fast compression, whereas people generating tarballs often are willing to have way slower compression for a slightly higher compression ratio. (Plus, new versions of 'tar' allow arbitrary compressors anyway.) I do have a suggestion: we shouldn't use an environment variable to select a compressor. It should just be an option. Environment variables are funny beasts and it's better to avoid them if we can. I'll construct a patch along those lines if you like. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: feature request: gzip/bzip support for sort
Firstly, I wanted to say that I am exited by the extremely fast progress that has been made in sort for compressed temporary files. Many thanks to Dan and others for the implementation. (I've failed to accomplish the bootstrap of the CVS sources - are there bootstrapped snapshots available anywhere?) For 'sort' the tradeoff is different than for 'tar'. We don't particularly care if the format is stable, since it's throwaway. And we want fast compression, whereas people generating tarballs often are willing to have way slower compression for a slightly higher compression ratio. (Plus, new versions of 'tar' allow arbitrary compressors anyway.) Now that we have the ability to fork decompression processes, are we likely to see sort have the ability to open gzipped(or bzip2ed) files? For sorting a stream of compressed, this is obviously not required, but for merging, this would reduce a substantial mess with zcats to fifos etc etc. However, I'd understand if it was decided not to, because unlike the temporary files, there is an existing workable solution. Many thanks Craig Macdonald ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: feature request: gzip/bzip support for sort
Dan Hipschman [EMAIL PROTECTED] wrote: On Sun, Jan 21, 2007 at 07:14:03PM +0100, Jim Meyering wrote: Not to look the gift horse in the mouth, but it'd be nice if you wrote ChangeLog entries, too. And even (gasp! :-) a test case or two. Of course, we'd expect such a test case (probably named tests/misc/sort-compress, and based on tests/sample-test) to have this line in it: . $srcdir/../very-expensive If you don't have time for that, I'll take care of it, eventually. Here's some tests. They're actually not very expensive. Of course, you need to chmod +x sort-compress. 2007-01-22 Dan Hipschman [EMAIL PROTECTED] Test sort compression. * tests/misc/Makefile.am: Add the test. * tests/misc/sort-compress: New file containing the tests. Thanks for all the work! I've checked in your changes, then changed NEWS a little: ** New features By default, sort now compresses any temporary file it writes. When sorting very large inputs, this usually results in sort using far less temporary disk space and in improved performance. Additionally, I'm probably going to change the documentation so that people will be less likely to depend on being able to run a separate program. To be precise, I'd like to document that the only valid values of GNUSORT_COMPRESSOR are the empty string, gzip and bzip2[*]. Then we will have the liberty to remove the exec calls and use library code instead, thus making the code a little more efficient -- but mainly, more robust. If someone makes a good case for allowing an arbitrary compressor, we can allow that later. But if we were to add (and document) this feature now, we might well be stuck with it for a long time. [*] If gzip and bzip2 are good enough for tar, why should sort make any compromise (exec'ing some other program) in order to be more flexible? ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: feature request: gzip/bzip support for sort
On Sun, Jan 21, 2007 at 07:14:03PM +0100, Jim Meyering wrote: Not to look the gift horse in the mouth, but it'd be nice if you wrote ChangeLog entries, too. And even (gasp! :-) a test case or two. Of course, we'd expect such a test case (probably named tests/misc/sort-compress, and based on tests/sample-test) to have this line in it: . $srcdir/../very-expensive If you don't have time for that, I'll take care of it, eventually. Here's some tests. They're actually not very expensive. Of course, you need to chmod +x sort-compress. 2007-01-22 Dan Hipschman [EMAIL PROTECTED] Test sort compression. * tests/misc/Makefile.am: Add the test. * tests/misc/sort-compress: New file containing the tests. Index: Makefile.am === RCS file: /sources/coreutils/coreutils/tests/misc/Makefile.am,v retrieving revision 1.46 diff -p -u -r1.46 Makefile.am --- tests/misc/Makefile.am 15 Jan 2007 10:33:49 - 1.46 +++ tests/misc/Makefile.am 23 Jan 2007 05:12:58 - @@ -69,6 +69,7 @@ TESTS = \ sha384sum \ sha512sum \ shuf \ + sort-compress \ sort-merge \ sort-rand \ split-a \ Index: sort-compress === diff -Nru a/sort-compress b/sort-compress --- tests/misc/sort-compress1969-12-31 16:00:00.0 -0800 +++ tests/misc/sort-compress2007-01-22 21:00:19.0 -0800 @@ -0,0 +1,92 @@ +#!/bin/sh +# Test use of compression by sort + +# Copyright (C) 2007 Free Software Foundation, Inc. + +# This program is free software; you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 2 of the License, or +# (at your option) any later version. + +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. + +# You should have received a copy of the GNU General Public License +# along with this program; if not, write to the Free Software +# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA +# 02110-1301, USA. + +if test $VERBOSE = yes; then + set -x + sort --version +fi + +pwd=`pwd` +t0=`echo $0|sed 's,.*/,,'`.tmp; tmp=$t0/$$ +trap 'status=$?; cd $pwd chmod -R u+rwx $t0 rm -rf $t0 exit $status' 0 +trap '(exit $?); exit $?' 1 2 13 15 + +framework_failure=0 +mkdir -p $tmp || framework_failure=1 +cd $tmp || framework_failure=1 +seq -w 2000 exp || framework_failure=1 +tac exp in || framework_failure=1 +SORT=`which sort` || framework_failure=1 + +if test $framework_failure = 1; then + echo $0: failure in testing framework 12 + (exit 1); exit 1 +fi + +fail=0 + +# This should force the use of temp files compressed with the default gzip +sort -S 1k in out || fail=1 +cmp exp out || fail=1 +test $fail = 1 diff out exp 2 /dev/null + +# Create our own gzip program that will be used as the default +cat \EOF gzip || fail=1 +#!/bin/sh +tr 41 14 +touch ok +EOF + +chmod +x gzip + +# This will find our new gzip in PATH +PATH=.:$PATH sort -S 1k in out || fail=1 +cmp exp out || fail=1 +test $fail = 1 diff out exp 2 /dev/null +test -f ok || fail=1 +rm -f ok + +# This is to make sure we can disable compression +PATH=.:$PATH GNUSORT_COMPRESSOR= sort -S 1k in out || fail=1 +cmp exp out || fail=1 +test $fail = 1 diff out exp 2 /dev/null +test -f ok fail=1 + +# This is to make sure we can use something other than gzip +mv gzip dzip || fail=1 +GNUSORT_COMPRESSOR=./dzip sort -S 1k in out || fail=1 +cmp exp out || fail=1 +test $fail = 1 diff out exp 2 /dev/null +test -f ok || fail=1 +rm -f ok + +# Make sure it can find other programs in PATH correctly +PATH=.:$PATH GNUSORT_COMPRESSOR=dzip sort -S 1k in out || fail=1 +cmp exp out || fail=1 +test $fail = 1 diff out exp 2 /dev/null +test -f ok || fail=1 +rm -f dzip ok + +# This is to make sure sort functions if it can't find the default gzip +PATH=. $SORT -S 1k in out || fail=1 +cmp exp out || fail=1 +test $fail = 1 diff out exp 2 /dev/null + +(exit $fail); exit $fail ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: feature request: gzip/bzip support for sort
Dan Hipschman [EMAIL PROTECTED] wrote: I think this patch addresses everything Paul mentioned in his critique of my last attempt. I did look at gnulib pipe module, but there were some problems with using it out of the box. First, it takes a Hi Dan, Thanks for doing all that. Not to look the gift horse in the mouth, but it'd be nice if you wrote ChangeLog entries, too. And even (gasp! :-) a test case or two. Of course, we'd expect such a test case (probably named tests/misc/sort-compress, and based on tests/sample-test) to have this line in it: . $srcdir/../very-expensive If you don't have time for that, I'll take care of it, eventually. Here are a few more comments. Default to just gzip, not /bin/gzip. The latter may not exist; your patch already handles that, but /bin/gzip may not be the first gzip in PATH. Also, don't bother with the access-XOK check. There's no point in incurring even such a small overhead in the general case, when no temporary is used. may-be-used-uninit: sort.c: In function 'pipe_fork': sort.c:696: warning: 'pid' may be used uninitialized in this function Coreutils policy -- via make distcheck -- dictates that file-global variables be declared static. In pipe_fork callers, use named constants, not 2 and 8. i.e. enum { /* Explain why this number is smaller than... */ MAX_FORK_RETRIES_COMPRESS = 2, /* ...this one. */ MAX_FORK_RETRIES_DECOMPRESS = 8 }; -- I've done the above in the patch below. But please address the FIXME I've added. The rest I've left for you. Have you considered using the gnulib hash module rather than rolling your own? There are examples in many of coreutils/src/*.c. For this, +#define PROC_THRESH 2 Use an enum, as above. It's more debugger-friendly. Use a more descriptive name. reap and register_proc need brief comments describing what they do and the meaning/function of their parameters. === Here's a start. The patch is relative to your latest posted patch: * src/sort.c (MAX_FORK_RETRIES_COMPRESS, MAX_FORK_RETRIES_DECOMPRESS): In pipe_fork callers, use these named constants, not 2 and 8. (proctab, nprocs): Declare to be static. (pipe_fork) [lint]: Initialize local, pid, to avoid unwarranted may-be-used-uninitialized warning. (create_temp): Use the active voice. Describe parameters, too. (main): Default to gzip, not /bin/gzip. Don't bother with the access-X_OK check. diff --git a/src/sort.c b/src/sort.c index 9604faa..da10759 100644 --- a/src/sort.c +++ b/src/sort.c @@ -93,6 +93,15 @@ enum SORT_FAILURE = 2 }; +enum + { +/* FIXME: explain why this number is smaller than... */ +MAX_FORK_RETRIES_COMPRESS = 2, + +/* ...this one. */ +MAX_FORK_RETRIES_DECOMPRESS = 8 + }; + /* The representation of the decimal point in the current locale. */ static int decimal_point; @@ -440,11 +449,11 @@ struct procnode processes in a timely manner so as not to exhaust system resources, so we store the info on whether the process is still running, or has been reaped here. */ -struct procnode *proctab[PROCTABSZ]; +static struct procnode *proctab[PROCTABSZ]; /* The total number of forked processes (compressors and decompressors) that have not been reaped yet. */ -size_t nprocs; +static size_t nprocs; /* The number of child processes we'll allow before we try to reap them. */ #define PROC_THRESH 2 @@ -693,7 +702,7 @@ pipe_fork (int pipefds[2], size_t tries) sigset_t oldset; int saved_errno; unsigned int wait_retry = 1; - pid_t pid; + pid_t pid IF_LINT (= 0); if (pipe (pipefds) 0) return -1; @@ -744,7 +753,9 @@ pipe_fork (int pipefds[2], size_t tries) #endif } -/* Creates a temp file and compression program to filter output to it. */ +/* Create a temporary file and start a compression program to filter output + to that file. Set *PFP to the file handle and if *PPID is non-NULL, + set it to the PID of the newly-created process. */ static char * create_temp (FILE **pfp, pid_t *ppid) @@ -757,7 +768,7 @@ create_temp (FILE **pfp, pid_t *ppid) { int pipefds[2]; - node-pid = pipe_fork (pipefds, 2); + node-pid = pipe_fork (pipefds, MAX_FORK_RETRIES_COMPRESS); if (0 node-pid) { close (tempfd); @@ -810,7 +821,7 @@ open_temp (const char *name, pid_t pid) if (compress_program) { int pipefds[2]; - pid_t child_pid = pipe_fork (pipefds, 8); + pid_t child_pid = pipe_fork (pipefds, MAX_FORK_RETRIES_DECOMPRESS); if (0 child_pid) { @@ -3030,20 +3041,9 @@ main (int argc, char **argv) } compress_program = getenv (GNUSORT_COMPRESSOR); - if (! compress_program) -{ - static const char *compressors[] = { /bin/gzip }; - enum { ncompressors = sizeof compressors / sizeof compressors[0] }; - size_t i; - - for (i = 0; i ncompressors; ++i) - if (access (compressors[i],
Re: feature request: gzip/bzip support for sort
James Youngman [EMAIL PROTECTED] wrote: It might be worth ensuring that we don't pass an invalid signal mask to sigprocmask(SET_MASK,...) if the previous call to sigprocmask(SIG_BLOCK,...) had failed. Offhand I can't think of a way for sigprocmask() to fail unless the first argument is invalid, but looking that the unchecked return code makes me mildly nervous. How about something resembling this (untested since my version of Automake is still only 1.9.6)? 2007-01-21 James Youngman [EMAIL PROTECTED] Centralize all the uses of sigprocmask(). Don't restore an invalid saved mask. * src/sort.c (enter_cs, leave_cs): New functions for protecting code sequences against signal delivery. * (exit_cleanup): Use enter_cs and leave_cs instead of calling sigprocmask directly. (create_temp_file, pipe_fork, zaptemp): Likewise Hi James, Good idea. I've applied it on a branch with some minor changes: - leave_cs should be void - remove now-unused declarations of oldset - rename new struct and functions to start with cs_ BTW, your patch was mangled: as if something filtered it through sed 's/^ //'. I've applied this on top of a modified version of Dan's patch. I'm fiddling with git now, with an eye to publishing this working branch. In the mean time, I've included my working version of sort.c below. Here's the adjusted patch: diff --git a/ChangeLog b/ChangeLog index fe8cd7a..764aef8 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,13 @@ +2007-01-21 James Youngman [EMAIL PROTECTED] + + Centralize all the uses of sigprocmask. Don't restore an invalid + saved mask. + * src/sort.c (cs_enter, cs_leave): New functions for protecting + code sequences against signal delivery. + (exit_cleanup): Use cs_enter and cs_leave instead of calling + sigprocmask directly. + (create_temp_file, pipe_fork, zaptemp): Likewise + 2007-01-21 Jim Meyering [EMAIL PROTECTED] * src/sort.c (MAX_FORK_RETRIES_COMPRESS, MAX_FORK_RETRIES_DECOMPRESS): diff --git a/src/sort.c b/src/sort.c index da10759..57003ea 100644 --- a/src/sort.c +++ b/src/sort.c @@ -64,7 +64,8 @@ struct rlimit { size_t rlim_cur; }; present. */ #ifndef SA_NOCLDSTOP # define SA_NOCLDSTOP 0 -# define sigprocmask(How, Set, Oset) /* empty */ +/* No sigprocmask. Always 'return' zero. */ +# define sigprocmask(How, Set, Oset) (0) # define sigset_t int # if ! HAVE_SIGINTERRUPT # define siginterrupt(sig, flag) /* empty */ @@ -412,6 +413,33 @@ static struct option const long_options[] = /* The set of signals that are caught. */ static sigset_t caught_signals; +/* Critical section status. */ +struct cs_status +{ + bool valid; + sigset_t sigs; +}; + +/* Enter a critical section. */ +static struct cs_status +cs_enter (void) +{ + struct cs_status status; + status.valid = (sigprocmask (SIG_BLOCK, caught_signals, status.sigs) == 0); + return status; +} + +/* Leave a critical section. */ +static void +cs_leave (struct cs_status status) +{ + if (status.valid) +{ + /* Ignore failure when restoring the signal mask. */ + sigprocmask (SIG_SETMASK, status.sigs, NULL); +} +} + /* The list of temporary files. */ struct tempnode { @@ -577,10 +605,9 @@ exit_cleanup (void) { /* Clean up any remaining temporary files in a critical section so that a signal handler does not try to clean them too. */ - sigset_t oldset; - sigprocmask (SIG_BLOCK, caught_signals, oldset); + struct cs_status cs = cs_enter (); cleanup (); - sigprocmask (SIG_SETMASK, oldset, NULL); + cs_leave (cs); } close_stdout (); @@ -594,7 +621,6 @@ create_temp_file (int *pfd) { static char const slashbase[] = /sortXX; static size_t temp_dir_index; - sigset_t oldset; int fd; int saved_errno; char const *temp_dir = temp_dirs[temp_dir_index]; @@ -602,6 +628,7 @@ create_temp_file (int *pfd) struct tempnode *node = xmalloc (offsetof (struct tempnode, name) + len + sizeof slashbase); char *file = node-name; + struct cs_status cs; memcpy (file, temp_dir, len); memcpy (file + len, slashbase, sizeof slashbase); @@ -611,7 +638,7 @@ create_temp_file (int *pfd) temp_dir_index = 0; /* Create the temporary file in a critical section, to avoid races. */ - sigprocmask (SIG_BLOCK, caught_signals, oldset); + cs = cs_enter (); fd = mkstemp (file); if (0 = fd) { @@ -619,7 +646,7 @@ create_temp_file (int *pfd) temptail = node-next; } saved_errno = errno; - sigprocmask (SIG_SETMASK, oldset, NULL); + cs_leave (cs); errno = saved_errno; if (fd 0) @@ -699,10 +726,10 @@ pipe_fork (int pipefds[2], size_t tries) { #if HAVE_WORKING_FORK struct tempnode *saved_temphead; - sigset_t oldset; int saved_errno; unsigned int wait_retry = 1; pid_t pid IF_LINT (= 0); + struct cs_status cs; if (pipe (pipefds) 0) return -1; @@ -711,7 +738,7 @@
Re: feature request: gzip/bzip support for sort
On Sun, Jan 21, 2007 at 07:14:03PM +0100, Jim Meyering wrote: Not to look the gift horse in the mouth, but it'd be nice if you wrote ChangeLog entries, too. And even (gasp! :-) a test case or two. Of course, we'd expect such a test case (probably named tests/misc/sort-compress, and based on tests/sample-test) to have this line in it: . $srcdir/../very-expensive If you don't have time for that, I'll take care of it, eventually. I'm not going to stop you :-) I just haven't had the time to look into it yet, so I've just been running the coreutils tests and then my own tests. I was planning on adding the tests, as you say, eventually. Default to just gzip, not /bin/gzip. The latter may not exist; your patch already handles that, but /bin/gzip may not be the first gzip in PATH. Also, don't bother with the access-XOK check. There's no point in incurring even such a small overhead in the general case, when no temporary is used. The reason I put the access check in there is that if we default to gzip and it doesn't exist, then of course the exec will fail, the child will fail, and this will cause sort to fail when it really should just not do the compression, or try another default if something suitable exists (what about compress?). How about we just delay the determination of the compress program until it's actually needed (e.g., in create_temp right before if (compress_program) we have if (compress_program_known) and inside the body we check the environment variable and/or do access checks on possible defaults)? But please address the FIXME I've added. If we can't fork a compression process, it's not the end of the world. We just don't compress that file and sort will still work. If we can't fork a decompression process, we can't continue the sort. So I figure we'll just try twice to fork compression processes, and if we can't do it after 1 sec, we're probably wasting more time waiting to fork than we would doing disk access. However, we really need to be able to fork decompression processes, so we can afford to wait a really long time for it. I was considering making the number of tries for decompression processes even larger (now, it'll wait about 2 min before giving up). Have you considered using the gnulib hash module rather than rolling your own? There are examples in many of coreutils/src/*.c. I'm not familiar with gnulib, so I didn't know a hash module existed or think to look for one. Looking at it now, though, it seems it will be slower because of its abstraction, unless the table fills up to the point where it would be faster to access if it grew. I'd prefer (since it seems to be my time we're talking about), to leave it the way it is because it's simple, and see if the gnulib module is faster later. We've already seen tests (yours :-) that show the patch improves large sorts (granted, they used the last patch, not this one), so it's still a win to put it in, and optimizations later are just gravy. Thanks for the advice. I've got some other work I've got to get caught up with first (trust me, I'd rather do this), but hopefully I can get to work on it tomorrow. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: feature request: gzip/bzip support for sort
Dan Hipschman [EMAIL PROTECTED] wrote: On Sun, Jan 21, 2007 at 07:14:03PM +0100, Jim Meyering wrote: Not to look the gift horse in the mouth, but it'd be nice if you wrote ChangeLog entries, too. And even (gasp! :-) a test case or two. Of course, we'd expect such a test case (probably named tests/misc/sort-compress, and based on tests/sample-test) to have this line in it: . $srcdir/../very-expensive If you don't have time for that, I'll take care of it, eventually. I'm not going to stop you :-) I just haven't had the time to look into it yet, so I've just been running the coreutils tests and then my own tests. I was planning on adding the tests, as you say, eventually. Default to just gzip, not /bin/gzip. The latter may not exist; your patch already handles that, but /bin/gzip may not be the first gzip in PATH. Also, don't bother with the access-XOK check. There's no point in incurring even such a small overhead in the general case, when no temporary is used. The reason I put the access check in there is that if we default to gzip and it doesn't exist, then of course the exec will fail, the child will This is a good argument for using libz by default, not a separate gzip program. Why incur the overhead of an exec when we don't need to? Now, I'm convinced that sort should provide built-in support for both gzip and bzip2. How to select built-in vs. actually exec the program is something to think about... Of course it should still be possible to specify some other program. fail, and this will cause sort to fail when it really should just not do the compression, or try another default if something suitable exists (what about compress?). How about we just delay the determination of compress? no thank you! :-) the compress program until it's actually needed (e.g., in create_temp right before if (compress_program) we have if (compress_program_known) and inside the body we check the environment variable and/or do access checks on possible defaults)? But please address the FIXME I've added. If we can't fork a compression process, it's not the end of the world. We just don't compress that file and sort will still work. If we can't fork a decompression process, we can't continue the sort. So I figure we'll just try twice to fork compression processes, and if we can't do it after 1 sec, we're probably wasting more time waiting to fork than we would doing disk access. However, we really need to be able to fork decompression processes, so we can afford to wait a really long time for it. I was considering making the number of tries for decompression processes even larger (now, it'll wait about 2 min before giving up). Have you considered using the gnulib hash module rather than rolling your own? There are examples in many of coreutils/src/*.c. I'm not familiar with gnulib, so I didn't know a hash module existed or think to look for one. Looking at it now, though, it seems it will be slower because of its abstraction, unless the table fills up to the Performance isn't the issue here, but code-reuse. I would be very surprised if changing hash table implementations has any measurable effect on sort's performance. point where it would be faster to access if it grew. I'd prefer (since it seems to be my time we're talking about), to leave it the way it is because it's simple, and see if the gnulib module is faster later. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: feature request: gzip/bzip support for sort
On Sun, Jan 21, 2007 at 10:41:11PM +0100, Jim Meyering wrote: This is a good argument for using libz by default, not a separate gzip program. Why incur the overhead of an exec when we don't need to? Now, I'm convinced that sort should provide built-in support for both gzip and bzip2. How to select built-in vs. actually exec the program is something to think about... Of course it should still be possible to specify some other program. Well, was there anything wrong with my original LZO patch? It shouldn't be too hard to replace LZO with libz. I don't disagree with you about using libz in addition to an external program, but if the current patch gives performance gains, why not check it in (once the wrinkles are ironed out) and add more optimizations later? At least that way I can take a break (aha, the real motivation comes out), and pick up where I left off later without worrying about the code changing under my feet. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: feature request: gzip/bzip support for sort
I have a good feeling about this one :-) I think I've addressed everything except adding libz. I just don't think I have the time to do that right now, and I think it can wait. I can look into writing some test cases in a bit. 2007-01-21 Jim Meyering [EMAIL PROTECTED] * src/sort.c (MAX_FORK_RETRIES_COMPRESS, MAX_FORK_RETRIES_DECOMPRESS): In pipe_fork callers, use these named constants, not 2 and 8. (proctab, nprocs): Declare to be static. (pipe_fork) [lint]: Initialize local, pid, to avoid unwarranted may-be-used-uninitialized warning. (create_temp): Use the active voice. Describe parameters, too. 2007-01-21 James Youngman [EMAIL PROTECTED] Centralize all the uses of sigprocmask(). Don't restore an invalid saved mask. * src/sort.c (enter_cs, leave_cs): New functions for protecting code sequences against signal delivery. * (exit_cleanup): Use enter_cs and leave_cs instead of calling sigprocmask directly. (create_temp_file, pipe_fork, zaptemp): Likewise 2007-01-21 Dan Hipschman [EMAIL PROTECTED] Add compression of temp files to sort. * NEWS: Mention this. * bootstrap.conf: Import findprog. * configure.ac: Add AC_FUNC_FORK. * doc/coreutils.texi: Document GNUSORT_COMPRESSOR environment variable. * src/sort.c (compress_program): New global, holds the name of the external compression program. (struct sortfile): New type used by mergepfs and friends instead of filenames to hold PIDs of compressor processes. (proctab): New global, holds compressor PIDs on which to wait. (enum procstate, struct procnode): New types used by proctab. (proctab_hasher, proctab_comparator): New functions for proctab. (nprocs): New global, number of forked but unreaped children. (reap, reap_some): New function, wait for/cleanup forked processes. (register_proc, update_proc, wait_proc): New functions for adding, modifying and removing proctab entries. (create_temp_file): Change parameter type to pointer to file descriptor, and return type to pointer to struct tempnode. (dup2_or_die): New function used in create_temp and open_temp. (pipe_fork): New function, creates a pipe and child process. (create_temp): Creates a temp file and possibly a compression program to which we filter output. (open_temp): Opens a compressed temp file and creates a decompression process through which to filter the input. (mergefps): Change FILES parameter type to struct sortfile array and update access accordingly. Use open_temp and reap_some. (avoid_trashing_input, merge): Change FILES parameter like mergefps and call create_temp instead of create_temp_file. (sort): Call create_temp instead of create_temp_file. Use reap_some. (avoid_trashing_input, merge, sort, main): Adapt to mergefps. Index: NEWS === RCS file: /sources/coreutils/coreutils/NEWS,v retrieving revision 1.467 diff -p -u -r1.467 NEWS --- NEWS18 Jan 2007 09:18:21 - 1.467 +++ NEWS22 Jan 2007 03:44:29 - @@ -29,6 +29,11 @@ GNU coreutils NEWS rm --interactive=never F no longer prompts for an unwritable F +** New features + + sort can now compresses temporary files to improve performance of + very large sorts. + * Noteworthy changes in release 6.7 (2006-12-08) [stable] Index: bootstrap.conf === RCS file: /sources/coreutils/coreutils/bootstrap.conf,v retrieving revision 1.15 diff -p -u -r1.15 bootstrap.conf --- bootstrap.conf 15 Jan 2007 10:33:52 - 1.15 +++ bootstrap.conf 22 Jan 2007 03:44:29 - @@ -43,7 +43,8 @@ gnulib_modules= config-h configmake closeout cycle-check d-ino d-type diacrit dirfd dirname dup2 error euidaccess exclude exitfail fchdir fcntl fcntl-safer fdl - file-type fileblocks filemode filenamecat fnmatch-gnu fopen-safer + file-type fileblocks filemode filenamecat findprog fnmatch-gnu + fopen-safer fprintftime fsusage ftruncate fts getdate getgroups gethrxtime getline getloadavg getndelim2 getopt getpagesize getpass-gnu gettext gettime gettimeofday getugroups getusershell gnupload Index: configure.ac === RCS file: /sources/coreutils/coreutils/configure.ac,v retrieving revision 1.102 diff -p -u -r1.102 configure.ac --- configure.ac28 Dec 2006 08:18:17 - 1.102 +++ configure.ac22 Jan 2007 03:44:29 - @@ -39,6 +39,8 @@ gl_EARLY gl_INIT coreutils_MACROS +AC_FUNC_FORK + AC_CHECK_FUNCS(uname, OPTIONAL_BIN_PROGS=$OPTIONAL_BIN_PROGS
Re: feature request: gzip/bzip support for sort
Dan Hipschman wrote on 22-01-07 05:55: sort can now compresses Small typo here. bjd ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: feature request: gzip/bzip support for sort
Hi, I think this patch addresses everything Paul mentioned in his critique of my last attempt. I did look at gnulib pipe module, but there were some problems with using it out of the box. First, it takes a filename as the stdout of the child process, when for temp files it's better to pass a file descriptor. That's a rather small issue, though, and would be pretty easy to add an additional function that fixed that. Another problem is that since it does the fork/exec in one call, the race condition where the child and parent both receive signals between the fork and child's exec can't be solved the way I solve it in this patch. Rather than change the way sort handles signals, I pleaded with Paul (well, I asked him nicely :-) to be able to put off the portability issues of fork until I had more time. Hence, I've just added a check for fork() in configure.ac and any system that doesn't have it will have to wait a while for this feature. I also tried to address Jim's concern about fork() failing. I clean up zombie children every so often in the main loops of sort() and mergefps() so as not to bleed the system dry. This seems to keep the number of defunct and running child processes close to a minimum. If we can't fork, I retry some number of times, sleeping a bit in between and killing off zombies (sounds like a good time :-). I was able to test the case where fork() fails with a macro like this: unsigned int foo; #define fork() (foo++ % 3 ? fork() : (errno = EAGAIN, -1)) I'm not sure how to test the case where we fork a compression process, it exits, we reap it, and we create another process with the same PID before we run the decompression process on the file it created. I did address this issue, and of course I *think* my solution is correct, but like I said, I'm not sure how to test it. Anyway, here's the patch. Index: NEWS === RCS file: /sources/coreutils/coreutils/NEWS,v retrieving revision 1.467 diff -p -u -r1.467 NEWS --- NEWS18 Jan 2007 09:18:21 - 1.467 +++ NEWS20 Jan 2007 05:49:27 - @@ -29,6 +29,11 @@ GNU coreutils NEWS rm --interactive=never F no longer prompts for an unwritable F +** New features + + sort can now compresses temporary files to improve performance of + very large sorts. + * Noteworthy changes in release 6.7 (2006-12-08) [stable] Index: configure.ac === RCS file: /sources/coreutils/coreutils/configure.ac,v retrieving revision 1.102 diff -p -u -r1.102 configure.ac --- configure.ac28 Dec 2006 08:18:17 - 1.102 +++ configure.ac20 Jan 2007 05:49:27 - @@ -39,6 +39,8 @@ gl_EARLY gl_INIT coreutils_MACROS +AC_FUNC_FORK + AC_CHECK_FUNCS(uname, OPTIONAL_BIN_PROGS=$OPTIONAL_BIN_PROGS uname\$(EXEEXT) MAN=$MAN uname.1) Index: doc/coreutils.texi === RCS file: /sources/coreutils/coreutils/doc/coreutils.texi,v retrieving revision 1.363 diff -p -u -r1.363 coreutils.texi --- doc/coreutils.texi 4 Jan 2007 11:04:46 - 1.363 +++ doc/coreutils.texi 20 Jan 2007 05:49:27 - @@ -3411,6 +3411,19 @@ value as the directory for temporary fil @option{--temporary-directory} (@option{-T}) option in turn overrides the environment variable. [EMAIL PROTECTED] GNUSORT_COMPRESSOR +To improve performance when sorting very large files, GNU sort will, +by default, try to compress temporary files with the program [EMAIL PROTECTED]/bin/gzip}. The environment variable @env{GNUSORT_COMPRESSOR} +can be set to the name of another program to be used. The program +specified must compress standard input to standard output when no +arguments are given to it, and it must decompress standard input to +standard output when the @option{-d} argument is given to it. To +disable compression of temporary files, set the variable to the empty +string. Whitespace characters, and the backslash character are +reserved for future use in the program name. If the program exits +with nonzero status, sort will terminate with an error. + The following options affect the ordering of output lines. They may be specified globally or as part of a specific key field. If no key Index: src/sort.c === RCS file: /sources/coreutils/coreutils/src/sort.c,v retrieving revision 1.345 diff -p -u -r1.345 sort.c --- src/sort.c 19 Jan 2007 22:28:31 - 1.345 +++ src/sort.c 20 Jan 2007 05:49:29 - @@ -1,5 +1,5 @@ /* sort - sort lines of text (with all kinds of options). - Copyright (C) 1988, 1991-2006 Free Software Foundation, Inc. + Copyright (C) 1988, 1991-2007 Free Software Foundation, Inc. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published
Re: feature request: gzip/bzip support for sort
Jim Meyering [EMAIL PROTECTED] wrote: Dan Hipschman [EMAIL PROTECTED] wrote: On Sat, Jan 13, 2007 at 10:07:59PM -0800, Paul Eggert wrote: 3. I can see where the user might be able to specify a better algorithm, for a particular data set. For that, how about if we have a --compress-program=PROGRAM option, which lets the user plug in any program that works as a pipeline? E.g., --compress-program=gzip would use gzip. The default would be to use PROGRAM -d to decompress; we could have another option if that doesn't suffice. This is pretty close. There is just a few more problems that I see: One more thing to consider: In your proposed patch, a failed fork evokes an immediate exit. In that case, it may be worthwhile to retry, or even to revert to using gzip directly. Upon rereading, I see the latter part is not clear. I mean resorting to the use of a direct (non-forking) implementation, like libz. This, of course, assumes we're using gzip. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: feature request: gzip/bzip support for sort
On Thu, 18 Jan 2007, Jim Meyering wrote: I've done some more timings, but with two more sizes of input. Here's the summary, comparing straight sort with sort --comp=gzip: 2.7GB: 6.6% speed-up 10.0GB: 17.8% speed-up It would be interesting to see the individual stats returned by wait4(2) from the child, to separate CPU seconds spent in sort itself, and in the compression/decompression forks. I think allowing an environment variable to define the compressor is a good idea, so long as there's a corresponding --nocompress override available from the command line. $ seq 999 k $ cat k k k k k k k k k j $ cat j j j j sort-in $ wc -c sort-in 283968 sort-in I had to use seq -f %.0f to get this filesize. With --compress=gzip: $ /usr/bin/time ./sort -T. --compress=gzip sort-in out 814.07user 29.97system 14:50.16elapsed 94%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (4major+2821589minor)pagefaults 0swaps There's a big difference in the time spent on gzip compression depending on the -1/-9 option (default -6). For a similar seq-generated data set above, I get gzip -1: User time (seconds): 48.63, output size is 6% of input gzip -9: User time (seconds): 952.97, output size is 3% of input Decompression time for both tests shows less variation (25s vs 21s). This would suggest the elapsed time to sort can be improved by trading compression ratio for less CPU time. Obviously a critical factor is the disk latency. Cheers, Phil ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: feature request: gzip/bzip support for sort
Philip Rowlands [EMAIL PROTECTED] wrote: On Thu, 18 Jan 2007, Jim Meyering wrote: I've done some more timings, but with two more sizes of input. Here's the summary, comparing straight sort with sort --comp=gzip: 2.7GB: 6.6% speed-up 10.0GB: 17.8% speed-up It would be interesting to see the individual stats returned by wait4(2) from the child, to separate CPU seconds spent in sort itself, and in the compression/decompression forks. I think allowing an environment variable to define the compressor is a good idea, so long as there's a corresponding --nocompress override available from the command line. $ seq 999 k $ cat k k k k k k k k k j $ cat j j j j sort-in $ wc -c sort-in 283968 sort-in I had to use seq -f %.0f to get this filesize. Odd. Here's what those generate for me: $ seq 999 k $ wc -c k 7888 $ tail -1 k 999 The remaining cat commands merely write 36 copies of that data to sort-in: $ (wc -c k|tr -d '\n'; echo '* 36')|bc 283968 What happens differently for you? ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: feature request: gzip/bzip support for sort
On Thu, 18 Jan 2007, Jim Meyering wrote: I had to use seq -f %.0f to get this filesize. Odd. Here's what those generate for me: $ seq 999 k $ wc -c k 7888 $ tail -1 k 999 What happens differently for you? $ seq 990 999 9.9e+06 9.9e+06 9.9e+06 9.9e+06 9.9e+06 1e+07 1e+07 1e+07 1e+07 1e+07 [EMAIL PROTECTED]:~$ seq --version seq (GNU coreutils) 5.93 I'm using an old version of seq with a different default format (I didn't realised it had changed recently.) Cheers, Phil ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: feature request: gzip/bzip support for sort
On Thu, Jan 18, 2007 at 05:47:53PM -0800, Dan Hipschman wrote: That's a thought, although libz only works with gzip (as you said), and it will add more complexity (like my original patch using LZO and this new one combined). I don't think we'll have 40 instances of gzip -d running. We should only need at most one compression process, and NMERGE (16) decompression processes running at one time. I think retrying fork if it fails is a good idea, and I've already added that since I read your mail. Apologies, I spoke to soon. We shouldn't have 40 *running* processes, but we generate a lot of defunct ones. I'm working on cleaning them up sooner and then hopefully we can bound the number of child processes at NMERGE + 1. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: feature request: gzip/bzip support for sort
Dan Hipschman [EMAIL PROTECTED] wrote: Here's the patch for comments. Thanks, I tried it and did some timings. Bottom line: with a 4+GB file, dual-processor, I see a 19% speed-up, but I think most of the savings is in reduced I/O. virtually no difference (~5%) for a file of size 324M, created like this: running on a uniprocessor amd-64 3400: $ seq 9 k $ cat k k k k k k k k k k k k k k k k k k k k k k k k j $ mv j k $ cat k k k k k k k k k k k k k k k k k k k k k k k k j $ shuf j sort-in $ /usr/bin/time ./sort --compress=gzip sort-in out 100.11user 4.69system 1:48.67elapsed 96%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (1major+761524minor)pagefaults 0swaps $ /usr/bin/time ./sort sort-in out 93.16user 3.35system 1:40.35elapsed 96%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+137435minor)pagefaults 0swaps - Trying similar, but with a 4.2GB file created like this, running on a dual-processor with 2GB of RAM. $ seq 999 k10M lines / 7888 bytes $ cat k k k k k k k k k j 90M lines $ cat j j j j j j j k 630M lines / 4.62 GB $ mv k sort-in What does /tmp look like, after a few minutes? $ du -sh /tmp/sort* 11M /tmp/sort0Sdnkk 11M /tmp/sort6cTaAE 11M /tmp/sortABogDY ... --- contrast with sizes during the run w/no compression: 216M/tmp/sort5NTqQu 216M/tmp/sortAjx50R 216M/tmp/sortKvyGIT ... $ /usr/bin/time ./sort -T /tmp --compress=gzip sort-in out 1535.33user 71.76system 27:15.72elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (1major+4985207minor)pagefaults 0swaps $ /usr/bin/time ./sort -T /tmp sort-in out $ /usr/bin/time ./sort -T /tmp sort-in out ./sort: write failed: /tmp/sortieA1nv: No space left on device Command exited with non-zero status 2 588.79user 17.76system 17:20.70elapsed 58%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (2major+225191minor)pagefaults 0swaps [Exit 2] $ df -hT /tmp FilesystemTypeSize Used Avail Use% Mounted on /dev/sda3 reiserfs 12G 6.4G 5.4G 55% / $ /usr/bin/time ./sort -T . sort-in out 754.03user 38.99system 33:42.35elapsed 39%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (2major+210437minor)pagefaults 0swaps So, with just one trial each, I see a 19% speed-up. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: feature request: gzip/bzip support for sort
Jim Meyering [EMAIL PROTECTED] writes: So, with just one trial each, I see a 19% speed-up. Yaayyy! That's good news. Thanks for timing it. I read your email just after talking with Dan (in person) about how we'd time it. I just bought 1 TB worth of disk for my home computer and hadn't hooked it up yet, so was going to volunteer that, but you beat me to it. At this point to my mind the only question is how to put this change in, and whether to make it the default (with gzip, say). Clearly it can lead to a big performance improvement with large sorts on modern machines. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: feature request: gzip/bzip support for sort
Paul Eggert wrote on 16-01-07 18:35: Jim Meyering [EMAIL PROTECTED] writes: So, with just one trial each, I see a 19% speed-up. Yaayyy! That's good news. Thanks for timing it. I read your email just after talking with Dan (in person) about how we'd time it. I just bought 1 TB worth of disk for my home computer and hadn't hooked it up yet, so was going to volunteer that, but you beat me to it. At this point to my mind the only question is how to put this change in, and whether to make it the default (with gzip, say). Clearly it can lead to a big performance improvement with large sorts on modern machines. How about files of 'average' size? Someone said: sort: Compress temporary files when doing large external sort/merges. This improves performance when you can compress/uncompress faster than you can read/write, which is common in these days of fast CPUs. If this statement is true, you get the best of both worlds. In that case I'd just make it the default (which is just that, the default, and implies an option to choose the 'old' behavior). bjd ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: feature request: gzip/bzip support for sort
On Tue, Jan 16, 2007 at 01:20:16PM +0100, Jim Meyering wrote: I tried it and did some timings. Bottom line: with a 4+GB file, dual-processor, I see a 19% speed-up, but I think most of the savings is in reduced I/O. Thanks very much for doing this. The performance gain is good news indeed. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: feature request: gzip/bzip support for sort
On 1/16/07, Dan Hipschman [EMAIL PROTECTED] wrote: On Tue, Jan 16, 2007 at 01:20:16PM +0100, Jim Meyering wrote: I tried it and did some timings. Bottom line: with a 4+GB file, dual-processor, I see a 19% speed-up, but I think most of the savings is in reduced I/O. Thanks very much for doing this. The performance gain is good news indeed. How does this compare with using 'cat' as the compressor? James. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: feature request: gzip/bzip support for sort
On Tue, Jan 16, 2007 at 09:35:52AM -0800, Paul Eggert wrote: At this point to my mind the only question is how to put this change in, and whether to make it the default (with gzip, say). Clearly it can lead to a big performance improvement with large sorts on modern machines. I think it should be triggered with an environment variable, say GNUSORT_COMPRESS=gzip (or whatever). That way, it's got the convenience of a default option, but it's easily configurable. Setting the variable to the empty string (or perhaps just not setting the variable) could disable the behaviour. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: feature request: gzip/bzip support for sort
On Sat, Jan 13, 2007 at 10:07:59PM -0800, Paul Eggert wrote: 3. I can see where the user might be able to specify a better algorithm, for a particular data set. For that, how about if we have a --compress-program=PROGRAM option, which lets the user plug in any program that works as a pipeline? E.g., --compress-program=gzip would use gzip. The default would be to use PROGRAM -d to decompress; we could have another option if that doesn't suffice. An advantage of (3) is that it should work well on two-processor hosts, since compression can be done in one CPU while sorting is done on another. (Hmm, perhaps we should consider forking even if we use a built-in default compressor, for the same reason.) I've started working on this, and have made good progress so far. There are a lot of subtleties, though, like making sure the forked child doesn't receive SIGINT and unlink all our temp files before it execs (I've solved that problem), and making sure the compress process finishes compressing the temp file before the corresponding decompress process starts processing it (I've got a plan for that). Anyway, my point is, I've gotten off to a good start, but it's going to take a lot of testing to make sure I've done it right due to all these race conditions. The actual compression is obviously a lot better (using gzip / bzip2), and it shouldn't be hard to extend the code so sort can read and write externally compressed files, which is what the OP wanted. It's not faster (not even close) on my machine, though. Of course, I've only got one CPU, and a slow one at that :-) Dan ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: feature request: gzip/bzip support for sort
On Sat, Jan 13, 2007 at 10:07:59PM -0800, Paul Eggert wrote: 3. I can see where the user might be able to specify a better algorithm, for a particular data set. For that, how about if we have a --compress-program=PROGRAM option, which lets the user plug in any program that works as a pipeline? E.g., --compress-program=gzip would use gzip. The default would be to use PROGRAM -d to decompress; we could have another option if that doesn't suffice. This is pretty close. There is just a few more problems that I see: * It doesn't respond well to SIGPIPE. It terminates without success, so at least the user knows something went wrong, if they check the status, but it doesn't print an error message. * It checks the exit status of decompression processes at the very end, so the user is notified (with an error message and non-successful exit code) if a decompression process fails, but after the output has been generated. I don't think there's anything we can do about this in general, though. * If we can't fork/exec a decompression process, we're hosed, but if we can't fork/exec a compression process, a possible improvement might be to just proceed without compression (and maybe print a warning). On the other hand, the user might want us to terminate (e.g., if they just spelled the name of the program wrong). * Like Paul said, we might want a separate decompress program, and the user might want to pass in options (e.g., --compress-program=gzip -1). I'm still testing, but so far it's looking pretty good. If you think this patch has a good chance of making it into coreutils, then I'll keep working on it. It does a good job of compression, so I think it will solve (most of) the OP's problem, but it's not a speed booster on my machine. If someone with a multi-processor computer wants to test it maybe we can get some good news on that front. Here's the patch for comments. Thanks, Dan Index: sort.c === RCS file: /sources/coreutils/coreutils/src/sort.c,v retrieving revision 1.344 diff -p -u -r1.344 sort.c --- sort.c 13 Dec 2006 21:27:05 - 1.344 +++ sort.c 16 Jan 2007 01:34:50 - @@ -1,5 +1,5 @@ /* sort - sort lines of text (with all kinds of options). - Copyright (C) 1988, 1991-2006 Free Software Foundation, Inc. + Copyright (C) 1988, 1991-2007 Free Software Foundation, Inc. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by @@ -25,6 +25,7 @@ #include getopt.h #include sys/types.h +#include sys/wait.h #include signal.h #include system.h #include error.h @@ -261,6 +262,9 @@ static bool have_read_stdin; /* List of key field comparisons to be tried. */ static struct keyfield *keylist; +/* Program used to (de)compress temp files. Must accept -d. */ +static const char *compress_program; + static void sortlines_temp (struct line *, size_t, struct line *); /* Report MESSAGE for FILE, then clean up and exit. @@ -328,6 +332,9 @@ Other options:\n\ multiple options specify multiple directories\n\ -u, --unique with -c, check for strict ordering;\n\ without -c, output only the first of an equal run\n\ + --compress-progam=PROG (de)compress temporaries with PROG, which must\n\ + (de)compress stdin to stdout, and accept the -d\n\ + command line option for decompressing\n\ ), DEFAULT_TMPDIR); fputs (_(\ -z, --zero-terminated end lines with 0 byte, not newline\n\ @@ -364,7 +371,8 @@ native byte values.\n\ non-character as a pseudo short option, starting with CHAR_MAX + 1. */ enum { - RANDOM_SOURCE_OPTION = CHAR_MAX + 1 + RANDOM_SOURCE_OPTION = CHAR_MAX + 1, + COMPRESS_PROGRAM }; static char const short_options[] = -bcdfgik:mMno:rRsS:t:T:uy:z; @@ -373,6 +381,7 @@ static struct option const long_options[ { {ignore-leading-blanks, no_argument, NULL, 'b'}, {check, no_argument, NULL, 'c'}, + {compress-program, required_argument, NULL, COMPRESS_PROGRAM}, {dictionary-order, no_argument, NULL, 'd'}, {ignore-case, no_argument, NULL, 'f'}, {general-numeric-sort, no_argument, NULL, 'g'}, @@ -403,6 +412,7 @@ static sigset_t caught_signals; struct tempnode { struct tempnode *volatile next; + pid_t pid; /* If compressed, the pid of compressor, else zero */ char name[1]; /* Actual size is 1 + file name length. */ }; static struct tempnode *volatile temphead; @@ -422,8 +432,8 @@ cleanup (void) /* Create a new temporary file, returning its newly allocated name. Store into *PFP a stream open for writing. */ -static char * -create_temp_file (FILE **pfp) +static struct tempnode *
Re: feature request: gzip/bzip support for sort
Thanks very much for looking into that. Some comments: Dan Hipschman [EMAIL PROTECTED] writes: + /* This is so the child process won't delete our temp files + if it receives a signal before exec-ing. */ + sigprocmask (SIG_BLOCK, caught_signals, oldset); + saved_temphead = temphead; + temphead = NULL; + + if ((pid = fork ()) 0) +error (SORT_FAILURE, errno, _(couldn't fork)); + else if (0 pid) +{ + temphead = saved_temphead; + sigprocmask (SIG_SETMASK, oldset, NULL); +} + else +{ + sigprocmask (SIG_SETMASK, oldset, NULL); There is a path out of code that does not restore the signal mask, which sounds dangerous. The usual pattern looks more like this: sigprocmask (SIG_BLOCK, caught_signals, oldset); unlink_status = unlink (name); unlink_errno = errno; *pnode = next; sigprocmask (SIG_SETMASK, oldset, NULL); The critical section is as small as possible, and it always restores the signal mask. Any tests for whether the critical section succeeded or not are deferred until after the critical section is over. + close_or_die (STDIN_FILENO, _(standard input)); + close_or_die (STDOUT_FILENO, _(standard output)); I don't see why we care whether these 'close' calls fail. The file descriptors are junk at this point. In general, we shouldn't test for 'close' failures unless we actually care about whether the file's I/O went well. There are other places where this is an issue. + if ((node-pid = pipe_fork (pipefds))) These days the usual style for 'if' is: node-pid = pipe_fork (pipefds); if (node-pid) + pid_t *pids = xnmalloc (nfiles, sizeof *pids); + memset (pids, 0, nfiles * sizeof *pids); This looks like a case for xcalloc. + while (reap (-1)) +continue; This one has me worried a bit, but perhaps I'm worrying too much. More thoughts: I'm not that worried about signal handling being propagated to the child compress/decompress processes. Let them worry about their own signal handling and cleanup. They shouldn't run for long if we exit. However, we shouldn't block or mask out signals for them; they should inherit the same signal mask etc. that 'sort' does. coreutils doesn't invoke pipe() anywhere else, and it rarely invokes fork() so we have to be prepared for the usual portability gotchas in this area. Could you please look at the gnulib pipe module and see whether it solves any problems here? Perhaps coreutils can use it directly, or add options to it to make it usable. It does reflect a lot of implementation wisdom. Even if we don't use the pipe module we should at least be aware of the problems it fixes. vfork is more efficient than fork on many platforms, particularly when the parent process is large (which can easily be the case here). It has some porting issues, though (as you can see if you look at diff3 and/or sdiff). We'll need documentation, of course (coreutils.texi and NEWS). The documentation should reserve space and backslash in the compress program name, for future use. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: feature request: gzip/bzip support for sort
OK, here's my current patch and theory on why it's still not fast enough. Here's profiling info on the cvs version of sort, up to write_bytes: % cumulative self self total time seconds secondscalls s/call s/call name 38.09 7.85 7.85 memcoll 22.42 12.47 4.62 3827913 0.00 0.00 compare 13.73 15.30 2.83 133249 0.00 0.00 mergelines 6.94 16.73 1.43 xmemcoll 5.92 17.95 1.221 1.22 2.84 mergefps 4.44 18.87 0.92 88 0.01 0.01 fillbuf 1.89 19.26 0.391 0.3910.50 sort 1.70 19.61 0.35 xalloc_die 1.31 19.88 0.27 md5_process_block 1.02 20.09 0.21 keycompare 0.85 20.26 0.1858368 0.00 0.00 sortlines_temp 0.66 20.40 0.1458368 0.00 0.00 sortlines 0.63 20.53 0.13 50 0.00 0.00 write_bytes Now here's the version compressing temp files, with a few irrelevant things removed: % cumulative self self total time seconds secondscalls s/call s/call name 42.09 8.70 8.70 memcoll 20.20 12.88 4.18 3827913 0.00 0.00 compare 12.19 15.40 2.52 133249 0.00 0.00 mergelines 6.39 16.72 1.32 xmemcoll 4.57 17.66 0.951 0.95 2.63 mergefps 4.40 18.57 0.91 88 0.01 0.01 fillbuf 1.79 18.94 0.37 50 0.00 0.00 compress_and_write_bytes 1.74 19.30 0.36 md5_process_block 1.60 19.63 0.331 0.33 9.87 sort 1.40 19.92 0.2958368 0.00 0.00 sortlines_temp 1.40 20.21 0.29 xalloc_die 1.04 20.43 0.2258368 0.00 0.00 sortlines 0.65 20.56 0.14 keycompare 0.15 20.59 0.03 251662 0.00 0.00 write_bytes 0.15 20.62 0.03 569 0.00 0.00 load_compression_buffer 0.05 20.66 0.01 6688 0.00 0.00 read_and_decompress_bytes 0.05 20.67 0.01 88 0.00 0.00 get_compression_buffer 0.00 20.67 0.00 3954 0.00 0.00 read_bytes 0.00 20.67 0.00 1123 0.00 0.00 compress_out 0.00 20.67 0.00 554 0.00 0.00 compress_wrkmem 0.00 20.67 0.00 554 0.00 0.00 flush_compression 0.00 20.67 0.00 16 0.00 0.00 flush_compression_and_xfclose compress_and_write_bytes takes the place of write_bytes, and as you can see, it's slower. In the second version, we spend less time in write_bytes, so the time sending data to the kernel to write to disk has decreased. The problem is, we're more than losing that time in compress_and_write_bytes. The part that actaully does compression is in flush_compression, and that's fast. The time seems to be spent just copying bytes into the buffer where it will later be compressed. It may be possible to eliminate this time if we had a stream compression algorithm instead of a block compression algorithm. My patch still just slows sort down (although, less than before). If anybody can find any obvious improvements to it that I've missed, then maybe there's still hope for it, but I don't think so at this point. I'll look at some different compression algorithms later, or if someone else wants to take up the cause, that's fine by me, because I'm a little busy with school right now. Still, I'll see what I can come up with in my spare time. Thanks, Dan Index: sort.c === RCS file: /sources/coreutils/coreutils/src/sort.c,v retrieving revision 1.344 diff -p -u -r1.344 sort.c --- sort.c 13 Dec 2006 21:27:05 - 1.344 +++ sort.c 14 Jan 2007 23:59:50 - @@ -50,6 +50,10 @@ struct rlimit { size_t rlim_cur; }; # define getrlimit(Resource, Rlp) (-1) #endif +#if HAVE_LZO1X_H +# include lzo1x.h +#endif + /* The official name of this program (e.g., no `g' prefix). */ #define PROGRAM_NAME sort @@ -261,6 +265,12 @@ static bool have_read_stdin; /* List of key field comparisons to be tried. */ static struct keyfield *keylist; +/* Flag to compress temp files. */ +static bool compress_temps = HAVE_LZO1X_H; + +/* The number of chars to allocate for compression buffers. */ +static size_t compress_buf_size = 65536; + static void sortlines_temp (struct line *, size_t, struct line *); /* Report MESSAGE for FILE, then clean up and exit. @@ -399,15 +409,87 @@ static struct option const long_options[ /* The set of signals that are caught. */ static sigset_t caught_signals; +/*
Re: feature request: gzip/bzip support for sort
Dan Hipschman [EMAIL PROTECTED] writes: % cumulative self self total time seconds secondscalls s/call s/call name 47.91 4.92 4.92 9269 0.00 0.00 find_temp Yes, thanks, that explains the results all right without the -S flag, even the largest file on my hard drive (~80M) didn't use any temp files so I had to resort to this artificial approach. For a more realistic test, we need to sort something larger, so that 'sort' becomes disk-bound. Maybe a terabyte of data or so, for a typically small machine nowadays that has a GB or two of RAM? It will be a bit of a pain to arrange for a decent set of test cases here, but without it I'm afraid we're just guessing about what will make things run faster oversll. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
feature request: gzip/bzip support for sort
On some occasions, I have the need to sort extremely large files, but which compress well using programs such as gzip or bzip. I can emulate the sorting of a gzipped files while keeping input compressed using shell pipes, eg zcat in.gz | sort | gzip out.gz However, if there is not enough temporary space available for sort to store to (ie, usually /tmp), then sort will fail. In a similar vein, I can sort multiple large gzipped files, where each file is small enough to sort in available temporary space, but the complete file is not, by making use of fifos. for i in *.gz; do zcat $i | sort | gzip $i.out mkfifo $i.out.fifo zcat $i.out $i.out.fifo done sort -m *.out.fifo | gzip out.gz However, things would be made easier if sort, like tar, could support the use of compressors and decompressors, (a) for input and (b) for the temporary files. Are there any difficulties in adding such features to sort that I haven't envisaged? Many thanks Craig ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: feature request: gzip/bzip support for sort
Craig Macdonald [EMAIL PROTECTED] wrote: On some occasions, I have the need to sort extremely large files, but which compress well using programs such as gzip or bzip. ... This task has been on the TODO list for some time: sort: Compress temporary files when doing large external sort/merges. This improves performance when you can compress/uncompress faster than you can read/write, which is common in these days of fast CPUs. suggestion from Charles Randall on 2001-08-10 Just waiting for someone to work on it. If you're willing to do the job (and can assign copyright) that'd be great. Let me know, and I'll send you the papers to get that process started. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: feature request: gzip/bzip support for sort
On Sat, Jan 13, 2007 at 10:36:05PM +0100, Jim Meyering wrote: Craig Macdonald [EMAIL PROTECTED] wrote: On some occasions, I have the need to sort extremely large files, but which compress well using programs such as gzip or bzip. ... This task has been on the TODO list for some time: sort: Compress temporary files when doing large external sort/merges. This improves performance when you can compress/uncompress faster than you can read/write, which is common in these days of fast CPUs. suggestion from Charles Randall on 2001-08-10 Just waiting for someone to work on it. I already wrote a patch to sort quite a while ago that added compression of the temporary files. Unfortunately, with my code, I saw no performance increase, so I discarded the patch. However, if it can fix the problem of running out of space, there may still be some use for it after all. The patch is old and needs some conflicts to be worked out, but I'll get on it and submit ASAP. In the meantime, I've posted the original C code http://linux.ucla.edu/~dsh/patches/compressed-sort.c in case you want to check it out and send comments. It uses LZO compression because it was supposed to be fast. I also had patches to the autoconf related stuff, but they seem to be hiding at the moment. That part shouldn't be too hard to rewrite. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils
Re: feature request: gzip/bzip support for sort
Thanks. I like the idea of compression, but before we get into the details of your patch, what do you mean by there not being a performance improvement with this patch? What's the holdup on performance? It seems to me that compression ought to be a real win. If it's not a win, we shouldn't bother with LZO; instead, we should use an algorithm that will typically be a clear win -- or, if there isn't any such algorithm, we shouldn't hardware any algorithm at all. Some other thoughts. 1. Have you looked into other compression algorithms? QuickLZ http://www.quicklz.com/ should compress fairly quickly, if compression speed is the bottleneck. I mention QuickLZ because it's currently the compression-speed champ at http://www.maximumcompression.com/data/summary_mf3.php. 2. For the default, let's not bother with a user-visible option. Let's just use compression, if a good algorithm is available. 3. I can see where the user might be able to specify a better algorithm, for a particular data set. For that, how about if we have a --compress-program=PROGRAM option, which lets the user plug in any program that works as a pipeline? E.g., --compress-program=gzip would use gzip. The default would be to use PROGRAM -d to decompress; we could have another option if that doesn't suffice. An advantage of (3) is that it should work well on two-processor hosts, since compression can be done in one CPU while sorting is done on another. (Hmm, perhaps we should consider forking even if we use a built-in default compressor, for the same reason.) ___ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils