Re: feature request: gzip/bzip support for sort

Dan Hipschman Sun, 21 Jan 2007 12:24:12 -0800

On Sun, Jan 21, 2007 at 07:14:03PM +0100, Jim Meyering wrote:
> Not to look the gift horse in the mouth, but it'd be nice
> if you wrote ChangeLog entries, too.  And even (gasp! :-)
> a test case or two.  Of course, we'd expect such a test case
> (probably named tests/misc/sort-compress, and based on
> tests/sample-test) to have this line in it:
> 
>   . $srcdir/../very-expensive
> 
> If you don't have time for that, I'll take care of it, eventually.


I'm not going to stop you :-)  I just haven't had the time to look into
it yet, so I've just been running the coreutils tests and then my own
tests.  I was planning on adding the tests, as you say, "eventually".

> Default to just "gzip", not /bin/gzip.  The latter may not exist;
> your patch already handles that, but /bin/gzip may not be the first
> gzip in PATH.  Also, don't bother with the access-XOK check.
> There's no point in incurring even such a small overhead in the
> general case, when no temporary is used.

The reason I put the access check in there is that if we default to gzip
and it doesn't exist, then of course the exec will fail, the child will
fail, and this will cause sort to fail when it really should just not do
the compression, or try another default if something suitable exists
(what about compress?).  How about we just delay the determination of
the compress program until it's actually needed (e.g., in create_temp
right before "if (compress_program)" we have "if (compress_program_known)"
and inside the body we check the environment variable and/or do access
checks on possible defaults)?

> But please address the FIXME I've added.

If we can't fork a compression process, it's not the end of the world.
We just don't compress that file and sort will still work.  If we can't
fork a decompression process, we can't continue the sort.  So I figure
we'll just try twice to fork compression processes, and if we can't do
it after 1 sec, we're probably wasting more time waiting to fork than we
would doing disk access.  However, we really need to be able to fork
decompression processes, so we can afford to wait a really long time for
it.  I was considering making the number of tries for decompression
processes even larger (now, it'll wait about 2 min before giving up).

> Have you considered using the gnulib hash module rather than
> rolling your own?  There are examples in many of coreutils/src/*.c.

I'm not familiar with gnulib, so I didn't know a hash module existed or
think to look for one.  Looking at it now, though, it seems it will be
slower because of its abstraction, unless the table fills up to the
point where it would be faster to access if it grew.  I'd prefer (since
it seems to be my time we're talking about), to leave it the way it is
because it's simple, and see if the gnulib module is faster later.
We've already seen tests (yours :-) that show the patch improves large
sorts (granted, they used the last patch, not this one), so it's still a
win to put it in, and optimizations later are just gravy.

Thanks for the advice.  I've got some other work I've got to get caught
up with first (trust me, I'd rather do this), but hopefully I can get to
work on it tomorrow.



_______________________________________________
Bug-coreutils mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/bug-coreutils

Re: feature request: gzip/bzip support for sort

Reply via email to