Re: SMP Version of tar

Tim Kientzle Mon, 01 Oct 2012 22:18:33 -0700

On Oct 1, 2012, at 9:51 AM, Brandon Falk wrote:

> I would be willing to work on a SMP version of tar (initially just gzip or 
> something).
> 
> I don't have the best experience in compression, and how to multi-thread it, 
> but I think I would be able to learn and help out.
> 
> Note: I would like to make this for *BSD under the BSD license. I am aware 
> that there are already tools to do this (under GPL), but I would really like 
> to see this existent in the FreeBSD base.
> 
> Anyone interested?


Great!

First rule:  be skeptical.  In particular, tar is so entirely disk-bound that 
many performance optimizations have no impact whatsoever.  If you don't do a 
lot of testing, you can end up wasting a lot of time.

There are a few different parallel command-line compressors and decompressors 
in ports; experiment a lot (with large files being read from and/or written to 
disk) and see what the real effect is.  In particular, some decompression 
algorithms are actually faster than memcpy() when run on a single processor.  
Parallelizing such algorithms is not likely to help much in the real world.

The two popular algorithms I would expect to benefit most are bzip2 compression 
and lzma compression (targeting xz or lzip format).  For decompression, bzip2 
is block-oriented so fits SMP pretty naturally.  Other popular algorithms are 
stream-oriented and less amenable to parallelization.

Take a careful look at pbzip2, which is a parallelized bzip2/bunzip2 
implementation that's already under a BSD license.  You should be able to get a 
lot of ideas about how to implement a parallel compression algorithm.  Better 
yet, you might be able to reuse a lot of the existing pbzip2 code.

Mark Adler's pigz is also worth studying.  It's also license-friendly, and is 
built on top of regular zlib, which is a nice technique when it's feasible.

There are three fundamentally different implementation approaches with 
different complexity/performance issues:

  * Implement as a stand-alone executable similar to pbzip2.  This makes your 
code a lot simpler and makes it reasonably easy for people to reuse your work.  
This could work with tar, though it could be slightly slower than the 
in-process version due to the additional data-copying and process-switch 
overhead.

  * Implement within libarchive directly.  This would benefit tar and a handful 
of other programs that use libarchive, but may not be worth the complexity.

  * Implement as a standalone library with an interface similar to zlib or 
libbz2 or liblzma.

The last would be my personal preference, though it's probably the most complex 
of all.   That would easily support libarchive and you could create a simple 
stand-alone wrapper around it as well, giving you the best of all worlds.

If you could extend the pigz technique, you might be able to build a 
multi-threaded compression library where the actual compression was handled by 
an existing single-threaded library.  Since zlib, bzlib, and liblzma already 
have similar interfaces, your layer might require only a thin adapter to handle 
any of those three.  *That* would be very interesting, indeed.

Sounds like a fun project.  I wish I had time to work on something like this.

Cheers,

Tim

_______________________________________________
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: SMP Version of tar

Reply via email to