Re: [fsck] Gzip is... terrible!

Alex Horn Fri, 08 Jul 2011 15:29:03 -0700

Well done on taking the initiative. When you do this you could also
help those who are interested in reviewing the code by basing your
work on the upstream project before committing your changes. This way
it is clearer what has changed and a link to the diff can be shared.


Cheers,
Alex

On 8 July 2011 22:36, Don Bindner <[email protected]> wrote:
> Oh, and you shouldn't use 'std' for stdin and out.  You should use '-'.
>  That's what many programs do (including gzip for example); hyphen will be
> more familiar to experinced users since it's already an established
> interface rule.
> Don
>
> On Fri, Jul 8, 2011 at 4:31 PM, Don Bindner <[email protected]> wrote:
>>
>> Did you remember to run your tests repeatedly in different orders to
>> minimize the effects that cacheing might have on your results?
>> Don
>>
>> On Fri, Jul 8, 2011 at 4:19 PM, Huan Truong <[email protected]> wrote:
>>>
>>> I've heard a complain from one guy in another mailing list about gzip
>>> recently. He was trying to backup tens-of-GB data every day and
>>> tar-gzipping (tar czvf) is so unacceptably slow.
>>>
>>> I once faced the same problem when I needed to create hard drive
>>> snapshot for computers and obviously I wanted to save bandwidth so that
>>> I wouldn't have to transfer a lot of data over a 100Mbps line.
>>>
>>> Let's suppose we can save 5GB on a 15GB file by compressing that file.
>>> To transfer 15GB we need 15,000 MB / (100/8) MB/sec = 1,200 secs =  20
>>> mins on a perfect network. Usually on Truman network (cross-buildings)
>>> it takes 3 times as much. So realistically we need 60 minutes to
>>> transfer a 15GB snapshot image. By compressing, the resulting 10GB  file
>>> would take only 40 mins to transfer. Good deal? No.
>>>
>>> It *didn't help*. It takes more than 1 hour to compress that file, so
>>> the uploading process takes even longer. The clients (pentium4 2.8 HT)
>>> somehow struggles to decompress the file too, so the result comes out
>>> even. So why the hassle? My conclusion: It's better *not* to compress
>>> the image with gzip at all. It's even clearer to see when you have a
>>> fast connection, the IO gain goes to CPU computation, the result comes
>>> out worse.
>>>
>>> Turns out gzip, also, bzip2 and zip are terrible in CPU usage, as it
>>> takes a lot of time to compress and decompress. There are other
>>> algorithms that compress a little bit worse than gzip but is much easier
>>> on the CPU (most of them are based on the Lempel-Ziv algorithm): LZO,
>>> Google's Snappy, LZF, and LZ4. LZ4 is crazily fast.
>>>
>>> I did some quick bench-marking with the linux source:
>>>
>>> 1634!ht:~/src/lz4-read-only$ time ./tar-none.sh ../linux-3.0-rc6 linux-s
>>> real    0m4.390s
>>> user    0m0.620s
>>> sys     0m0.870s
>>>
>>> 1635!ht:~/src/lz4-read-only$ time ./tar-gzip.sh ../linux-3.0-rc6 linux-s
>>> real    0m43.683s
>>> user    0m40.901s
>>> sys     0m0.319s
>>>
>>> 1636!ht:~/src/lz4-read-only$ time ./tar-lz4.sh ../linux-3.0-rc6 linux-s
>>> real    0m5.568s
>>> user    0m4.831s
>>> sys     0m0.272s
>>>
>>> Clear win for lz4! (I used pipe, so theoretically it can be even
>>> better).
>>>
>>> I have patched lz4 utility so that it would happily accept std for stdin
>>> for infile, and also std for stdout for outfile, so you can pipe from
>>> whatever program you like.
>>>
>>> git clone [email protected]:htruong/lz4.git for the utility.
>>>
>>>
>>> Cheers, nice weekend,
>>> - Huan.
>>> --
>>> Huan Truong
>>> 600-988-9066
>>> http://tnhh.net/
>>>
>>
>
>

Re: [fsck] Gzip is... terrible!

Reply via email to