Bug#494169: [Fwd: FW: Bug#494169: libarchive-dev: Please add a way to precompute (non-compressed) archive size]

Tim Kientzle Fri, 08 Aug 2008 08:18:04 -0700

Thibaut VARENE wrote:

On Fri, Aug 8, 2008 at 8:42 AM, Tim Kientzle <[EMAIL PROTECTED]> wrote:

Thibaut,

John Goerzen forwarded your idea to me.

You can actually implement this on top of the current libarchive
code quite efficiently.  Use the low-level archive_write_open()
call and provide your own callbacks that just count the write
requests.  Then go through and write the archive as usual,
except skip the write_data() part (for tar and cpio formats,
libarchive will automatically pad the entry with NUL bytes).



Hum, I'm not quite sure I get this right... By "count the write
requests and skip the write_data() part", you mean "count the number
of bytes that should have been written, without writting them?


Yes.

This may sound slow, but it's really not.  One of the libarchive
unit tests use this approach to "write" 1TB archives in just a couple
of seconds.  (Thist test checks libarchive's handling of very large
archives with very large entries.)  Look at test_tar_large.c
for the details of how this particular test works.  (test_tar_large.c
actually does more than just count the data, but it should
give you the general idea.)



I will have to look into that code indeed. If I get this right tho,
you're basically suggesting that I read the input files twice: once
without writing the data, and the second time writing the data?


No.  I'm suggesting you use three passes:

1) Get the information for all of the files, create archive_entryobjects.2) Create a "fake archive" using the technique above. You don't needto read the file data here! After you call archive_write_close(),you'll know the size of the complete archive. (This is really just youroriginal idea.)3) Write the real archive as usual, including reading the actual filedata and writing it to the archive.

Arguably the second read would come from the VFS cache, but that's
only assuming the server isn't too busy serving hundreds of other
files, which is why I'm a bit concerned about optimality... My limited
understanding of the tar format made me believe that it was possible
to know the space taken by a given file in a tar archive just by
looking at its size and adding the necessary padding bytes. Was I
wrong?

You could make this work. If you're using plain ustar (no tarextensions!), then each file has the data padded to a multiple of 512bytes and there is a 512 byte header for each file. Then you need toround the total result up the a multiple of the block size. (Default is10240 bytes, you probably should set the block size to 512 bytes.)

For reference, here's the (relatively short) code I use:
http://www.parisc-linux.org/~varenet/musicindex/doc/html/output-tarball_8c-source.html

This will work very well with all of the tar and cpio formats.
It won't work well with some other formats where the length
does actually depend on the data.



Yep, that was quite clear indeed ;)

Thanks for your input!





--
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]

Bug#494169: [Fwd: FW: Bug#494169: libarchive-dev: Please add a way to precompute (non-compressed) archive size]

Reply via email to