[Bug-tar] Proposed fix for a very old problem.

David M. Warme Tue, 17 May 2011 06:17:44 -0700

GNU tar fans and developers,

I would like to alert you to an issue that has been lurking in tar since
the dinosaurs died off and first started making tar.


The problem has to do with the "skipping to next header" behavior.  This
does not work well when the archive contains files that are themselves
tar archives, or contain tar archives within them.  Tar has no way of
telling the difference between a tar header that is part of the archive,
versus one that belongs to a member of the archive.

Here is a way to eliminate this mis-behavior once and for all:

a. Add a --guid-header option to tar, which really only affects the
   create command.  At the start of the tar "create" session, it picks a
   random GUID, and then includes this GUID (as an extension datum) in
   every header record generated.

b. When reading the first header record, save off any GUID datum that
   is encoded there.

c. When reading successive header records, and a GUID datum is "known"
   from step b, require that the present header record contain the same
   GUID datum.  If not, then consider this header to be invalid and
   skip forward looking for another suitable header record (one that
   contains the proper GUID datum).

The unlikelihood of a member picking the same GUID (if it even has GUIDs
in its headers) eliminates the problem.

For those interested, here is the sad tail where this became a big
problem.  For many years we have been using tar-1.11.8 to do our nightly
backups.  To prevent inconsistencies and incompatibilities in our backup
procedures, we chose to patch the several problems we ran into (long
names, buffering on --multi-volume, etc.), rather than risk upgrading to
newer versions of tar.

Recently, it became necessary to recover one of our servers, and the
administrators were repeatedly unsuccessful in extracting the archive.
The only clue we had was a message to the effect that tar didn't see a
header record at the point it expected to see one, and was skipping
forward to find one.  Very soon after this it extracted a few files that
seemed "out of place" and then stopped, declaring victory.
Unfortunately, only about 75 Gig out of 250 Gig or so had been extracted
from this archive!

I tracked the problem down to the following:

1. This old tar has an 8-Gb limit on file size that can be properly
   represented in the archive.

2. It does not warn about files that exceed this 8 Gb size limit.

3. The volume we were backing up contained a file 14 Gb in size.  Tar
   wrote all 14 Gb into the archive.  The header for this file reduced
   the size (modulo 8 Gb) down to 6 Gb, however.

4. While extracting this archive, tar was expecting (from the header) to
   see a file only 6 Gb long.  This left it 6 Gb into a 14 Gig file, and
   did not see a header record where it expected to.  So it started
   skipping forward to find another header record.

5. This 14 Gb file just happened to be a "virtual disk" file for a
   vmWare virtual disk.  The virtual machine itself had a version of
   Linux loaded into it, and so it contained quite a number of .tar
   files lying around (in uncompressed format).

6. The first of these that it slammed into contained 6 or 8 files, which
   tar dutifully extracted (their pathnames had nothing to do with those
   that came before the "skipping to next header" message).

7. The big gotcha, however, was that when tar read the "EOF" record on
   this little "embedded" .tar file, it decided that it was done --
   leaving some 175 Gb of the "real" archive un-extracted.

Yes, I was able to pull the 14 Gb file out of the archive manually,
remove that 14 Gb chunk from the 250 Gb archive, and then correctly
extract the edited archive.

As disks get larger, virtual machines become more prevalent, and Linux /
*nix operating systems appear in more of these virtual disks, problems
of this sort are going to become more severe.  A simple GUID in every
header record can go a long way to eliminating problems caused by
archive members that contain tar files within them.

We have now upgraded to tar-1.26, which gets rid of the file size
limitations, but this skipping headers problem apparently remains --
undisturbed (as it has been), for eons.

Note -- there seem to be some data stored in extension records now
that contain process IDs in them.  (a) The space of process IDs is too
small, and there is insufficient randomness there to guarantee
disambiguation of headers, (b) even if they were checked in this
fashion.  A GUID fits the bill precisely.

David


David M. Warme, Ph.D.
Principal Computer Scientist
Group W, Inc.
8315 Lee Highway, Suite 400
Fairfax, VA  22031
USA

[Bug-tar] Proposed fix for a very old problem.

Reply via email to