GNU tar fans and developers, I would like to alert you to an issue that has been lurking in tar since the dinosaurs died off and first started making tar.
The problem has to do with the "skipping to next header" behavior. This does not work well when the archive contains files that are themselves tar archives, or contain tar archives within them. Tar has no way of telling the difference between a tar header that is part of the archive, versus one that belongs to a member of the archive. Here is a way to eliminate this mis-behavior once and for all: a. Add a --guid-header option to tar, which really only affects the create command. At the start of the tar "create" session, it picks a random GUID, and then includes this GUID (as an extension datum) in every header record generated. b. When reading the first header record, save off any GUID datum that is encoded there. c. When reading successive header records, and a GUID datum is "known" from step b, require that the present header record contain the same GUID datum. If not, then consider this header to be invalid and skip forward looking for another suitable header record (one that contains the proper GUID datum). The unlikelihood of a member picking the same GUID (if it even has GUIDs in its headers) eliminates the problem. For those interested, here is the sad tail where this became a big problem. For many years we have been using tar-1.11.8 to do our nightly backups. To prevent inconsistencies and incompatibilities in our backup procedures, we chose to patch the several problems we ran into (long names, buffering on --multi-volume, etc.), rather than risk upgrading to newer versions of tar. Recently, it became necessary to recover one of our servers, and the administrators were repeatedly unsuccessful in extracting the archive. The only clue we had was a message to the effect that tar didn't see a header record at the point it expected to see one, and was skipping forward to find one. Very soon after this it extracted a few files that seemed "out of place" and then stopped, declaring victory. Unfortunately, only about 75 Gig out of 250 Gig or so had been extracted from this archive! I tracked the problem down to the following: 1. This old tar has an 8-Gb limit on file size that can be properly represented in the archive. 2. It does not warn about files that exceed this 8 Gb size limit. 3. The volume we were backing up contained a file 14 Gb in size. Tar wrote all 14 Gb into the archive. The header for this file reduced the size (modulo 8 Gb) down to 6 Gb, however. 4. While extracting this archive, tar was expecting (from the header) to see a file only 6 Gb long. This left it 6 Gb into a 14 Gig file, and did not see a header record where it expected to. So it started skipping forward to find another header record. 5. This 14 Gb file just happened to be a "virtual disk" file for a vmWare virtual disk. The virtual machine itself had a version of Linux loaded into it, and so it contained quite a number of .tar files lying around (in uncompressed format). 6. The first of these that it slammed into contained 6 or 8 files, which tar dutifully extracted (their pathnames had nothing to do with those that came before the "skipping to next header" message). 7. The big gotcha, however, was that when tar read the "EOF" record on this little "embedded" .tar file, it decided that it was done -- leaving some 175 Gb of the "real" archive un-extracted. Yes, I was able to pull the 14 Gb file out of the archive manually, remove that 14 Gb chunk from the 250 Gb archive, and then correctly extract the edited archive. As disks get larger, virtual machines become more prevalent, and Linux / *nix operating systems appear in more of these virtual disks, problems of this sort are going to become more severe. A simple GUID in every header record can go a long way to eliminating problems caused by archive members that contain tar files within them. We have now upgraded to tar-1.26, which gets rid of the file size limitations, but this skipping headers problem apparently remains -- undisturbed (as it has been), for eons. Note -- there seem to be some data stored in extension records now that contain process IDs in them. (a) The space of process IDs is too small, and there is insufficient randomness there to guarantee disambiguation of headers, (b) even if they were checked in this fashion. A GUID fits the bill precisely. David David M. Warme, Ph.D. Principal Computer Scientist Group W, Inc. 8315 Lee Highway, Suite 400 Fairfax, VA 22031 USA
