Re: [Bug-tar] Incorrect listing of sparse files with more than 8G of real data

Niessen, Chris Wed, 29 Oct 2014 14:16:49 -0700

Hi Pavel-
Thanks for getting back to me on this.

Let me see if I can explain my reasoning on how I was assigning to stat.st_size 
and archive_file_size.

When the extended header starts being parsed, stat.st_size and 
archive_file_size have the same value, always.  They have the value from the 
ustar header (which may be zero if the real value didn't fit in the header).

If the file is sparse, then there will be a GNU.sparse.realsize extended header 
field.  If it is not sparse, then that field will not be there.  
There may be a "size" extended header field if the amount of data in the file 
exceeds 8G (for sparse or non sparse files).

The objective is that for all files, the amount of real data goes in 
archive_file_size, and the overall size of the file goes in stat.st_size.  For 
sparse files, those two may be different.  For non-sparse files, those two 
values will be the same.

So when I encounter a "GNU.sparse.realsize" value, that value should always go 
in archive_file_size.  The file is sparse, and the apparent size of the file 
will either come from a "size" extended attribute (which may come before or 
after the GNU.sparse.realsize) or from the ustar header.

When I encounter a "size" extended header, that value should always go in 
stat.st_size.  For sparse files, that's the only place it goes; the value in 
archive_file_size will always come from GNU.sparse.realsize, which will always 
be there.  For non-sparse files, however, archive_file_size and stat.st_size 
should always be set to the same value (since the apparent size and the size on 
disk are the same).    Therefore, in the "size" handler, I need to figure out 
whether or not I should write archive_file_size with the value from the "size" 
attribute.  (I could probably just look to see if the file is sparse, and react 
to that, but that wasn't what I did.)

When I'm handling a "size" extended header attribute, if the file is 
non-sparse, then archive_file_size and stat.st_size will always have the same 
value.  They get set to the same value before the extended attribute 
processing, and only get updated by the "size" attribute handler.  So if they 
have the same value, then I keep them the same by setting archive_file_size as 
well.  (I always set stat.st_size, unconditionally.)  The only way those two 
values (archive_file_size and stat.st_size) could have different values is if I 
found a GNU.sparse.realsize header before I parsed the "size" header.  So if 
the "size" handler looks at the two values, and they are different, then that 
can only be because the file is sparse, and archive_file_size was already set 
by the GNU.sparse.realsize handler.  In which case I should not overwrite that 
value.  If the file is sparse, and those two values are the same, then its 
because I just haven't parsed the GNU.sparse.realsize extended header yet 
(which must be present), so its no big deal if I write the archive_file_size 
value; its going to get overwritten anyway when I get to the 
GNU.sparse.realsize header.  So its always safe for me to write 
archive_file_size if archive_file_size == stat.st_size on entry to the "size" 
handler.

Now my logic above is contingent on the following assumptions:
1. The only extended attributes that touch stat.st_size and archive_file_size 
are GNU.sparse.realsize and "size".
2. Prior to the start of parsing the extended header options, archive_file_size 
and stat.st_size have the same value, (should be the one from the ustar header, 
but I don't care where it came from as long as they are the same).
3.  A sparse file will always have a GNU.sparse.realsize extended header 
attribute.

I hadn't really looked at the source for tar prior to last week, so if my 
assumptions above are not correct, then my logic above is flawed.

Thanks a lot-
-chris

So when I start parsing the extended header fields, if I find a "size" header

-----Original Message-----
From: Pavel Raiskup [mailto:[email protected]] 
Sent: Monday, October 27, 2014 2:38 PM
To: [email protected]
Cc: Niessen, Chris
Subject: Re: [Bug-tar] Incorrect listing of sparse files with more than 8G of 
real data

Hello Chris,

On Saturday 25 of October 2014 21:08:38 Niessen, Chris wrote:
> If a sparse file with more than 8G of real data is stored in a POSIX
> format archive (which is done correctly in 1.28), listing the contents
> of the archive will fail.

[SNIP]

> A patch to address this was submitted against 1.27
> http://www.mail-archive.com/bug-tar%40gnu.org/msg03905.html
> but it doesn't seem to have made it in to 1.28.

correct thread, but better link is:
http://www.mail-archive.com/bug-tar%40gnu.org/msg03910.html
.. which iterated to:
http://www.mail-archive.com/bug-tar%40gnu.org/msg03917.html

The last patch deals with realsize/size once *all* extended headers are
decoded - thus the order of 'GNU.sparse.realsize' vs. 'size' extended
haders does not matter.

> Before finding that patch, I generated my own that modifies size_decoder
> to put the value of the "size" extended header value into
> archive_file_size

I understand so far, however ..

> , and if archive_file_size and stat.st_size have the
> same value (meaning stat.st_size hasn't been updated by a previously
> parsed extended header), then the "size" attribute will also get put
> into stat.st_size.  That way, stat.st_size will be updated properly for
> non-sparse files, but will not be clobbered for sparse ones.

... I'm getting lost here because you *now* assigned a value to the
archive_file_size.  In this case, the 'stat.st_size' may already be set
(a) by the parsed ustar header and (b) also re-asssigned once more by
extended header 'GNU.sparse.realsize' (if it was decoded before 'size'
ext. header) but why the 'stat.st_size' and 'archive_file_size' should
have the same value if you changed one of those?

Pavel

Re: [Bug-tar] Incorrect listing of sparse files with more than 8G of real data

Reply via email to