Hi Pavel- Thanks for getting back to me on this. Let me see if I can explain my reasoning on how I was assigning to stat.st_size and archive_file_size.
When the extended header starts being parsed, stat.st_size and archive_file_size have the same value, always. They have the value from the ustar header (which may be zero if the real value didn't fit in the header). If the file is sparse, then there will be a GNU.sparse.realsize extended header field. If it is not sparse, then that field will not be there. There may be a "size" extended header field if the amount of data in the file exceeds 8G (for sparse or non sparse files). The objective is that for all files, the amount of real data goes in archive_file_size, and the overall size of the file goes in stat.st_size. For sparse files, those two may be different. For non-sparse files, those two values will be the same. So when I encounter a "GNU.sparse.realsize" value, that value should always go in archive_file_size. The file is sparse, and the apparent size of the file will either come from a "size" extended attribute (which may come before or after the GNU.sparse.realsize) or from the ustar header. When I encounter a "size" extended header, that value should always go in stat.st_size. For sparse files, that's the only place it goes; the value in archive_file_size will always come from GNU.sparse.realsize, which will always be there. For non-sparse files, however, archive_file_size and stat.st_size should always be set to the same value (since the apparent size and the size on disk are the same). Therefore, in the "size" handler, I need to figure out whether or not I should write archive_file_size with the value from the "size" attribute. (I could probably just look to see if the file is sparse, and react to that, but that wasn't what I did.) When I'm handling a "size" extended header attribute, if the file is non-sparse, then archive_file_size and stat.st_size will always have the same value. They get set to the same value before the extended attribute processing, and only get updated by the "size" attribute handler. So if they have the same value, then I keep them the same by setting archive_file_size as well. (I always set stat.st_size, unconditionally.) The only way those two values (archive_file_size and stat.st_size) could have different values is if I found a GNU.sparse.realsize header before I parsed the "size" header. So if the "size" handler looks at the two values, and they are different, then that can only be because the file is sparse, and archive_file_size was already set by the GNU.sparse.realsize handler. In which case I should not overwrite that value. If the file is sparse, and those two values are the same, then its because I just haven't parsed the GNU.sparse.realsize extended header yet (which must be present), so its no big deal if I write the archive_file_size value; its going to get overwritten anyway when I get to the GNU.sparse.realsize header. So its always safe for me to write archive_file_size if archive_file_size == stat.st_size on entry to the "size" handler. Now my logic above is contingent on the following assumptions: 1. The only extended attributes that touch stat.st_size and archive_file_size are GNU.sparse.realsize and "size". 2. Prior to the start of parsing the extended header options, archive_file_size and stat.st_size have the same value, (should be the one from the ustar header, but I don't care where it came from as long as they are the same). 3. A sparse file will always have a GNU.sparse.realsize extended header attribute. I hadn't really looked at the source for tar prior to last week, so if my assumptions above are not correct, then my logic above is flawed. Thanks a lot- -chris So when I start parsing the extended header fields, if I find a "size" header -----Original Message----- From: Pavel Raiskup [mailto:[email protected]] Sent: Monday, October 27, 2014 2:38 PM To: [email protected] Cc: Niessen, Chris Subject: Re: [Bug-tar] Incorrect listing of sparse files with more than 8G of real data Hello Chris, On Saturday 25 of October 2014 21:08:38 Niessen, Chris wrote: > If a sparse file with more than 8G of real data is stored in a POSIX > format archive (which is done correctly in 1.28), listing the contents > of the archive will fail. [SNIP] > A patch to address this was submitted against 1.27 > http://www.mail-archive.com/bug-tar%40gnu.org/msg03905.html > but it doesn't seem to have made it in to 1.28. correct thread, but better link is: http://www.mail-archive.com/bug-tar%40gnu.org/msg03910.html .. which iterated to: http://www.mail-archive.com/bug-tar%40gnu.org/msg03917.html The last patch deals with realsize/size once *all* extended headers are decoded - thus the order of 'GNU.sparse.realsize' vs. 'size' extended haders does not matter. > Before finding that patch, I generated my own that modifies size_decoder > to put the value of the "size" extended header value into > archive_file_size I understand so far, however .. > , and if archive_file_size and stat.st_size have the > same value (meaning stat.st_size hasn't been updated by a previously > parsed extended header), then the "size" attribute will also get put > into stat.st_size. That way, stat.st_size will be updated properly for > non-sparse files, but will not be clobbered for sparse ones. ... I'm getting lost here because you *now* assigned a value to the archive_file_size. In this case, the 'stat.st_size' may already be set (a) by the parsed ustar header and (b) also re-asssigned once more by extended header 'GNU.sparse.realsize' (if it was decoded before 'size' ext. header) but why the 'stat.st_size' and 'archive_file_size' should have the same value if you changed one of those? Pavel
