Re: [PATCH 1/2] archive-tar: write extended headers for file sizes >= 8GB

2016-06-23 Thread Jeff King
On Wed, Jun 22, 2016 at 07:46:13AM +0200, René Scharfe wrote:

> Yes, it's only useful as a debug flag, but the fact that it breaks
> highlights a (admittedly mostly theoretical) shortcoming: The code doesn't
> handle extended headers that are over the size limit as nicely as it could.
> So the test was already worth it, even if won't land in master. :)

Kind of. It was impossible to trigger in the original (and we still
don't actually handle it in the revised version; we just die in
xsnprintf).

But still, I'll go with the simpler thing we've discussed here. The
symmetry with ustar_mtime isn't worth it, and doubly so if we just write
a single pax header.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] archive-tar: write extended headers for file sizes >= 8GB

2016-06-21 Thread René Scharfe
Am 21.06.2016 um 23:04 schrieb Jeff King:
> On Tue, Jun 21, 2016 at 10:57:43PM +0200, René Scharfe wrote:
> 
>> Am 21.06.2016 um 22:42 schrieb René Scharfe:
>>> The value 120 is magic; we need it to pass the tests.  That's
>>> because prepare_header() is used for building extended header
>>> records as well and we don't create extended headers for extended
>>> headers (not sure if that would work anyway), so they simply
>>> vanish when they're over the limit as their size field is set to
>>> zero.
>>
>> So how about something like this to make sure extended headers are
>> only written for regular files and not for other extended headers?
> 
> This is quite similar to what I wrote originally, but I moved to the
> ustar_size() format to better match the mtime code (which needs
> something like that, because we pass around args->time).
> 
> I think you could drop ustar_size() completely here and just put the
> "if" into write_tar_entry().

Which would look like this:

---
 archive-tar.c | 24 ++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/archive-tar.c b/archive-tar.c
index cb99df2..274bdfa 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -137,6 +137,20 @@ static void strbuf_append_ext_header(struct strbuf *sb, 
const char *keyword,
strbuf_addch(sb, '\n');
 }
 
+/*
+ * Like strbuf_append_ext_header, but for numeric values.
+ */
+static void strbuf_append_ext_header_uint(struct strbuf *sb,
+ const char *keyword,
+ uintmax_t value)
+{
+   char buf[40]; /* big enough for 2^128 in decimal, plus NUL */
+   int len;
+
+   len = xsnprintf(buf, sizeof(buf), "%"PRIuMAX, value);
+   strbuf_append_ext_header(sb, keyword, buf, len);
+}
+
 static unsigned int ustar_header_chksum(const struct ustar_header *header)
 {
const unsigned char *p = (const unsigned char *)header;
@@ -208,7 +222,7 @@ static int write_tar_entry(struct archiver_args *args,
struct ustar_header header;
struct strbuf ext_header = STRBUF_INIT;
unsigned int old_mode = mode;
-   unsigned long size;
+   unsigned long size, size_in_header;
void *buffer;
int err = 0;
 
@@ -267,7 +281,13 @@ static int write_tar_entry(struct archiver_args *args,
memcpy(header.linkname, buffer, size);
}
 
-   prepare_header(args, , mode, size);
+   size_in_header = size;
+   if (S_ISREG(mode) && size > 0777UL) {
+   size_in_header = 0;
+   strbuf_append_ext_header_uint(_header, "size", size);
+   }
+
+   prepare_header(args, , mode, size_in_header);
 
if (ext_header.len > 0) {
err = write_extended_header(args, sha1, ext_header.buf,
-- 
2.9.0
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] archive-tar: write extended headers for file sizes >= 8GB

2016-06-21 Thread René Scharfe

Am 21.06.2016 um 23:02 schrieb Jeff King:

On Tue, Jun 21, 2016 at 10:42:52PM +0200, René Scharfe wrote:


If we could set the limit to a lower value than 8GB for testing then we
could at least check if the extended header is written, e.g. if ustar_size()
could be convinced to return 0 every time using a hidden command line
parameter or an environment variable or something better.


Yes, we could do that, though I think it loses most of the value of the
test. We can check that if we hit an arbitrary value we generate the pax
header, but I think what we _really_ care about is: did we generate an
output that somebody else's tar implementation can handle.


I agree with the last point, but don't see how that diminishes the
value of such a test.  If we provide file sizes only through extended
headers (the normal header field being set to 0) and we can extract
files with correct sizes then tar must have interpreted those header
as intended, right?


The diminished value is:

   1. This is a situation that doesn't actually happen in real life.

   2. Now we're carrying extra code inside git only for the sake of
  testing (which can have its own bugs, etc).

Still, it may be better than nothing.


-- >8 --
Subject: archive-tar: test creation of pax extended size headers

---
The value 120 is magic; we need it to pass the tests.  That's
because prepare_header() is used for building extended header
records as well and we don't create extended headers for extended
headers (not sure if that would work anyway), so they simply
vanish when they're over the limit as their size field is set to
zero.


Right, so this is sort of what I meant in (2). Now we have a
tar.ustarsizemax setting shipped in git that is totally broken if you
set it to "1".

I can live with it as a tradeoff, but it is definitely a negative IMHO.


Yes, it's only useful as a debug flag, but the fact that it breaks 
highlights a (admittedly mostly theoretical) shortcoming: The code 
doesn't handle extended headers that are over the size limit as nicely 
as it could.  So the test was already worth it, even if won't land in 
master. :)


René

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] archive-tar: write extended headers for file sizes >= 8GB

2016-06-21 Thread Jeff King
On Tue, Jun 21, 2016 at 10:57:43PM +0200, René Scharfe wrote:

> Am 21.06.2016 um 22:42 schrieb René Scharfe:
> > The value 120 is magic; we need it to pass the tests.  That's
> > because prepare_header() is used for building extended header
> > records as well and we don't create extended headers for extended
> > headers (not sure if that would work anyway), so they simply
> > vanish when they're over the limit as their size field is set to
> > zero.
> 
> So how about something like this to make sure extended headers are
> only written for regular files and not for other extended headers?

This is quite similar to what I wrote originally, but I moved to the
ustar_size() format to better match the mtime code (which needs
something like that, because we pass around args->time).

I think you could drop ustar_size() completely here and just put the
"if" into write_tar_entry().

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] archive-tar: write extended headers for file sizes >= 8GB

2016-06-21 Thread Jeff King
On Tue, Jun 21, 2016 at 10:42:52PM +0200, René Scharfe wrote:

> >> If we could set the limit to a lower value than 8GB for testing then we
> >> could at least check if the extended header is written, e.g. if 
> >> ustar_size()
> >> could be convinced to return 0 every time using a hidden command line
> >> parameter or an environment variable or something better.
> > 
> > Yes, we could do that, though I think it loses most of the value of the
> > test. We can check that if we hit an arbitrary value we generate the pax
> > header, but I think what we _really_ care about is: did we generate an
> > output that somebody else's tar implementation can handle.
> 
> I agree with the last point, but don't see how that diminishes the
> value of such a test.  If we provide file sizes only through extended
> headers (the normal header field being set to 0) and we can extract
> files with correct sizes then tar must have interpreted those header
> as intended, right?

The diminished value is:

  1. This is a situation that doesn't actually happen in real life.

  2. Now we're carrying extra code inside git only for the sake of
 testing (which can have its own bugs, etc).

Still, it may be better than nothing.

> -- >8 --
> Subject: archive-tar: test creation of pax extended size headers
> 
> ---
> The value 120 is magic; we need it to pass the tests.  That's
> because prepare_header() is used for building extended header
> records as well and we don't create extended headers for extended
> headers (not sure if that would work anyway), so they simply
> vanish when they're over the limit as their size field is set to
> zero.

Right, so this is sort of what I meant in (2). Now we have a
tar.ustarsizemax setting shipped in git that is totally broken if you
set it to "1".

I can live with it as a tradeoff, but it is definitely a negative IMHO.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] archive-tar: write extended headers for file sizes >= 8GB

2016-06-21 Thread René Scharfe
Am 21.06.2016 um 17:59 schrieb Jeff King:
> On Tue, Jun 21, 2016 at 12:54:11AM +0200, René Scharfe wrote:
>
>>> Unfortunately, it's quite an expensive test to run. For one
>>> thing, unless your filesystem supports files with holes, it
>>> takes 64GB of disk space (you might think piping straight to
>>> `hash-object --stdin` would be better, but it's not; that
>>> tries to buffer all 64GB in RAM!). Furthermore, hashing and
>>> compressing the object takes several minutes of CPU time.
>>>
>>> We could ship just the resulting compressed object data as a
>>> loose object, but even that takes 64MB. So sadly, this code
>>> path remains untested in the test suite.
>>
>> If we could set the limit to a lower value than 8GB for testing then we
>> could at least check if the extended header is written, e.g. if ustar_size()
>> could be convinced to return 0 every time using a hidden command line
>> parameter or an environment variable or something better.
>
> Yes, we could do that, though I think it loses most of the value of the
> test. We can check that if we hit an arbitrary value we generate the pax
> header, but I think what we _really_ care about is: did we generate an
> output that somebody else's tar implementation can handle.

I agree with the last point, but don't see how that diminishes the
value of such a test.  If we provide file sizes only through extended
headers (the normal header field being set to 0) and we can extract
files with correct sizes then tar must have interpreted those header
as intended, right?

(Or it just guessed the sizes by searching for the next header magic,
but such a fallback won't be accurate for files ending with NUL
characters due to NUL-padding, so we just have to add such a file.)

René


-- >8 --
Subject: archive-tar: test creation of pax extended size headers

---
The value 120 is magic; we need it to pass the tests.  That's
because prepare_header() is used for building extended header
records as well and we don't create extended headers for extended
headers (not sure if that would work anyway), so they simply
vanish when they're over the limit as their size field is set to
zero.

 archive-tar.c   | 7 ++-
 t/t5000-tar-tree.sh | 7 +++
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/archive-tar.c b/archive-tar.c
index f53e61c..fbbc4cc 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -14,6 +14,7 @@ static char block[BLOCKSIZE];
 static unsigned long offset;
 
 static int tar_umask = 002;
+static unsigned long ustar_size_max = 0777UL;
 
 static int write_tar_filter_archive(const struct archiver *ar,
struct archiver_args *args);
@@ -179,7 +180,7 @@ static size_t get_path_prefix(const char *path, size_t 
pathlen, size_t maxlen)
 
 static inline unsigned long ustar_size(uintmax_t size)
 {
-   if (size <= 0777UL)
+   if (size <= ustar_size_max)
return size;
else
return 0;
@@ -412,6 +413,10 @@ static int git_tar_config(const char *var, const char 
*value, void *cb)
}
return 0;
}
+   if (!strcmp(var, "tar.ustarsizemax")) {
+   ustar_size_max = git_config_ulong(var, value);
+   return 0;
+   }
 
return tar_filter_config(var, value, cb);
 }
diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
index 4b68bba..03bb4c7 100755
--- a/t/t5000-tar-tree.sh
+++ b/t/t5000-tar-tree.sh
@@ -102,6 +102,7 @@ test_expect_success \
  echo long filename >a/four$hundred &&
  mkdir a/bin &&
  test-genrandom "frotz" 50 >a/bin/sh &&
+ printf "\0\0\0" >>a/bin/sh &&
  printf "A\$Format:%s\$O" "$SUBSTFORMAT" >a/substfile1 &&
  printf "A not substituted O" >a/substfile2 &&
  if test_have_prereq SYMLINKS; then
@@ -157,6 +158,12 @@ test_expect_success 'git-archive --prefix=olde-' '
 
 check_tar with_olde-prefix olde-
 
+test_expect_success !TAR_NEEDS_PAX_FALLBACK 'pax extended size headers' '
+   git -c tar.ustarsizemax=120 archive HEAD >extended_size_header.tar
+'
+
+check_tar extended_size_header
+
 test_expect_success 'git archive on large files' '
 test_config core.bigfilethreshold 1 &&
 git archive HEAD >b3.tar &&
-- 
2.9.0

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] archive-tar: write extended headers for file sizes >= 8GB

2016-06-21 Thread Jeff King
On Tue, Jun 21, 2016 at 07:44:55PM +, Robin H. Johnson wrote:

> On Thu, Jun 16, 2016 at 12:37:33AM -0400,  Jeff King wrote:
> > We could ship just the resulting compressed object data as a
> > loose object, but even that takes 64MB. So sadly, this code
> > path remains untested in the test suite.
> Additionally compress the object data, and insert it for the purpose of
> testing? It's still an expensive test time-wise, but repeated
> gzip compression on zeros does still help; to that end, here's the
> pieces to add a testcase while only being <9KiB.

Interesting idea. With bzip2 it actually drops to about 600 bytes (I
suspect we could get it even smaller by writing a custom program to
generate it, but it's diminishing returns).

I think there are some other portability issues, though (like does the
receiving tar actually support 64GB sizes).

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] archive-tar: write extended headers for file sizes >= 8GB

2016-06-21 Thread René Scharfe
Am 21.06.2016 um 22:42 schrieb René Scharfe:
> The value 120 is magic; we need it to pass the tests.  That's
> because prepare_header() is used for building extended header
> records as well and we don't create extended headers for extended
> headers (not sure if that would work anyway), so they simply
> vanish when they're over the limit as their size field is set to
> zero.

So how about something like this to make sure extended headers are
only written for regular files and not for other extended headers?
---
 archive-tar.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/archive-tar.c b/archive-tar.c
index ed562d4..f53e61c 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -199,7 +199,7 @@ static void prepare_header(struct archiver_args *args,
 {
xsnprintf(header->mode, sizeof(header->mode), "%07o", mode & 0);
xsnprintf(header->size, sizeof(header->size), "%011lo",
- S_ISREG(mode) ? ustar_size(size) : 0);
+ S_ISREG(mode) ? size : 0);
xsnprintf(header->mtime, sizeof(header->mtime), "%011lo",
  ustar_mtime(args->time));
 
@@ -240,7 +240,7 @@ static int write_tar_entry(struct archiver_args *args,
struct ustar_header header;
struct strbuf ext_header = STRBUF_INIT;
unsigned int old_mode = mode;
-   unsigned long size;
+   unsigned long size, size_in_header;
void *buffer;
int err = 0;
 
@@ -299,12 +299,14 @@ static int write_tar_entry(struct archiver_args *args,
memcpy(header.linkname, buffer, size);
}
 
-   if (S_ISREG(mode) && ustar_size(size) != size)
+   size_in_header = S_ISREG(mode) ? ustar_size(size) : size;
+   if (size_in_header != size)
strbuf_append_ext_header_uint(_header, "size", size);
+
if (ustar_mtime(args->time) != args->time)
strbuf_append_ext_header_uint(_header, "mtime", args->time);
 
-   prepare_header(args, , mode, size);
+   prepare_header(args, , mode, size_in_header);
 
if (ext_header.len > 0) {
err = write_extended_header(args, sha1, ext_header.buf,
-- 
2.9.0


--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] archive-tar: write extended headers for file sizes >= 8GB

2016-06-21 Thread René Scharfe
Am 21.06.2016 um 17:59 schrieb Jeff King:
> On Tue, Jun 21, 2016 at 12:54:11AM +0200, René Scharfe wrote:
> 
>>> Unfortunately, it's quite an expensive test to run. For one
>>> thing, unless your filesystem supports files with holes, it
>>> takes 64GB of disk space (you might think piping straight to
>>> `hash-object --stdin` would be better, but it's not; that
>>> tries to buffer all 64GB in RAM!). Furthermore, hashing and
>>> compressing the object takes several minutes of CPU time.
>>>
>>> We could ship just the resulting compressed object data as a
>>> loose object, but even that takes 64MB. So sadly, this code
>>> path remains untested in the test suite.
>>
>> If we could set the limit to a lower value than 8GB for testing then we
>> could at least check if the extended header is written, e.g. if ustar_size()
>> could be convinced to return 0 every time using a hidden command line
>> parameter or an environment variable or something better.
> 
> Yes, we could do that, though I think it loses most of the value of the
> test. We can check that if we hit an arbitrary value we generate the pax
> header, but I think what we _really_ care about is: did we generate an
> output that somebody else's tar implementation can handle.

I agree with the last point, but don't see how that diminishes the
value of such a test.  If we provide file sizes only through extended
headers (the normal header field being set to 0) and we can extract
files with correct sizes then tar must have interpreted those header
as intended, right?

(Or it just guessed the sizes by searching for the next header magic,
but such a fallback won't be accurate for files ending with NUL
characters due to NUL-padding, so we just have to add such a file.)

René


-- >8 --
Subject: archive-tar: test creation of pax extended size headers

---
The value 120 is magic; we need it to pass the tests.  That's
because prepare_header() is used for building extended header
records as well and we don't create extended headers for extended
headers (not sure if that would work anyway), so they simply
vanish when they're over the limit as their size field is set to
zero.

 archive-tar.c   | 7 ++-
 t/t5000-tar-tree.sh | 7 +++
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/archive-tar.c b/archive-tar.c
index f53e61c..fbbc4cc 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -14,6 +14,7 @@ static char block[BLOCKSIZE];
 static unsigned long offset;
 
 static int tar_umask = 002;
+static unsigned long ustar_size_max = 0777UL;
 
 static int write_tar_filter_archive(const struct archiver *ar,
struct archiver_args *args);
@@ -179,7 +180,7 @@ static size_t get_path_prefix(const char *path, size_t 
pathlen, size_t maxlen)
 
 static inline unsigned long ustar_size(uintmax_t size)
 {
-   if (size <= 0777UL)
+   if (size <= ustar_size_max)
return size;
else
return 0;
@@ -412,6 +413,10 @@ static int git_tar_config(const char *var, const char 
*value, void *cb)
}
return 0;
}
+   if (!strcmp(var, "tar.ustarsizemax")) {
+   ustar_size_max = git_config_ulong(var, value);
+   return 0;
+   }
 
return tar_filter_config(var, value, cb);
 }
diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
index 4b68bba..03bb4c7 100755
--- a/t/t5000-tar-tree.sh
+++ b/t/t5000-tar-tree.sh
@@ -102,6 +102,7 @@ test_expect_success \
  echo long filename >a/four$hundred &&
  mkdir a/bin &&
  test-genrandom "frotz" 50 >a/bin/sh &&
+ printf "\0\0\0" >>a/bin/sh &&
  printf "A\$Format:%s\$O" "$SUBSTFORMAT" >a/substfile1 &&
  printf "A not substituted O" >a/substfile2 &&
  if test_have_prereq SYMLINKS; then
@@ -157,6 +158,12 @@ test_expect_success 'git-archive --prefix=olde-' '
 
 check_tar with_olde-prefix olde-
 
+test_expect_success !TAR_NEEDS_PAX_FALLBACK 'pax extended size headers' '
+   git -c tar.ustarsizemax=120 archive HEAD >extended_size_header.tar
+'
+
+check_tar extended_size_header
+
 test_expect_success 'git archive on large files' '
 test_config core.bigfilethreshold 1 &&
 git archive HEAD >b3.tar &&
-- 
2.9.0

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] archive-tar: write extended headers for file sizes >= 8GB

2016-06-21 Thread Robin H. Johnson
On Thu, Jun 16, 2016 at 12:37:33AM -0400,  Jeff King wrote:
> We could ship just the resulting compressed object data as a
> loose object, but even that takes 64MB. So sadly, this code
> path remains untested in the test suite.
Additionally compress the object data, and insert it for the purpose of
testing? It's still an expensive test time-wise, but repeated
gzip compression on zeros does still help; to that end, here's the
pieces to add a testcase while only being <9KiB.

t5005-archive-hugefile.sh:
...
mkdir t
zcat t5005-barerepo-64gb-obj.tar.gz.gz | tar xzf - -C t
GIT_DIR=t git archive HEAD | head -c 1 | tar tvf - 2>/dev/null
...

Test repo attached, it's only 8KiB.

Test repo can be recreated with:
truncate -s 64g bigfile
git add bigfile # takes 10 mins
git commit -q -m foo # takes another 10 minutes
tar cf - -C .git . |gzip -9 --no-name |gzip -9f --no-name >barerepo.tar.gz.gz


-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Trustee & Treasurer
E-Mail   : robb...@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] archive-tar: write extended headers for file sizes >= 8GB

2016-06-21 Thread Jeff King
On Tue, Jun 21, 2016 at 12:54:11AM +0200, René Scharfe wrote:

> > Unfortunately, it's quite an expensive test to run. For one
> > thing, unless your filesystem supports files with holes, it
> > takes 64GB of disk space (you might think piping straight to
> > `hash-object --stdin` would be better, but it's not; that
> > tries to buffer all 64GB in RAM!). Furthermore, hashing and
> > compressing the object takes several minutes of CPU time.
> > 
> > We could ship just the resulting compressed object data as a
> > loose object, but even that takes 64MB. So sadly, this code
> > path remains untested in the test suite.
> 
> If we could set the limit to a lower value than 8GB for testing then we
> could at least check if the extended header is written, e.g. if ustar_size()
> could be convinced to return 0 every time using a hidden command line
> parameter or an environment variable or something better.

Yes, we could do that, though I think it loses most of the value of the
test. We can check that if we hit an arbitrary value we generate the pax
header, but I think what we _really_ care about is: did we generate an
output that somebody else's tar implementation can handle.

And for the smaller-than-64GB case, GNU tar happily handles our existing
output (though I suspect other tars might fail at "only" 8GB).

> > +static inline unsigned long ustar_size(uintmax_t size)
> > +{
> > +   if (size < 0777UL)
> 
> Shouldn't that be less-or-equal?

Yeah, you're right (and for the one in the next patch, too).

> > +   if (ustar_size(size) != size)
> > +   strbuf_append_ext_header_uint(_header, "size", size);
> 
> It needs "S_ISREG(mode) && " as well, no?  In practice it probably doesn't
> matter (until someone stores a 8GB long symlink target), but the size field
> should only be set for regular files.

Thanks for noticing that. I remembered wondering that when I was early
in debugging/diagnosing, but forgot to follow up on it. I agree it's
unlikely in practice, but we should have consistent checks (I think it
would actually make sense to move the ISREG check inside ustar_size, and
then we can apply it consistently here and when generating the header;
my goal with ustar_size() was to avoid having the same logic in multiple
places).

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] archive-tar: write extended headers for file sizes >= 8GB

2016-06-21 Thread Jeff King
On Tue, Jun 21, 2016 at 11:59:20AM -0400, Jeff King wrote:

> > > + if (ustar_size(size) != size)
> > > + strbuf_append_ext_header_uint(_header, "size", size);
> > 
> > It needs "S_ISREG(mode) && " as well, no?  In practice it probably doesn't
> > matter (until someone stores a 8GB long symlink target), but the size field
> > should only be set for regular files.
> 
> Thanks for noticing that. I remembered wondering that when I was early
> in debugging/diagnosing, but forgot to follow up on it. I agree it's
> unlikely in practice, but we should have consistent checks (I think it
> would actually make sense to move the ISREG check inside ustar_size, and
> then we can apply it consistently here and when generating the header;
> my goal with ustar_size() was to avoid having the same logic in multiple
> places).

Actually, scratch that. It makes things awkward because it would be hard
to tell when ustar_size() returned zero because it's a file with a big
size, and thus needs a pax header, versus that it was not a file, and
therefore must _not_ have a pax header.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] archive-tar: write extended headers for file sizes >= 8GB

2016-06-20 Thread René Scharfe

Am 16.06.2016 um 06:37 schrieb Jeff King:

The ustar format has a fixed-length field for the size of
each file entry which is supposed to contain up to 11 bytes
of octal-formatted data plus a NUL or space terminator.

These means that the largest size we can represent is
0777, or 1 byte short of 8GB. The correct solution
for a larger file, according to POSIX.1-2001, is to add an
extended pax header, similar to how we handle long
filenames. This patch does that, and writes zero for the
size field in the ustar header (the last bit is not
mentioned by POSIX, but it matches how GNU tar behaves with
--format=pax).

This should be a strict improvement over the current
behavior, which is to die in xsnprintf with a "BUG".
However, there's some interesting history here.

Prior to f2f0267 (archive-tar: use xsnprintf for trivial
formatting, 2015-09-24), we silently overflowed the "size"
field. The extra bytes ended up in the "mtime" field of the
header, which was then immediately written itself,
overwriting our extra bytes. What that means depends on how
many bytes we wrote.

If the size was 64GB or greater, then we actually overflowed
digits into the mtime field, meaning our value was was
effectively right-shifted by those lost octal digits. And
this patch is again a strict improvement over that.

But if the size was between 8GB and 64GB, then our 12-byte
field held all of the actual digits, and only our NUL
terminator overflowed. According to POSIX, there should be a
NUL or space at the end of the field. However, GNU tar seems
to be lenient here, and will correctly parse a size up 64GB
(minus one) from the field. So sizes in this range might
have just worked, depending on the implementation reading
the tarfile.

This patch is mostly still an improvement there, as the 8GB
limit is specifically mentioned in POSIX as the correct
limit. But it's possible that it could be a regression
(versus the pre-f2f0267 state) if all of the following are
true:

   1. You have a file between 8GB and 64GB.

   2. Your tar implementation _doesn't_ know about pax
  extended headers.

   3. Your tar implementation _does_ parse 12-byte sizes from
  the ustar header without a delimiter.

It's probably not worth worrying about such an obscure set
of conditions, but I'm documenting it here just in case.

There's no test included here. I did confirm that this works
(using GNU tar) with:

   $ dd if=/dev/zero seek=64G bs=1 count=1 of=huge
   $ git add huge
   $ git commit -q -m foo
   $ git archive HEAD | head -c 1 | tar tvf - 2>/dev/null
   -rw-rw-r-- root/root 68719476737 2016-06-15 21:07 huge

Pre-f2f0267, this would yield a bogus size of 8GB, and
post-f2f0267, git-archive simply dies.

Unfortunately, it's quite an expensive test to run. For one
thing, unless your filesystem supports files with holes, it
takes 64GB of disk space (you might think piping straight to
`hash-object --stdin` would be better, but it's not; that
tries to buffer all 64GB in RAM!). Furthermore, hashing and
compressing the object takes several minutes of CPU time.

We could ship just the resulting compressed object data as a
loose object, but even that takes 64MB. So sadly, this code
path remains untested in the test suite.


If we could set the limit to a lower value than 8GB for testing then we 
could at least check if the extended header is written, e.g. if 
ustar_size() could be convinced to return 0 every time using a hidden 
command line parameter or an environment variable or something better.




Signed-off-by: Jeff King 
---
  archive-tar.c | 28 +++-
  1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/archive-tar.c b/archive-tar.c
index cb99df2..7340b64 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -137,6 +137,20 @@ static void strbuf_append_ext_header(struct strbuf *sb, 
const char *keyword,
strbuf_addch(sb, '\n');
  }

+/*
+ * Like strbuf_append_ext_header, but for numeric values.
+ */
+static void strbuf_append_ext_header_uint(struct strbuf *sb,
+ const char *keyword,
+ uintmax_t value)
+{
+   char buf[40]; /* big enough for 2^128 in decimal, plus NUL */
+   int len;
+
+   len = xsnprintf(buf, sizeof(buf), "%"PRIuMAX, value);
+   strbuf_append_ext_header(sb, keyword, buf, len);
+}
+
  static unsigned int ustar_header_chksum(const struct ustar_header *header)
  {
const unsigned char *p = (const unsigned char *)header;
@@ -163,12 +177,21 @@ static size_t get_path_prefix(const char *path, size_t 
pathlen, size_t maxlen)
return i;
  }

+static inline unsigned long ustar_size(uintmax_t size)
+{
+   if (size < 0777UL)


Shouldn't that be less-or-equal?


+   return size;
+   else
+   return 0;
+}
+
  static void prepare_header(struct archiver_args *args,
   struct ustar_header *header,