Re: [PATCH 0/18] snprintf cleanups

2017-03-30 Thread Jeff King
On Thu, Mar 30, 2017 at 10:24:36AM -0700, Junio C Hamano wrote:

> > I'm OK either with the series I posted, or wrapping up the alternative
> > in a commit message.
> 
> I do find the updated one easier to follow (if anything it is more
> compact); I do not think it is worth a reroll, but it is easy enough
> to replace the patch part of the original with the updated patch and
> tweak "it's easy to fix by moving to a strbuf" in its log message to
> something like "But it's easy to eliminate the allocation with a few
> helper functions, and it makes the result easier to follow", so I am
> tempted to go that route myself...

Yeah, I think that's all that would be needed. Here it is wrapped up as
a patch:

-- >8 --
Subject: [PATCH] diff: avoid fixed-size buffer for patch-ids

To generate a patch id, we format the diff header into a
fixed-size buffer, and then feed the result to our sha1
computation. The fixed buffer has size '4*PATH_MAX + 20',
which in theory accommodates the four filenames plus some
extra data. Except:

  1. The filenames may not be constrained to PATH_MAX. The
 static value may not be a real limit on the current
 filesystem. Moreover, we may compute patch-ids for
 names stored only in git, without touching the current
 filesystem at all.

  2. The 20 bytes is not nearly enough to cover the
 extra content we put in the buffer.

As a result, the data we feed to the sha1 computation may be
truncated, and it's possible that a commit with a very long
filename could erroneously collide in the patch-id space
with another commit. For instance, if one commit modified
"really-long-filename/foo" and another modified "bar" in the
same directory.

In practice this is unlikely. Because the filenames are
repeated, and because there's a single cutoff at the end of
the buffer, the offending filename would have to be on the
order of four times larger than PATH_MAX.

We could fix this by moving to a strbuf. However, we can
observe that the purpose of formatting this in the first
place is to feed it to git_SHA1_Update(). So instead, let's
just feed each part of the formatted string directly. This
actually ends up more readable, and we can even factor out
some duplicated bits from the various conditional branches.

Technically this may change the output of patch-id for very
long filenames, but it's not worth making an exception for
this in the --stable output. It was a bug, and one that only
affected an unlikely set of paths.  And anyway, the exact
value would have varied from platform to platform depending
on the value of PATH_MAX, so there is no "stable" value.

Signed-off-by: Jeff King 
---
 diff.c | 68 --
 1 file changed, 37 insertions(+), 31 deletions(-)

diff --git a/diff.c b/diff.c
index 58cb72d7e..92d78e459 100644
--- a/diff.c
+++ b/diff.c
@@ -4570,6 +4570,19 @@ static void patch_id_consume(void *priv, char *line, 
unsigned long len)
data->patchlen += new_len;
 }
 
+static void patch_id_add_string(git_SHA_CTX *ctx, const char *str)
+{
+   git_SHA1_Update(ctx, str, strlen(str));
+}
+
+static void patch_id_add_mode(git_SHA_CTX *ctx, unsigned mode)
+{
+   /* large enough for 2^32 in octal */
+   char buf[12];
+   int len = xsnprintf(buf, sizeof(buf), "%06o", mode);
+   git_SHA1_Update(ctx, buf, len);
+}
+
 /* returns 0 upon success, and writes result into sha1 */
 static int diff_get_patch_id(struct diff_options *options, unsigned char 
*sha1, int diff_header_only)
 {
@@ -4577,7 +4590,6 @@ static int diff_get_patch_id(struct diff_options 
*options, unsigned char *sha1,
int i;
git_SHA_CTX ctx;
struct patch_id_t data;
-   char buffer[PATH_MAX * 4 + 20];
 
git_SHA1_Init(&ctx);
memset(&data, 0, sizeof(struct patch_id_t));
@@ -4609,36 +4621,30 @@ static int diff_get_patch_id(struct diff_options 
*options, unsigned char *sha1,
 
len1 = remove_space(p->one->path, strlen(p->one->path));
len2 = remove_space(p->two->path, strlen(p->two->path));
-   if (p->one->mode == 0)
-   len1 = snprintf(buffer, sizeof(buffer),
-   "diff--gita/%.*sb/%.*s"
-   "newfilemode%06o"
-   "---/dev/null"
-   "+++b/%.*s",
-   len1, p->one->path,
-   len2, p->two->path,
-   p->two->mode,
-   len2, p->two->path);
-   else if (p->two->mode == 0)
-   len1 = snprintf(buffer, sizeof(buffer),
-   "diff--gita/%.*sb/%.*s"
-   "deletedfilemode%06o"
-   "---a/%.*s"
-   "+++/dev/nu

Re: [PATCH 0/18] snprintf cleanups

2017-03-30 Thread Junio C Hamano
Jeff King  writes:

> On Wed, Mar 29, 2017 at 09:05:33AM -0700, Junio C Hamano wrote:
>
>> > I think there are two things going on in your example.
>> >
>> > One is that obviously patch_id_addf() removes the spaces from the
>> > result. But we could do that now by keeping the big strbuf_addf(), and
>> > then just walking the result and feeding non-spaces.
>> >
>> > The second is that your addf means we are back to formatting everything
>> > into a buffer again
>> 
>> You are right to point out that I was blinded by the ugliness of
>> words stuck together without spaces in between, which was inherited
>> from the original code, and failed to see the sole point of this
>> series, which is to remove truncation without adding unnecessary
>> allocation and freeing.
>> 
>> Thanks for straighten my thinking out.  I think the seeming
>> ugliness, if it ever becomes a real problem, should be handled
>> outside this series after the dust settles.
>
> Yeah, the no-spaces thing should almost certainly wait.
>
> There is still the minor question of whether skipping the strbuf
> entirely is nicer, even if you still have to feed it strings without
> spaces (i.e., what I posted in my initial reply).
>
> I'm OK either with the series I posted, or wrapping up the alternative
> in a commit message.

I do find the updated one easier to follow (if anything it is more
compact); I do not think it is worth a reroll, but it is easy enough
to replace the patch part of the original with the updated patch and
tweak "it's easy to fix by moving to a strbuf" in its log message to
something like "But it's easy to eliminate the allocation with a few
helper functions, and it makes the result easier to follow", so I am
tempted to go that route myself...


Re: [PATCH 0/18] snprintf cleanups

2017-03-29 Thread Jeff King
On Wed, Mar 29, 2017 at 09:05:33AM -0700, Junio C Hamano wrote:

> > I think there are two things going on in your example.
> >
> > One is that obviously patch_id_addf() removes the spaces from the
> > result. But we could do that now by keeping the big strbuf_addf(), and
> > then just walking the result and feeding non-spaces.
> >
> > The second is that your addf means we are back to formatting everything
> > into a buffer again
> 
> You are right to point out that I was blinded by the ugliness of
> words stuck together without spaces in between, which was inherited
> from the original code, and failed to see the sole point of this
> series, which is to remove truncation without adding unnecessary
> allocation and freeing.
> 
> Thanks for straighten my thinking out.  I think the seeming
> ugliness, if it ever becomes a real problem, should be handled
> outside this series after the dust settles.

Yeah, the no-spaces thing should almost certainly wait.

There is still the minor question of whether skipping the strbuf
entirely is nicer, even if you still have to feed it strings without
spaces (i.e., what I posted in my initial reply).

I'm OK either with the series I posted, or wrapping up the alternative
in a commit message.

-Peff


Re: [PATCH 0/18] snprintf cleanups

2017-03-29 Thread Junio C Hamano
Jeff King  writes:

> On Tue, Mar 28, 2017 at 03:33:48PM -0700, Junio C Hamano wrote:
>
>> Jeff King  writes:
>> 
>> > It's a lot of patches, but hopefully they're all pretty straightforward
>> > to read.
>> 
>> Yes, quite a lot of changes.  I didn't see anything questionable in
>> there.
>> 
>> As to the "patch-id" thing, I find the alternate one slightly easier
>> to read.  Also, exactly because this is not a performance critical
>> codepath, it may be better if patch_id_add_string() filtered out
>> whitespaces; that would allow the source to express things in more
>> natural way, e.g.
>> 
>>  patch_id_addf(&ctx, "new file mode");
>>  patch_id_addf(&ctx, "%06o", p->two->mode);
>>  patch_id_addf(&ctx, "--- /dev/null");
>>  patch_id_addf(&ctx, "+++ b/%.*s", len2, p->two->path);
>> 
>> Or I may be going overboard by bringing "addf" into the mix X-<.
>
> I think there are two things going on in your example.
>
> One is that obviously patch_id_addf() removes the spaces from the
> result. But we could do that now by keeping the big strbuf_addf(), and
> then just walking the result and feeding non-spaces.
>
> The second is that your addf means we are back to formatting everything
> into a buffer again

You are right to point out that I was blinded by the ugliness of
words stuck together without spaces in between, which was inherited
from the original code, and failed to see the sole point of this
series, which is to remove truncation without adding unnecessary
allocation and freeing.

Thanks for straighten my thinking out.  I think the seeming
ugliness, if it ever becomes a real problem, should be handled
outside this series after the dust settles.





Re: [PATCH 0/18] snprintf cleanups

2017-03-28 Thread Jeff King
On Tue, Mar 28, 2017 at 03:33:48PM -0700, Junio C Hamano wrote:

> Jeff King  writes:
> 
> > It's a lot of patches, but hopefully they're all pretty straightforward
> > to read.
> 
> Yes, quite a lot of changes.  I didn't see anything questionable in
> there.
> 
> As to the "patch-id" thing, I find the alternate one slightly easier
> to read.  Also, exactly because this is not a performance critical
> codepath, it may be better if patch_id_add_string() filtered out
> whitespaces; that would allow the source to express things in more
> natural way, e.g.
> 
>   patch_id_addf(&ctx, "new file mode");
>   patch_id_addf(&ctx, "%06o", p->two->mode);
>   patch_id_addf(&ctx, "--- /dev/null");
>   patch_id_addf(&ctx, "+++ b/%.*s", len2, p->two->path);
> 
> Or I may be going overboard by bringing "addf" into the mix X-<.

I think there are two things going on in your example.

One is that obviously patch_id_addf() removes the spaces from the
result. But we could do that now by keeping the big strbuf_addf(), and
then just walking the result and feeding non-spaces.

The second is that your addf means we are back to formatting everything
into a buffer again. And it has to be dynamic to handle the final line
there, because "len2" isn't bounded. At which point we may as well go
back to sticking it all in one big strbuf (your example also breaks it
down line by line, but we could do that with separte strbuf_addf calls,
too).

Or you have to reimplement the printf format-parsing yourself, and write
into the sha1 instead of into the buffers. But that's probably insane.

I think the "no extra buffer with whitespace" combo is more like:

  void patch_id_add_buf(git_SHA1_CTX *ctx, const char *buf, size_t len)
  {
for (; len > 0; buf++, len--) {
if (!isspace(*buf))
git_SHA1_Update(ctx, buf, 1);
}
  }

  void patch_id_add_str(git_SHA1_CTX *ctx, const char *str)
  {
patch_id_add_buf(ctx, strlen(str));
  }

  void patch_id_add_mode(git_SHA1_CTX *ctx, unsigned mode)
  {
char buf[16]; /* big enough... */
int len = xsnprintf(buf, "%06o", mode);
patch_id_add_buf(ctx, buf, len);
  }

  patch_id_add_str(&ctx, "new file mode");
  patch_id_add_mode(&ctx, p->two->mode);
  patch_id_add_str(&ctx, "--- /dev/null");
  patch_id_add_str(&ctx, "+++ b/");
  patch_id_add_buf(&ctx, p->two->path, len2);

I dunno. I wondered if feeding single bytes to the sha1 update might
actually be noticeably slower, because I would assume that internally it
generally copies data in larger chunks. I didn't measure it, though.

-Peff


Re: [PATCH 0/18] snprintf cleanups

2017-03-28 Thread Junio C Hamano
Jeff King  writes:

> It's a lot of patches, but hopefully they're all pretty straightforward
> to read.

Yes, quite a lot of changes.  I didn't see anything questionable in
there.

As to the "patch-id" thing, I find the alternate one slightly easier
to read.  Also, exactly because this is not a performance critical
codepath, it may be better if patch_id_add_string() filtered out
whitespaces; that would allow the source to express things in more
natural way, e.g.

patch_id_addf(&ctx, "new file mode");
patch_id_addf(&ctx, "%06o", p->two->mode);
patch_id_addf(&ctx, "--- /dev/null");
patch_id_addf(&ctx, "+++ b/%.*s", len2, p->two->path);

Or I may be going overboard by bringing "addf" into the mix X-<.


[PATCH 0/18] snprintf cleanups

2017-03-28 Thread Jeff King
Our code base calls snprintf() into a fixed-size buffer in a bunch of
places. Sometimes we check the result, and sometimes we accept a silent
truncation. In some cases an overflow is easy given long input. In some
cases it's impossible. And in some cases it depends on how big PATH_MAX
is on your filesystem, and whether it's actually enforced. :)

This series attempts to give more predictable and consistent results by
removing arbitrary buffer limitations. It also tries to make further
audits of snprintf() easier by converting to xsnprintf() where
appropriate.

There are still some snprintf() calls left after this. A few are in code
that's in flux, or is being cleaned up in nearby series (several of my
recent cleanup series were split off from this). A few should probably
remain (e.g., git-daemon will refuse to consider a repo name larger than
PATH_MAX, which may be a reasonable defense against weird memory tricks.
I wouldn't be sad to see this turned into a strbuf with an explicit
length policy enforced separately, though). And there were a few that I
just didn't get around to converting (the dumb-http walker, for example,
but I think it may need a pretty involved audit overall).

It's a lot of patches, but hopefully they're all pretty straightforward
to read.

  [01/18]: do not check odb_mkstemp return value for errors
  [02/18]: odb_mkstemp: write filename into strbuf
  [03/18]: odb_mkstemp: use git_path_buf
  [04/18]: diff: avoid fixed-size buffer for patch-ids
  [05/18]: tag: use strbuf to format tag header
  [06/18]: fetch: use heap buffer to format reflog
  [07/18]: avoid using fixed PATH_MAX buffers for refs
  [08/18]: avoid using mksnpath for refs
  [09/18]: create_branch: move msg setup closer to point of use
  [10/18]: create_branch: use xstrfmt for reflog message
  [11/18]: name-rev: replace static buffer with strbuf
  [12/18]: receive-pack: print --pack-header directly into argv array
  [13/18]: replace unchecked snprintf calls with heap buffers
  [14/18]: combine-diff: replace malloc/snprintf with xstrfmt
  [15/18]: convert unchecked snprintf into xsnprintf
  [16/18]: transport-helper: replace checked snprintf with xsnprintf
  [17/18]: gc: replace local buffer with git_path
  [18/18]: daemon: use an argv_array to exec children

 bisect.c   |  8 +---
 branch.c   | 16 
 builtin/checkout.c |  5 ++---
 builtin/fetch.c|  6 --
 builtin/gc.c   |  8 +---
 builtin/index-pack.c   | 22 --
 builtin/ls-remote.c| 10 ++
 builtin/name-rev.c | 21 -
 builtin/notes.c|  9 -
 builtin/receive-pack.c | 17 ++---
 builtin/replace.c  | 50 +++---
 builtin/rev-parse.c|  5 +++--
 builtin/tag.c  | 42 ++
 cache.h|  7 +--
 combine-diff.c |  7 ---
 daemon.c   | 38 +-
 diff.c | 20 +---
 environment.c  | 14 ++
 fast-import.c  |  9 +
 grep.c |  4 ++--
 http.c | 10 +-
 imap-send.c|  2 +-
 pack-bitmap-write.c| 14 +++---
 pack-write.c   | 16 
 refs.c | 44 ++--
 sha1_file.c|  4 ++--
 submodule.c|  2 +-
 transport-helper.c |  5 +
 28 files changed, 215 insertions(+), 200 deletions(-)