[PATCH] shortlog: skip format/parse roundtrip for internal traversal

2017-09-08 Thread Jeff King
On Fri, Sep 08, 2017 at 03:39:36PM +0900, Junio C Hamano wrote:

> Jeff King  writes:
> 
> > IOW, something like the patch below, which pushes the re-parsing out to
> > the stdin code-path, and lets the internal traversal format directly
> > into the final buffer. It seems to be about 3% faster than the existing
> > code, and fixes the leak (by dropping that variable entirely).
> 
> Wow, that is s logical a conclusion that I somewhat feel ashamed
> that I didn't think of it myself.
> 
> Nicely done.

Thanks. Here it is with a commit message.

Note that the non-stdin path no longer looks at the "mailmap" entry of
"struct shortlog" (instead we use the one cached inside pretty.c). But
we still waste time loading it. I'm not sure if it's worth addressing
that. It's only once per program invocation, and it's a little tricky to
fix (we do shortlog_init() before we know whether or not we're using
stdin). We could just load it lazily, though, which would cover the
stdin case.

-- >8 --
Subject: shortlog: skip format/parse roundtrip for internal traversal

The original git-shortlog command parsed the output of
git-log, and the logic went something like this:

  1. Read stdin looking for "author" lines.

  2. Parse the identity into its name/email bits.

  3. Apply mailmap to the name/email.

  4. Reformat the identity into a single buffer that is our
 "key" for grouping entries (either a name by default,
 or "name " if --email was given).

The first part happens in read_from_stdin(), and the other
three steps are part of insert_one_record().

When we do an internal traversal, we just swap out the stdin
read in step 1 for reading the commit objects ourselves.
Prior to 2db6b83d18 (shortlog: replace hand-parsing of
author with pretty-printer, 2016-01-18), that made sense; we
still had to parse the ident in the commit message.

But after that commit, we use pretty.c's "%an <%ae>" to get
the author ident (for simplicity). Which means that the
pretty printer is doing a parse/format under the hood, and
then we parse the result, apply the mailmap, and format the
result again.

Instead, we can just ask pretty.c to do all of those steps
for us (including the mailmap via "%aN <%aE>", and not
formatting the address when --email is missing).

And then we can push steps 2-4 into read_from_stdin(). This
speeds up "git shortlog -ns" on linux.git by about 3%, and
eliminates a leak in insert_one_record() of the namemailbuf
strbuf.

Signed-off-by: Jeff King 
---
 builtin/shortlog.c | 56 ++
 1 file changed, 35 insertions(+), 21 deletions(-)

diff --git a/builtin/shortlog.c b/builtin/shortlog.c
index 43c4799ea9..e29875b843 100644
--- a/builtin/shortlog.c
+++ b/builtin/shortlog.c
@@ -52,26 +52,8 @@ static void insert_one_record(struct shortlog *log,
  const char *oneline)
 {
struct string_list_item *item;
-   const char *mailbuf, *namebuf;
-   size_t namelen, maillen;
-   struct strbuf namemailbuf = STRBUF_INIT;
-   struct ident_split ident;
 
-   if (split_ident_line(&ident, author, strlen(author)))
-   return;
-
-   namebuf = ident.name_begin;
-   mailbuf = ident.mail_begin;
-   namelen = ident.name_end - ident.name_begin;
-   maillen = ident.mail_end - ident.mail_begin;
-
-   map_user(&log->mailmap, &mailbuf, &maillen, &namebuf, &namelen);
-   strbuf_add(&namemailbuf, namebuf, namelen);
-
-   if (log->email)
-   strbuf_addf(&namemailbuf, " <%.*s>", (int)maillen, mailbuf);
-
-   item = string_list_insert(&log->list, namemailbuf.buf);
+   item = string_list_insert(&log->list, author);
 
if (log->summary)
item->util = (void *)(UTIL_TO_INT(item) + 1);
@@ -114,9 +96,33 @@ static void insert_one_record(struct shortlog *log,
}
 }
 
+static int parse_stdin_author(struct shortlog *log,
+  struct strbuf *out, const char *in)
+{
+   const char *mailbuf, *namebuf;
+   size_t namelen, maillen;
+   struct ident_split ident;
+
+   if (split_ident_line(&ident, in, strlen(in)))
+   return -1;
+
+   namebuf = ident.name_begin;
+   mailbuf = ident.mail_begin;
+   namelen = ident.name_end - ident.name_begin;
+   maillen = ident.mail_end - ident.mail_begin;
+
+   map_user(&log->mailmap, &mailbuf, &maillen, &namebuf, &namelen);
+   strbuf_add(out, namebuf, namelen);
+   if (log->email)
+   strbuf_addf(out, " <%.*s>", (int)maillen, mailbuf);
+
+   return 0;
+}
+
 static void read_from_stdin(struct shortlog *log)
 {
struct strbuf author = STRBUF_INIT;
+   struct strbuf mapped_author = STRBUF_INIT;
struct strbuf oneline = STRBUF_INIT;
static const char *author_match[2] = { "Author: ", "author " };
static const char *committer_match[2] = { "Commit: ", "committer " };
@@ -134,9 +140,15 @@ static void read_from_stdin(st

Re: [PATCH] shortlog: skip format/parse roundtrip for internal traversal

2017-09-10 Thread René Scharfe
Am 08.09.2017 um 11:21 schrieb Jeff King:
> Note that the non-stdin path no longer looks at the "mailmap" entry of
> "struct shortlog" (instead we use the one cached inside pretty.c). But
> we still waste time loading it. I'm not sure if it's worth addressing
> that. It's only once per program invocation, and it's a little tricky to
> fix (we do shortlog_init() before we know whether or not we're using
> stdin). We could just load it lazily, though, which would cover the
> stdin case.

The difference in performance and memory usage will only be measurable
with really big mailmap files.  However, it may be an opportunity for
simplifying the mailmap API in general.  Conceptually the map data
should fit into struct repository instead of being read and stored by
each user, right?

> -- >8 --
> Subject: shortlog: skip format/parse roundtrip for internal traversal
> 
> The original git-shortlog command parsed the output of
> git-log, and the logic went something like this:
> 
>1. Read stdin looking for "author" lines.
> 
>2. Parse the identity into its name/email bits.
> 
>3. Apply mailmap to the name/email.
> 
>4. Reformat the identity into a single buffer that is our
>   "key" for grouping entries (either a name by default,
>   or "name " if --email was given).
> 
> The first part happens in read_from_stdin(), and the other
> three steps are part of insert_one_record().
> 
> When we do an internal traversal, we just swap out the stdin
> read in step 1 for reading the commit objects ourselves.
> Prior to 2db6b83d18 (shortlog: replace hand-parsing of
> author with pretty-printer, 2016-01-18), that made sense; we
> still had to parse the ident in the commit message.
> 
> But after that commit, we use pretty.c's "%an <%ae>" to get
> the author ident (for simplicity). Which means that the
> pretty printer is doing a parse/format under the hood, and
> then we parse the result, apply the mailmap, and format the
> result again.
> 
> Instead, we can just ask pretty.c to do all of those steps
> for us (including the mailmap via "%aN <%aE>", and not
> formatting the address when --email is missing).
> 
> And then we can push steps 2-4 into read_from_stdin(). This
> speeds up "git shortlog -ns" on linux.git by about 3%, and
> eliminates a leak in insert_one_record() of the namemailbuf
> strbuf.

Great!  Thanks for stepping back, looking at the bigger
picture and making it prettier.

> 
> Signed-off-by: Jeff King 
> ---
>   builtin/shortlog.c | 56 
> ++
>   1 file changed, 35 insertions(+), 21 deletions(-)

While longer, the resulting code is split up into more
digestible chunks.

René


Re: [PATCH] shortlog: skip format/parse roundtrip for internal traversal

2017-09-10 Thread Jeff King
On Sun, Sep 10, 2017 at 10:44:46AM +0200, René Scharfe wrote:

> Am 08.09.2017 um 11:21 schrieb Jeff King:
> > Note that the non-stdin path no longer looks at the "mailmap" entry of
> > "struct shortlog" (instead we use the one cached inside pretty.c). But
> > we still waste time loading it. I'm not sure if it's worth addressing
> > that. It's only once per program invocation, and it's a little tricky to
> > fix (we do shortlog_init() before we know whether or not we're using
> > stdin). We could just load it lazily, though, which would cover the
> > stdin case.
> 
> The difference in performance and memory usage will only be measurable
> with really big mailmap files.  However, it may be an opportunity for
> simplifying the mailmap API in general.  Conceptually the map data
> should fit into struct repository instead of being read and stored by
> each user, right?

Yes, I think it would be fine to have a single "the_mailmap", and in
post-struct-repository world, that's where it should go.

-Peff