Hi Josh,

On Mon, 6 Feb 2017, Johannes Schindelin wrote:

> as discussed at the GitMerge, I am trying to come up with tooling that
> will allow for substantially less tedious navigation between the local
> repository, the mailing list, and what ends up in the `pu` branch.

I found a little bit more time last Friday to play with the
cross-correlation between commits in `pu` and mails in
public-inbox/git.git and it is worse than I previously assumed.

Just as a reminder: my plan was to start developing tools that will
ultimately help me as well as other contributors with the arcane mailing
list model of patch submission. And my first target was the seemingly
simple task of figuring out the mail corresponding to any given commit in
`pu` (i.e. the mail that contained the patch, and whose mail thread is
hence expected to have the entire patch review, and to which I would be
expected to respond if I find a problem with that commit).

And since it is all-too-common that the oneline is adjusted before
applying the patch, the Subject:/oneline pair is not a good candidate to
find matches.

My next best guess was that the author date would not be touched, so the
pair of Date: and authordate should make a good candidate.

My initial finding was that this is not without problems, as some mails
were sent with identical Date: lines (most likely due to bugs in the
tools, e.g. the well-known and already fixed bug in git-am, and hence
git-rebase, where it would apply all patches using the first patch's
author date), and worse: some of those mails contained actual patch series
that actually made it into Git's commit history.

But those are not the only problems.

For starters, I tried to cross-correlate *just* the commits that entered
`pu` since one week ago (git rev-list --since=1.week.ago upstream/pu) with
mails of the past month in the mailing list archive.

One obvious caveat is that RFC 2822 is ambiguous when it comes to the date
format. While it seems nice that you *can* write single-digit day numbers
as single digit if you want, or with a leading zero, or with a leading
space, it makes it impossible to get away with exact matching. I did not
really want to complicate my research by parsing the dates and normalizing
them to epoch + timezone, also because I wanted results quick, so I simply
normalized the dates to have leading zeroes for single-digit day numbers,
that seems to work for the moment).

The first category of problematic commits come as no surprise: merges. We
do not even have a way to represent them as mails. I simply excluded them
from the remainder of this study.

The second category should not be all that surprising, too: Junio often
adjusts the release notes without sending those patches out for review.
Those commits are:

363588f (### match next, Junio C Hamano 2017-02-17)
2076907 (Git 2.12-rc2, Junio C Hamano 2017-02-17)
076c053 (Hopefully the final batch of mini-topics before the final,
        Junio C Hamano 2017-02-16)
ae86372 (Revert "reset: add an example of how to split a commit into two",
        Junio C Hamano 2017-02-16)
d09b692 (A bit more for -rc2, Junio C Hamano 2017-02-15)

There is a third category, and this one *does* come as a surprise to me.
It appears that at least *some* patches' Date: lines are either ignored or
overridden or changed on their way from the mailing list into Git's commit
history. There was only one commit in that commit range:

3c0cb0c (read_loose_refs(): read refs using resolve_ref_recursively(),
        Michael Haggerty 2017-02-09)

This one was committed with an author date "Thu, 09 Feb 2017 21:53:52
+0100" but it appears that there was no mail sent to the Git mailing list
with that particular Date: header and the *actual* mail containing the
patch was sent with a Date: header "Fri, 10 Feb 2017 12:16:19 +0100"
(Message-ID:
d8e906d969700acbca8dc717673d0a9cdc910f62.1486724698.git.mhag...@alum.mit.edu).

It is labor-intensive, but possible to find the correlation manually in
this case because the Subject: line has been left intact.

However, this points to a serious problem with my approach: I try to
re-create information that is actually not available (which Message-ID
corresponds to a given commit name). Since that information is not
available, it is quite possible that this information cannot be retrieved
accurately (and Michael's commit demonstrates that this is not a merely
theoretic consideration). I do not know that I can fix this on my side.

> P.S.: I used public-inbox.org links instead of commit references to the
> Git repository containing the mailing list archive, because the format
> of said Git repository is so unfavorable that it was determined very
> quickly in a discussion between Patrick Reynolds (GitHub) and myself
> that it would put totally undue burden on GitHub to mirror it there
> (compare also Carlos Nieto's talk at GitMerge titled "Top Ten Worst
> Repositories to host on GitHub").

Since the main problem was the unfavorable commit history structure, I
*think* that it may be possible to auto-process public-inbox.org/git.git
into a frequently-rewritten branch that squashes all commits from past
years into single, per-year commits (and the same for recent months, the
past days, and a single commit accumulating the current day's commits) and
that that may solve the problematic structure. The blob names would remain
identical to what is on public-inbox, of course.

Ciao,
Johannes

P.S.: The *mini* scripts I used were

cat generate-date-index.sh <<\EOF
#! /bin/sh

cd public-inbox-git

since_commit="$1"
test -n "$since_commit" ||
since_commit=$(git rev-list --since=1.month.ago master --reverse | head -n 1)
for sha1 in $(git diff --raw --no-abbrev $since_commit..master | cut -f 4 -d \ )
do
        printf '%s\t%s\n' \
                "$(git cat-file blob $sha1 |
                sed -n \
                        -e 's/^Date:[   ]*\([^,]*,\) *\([1-9] .*\)/\1 0\2/p' \
                        -e 's/^Date:[   ]*\([^,]*,\) *\([0-9][0-9] .*\)/\1 
\2/p' \
                        -e '/^$/q')" \
                $sha1
done | less -S
EOF

to generate a file date-index.txt containing "date\tblob" pairs where the blob
refers to the SHA-1 of the mail in public-inbox/git.git, and

cat >match-pu.sh <<\EOF
#! /bin/sh

for commit in $(git rev-list --since=1.week.ago --no-merges upstream/pu)
do
        date="$(git show -s --format=%aD $commit |
                sed 's/, \([1-9]\) /, 0\1 /')" # fix up Git's idea of RFC 2822
        mail_id=$(grep "^$date" date-index.txt | sed 's/.*      //')
        case "$mail_id" in
        '')
                echo "ERROR: no mail found for $commit (date $date)" >&2
                git show -s --pretty='tformat:%h (%s, %an %ad)' --date=short \
                        $commit >&2
                ;;
        *' '*)
                echo "ERROR: multiple candidates found for $commit ($mail_id)" 
>&2
                ;;
        *)
                echo "$date $mail_id"
                ;;
        esac
done
EOF

to try to match the author dates with the ones in date-index.txt. The
obvious next improvement is to list also Message-ID in date-index.txt.

Reply via email to