Hi Philip,

On Wed, 9 Oct 2019, Philip Oakley wrote:

> On 08/10/2019 13:46, Johannes Schindelin wrote:
> > Hi Junio,
> >
> > On Tue, 8 Oct 2019, Junio C Hamano wrote:
> >
> > > Johannes Schindelin <johannes.schinde...@gmx.de> writes:
> > >
> > > > > I didn't quite understand this part, though.
> > > > >
> > > > >      The default creation factor is 60 (roughly speaking, it wants 60%
> > > > >      of
> > > > >      the lines to match between two patches, otherwise it considers
> > > > >      the
> > > > >      patches to be unrelated).
> > > > >
> > > > > Would the updated creation factor used which is 95 (roughly
> > > > > speaking) want 95% of the lines to match between two patches?
> > > > >
> > > > > That would make the matching logic even pickier and reject more
> > > > > paring, so I must be reading the statement wrong X-<.
> > > > No, I must have written the opposite of what I tried to say, is all.
> > > So, cfactor of 60 means at most 60% is allowed to differ and the
> > > two patches are still considered to be related, while 95 means only
> > > 5% needs to be common?  That would make more sense to me.
> > Okay, I not only wrote the opposite of what I wanted to say, I also
> > misremembered.
> >
> > When `range-diff` tries to determine matching pairs of patches, it
> > builds an `(m+n)x(m+n)` cost matrix, where `m` is the number of patches
> > in the first commit range and `n` is the number of patches in the second
> > one.
> >
> > Why not `m x n`? Well, that's the obvious matrix, and that's what it
> > starts with, essentially assigning the number of lines of the diff
> > between the diffs as "cost".
> >
> > But then `git range-diff` extends the cost matrix to allow for _all_ of
> > the `m` patches to be considered deleted, and _all_ of the `n` patches
> > to be added. As cost, it cannot use a "diff of diffs" because there is
> > no second diff. So it uses the number of lines of the one diff it has,
> > multiplied by the creation factor interpreted as a percentage.
> >
> > The naive creation factor would be 100%, which is (almost) as if we
> > assumed an empty diff for the missing diff. But that would make the
> > range-diff too eager to dismiss rewrites, as experience obviously showed
> > (not my experience, but Thomas Rast's, who came up with `tbdiff` after
> > all): the diff of diffs includes a diff header, for example.
> >
> > The interpretation I offered (although I inverted what I wanted to say)
> > is similar in spirit to that metric (which is not actually a metric, I
> > believe, because I expect it to violate the triangle inequality) is
> > obviously inaccurate: the number of lines of the diff of diffs does not
> > say anything about the number of matching lines, quite to the contrary,
> > it correlates somewhat to the number of non-matching lines.
> >
> > So a better interpretation would have been:
> >
> >  The default creation factor is 60 (roughly speaking, it wants at
> >  most 60% of the diffs' lines to differ, otherwise it considers
> >  them not to be a match.
> >
> > This is still inaccurate, but at least it gets the idea of the
> > range-diff across.
> >
> > Of course, I will never be able to amend the commit message in
> > GitGitGadget anyway, as I have merged that PR already.
> >
> > Ciao,
> > Dscho
> Medium term, is this something that could go in the algorithms section of the
> range-diff man page, especially if the upstream commit message is already in
> place.
>
> #leftoverdocs ?

Sure. How about giving it a try while our memory is still fresh? You
would help me immensely if you could take that task off of my plate...

Thanks,
Dscho

Reply via email to