Re: [PATCH 2/2] pickaxe: use textconv for -S counting

2012-11-21 Thread Jeff King
On Mon, Nov 19, 2012 at 04:48:22PM -0800, Junio C Hamano wrote:

> Junio C Hamano  writes:
> 
> >> Exact renames are the obvious one, but they are not handled here.
> >
> > That is half true.  Before this change, we will find the same number
> > of needles and this function would have said "no differences" in a
> > very inefficient way.  After this change, we may apply different
> > textconv filters and this function will say "there is a difference",
> > even though we wouldn't see such a difference at the content level
> > if there wasn't any rename.
> 
> ... but I think that is a good thing anyway.
> 
> If you renamed foo.c to foo.cc with different conversions from C
> code to the text that explain what the code does, if we special case
> only the exact rename case but let pickaxe examine the converted
> result in a case where blobs are modified only by one byte, we would
> get drastically different results between the two cases.

Right, exactly. I think the only sane thing is to always textconv or
always not textconv (whether they are identical renames or not), and any
"these are the same" optimization for identical content needs to take
into account whether we _would have_ done a different textconv (which
most of the time is going to be "no", as textconv is either not in use,
or both paths use the same diff driver; but it is not too expensive to
look up).

The diff_unmodified_pair at the top off diff_flush_patch is correct,
because it treats renames as interesting (because we have to show the
diff header, anyway). I do not know offhand if we avoid feeding
identical content to xdiff at all, but if so, we should be doing so only
after checking that the textconv filters are identical.

> Corollary to this is what should happen when you update the attributes
> between two trees so that textconv for a path that did not change
> between preimage and postimage are different.  Ideally, we should
> notice that the two converted result are different, perhaps, but I
> do not like the performance implications very much.

The content to compare cannot be different unless either the input
content changed or the path changed, and we treat either as
"interesting" in most code paths. So I do not think there are any
performance implications, except that we may need to make sure to look
up textconvs a few lines sooner in some cases.

I'll re-roll the series next week and break out the rename-optimization
bits separately so it is more obvious that it is doing the right thing.

As an aside, I also need to revisit the regex half of that code, which
is still buggy (before and after my patch, due to the expecting-a-NUL
behavior we talked about a week or two ago).  That is a separate topic,
but the same area of code.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] pickaxe: use textconv for -S counting

2012-11-19 Thread Junio C Hamano
Junio C Hamano  writes:

>> Exact renames are the obvious one, but they are not handled here.
>
> That is half true.  Before this change, we will find the same number
> of needles and this function would have said "no differences" in a
> very inefficient way.  After this change, we may apply different
> textconv filters and this function will say "there is a difference",
> even though we wouldn't see such a difference at the content level
> if there wasn't any rename.

... but I think that is a good thing anyway.

If you renamed foo.c to foo.cc with different conversions from C
code to the text that explain what the code does, if we special case
only the exact rename case but let pickaxe examine the converted
result in a case where blobs are modified only by one byte, we would
get drastically different results between the two cases.

Corollary to this is what should happen when you update the attributes
between two trees so that textconv for a path that did not change
between preimage and postimage are different.  Ideally, we should
notice that the two converted result are different, perhaps, but I
do not like the performance implications very much.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] pickaxe: use textconv for -S counting

2012-11-19 Thread Junio C Hamano
Jeff King  writes:

> On Tue, Nov 13, 2012 at 03:13:19PM -0800, Junio C Hamano wrote:
>
>> >  static int has_changes(struct diff_filepair *p, struct diff_options *o,
>> >   regex_t *regexp, kwset_t kws)
>> >  {
>> > +  struct userdiff_driver *textconv_one = get_textconv(p->one);
>> > +  struct userdiff_driver *textconv_two = get_textconv(p->two);
>> > +  mmfile_t mf1, mf2;
>> > +  int ret;
>> > +
>> >if (!o->pickaxe[0])
>> >return 0;
>> >  
>> > -  if (!DIFF_FILE_VALID(p->one)) {
>> > -  if (!DIFF_FILE_VALID(p->two))
>> > -  return 0; /* ignore unmerged */
>> 
>> What happened to this part that avoids showing nonsense for unmerged
>> paths?
>
> It's moved down. fill_one will return an empty mmfile if
> !DIFF_FILE_VALID, so we end up here:
>
> fill_one(p->one, &mf1, &textconv_one);
> fill_one(p->two, &mf2, &textconv_two);
>
> if (!mf1.ptr) {
> if (!mf2.ptr)
> ret = 0; /* ignore unmerged */
>
> Prior to this change, we didn't use fill_one, so we had to check manually.
>
>> > +  /*
>> > +   * If we have an unmodified pair, we know that the count will be the
>> > +   * same and don't even have to load the blobs. Unless textconv is in
>> > +   * play, _and_ we are using two different textconv filters (e.g.,
>> > +   * because a pair is an exact rename with different textconv attributes
>> > +   * for each side, which might generate different content).
>> > +   */
>> > +  if (textconv_one == textconv_two && diff_unmodified_pair(p))
>> > +  return 0;
>> 
>> I am not sure about this part that cares about the textconv.
>> 
>> Wouldn't the normal "git diff A B" skip the filepair that are
>> unmodified in the first place at the object name level without even
>> looking at the contents (see e.g. diff_flush_patch())?
>
> Hmph. The point was to find the case when the paths are different (e.g.,
> in a rename), and therefore the textconvs might be different. But I
> think I missed the fact that diff_unmodified_pair will note the
> difference in paths. So just calling diff_unmodified_pair would be
> sufficient, as the code prior to my patch does.
>
> I thought the point was an optimization to avoid comparing contains() on
> the same data (which we can know will match without looking at it).

Yes.

> Exact renames are the obvious one, but they are not handled here.

That is half true.  Before this change, we will find the same number
of needles and this function would have said "no differences" in a
very inefficient way.  After this change, we may apply different
textconv filters and this function will say "there is a difference",
even though we wouldn't see such a difference at the content level
if there wasn't any rename.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] pickaxe: use textconv for -S counting

2012-11-14 Thread Jeff King
On Tue, Nov 13, 2012 at 03:13:19PM -0800, Junio C Hamano wrote:

> >  static int has_changes(struct diff_filepair *p, struct diff_options *o,
> >regex_t *regexp, kwset_t kws)
> >  {
> > +   struct userdiff_driver *textconv_one = get_textconv(p->one);
> > +   struct userdiff_driver *textconv_two = get_textconv(p->two);
> > +   mmfile_t mf1, mf2;
> > +   int ret;
> > +
> > if (!o->pickaxe[0])
> > return 0;
> >  
> > -   if (!DIFF_FILE_VALID(p->one)) {
> > -   if (!DIFF_FILE_VALID(p->two))
> > -   return 0; /* ignore unmerged */
> 
> What happened to this part that avoids showing nonsense for unmerged
> paths?

It's moved down. fill_one will return an empty mmfile if
!DIFF_FILE_VALID, so we end up here:

fill_one(p->one, &mf1, &textconv_one);
fill_one(p->two, &mf2, &textconv_two);

if (!mf1.ptr) {
if (!mf2.ptr)
ret = 0; /* ignore unmerged */

Prior to this change, we didn't use fill_one, so we had to check manually.

> > +   /*
> > +* If we have an unmodified pair, we know that the count will be the
> > +* same and don't even have to load the blobs. Unless textconv is in
> > +* play, _and_ we are using two different textconv filters (e.g.,
> > +* because a pair is an exact rename with different textconv attributes
> > +* for each side, which might generate different content).
> > +*/
> > +   if (textconv_one == textconv_two && diff_unmodified_pair(p))
> > +   return 0;
> 
> I am not sure about this part that cares about the textconv.
> 
> Wouldn't the normal "git diff A B" skip the filepair that are
> unmodified in the first place at the object name level without even
> looking at the contents (see e.g. diff_flush_patch())?

Hmph. The point was to find the case when the paths are different (e.g.,
in a rename), and therefore the textconvs might be different. But I
think I missed the fact that diff_unmodified_pair will note the
difference in paths. So just calling diff_unmodified_pair would be
sufficient, as the code prior to my patch does.

I thought the point was an optimization to avoid comparing contains() on
the same data (which we can know will match without looking at it).
Exact renames are the obvious one, but they are not handled here. So I
am not sure of the point (to catch "git diff $blob1 $blob2" when the two
are identical? I am not sure at what layer we cull that from the diff
queue).

So there is room for optimization here on exact renames, but
diff_unmodified_pair is too forgiving of what is interesting (a rename
is interesting to diff_flush_patch, because it wants to mention the
rename, but it is not interesting to pickaxe, because we did not change
the content, and it could be culled here).

I don't know that it is that big a deal in general. Pure renames are
going to be the minority of blobs we look at, so it is probably not even
measurable. You could construct a pathological case (e.g., an otherwise
small repo with a 2G file, rename the 2G file without modification, then
running "git log -Sfoo" will unnecessarily load the giant blob while
examining the rename commit).

> Shouldn't this part of the code emulating that behaviour no matter
> what textconv filter(s) are configured for these paths?

Yeah, I just missed that it is checking the path already. It may still
make sense to tighten the optimization, but that is a separate issue. It
should just check diff_unmodified_pair as before; textconv only matters
if you are trying to optimize out exact renames.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] pickaxe: use textconv for -S counting

2012-11-13 Thread Junio C Hamano
Jeff King  writes:

> We currently just look at raw blob data when using "-S" to
> pickaxe. This is mostly historical, as pickaxe predates the
> textconv feature. If the user has bothered to define a
> textconv filter, it is more likely that their search string will be
> on the textconv output, as that is what they will see in the
> diff (and we do not even provide a mechanism for them to
> search for binary needles that contain NUL characters).

Oookay, I suppose...

>  static int has_changes(struct diff_filepair *p, struct diff_options *o,
>  regex_t *regexp, kwset_t kws)
>  {
> + struct userdiff_driver *textconv_one = get_textconv(p->one);
> + struct userdiff_driver *textconv_two = get_textconv(p->two);
> + mmfile_t mf1, mf2;
> + int ret;
> +
>   if (!o->pickaxe[0])
>   return 0;
>  
> - if (!DIFF_FILE_VALID(p->one)) {
> - if (!DIFF_FILE_VALID(p->two))
> - return 0; /* ignore unmerged */

What happened to this part that avoids showing nonsense for unmerged
paths?

> + /*
> +  * If we have an unmodified pair, we know that the count will be the
> +  * same and don't even have to load the blobs. Unless textconv is in
> +  * play, _and_ we are using two different textconv filters (e.g.,
> +  * because a pair is an exact rename with different textconv attributes
> +  * for each side, which might generate different content).
> +  */
> + if (textconv_one == textconv_two && diff_unmodified_pair(p))
> + return 0;

I am not sure about this part that cares about the textconv.

Wouldn't the normal "git diff A B" skip the filepair that are
unmodified in the first place at the object name level without even
looking at the contents (see e.g. diff_flush_patch())?

Shouldn't this part of the code emulating that behaviour no matter
what textconv filter(s) are configured for these paths?
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html