Jeff King <[email protected]> writes:
> @@ -175,6 +177,11 @@ static int estimate_similarity(struct diff_filespec *src,
> if (max_size * (MAX_SCORE-minimum_score) < delta_size * MAX_SCORE)
> return 0;
>
> + hashcpy(pair.one, src->sha1);
> + hashcpy(pair.two, dst->sha1);
> + if (rename_cache_get(&pair, &score))
> + return score;
> +
Random thoughts.
Even though your "rename cache" could be used to reject pairing that
the similarity estimator would otherwise give high score, I would
imagine that in practice, people would always use the mechanism to
boost the similarity score of desired pairing. This conjecture has
a few interesting implications.
- As we track of only the top NUM_CANDIDATE_PER_DST rename src for
each dst (see record_if_better()), you should be able to first
see if pairs that have dst exist in your rename cache, and
iterate over the <src,dst> pairs, filling m[] with srcs that
appear in this particular invocation of diff.
- If you find NUM_CANDIDATE_PER_DST srcs from your rename cache,
you wouldn't have to run estimate_similarity() at all, but that
is very unlikely. We could however declare that user configured
similarity boost always wins computed ones, and skip estimation
for a dst for which you find an entry in the rename cache.
- As entries in rename cache that record high scores have names of
"similar" blobs, pack-objects may be able to take advantage of
this information.
- If you declare blobs A and B are similar, it is likely that blobs
C, D, E, ... that are created by making a series of small tweaks
to B are also similar. Would it make more sense to introduce a
concept of "set of similar blobs" instead of recording pairwise
scores for (A,B), (A,C), (A,D), ... (B,C), (B,D), ...? If so,
the body of per-dst loop in diffcore_rename() may become:
if (we know where dst came from)
continue;
if (dst belongs to a known blob family) {
for (each src in rename_src[]) {
if (src belongs to the same blob family as dst)
record it in m[];
}
}
if (the above didn't record anything in m[]) {
... existing estimate_similarity() code ...
}
Regarding your rename-and-tweak-exif photo sets, is the issue that
there are too many rename src/dst candidates and filling a large
matrix takes a lot of time, or tweaking exif makes the contents
unnecessarily dissimilar and causes the similarity detection to
fail? As we still have the pathname in this codepath, I am
wondering if we would benefit from custom "content hash" that knows
the nature of payload than the built-in similarity estimator, driven
by the attribute mechanism (if the latter is the case, that is).
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html