> > but now i get to step into the git xdiff code and see !!
here's the call to the xdiff implementation:
887 mustbe0(xdl_recs_cmp(&dd1, 0, dd1.nrec, &dd2, 0,
dd2.nrec,
888 kvdf, kvdb, (xp.flags & XDF_NEED_MINIMAL)
!= 0,
889 &xenv));
git added other implementations (patience, histogram) and patched different
infrastructure on for those.
but i think this is the core xdiff implementation that uses the first-class
infrastructure in the xdiff structures.
all the variables from xdf1 and xdf2 have been copied into dd1 and dd2:
878 dd1.nrec = xe->xdf1.nreff;
879 dd1.ha = xe->xdf1.ha;
880 dd1.rchg = xe->xdf1.rchg;
881 dd1.rindex = xe->xdf1.rindex;
882 dd2.nrec = xe->xdf2.nreff;
883 dd2.ha = xe->xdf2.ha;
884 dd2.rchg = xe->xdf2.rchg;
885 dd2.rindex = xe->xdf2.rindex;
So basically they are presenting the data to xdiff as if all the preprocessed
lines never existed.
878 dd1.nrec = xe->xdf1.nreff; // this is the number of
lines in file 1
879 dd1.ha = xe->xdf1.ha; // this is the "hash" of
every line. it's actually an index into a hash list, same thing
880 dd1.rchg = xe->xdf1.rchg; // this is the change vector
-- the algorithm sets 1 to indicate the line does not match
881 dd1.rindex = xe->xdf1.rindex; // this is the vector of
pointers to line data including the actual content
The goal of the algorithm is to correctly fill rchg, which is its output.
In this buggy case, rchg already has some 1s in it (hummm i could just set it
to 0s and see if there's a difference ... well if i do that i'd better not stop
my debugging session in case it doesn't help!) maybe try that in a bit unsure
(gdb) s
xdl_recs_cmp (dd1=0x7ffff4b08880, off1=0, lim1=2, dd2=0x7ffff4b088c0, off2=0,
lim2=1, kvdf=0x50c000000290, kvdb=0x50c0000002c0, need_min=1,
xenv=0x7ffff4b08840) at
/home/karl3/projects/zinc/third_party/xdiff/xdiffi.c:259
259 unsigned long const *ha1 = dd1->ha, *ha2 = dd2->ha;
(gdb) list
254 * (marking changed lines) is done in the two boundary reaching checks.
255 */
256 int xdl_recs_cmp(diffdata_t *dd1, long off1, long lim1,
257 diffdata_t *dd2, long off2, long lim2,
258 long *kvdf, long *kvdb, int need_min, xdalgoenv_t
*xenv) {
259 unsigned long const *ha1 = dd1->ha, *ha2 = dd2->ha;
260
261 /*
262 * Shrink the box by walking through each diagonal snake (SW
and NE).
263 */
This function is supported by xdiffi.c it's kind of the focus of the xdiff
library. (previously i was working with xprepare.c which sets up all those
structures).
i kind of don't expect it to be that long and complex, but i do expect to be
highly confused by it.
let's list some more lines
261 /*
262 * Shrink the box by walking through each diagonal snake (SW
and NE).
263 */
(gdb) list
264 for (; off1 < lim1 && off2 < lim2 && ha1[off1] == ha2[off2];
off1++, off2++);
265 for (; off1 < lim1 && off2 < lim2 && ha1[lim1 - 1] == ha2[lim2
- 1]; lim1--, lim2--);
266
267 /*
268 * If one dimension is empty, then all records on the other one
must
269 * be obviously changed.
270 */
271 if (off1 == lim1) {
272 char *rchg2 = dd2->rchg;
273 long *rindex2 = dd2->rindex;
(gdb) list
274
275 for (; off2 < lim2; off2++)
276 rchg2[rindex2[off2]] = 1;
277 } else if (off2 == lim2) {
278 char *rchg1 = dd1->rchg;
279 long *rindex1 = dd1->rindex;
280
281 for (; off1 < lim1; off1++)
282 rchg1[rindex1[off1]] = 1;
283 } else {
(gdb) list
284 xdpsplit_t spl;
285 spl.i1 = spl.i2 = 0;
286
287 /*
288 * Divide ...
289 */
290 if (xdl_split(ha1, off1, lim1, ha2, off2, lim2, kvdf,
kvdb,
291 need_min, &spl, xenv) < 0) {
292
293 return -1;
(gdb) list
294 }
295
296 /*
297 * ... et Impera.
298 */
299 if (xdl_recs_cmp(dd1, off1, spl.i1, dd2, off2, spl.i2,
300 kvdf, kvdb, spl.min_lo, xenv) < 0 ||
301 xdl_recs_cmp(dd1, spl.i1, lim1, dd2, spl.i2, lim2,
302 kvdf, kvdb, spl.min_hi, xenv) < 0) {
303
(gdb) list
304 return -1;
305 }
306 }
307
308 return 0;
309 }
see? not very long or complex-looking. it looks like its recursive!
I'm surprised to encounter a recursive function here, I remember when I wrote
recursive functions before I knew about state stacks ... didn't learn too much
new after that ... but i used to get a lot of stack overflows, surprised now to
see recursive functions
261 /*
262 * Shrink the box by walking through each diagonal snake (SW
and NE).
263 */
(gdb) list
264 for (; off1 < lim1 && off2 < lim2 && ha1[off1] == ha2[off2];
off1++, off2++);
265 for (; off1 < lim1 && off2 < lim2 && ha1[lim1 - 1] == ha2[lim2
- 1]; lim1--, lim2--);
"snakes" are a term from myers first paper on LCS. i don't immediately recall
what it refers to, and it's quite important for understanding the terminology
of these algorithms (and this is likely that first one as it is still very
popular) ... i think a snake is a set of matching lines that can be thought of
as a diagonal path through a grid of potential line matchings, one file's line
numbers on one axis, and one file's line numbers on another.
so in 264 and 265 above, it's looking for starting and ending lines that match,
and increasing off and decreasing lim to narrow the problem domain
after these lines it then runs some quick checks for unchangedness, if either
pair of offs and lims end up touching. we won't have that condition as the
window is only partly filled. the ending lines will mismatch; the condition
ha1[lim1 - 1] == ha2[lim2 - 1] will be false.
i guess i'd better step into that and verify i'm right!