On 19 March 2013 04:18, pintuxgu <[email protected]> wrote: > I have an idea for resynchronizing matches between files and I'm curious what > you think of it. I'm not sure about the quality of this idea, nor of the > feasability of implementation. > > The basic algorithm is this: > > 1). Take a limited number of random (sub) strings form file A (20, every 5 % > for example). > 2). Make 20 linked lists of matches with file B. (Samples "unique" enough, > short lists). > 3). Expand all unique matches (Linked lists with 2 items) maximally. (Is this > the "snake" in action?) > 4). Mark all the matched text found / and the matched links between the files. > 5). Repeat step 1 once (twice, recursive?). But not for the whole file, but > for the parts between 2 previous matches. > 6). We have now a maximum of 20x20=400 (8000 for 3 passes...) matches > between the files. > 7). All text not marked yet is now assumed to be only in file A or in file B. > 8). Maybe use a different algorithm for further refinement. > > Some thoughts: > - This works beautifully if for example the order of complete functions is > changed in a source files. > - I think It's relatively fast because of the limited number of scans through > the files. > - File's which don't mach at all can easily be recognized (If the samples in > step 1 are chosen properly). > - Make binary tree with snippets text marked "matched", "Only in A", "Only in > B" ? > - How difficult would it be to implement this? > - Lots of room for all kinds of optimisations > - Don't read file A, > - Smart "guesses" in file B. > - Offsets form start or end of line for matches. > - Etc. > - Maybe a variant of this idea can be integrated in the existing matching > algorithms. > - I'm curious what you tink of this, maybe I'm just being silly.
It's difficult to tell how such an algorithm would work in practice. However, it does look like you're doing a fair bit of early optimisation that may well not be necessary. The random sampling is going to cause issues because in the vast majority of cases, a diff is between two files that are largely identical; this is why many diff algorithms optimisations are based on unique matching lines and similar constructs. I like that you'd have a fairly natural method of detecting moved chunks, but then that sort of thing can probably be retrofitted onto existing algorithms. Anyway, if you're interested, I'd encourage you to take a look at the various Myer's diff papers - most are freely available - and check out some other resources. The author of the diff-match-patch library has some good material on pre- and post-processing strategies for diffing (see http://neil.fraser.name/writing/diff/). cheers, Kai _______________________________________________ meld-list mailing list [email protected] https://mail.gnome.org/mailman/listinfo/meld-list
