On 19 March 2013 04:18, pintuxgu <[email protected]> wrote:
> I have an idea for resynchronizing matches between files and I'm curious what 
> you think of it. I'm not sure about the quality of this idea, nor of the 
> feasability of implementation.
>
> The basic algorithm is this:
>
> 1). Take a limited number of random (sub) strings form file A (20, every 5 % 
> for example).
> 2). Make 20 linked lists of matches with file B. (Samples "unique" enough, 
> short lists).
> 3). Expand all unique matches (Linked lists with 2 items) maximally. (Is this 
> the "snake" in action?)
> 4). Mark all the matched text found / and the matched links between the files.
> 5). Repeat step 1 once (twice, recursive?). But not for the whole file, but 
> for the parts between 2 previous matches.
> 6). We have now a maximum of 20x20=400 (8000 for 3 passes...)  matches 
> between the files.
> 7). All text not marked yet is now assumed to be only in file A or in file B.
> 8). Maybe use a different algorithm for further refinement.
>
> Some thoughts:
> - This works beautifully if for example the order of complete functions is 
> changed in a source files.
> - I think It's relatively fast because of the limited number of scans through 
> the files.
> - File's which don't mach at all can easily be recognized (If the samples in 
> step 1 are chosen properly).
> - Make binary tree with snippets text marked "matched", "Only in A", "Only in 
> B" ?
> - How difficult would it be to implement this?
> - Lots of room for all kinds of optimisations
>         - Don't read file A,
>         - Smart "guesses" in file B.
>         - Offsets form start or end of line for matches.
>         - Etc.
> - Maybe a variant of this idea can be integrated in the existing matching 
> algorithms.
> - I'm curious what you tink of this, maybe I'm just being silly.

It's difficult to tell how such an algorithm would work in practice.
However, it does look like you're doing a fair bit of early
optimisation that may well not be necessary. The random sampling is
going to cause issues because in the vast majority of cases, a diff is
between two files that are largely identical; this is why many diff
algorithms optimisations are based on unique matching lines and
similar constructs.

I like that you'd have a fairly natural method of detecting moved
chunks, but then that sort of thing can probably be retrofitted onto
existing algorithms.

Anyway, if you're interested, I'd encourage you to take a look at the
various Myer's diff papers - most are freely available - and check out
some other resources. The author of the diff-match-patch library has
some good material on pre- and post-processing strategies for diffing
(see http://neil.fraser.name/writing/diff/).

cheers,
Kai
_______________________________________________
meld-list mailing list
[email protected]
https://mail.gnome.org/mailman/listinfo/meld-list

Reply via email to