On Fri, Apr 1, 2022 at 4:37 PM Stefan Sperling <s...@elego.de> wrote:
> Yes, roughly, Patience diff involves two algorithms, the grouping of
> lines along similar-line boundaries performed by Patience, and an
> LCS for parts of the files which Patience cannot deal with by itself.
>
> But LCS needs to complete its work before Patience can do its thing,
> and vice-versa. For a given input, each algorithm might run multiple
> times, switching whenever the current algorithm cannot make any process.
> So if LCS still has a fundamental performance issue for some inputs,
> adding Patience diff probably won't help. It could help in cases where
> LCS is now fed with different inputs as determined by Patience, such
> that bad inputs for LCS are avoided. But I don't know if that would
> always be the case.

Yes, I suppose this is the case: Patience feeds different (smaller)
things to LCS. Because, as far as I understand, Myers' way of
calculating the LCS is fundamentally "somewhat" quadratic (according
to the Myers paper from 1986 [1], titled "An O(ND) Difference
Algorithm and Its Variations").

That is: O(ND) where N = the sum of the lengths of both sides of the
diff and D = the size of the minimal diff. That's why eliminating the
common prefix and suffix (added as an optimization years ago) is
useful, it makes N much smaller.

Now, in cases where both texts are huge (even after prefix-suffix
stripping), but the minimal diff is small, the algorithm should in
theory be fast (because D is small). Just not sure what is going wrong
then in our variation of this algorithm. Or have we implemented
another algorithm than the one described by Myers?

Anyway, if Patience diff is able to split up the work for LCS in
smaller bits (with their small O(ND) complexity), then I can totally
understand that this can be faster.

[1] http://www.xmailserver.org/diff2.pdf

-- 
Johan

Reply via email to