On Fri, Apr 1, 2022 at 4:37 PM Stefan Sperling <s...@elego.de> wrote: > Yes, roughly, Patience diff involves two algorithms, the grouping of > lines along similar-line boundaries performed by Patience, and an > LCS for parts of the files which Patience cannot deal with by itself. > > But LCS needs to complete its work before Patience can do its thing, > and vice-versa. For a given input, each algorithm might run multiple > times, switching whenever the current algorithm cannot make any process. > So if LCS still has a fundamental performance issue for some inputs, > adding Patience diff probably won't help. It could help in cases where > LCS is now fed with different inputs as determined by Patience, such > that bad inputs for LCS are avoided. But I don't know if that would > always be the case.
Yes, I suppose this is the case: Patience feeds different (smaller) things to LCS. Because, as far as I understand, Myers' way of calculating the LCS is fundamentally "somewhat" quadratic (according to the Myers paper from 1986 [1], titled "An O(ND) Difference Algorithm and Its Variations"). That is: O(ND) where N = the sum of the lengths of both sides of the diff and D = the size of the minimal diff. That's why eliminating the common prefix and suffix (added as an optimization years ago) is useful, it makes N much smaller. Now, in cases where both texts are huge (even after prefix-suffix stripping), but the minimal diff is small, the algorithm should in theory be fast (because D is small). Just not sure what is going wrong then in our variation of this algorithm. Or have we implemented another algorithm than the one described by Myers? Anyway, if Patience diff is able to split up the work for LCS in smaller bits (with their small O(ND) complexity), then I can totally understand that this can be faster. [1] http://www.xmailserver.org/diff2.pdf -- Johan