On 06/29/2018 10:28 PM, Stefan Beller wrote: > [...] > Adds some threshold to avoid expensive cases, like: > > ``` > #!python > open('a', 'w').write(" \n" * 1000000) > open('b', 'w').write(" \n" * 1000001) > ``` > > The indent heuristic is O(N * 20) (N = 1000000) for the above case, and > makes diff much slower. > [...] > +/* > + * For indentation heuristic, skip searching for better slide position after > + * checking MAX_BORING lines without finding an improvement. This defends the > + * indentation heuristic logic against pathological cases. The value is not > + * picked scientifically but should be good enough. > + */ > +#define MAX_BORING 100
This is an interesting case, and a speed difference of almost a factor of five seems impressive. But this is a pretty pathological case, isn't it? And I'm pretty sure that the algorithm is `O(N)` both before and after this change. Remember that to find `earliest_end` and `g.end`, there has already been a scan through all 1000000 lines. In other words, you're not improving how the overall algorithm scales with `N`; you're only changing the constant factor in front. So it's a little bit questionable whether it is worth complicating the code for this unusual case. But *if* we want to improve this case, I think that we could be smarter about it. By the time we get to this point in the code, we already know that there is a "slider" hunk of length `M` (`groupsize`) that can be slid up or down within a range of `N` (`g.end - earliest_end + 1`) possible positions. The interesting case here is `N ≫ M`, because then naively the number of positions to try out is a lot bigger than the size of the hunk itself. (In the case described above, `N` is 1000000 and `M` is 1.) But how can that situation even arise? Remember, a hunk can only be slid down by a line if the first line *after* the hunk is identical to the first line *of* the hunk. It follows that if you shift a hunk down `M` lines, then it has the same contents as when you started—you've just rotated all of the hunk lines around once. So if `N ≫ M`, there is necessarily a lot of repetition among the `N + M` lines that the hunk could possibly overlay. Specifically, it must consist of `floor((N + M)/M)` identical copies of the hunk, plus possibly a few leftover lines constituting the start of another repetition. Given this large amount of repetition, it seems to me that there is never a need to scan more than the bottom `M + 1` possible positions [1] plus the highest possible position [2] to be sure of finding the very best one. In the pathological case that you described above, where `M` is 1, only three positions have to be evaluated, not 100. In fact, it *could* be that there is even more repetition, namely if the hunk itself contains multiple copies of an even shorter block of `K` lines. In that case, you would only have to scan `K + 1` positions at the bottom plus one at the top to be sure to find the best hunk position. This would be an interesting optimization for a case like > open('a', 'w').write(" \n" * 1000000) > open('b', 'w').write(" \n" * 1100000) (`N = 1000000`, `M = 100000`, `K = 1`) or > open('a', 'w').write("<item>\nMISSING\n</item>\n" * 1000000) > open('b', 'w').write("<item>\nMISSING\n</item>\n" * 1100000) (`N = 3000000`, `M = 300000`, `K = 3`). On the other hand, it's not entirely trivial to find periodicity in a group of lines (i.e., to find `K`), and I don't know offhand how that task scales with `M`. Michael [1] Actually, to be rigorously correct it might be necessary to check even a bit more than `M + 1` positions at the bottom because the heuristic looks a bit beyond the lines of the hunk. [2] The position at the top has different predecessor lines than the other positions, so it could have a lower score than all of the others. It's worth checking it. Here too, to be rigorously correct it might be necessary to check more than one position at the top because the heuristic looks a bit beyond the lines of the hunk.