On Thu, 28 Jan 2016 16:48:22 +0100 Markus Wichmann <nullp...@gmx.net> wrote:
> On Wed, Jan 27, 2016 at 11:22:55PM +0100, Mattias Andrée > wrote: > > Perhaps I should describe how the program works > > (although it is very simple.) The documents are > > compared just like of they were words, but with > > lines rather than characters, and only add and > > remove are valid changes, not substitution. This > > is implemented using dynamic programming. It > > procedure fills in two matrices. One that describes > > how many changes have to be made to get from the > > old document to the new document. In the other it > > is registered changes have to be made. This can be > > extrapolated from the first matrix, but to keep it > > simple it is written to the another matrix. > > For any index-pair (i, j) in the matrices, the > > corresponding element describes the changes to get > > from the i first lines of old document to the j first > > lines of the new document. The next part of the > > program is just print the differences. Any time an > > edit is about to be printed, it is cached, then when > > all directly subsequent edits have been cached, it > > prints the line removals followed by the line > > additions. This is to make the output more readable. > > > > Some comments on this: > > First, we allow C99 extensions, right? C99 has not only > variadic arrays, but pointers to them as well, so your > matrix could be declared and allocated as > > size_t (*matrix)[a + 2][b + 1] = malloc(sizeof(size_t[a + > 2][b + 1])); > > Then access array elements as (*matrix)[i][j] and you're > done. This saves you the trouble of allocating the > pointer block, or allocating a single-dimensional array > with enough elements and doing the arithmetic by hand. > > Second, wikipedia describes this algorithm to initialize > the topmost line and the leftmost column with zero, not > with their indices. I am using https://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm > > Third, this algorithm needs quadratic space, i.e. on a > 32-bit machine, it is incapable of comparing two 200,000 > line files (it would need 40,000,000,000 machine words > for that, and that would just die). So instead I'm going > to explain what busybox diff does: > > The two files to be compared are loaded into memory, > whereby each line is annotated with its line number > (actually, that's a lie, busybox only saves a hash of the > line and really, really hopes, there are no collisions, > but hush). Then both files are sorted by line content > first and line number second. Then a mapping is > calculated, that maps each line of the old file to its > corresponding line in the new one, if any. This is done > by merging the two files, i. e. (pseudocode) > > memset(map, -1, sizeof map) > new file pointer = first new line > for every old line: > while (*new file pointer < old line) > new file pointer ++ > if (*new file pointer == old line) > map[old line number] = new line number > > Afterwards the files are reread (to undo any > transformation that might have happened to the lines. > E.g. if "ignore whitespace" was requested, the whitespace > was actually filtered out back in step 1) and then > basically the mapping is printed: If the current old line > has no mapping, or maps to a line before the current new > line, it was deleted, if it maps to the current new line > it stayed the same, and if it maps to a new line after > the current one, all new lines between the current one > and the mapped one were added. > > I can't prove that this algorithm will conclusively > always calculate the longest common subsequence, but as > long as the common subsequence arrived at in this way is > at least up there with the longest, the number of useless > changes (where a line was removed and then added > verbatim) should be pretty low, so this would only be a > minor problem over all. But that algorithm only requires > the memory to hold both files with line numbers, and the > map. > > Comments? > > Ciao, > Markus >
pgpwYvT_khGWq.pgp
Description: OpenPGP digital signature