Re: [dev] [sbase] diff

Mattias Andrée Thu, 28 Jan 2016 10:06:33 -0800

On Thu, 28 Jan 2016 16:48:22 +0100
Markus Wichmann <nullp...@gmx.net> wrote:


> On Wed, Jan 27, 2016 at 11:22:55PM +0100, Mattias Andrée
> wrote:
> > Perhaps I should describe how the program works
> > (although it is very simple.) The documents are
> > compared just like of they were words, but with
> > lines rather than characters, and only add and
> > remove are valid changes, not substitution. This
> > is implemented using dynamic programming. It
> > procedure fills in two matrices. One that describes
> > how many changes have to be made to get from the
> > old document to the new document. In the other it
> > is registered changes have to be made. This can be
> > extrapolated from the first matrix, but to keep it
> > simple it is written to the another matrix.
> > For any index-pair (i, j) in the matrices, the
> > corresponding element describes the changes to get
> > from the i first lines of old document to the j first
> > lines of the new document. The next part of the
> > program is just print the differences. Any time an
> > edit is about to be printed, it is cached, then when
> > all directly subsequent edits have been cached, it
> > prints the line removals followed by the line
> > additions. This is to make the output more readable.
> >   
> 
> Some comments on this:
> 
> First, we allow C99 extensions, right? C99 has not only
> variadic arrays, but pointers to them as well, so your
> matrix could be declared and allocated as
> 
> size_t (*matrix)[a + 2][b + 1] = malloc(sizeof(size_t[a +
> 2][b + 1]));
> 
> Then access array elements as (*matrix)[i][j] and you're
> done. This saves you the trouble of allocating the
> pointer block, or allocating a single-dimensional array
> with enough elements and doing the arithmetic by hand.
> 
> Second, wikipedia describes this algorithm to initialize
> the topmost line and the leftmost column with zero, not
> with their indices.

I am using
https://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm

> 
> Third, this algorithm needs quadratic space, i.e. on a
> 32-bit machine, it is incapable of comparing two 200,000
> line files (it would need 40,000,000,000 machine words
> for that, and that would just die). So instead I'm going
> to explain what busybox diff does:
> 
> The two files to be compared are loaded into memory,
> whereby each line is annotated with its line number
> (actually, that's a lie, busybox only saves a hash of the
> line and really, really hopes, there are no collisions,
> but hush). Then both files are sorted by line content
> first and line number second. Then a mapping is
> calculated, that maps each line of the old file to its
> corresponding line in the new one, if any. This is done
> by merging the two files, i. e. (pseudocode)
> 
> memset(map, -1, sizeof map)
> new file pointer = first new line
> for every old line:
>     while (*new file pointer < old line)
>         new file pointer ++
>     if (*new file pointer == old line)
>         map[old line number] = new line number
> 
> Afterwards the files are reread (to undo any
> transformation that might have happened to the lines.
> E.g. if "ignore whitespace" was requested, the whitespace
> was actually filtered out back in step 1) and then
> basically the mapping is printed: If the current old line
> has no mapping, or maps to a line before the current new
> line, it was deleted, if it maps to the current new line
> it stayed the same, and if it maps to a new line after
> the current one, all new lines between the current one
> and the mapped one were added.
> 
> I can't prove that this algorithm will conclusively
> always calculate the longest common subsequence, but as
> long as the common subsequence arrived at in this way is
> at least up there with the longest, the number of useless
> changes (where a line was removed and then added
> verbatim) should be pretty low, so this would only be a
> minor problem over all. But that algorithm only requires
> the memory to hold both files with line numbers, and the
> map.
> 
> Comments?
> 
> Ciao,
> Markus
>

pgpwYvT_khGWq.pgp
Description: OpenPGP digital signature

Re: [dev] [sbase] diff

Reply via email to