work on diff code in vim (was: Re: Updating runtime files without version control)

Olaf Dabrunz Sun, 10 Jan 2016 02:11:31 -0800

On 09-Jan-16, Justin M. Keyes wrote:
> On Sat, Jan 9, 2016 at 7:29 PM, Olaf Dabrunz <o...@fctrace.org> wrote:
> > On 09-Jan-16, Olaf Dabrunz wrote:
> >> Note that the existing diff tools that we all use every day bring in
> >> heuristical speed improvements, and do pre- and postprocessing to
> >> improve the results.
> >
> > To clarify: the pre- and post-processing improve the result of the diff,
> > regardless of whether any speedup heuristics are used in the comparison
> > itself.
> >
> > Basically, the pre-processing makes arrangements to skip uninteresting
> > lines when comparing the texts, so that the relative weight of matches
> > between interesting lines increases, making it more likely to match-up
> > interesting lines.
> >
> > The post-processing chooses between several same-weight solutions, and
> > tries to select a solution that is typically more likely to represent
> > the actual change.
> 
> Olaf, what you describe is very exciting. Why not post a pull request?


Thank you!  I am excited as well. :)

A pull request would mean that licensing issues need to be resolved, so
parts of the code need to be re-written from scratch.  Please see below.

> What exactly do you mean by "license issues": is some part of the code
> proprietary, or GPL?

I cannot write short answers.  Please bear with me.


 -- LCS Implementation (The Core Algorithm)

The LCS implementation that I use at the moment, that is the core
algorithm of the diff, is a verbatim copy of a file from a GPL project.

This needs to be replaced with a different file under a different
license.  I am not 100% sure (IANAL), but I believe the interface could
be re-used.  It consists of a few #defines and a single function call.

I want to re-implement this from Myers' 1985 paper, and I have
implemented and analyzed LCS algorithms before, and adapted them for
research purposes (including Myers' famous algorithm).

The boundary conditions and corner cases need to be done right and
tested well.  They are real brain teasers.

But

    - It is difficult to ignore the additional ideas that went into the
      GPL project's implementation of Myers' LCS algorithm, and which
      contribute to the quality of its results.  I have not seen any
      papers published about these, so a re-implementation must probably
      be based on a write-down of these ideas.  They include:

        - two optional speed heuristics, one is used by default: much
          faster on larger files

        - a simple way to create several instances of the algorithm for
          the comparison of different kinds of sequences (not required
          for vim though)

        - pre- and post-processing stages that improve the diff results
          -- these are not part of the core algorithm, but the
          post-processing is included in my integration code, see below

        - an elegant, readable and versatile implementation

      I have already written descriptions of how these ideas work,
      except for the pre-processing part.

      But I fear that a re-implementation would be similar to the GPLed
      implementation.

      But I may be wrong. :)

    - It takes time to implement this and to get it right, testing,
      debugging ...

    - Most of the other implementations I saw add ideas that are not
      really helpful, at least not for our purposes.  Some even add
      unnecessary overhead.  They are also more difficult to adapt or
      re-use, some rely on features in some other implementation
      language, and they are not as tried and tested.


That being said, the implementation Bram mentioned in his mail seems to
be worth a closer look.  It may be the best re-implementation so far,
and may be a good basis for a C version.

But it should also include the speed heuristics and the pre- and post-
processing done in the GPLed project.


 -- Vim Integration Code

The integration code is mostly written by me, aimed at inclusion in vim,
when usable for everyone.  It also contains 100 - 150 lines of code
adapted from the GPL project (and about 2400 new lines of code, not
counting large comments).  The GPLed parts need to be rewritten to make
all of it available under the vim license.

Also, the pre-processing needs to be implemented (and adapted to use my
"string sequences".)

I wrote the patch within one month last year, mainly because it was
there in my head and it eventually needed to get written.  I also needed
a fast lcs() function for several reasons, I was using a Perl module
before that.

I do not think I should publish any of this unless I can publish all of
it under a palatable license, to avoid license confusion.


But since this keeps coming up, the purpose of my mail was to show that
there is work in progress here.

Also, it would be interesting to know what people here think about it,
and how the license issues should be sorted out.


    By the way, I can also use the code as a back end for vim diff: the
    patch implements the vim function lcs() that I can call from a small
    vim script function that can be invoked by 'diffexpr'.

    It works and is relatively fast, but since the pre-processing is not
    implemented yet, calling aforementioned well-known diff program
    still gives better results, at least in some cases.

    Other than that, the lcs() interface exposes all configuration
    options of the diff algorithm and the post-processing.  And it adds
    the optional tokenizer layer.

    When the implementation is more complete (pre-processing included,
    adapted code rewritten), a direct integration into vim's diff code
    becomes interesting.  Bram already noted that the LCS results can
    directly be written into diff_T blocks.  I have worked with diff_T
    blocks for another patch (for my research project, see below), so I
    do not expect any major problems with this part of the integration.

    And my code already supports the comparison of buffer text and the
    comparison of characters in two strings (read: comparison of two
    lines).

    The update strategy code needs to be added though, as noted by Bram:
    keep calling the diff for every update, unless that takes too long,
    then fall back on the faster workarounds.

    I also have one or two ideas in that area.


 -- Diff Research

The last part I was writing about is about proprietary research.

This is also not ready for publishing yet.

I mentioned this mainly to give Bram a reason to keep the code that
interfaces with an external diff program, because he said it could
probably be dropped.

If my research is any good, and the results look promising so far (and I
am already using the prototype for my daily work), then there will be
an external diff at some point and people may want to call it from vim.

-- 
Olaf Dabrunz (oda <at> fctrace.org)

-- 
-- 
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

--- 
You received this message because you are subscribed to the Google Groups 
"vim_dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to vim_dev+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

work on diff code in vim (was: Re: Updating runtime files without version control)

Raspunde prin e-mail lui