On 09-Jan-16, Justin M. Keyes wrote: > On Sat, Jan 9, 2016 at 7:29 PM, Olaf Dabrunz <o...@fctrace.org> wrote: > > On 09-Jan-16, Olaf Dabrunz wrote: > >> Note that the existing diff tools that we all use every day bring in > >> heuristical speed improvements, and do pre- and postprocessing to > >> improve the results. > > > > To clarify: the pre- and post-processing improve the result of the diff, > > regardless of whether any speedup heuristics are used in the comparison > > itself. > > > > Basically, the pre-processing makes arrangements to skip uninteresting > > lines when comparing the texts, so that the relative weight of matches > > between interesting lines increases, making it more likely to match-up > > interesting lines. > > > > The post-processing chooses between several same-weight solutions, and > > tries to select a solution that is typically more likely to represent > > the actual change. > > Olaf, what you describe is very exciting. Why not post a pull request?
Thank you! I am excited as well. :) A pull request would mean that licensing issues need to be resolved, so parts of the code need to be re-written from scratch. Please see below. > What exactly do you mean by "license issues": is some part of the code > proprietary, or GPL? I cannot write short answers. Please bear with me. -- LCS Implementation (The Core Algorithm) The LCS implementation that I use at the moment, that is the core algorithm of the diff, is a verbatim copy of a file from a GPL project. This needs to be replaced with a different file under a different license. I am not 100% sure (IANAL), but I believe the interface could be re-used. It consists of a few #defines and a single function call. I want to re-implement this from Myers' 1985 paper, and I have implemented and analyzed LCS algorithms before, and adapted them for research purposes (including Myers' famous algorithm). The boundary conditions and corner cases need to be done right and tested well. They are real brain teasers. But - It is difficult to ignore the additional ideas that went into the GPL project's implementation of Myers' LCS algorithm, and which contribute to the quality of its results. I have not seen any papers published about these, so a re-implementation must probably be based on a write-down of these ideas. They include: - two optional speed heuristics, one is used by default: much faster on larger files - a simple way to create several instances of the algorithm for the comparison of different kinds of sequences (not required for vim though) - pre- and post-processing stages that improve the diff results -- these are not part of the core algorithm, but the post-processing is included in my integration code, see below - an elegant, readable and versatile implementation I have already written descriptions of how these ideas work, except for the pre-processing part. But I fear that a re-implementation would be similar to the GPLed implementation. But I may be wrong. :) - It takes time to implement this and to get it right, testing, debugging ... - Most of the other implementations I saw add ideas that are not really helpful, at least not for our purposes. Some even add unnecessary overhead. They are also more difficult to adapt or re-use, some rely on features in some other implementation language, and they are not as tried and tested. That being said, the implementation Bram mentioned in his mail seems to be worth a closer look. It may be the best re-implementation so far, and may be a good basis for a C version. But it should also include the speed heuristics and the pre- and post- processing done in the GPLed project. -- Vim Integration Code The integration code is mostly written by me, aimed at inclusion in vim, when usable for everyone. It also contains 100 - 150 lines of code adapted from the GPL project (and about 2400 new lines of code, not counting large comments). The GPLed parts need to be rewritten to make all of it available under the vim license. Also, the pre-processing needs to be implemented (and adapted to use my "string sequences".) I wrote the patch within one month last year, mainly because it was there in my head and it eventually needed to get written. I also needed a fast lcs() function for several reasons, I was using a Perl module before that. I do not think I should publish any of this unless I can publish all of it under a palatable license, to avoid license confusion. But since this keeps coming up, the purpose of my mail was to show that there is work in progress here. Also, it would be interesting to know what people here think about it, and how the license issues should be sorted out. By the way, I can also use the code as a back end for vim diff: the patch implements the vim function lcs() that I can call from a small vim script function that can be invoked by 'diffexpr'. It works and is relatively fast, but since the pre-processing is not implemented yet, calling aforementioned well-known diff program still gives better results, at least in some cases. Other than that, the lcs() interface exposes all configuration options of the diff algorithm and the post-processing. And it adds the optional tokenizer layer. When the implementation is more complete (pre-processing included, adapted code rewritten), a direct integration into vim's diff code becomes interesting. Bram already noted that the LCS results can directly be written into diff_T blocks. I have worked with diff_T blocks for another patch (for my research project, see below), so I do not expect any major problems with this part of the integration. And my code already supports the comparison of buffer text and the comparison of characters in two strings (read: comparison of two lines). The update strategy code needs to be added though, as noted by Bram: keep calling the diff for every update, unless that takes too long, then fall back on the faster workarounds. I also have one or two ideas in that area. -- Diff Research The last part I was writing about is about proprietary research. This is also not ready for publishing yet. I mentioned this mainly to give Bram a reason to keep the code that interfaces with an external diff program, because he said it could probably be dropped. If my research is any good, and the results look promising so far (and I am already using the prototype for my daily work), then there will be an external diff at some point and people may want to call it from vim. -- Olaf Dabrunz (oda <at> fctrace.org) -- -- You received this message from the "vim_dev" maillist. Do not top-post! Type your reply below the text you are replying to. For more information, visit http://www.vim.org/maillist.php --- You received this message because you are subscribed to the Google Groups "vim_dev" group. To unsubscribe from this group and stop receiving emails from it, send an email to vim_dev+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.