On 22.05.2016 00:07, Walter Bright wrote:
On 5/21/2016 2:37 PM, Timon Gehr wrote:
Why is longest_match Ω(nm) instead of O(n+m) (e.g. KMP)?

I don't understand the terms you use, but as to the "why" it is based on
what I knew about LZ77 compression.  I don't pretend to be an expert on
compression, and this is what I churned out. As mentioned elsewhere, the
C++ mangling scheme has a primitive and ineffective LZ77 scheme built
in, so I wouldn't waste time on that.

A quick google search on finding the longest match in a string did not
turn up anything obvious.
...

E.g. the Knuth-Morris-Pratt (KMP) string search algorithm can be modified to compute longest match instead of full match (as it just efficiently builds and runs a finite state machine whose state is the length of the longest match ending at the current position).

https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm

Implementation:

auto longestMatch(string x,string y){
    if(y.length>x.length) y=y[0..x.length];
    auto f=new size_t[y.length+1];
    for(size_t i=0,j=f[0]=-1;i<y.length;f[++i]=++j)
        while(~j&&y[i]!=y[j]) j=f[j];
    auto r=x[0..0];
    for(size_t i=0,j=0;i<x.length&&j<y.length;++i,++j){
        while(~j&&x[i]!=y[j]) j=f[j];
        if(j+1>r.length) r=x[i-j..i+1];
    }
    return r;
}

This returns a slice of x representing the leftmost longest match with y. Running time is O(x.length). (In practice, if this should be really fast, the allocation for 'f' should probably be shared among multiple calls.)

(This probably only improves running time in case there are sufficiently many sufficiently long matches.)

But this just improves longest_match. It should be possible to bring down the running time of the entire compression algorithm significantly using a suffix tree (the resulting running time bound is linear in the input size and independent of the window size).


If you want to throw your hat (i.e. expertise) into the ring and post a
faster compressor, please do so!

As far as I understand, the use case for this is compressing mangled names? I don't think that generating huge mangled names explicitly just to compress them afterwards is a viable strategy. This process throws away a lot of structure information during mangling that could be useful for fast and effective compression.

Instead, compression should be performed while generating the mangled string. As far as I understand, mangled names only grow unmanageably large because the same symbols are mangled into them multiple times? Isn't it enough to create references to previously embedded mangled symbols (i-th symbol already mangled) while generating the mangled string?

Reply via email to