Re: Need a Faster Compressor

Timon Gehr via Digitalmars-d Mon, 23 May 2016 12:36:06 -0700

On 22.05.2016 00:07, Walter Bright wrote:

On 5/21/2016 2:37 PM, Timon Gehr wrote:

Why is longest_match Ω(nm) instead of O(n+m) (e.g. KMP)?


I don't understand the terms you use, but as to the "why" it is based on
what I knew about LZ77 compression.  I don't pretend to be an expert on
compression, and this is what I churned out. As mentioned elsewhere, the
C++ mangling scheme has a primitive and ineffective LZ77 scheme built
in, so I wouldn't waste time on that.

A quick google search on finding the longest match in a string did not
turn up anything obvious.
...

E.g. the Knuth-Morris-Pratt (KMP) string search algorithm can bemodified to compute longest match instead of full match (as it justefficiently builds and runs a finite state machine whose state is thelength of the longest match ending at the current position).


https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm

Implementation:

auto longestMatch(string x,string y){
    if(y.length>x.length) y=y[0..x.length];
    auto f=new size_t[y.length+1];
    for(size_t i=0,j=f[0]=-1;i<y.length;f[++i]=++j)
        while(~j&&y[i]!=y[j]) j=f[j];
    auto r=x[0..0];
    for(size_t i=0,j=0;i<x.length&&j<y.length;++i,++j){
        while(~j&&x[i]!=y[j]) j=f[j];
        if(j+1>r.length) r=x[i-j..i+1];
    }
    return r;
}

This returns a slice of x representing the leftmost longest match withy. Running time is O(x.length). (In practice, if this should be reallyfast, the allocation for 'f' should probably be shared among multiplecalls.)

(This probably only improves running time in case there are sufficientlymany sufficiently long matches.)

But this just improves longest_match. It should be possible to bringdown the running time of the entire compression algorithm significantlyusing a suffix tree (the resulting running time bound is linear in theinput size and independent of the window size).

If you want to throw your hat (i.e. expertise) into the ring and post a
faster compressor, please do so!

As far as I understand, the use case for this is compressing manglednames? I don't think that generating huge mangled names explicitly justto compress them afterwards is a viable strategy. This process throwsaway a lot of structure information during mangling that could be usefulfor fast and effective compression.

Instead, compression should be performed while generating the mangledstring. As far as I understand, mangled names only grow unmanageablylarge because the same symbols are mangled into them multiple times?Isn't it enough to create references to previously embedded mangledsymbols (i-th symbol already mangled) while generating the mangled string?

Re: Need a Faster Compressor

Reply via email to