I've got my specialized match-finder working (minimize cache misses while processing blocks of size 2**17.) Now I find that it is slow, mainly because it finds *all* matches, even those that "obviously" are not good candidates for encoding.
It seems to me that there are two missing parameters to the match finder: 1) the current four offsets which have very low encoding cost (many bits less than other nearby offsets) 2) for each of the four match lengths 2,3,4, and 5: the maximum offset that yields a savings when encoded, but ignoring the possibility of special savings due to using one of the four most-recent offsets. For instance, gzip won't even consider any offset greater than 4096 ("TOO_FAR") for gzip's minimum match length of 3. If the match finder knows that info, then the match finder can avoid or suppress "obvious" bad candidates. Comments? --