Re: [Jprogramming] LZ77

Raul Miller Wed, 17 Sep 2014 06:21:18 -0700

I would ignore the stuff about "hashes" and "hash chains" and "singly
linked". Hashes are nice sometimes but other times they're just a
disaster. Sometimes that's better than the alternative, but their
marketing is sometimes significantly better than their actual
behavior.

Anyways, what's really happening here is an exact match search and
some kind of constraint on the size of the table.

I think the best way to approach that in J is to search in sorted
lists. The problem here is the incremental update. You don't want to
re-sort a large list every time you add something to the list.

The way to approach that, I think, is to have several generations of
lists, with each larger than the last, and each slower to be update
than the last.

So, for example, have on list which is unsorted. Use i. to look things
up in it, and merge it into the next list when it gets too long (maybe
when it has 30 items in it).

The next list gets sorted every time it gets stuff added to it. Use I.
to look things up in it. Merge it into the next list when it gets too
long (maybe when it has 1000 items in it).

Same for the next list, but this one, when it gets too long, you
arbitrarily discard items from it.

Now the other thing that's happening, here, is that J doesn't really
do all that well with incremental update algorithms. You'll still be
slow if you do things "one group of three" at a time. But -
hypothetically at least - you don't have to. The ideal would be to
come up with an algorithm for a block of text which complies (at least
in terms of data structures) with the rfc and which behaves reasonably
close to the behavior of the rfc described behavior.

But having a slow implementation that works is a really important
step. That (with some test cases - like the one you described here)
gives you a baseline to test against.

Hopefully this helps?

Thanks,

-- 
Raul

On Wed, Sep 17, 2014 at 5:23 AM, bill lam <[email protected]> wrote:
> The LZ77 variation used in deflate is at the bottom (copied from rfc
> 1951).  I tried to translate it into J using a hash function which
> interprets XYZ as the lower 3 bytes of a 4 byte integer.  My J
> implementation worked but efficient is rather poor.  I think the
> inefficiency of my implementation came from the need to
> 1. build a vector of pre-computed hash of size equal to sliding
> window. (although the hash of all bytes had been pre-computed)
> 2. find the last index of the hash of current XYZ in the vector.
> (since the left argument changes, special code of the form m&i: does not 
> apply)
>
> If data are truly random, the chance of matching would be very small,
> thus the time spent should roughly the sum of the above 1 & 2 for
> every byte, however it took a very long time even for data size of
> only about 500KB.  Any idea for a better J implementation.
>
> NB. for deflate, minimum length for match is 3, max length is 285,
> maximum distance is 32K.
>
> eg. LZ77 compression of the input 500#'abc'
> is
> aa(285,1)(213,1)b(285,1)(214,1)c(285,1)(214,1)
>
> inside each bracket is (length,distance) pair applied to the  history
> buffer. Untested but should be something like this.
>
> (text from rfc)
> The compressor uses a chained hash table to find duplicated strings,
>    using a hash function that operates on 3-byte sequences.  At any
>    given point during compression, let XYZ be the next 3 input bytes to
>    be examined (not necessarily all different, of course).  First, the
>    compressor examines the hash chain for XYZ.  If the chain is empty,
>    the compressor simply writes out X as a literal byte and advances one
>    byte in the input.  If the hash chain is not empty, indicating that
>    the sequence XYZ (or, if we are unlucky, some other 3 bytes with the
>    same hash function value) has occurred recently, the compressor
>    compares all strings on the XYZ hash chain with the actual input data
>    sequence starting at the current point, and selects the longest
>    match.
>
>    The compressor searches the hash chains starting with the most recent
>    strings, to favor small distances and thus take advantage of the
>    Huffman encoding.  The hash chains are singly linked. There are no
>    deletions from the hash chains; the algorithm simply discards matches
>    that are too old.  To avoid a worst-case situation, very long hash
>    chains are arbitrarily truncated at a certain length, determined by a
>    run-time parameter.
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] LZ77

Reply via email to