Re: [Jprogramming] LZ77

bill lam Thu, 18 Sep 2014 02:46:55 -0700

Thanks Raul,

I followed you suggestion to use 2 generations of hash vector,
Its speed is faster now although still far behind zlib.so.


Ср, 17 сен 2014, Raul Miller написал(а):
> I would ignore the stuff about "hashes" and "hash chains" and "singly
> linked". Hashes are nice sometimes but other times they're just a
> disaster. Sometimes that's better than the alternative, but their
> marketing is sometimes significantly better than their actual
> behavior.
> 
> Anyways, what's really happening here is an exact match search and
> some kind of constraint on the size of the table.
> 
> I think the best way to approach that in J is to search in sorted
> lists. The problem here is the incremental update. You don't want to
> re-sort a large list every time you add something to the list.
> 
> The way to approach that, I think, is to have several generations of
> lists, with each larger than the last, and each slower to be update
> than the last.
> 
> So, for example, have on list which is unsorted. Use i. to look things
> up in it, and merge it into the next list when it gets too long (maybe
> when it has 30 items in it).
> 
> The next list gets sorted every time it gets stuff added to it. Use I.
> to look things up in it. Merge it into the next list when it gets too
> long (maybe when it has 1000 items in it).
> 
> Same for the next list, but this one, when it gets too long, you
> arbitrarily discard items from it.
> 
> Now the other thing that's happening, here, is that J doesn't really
> do all that well with incremental update algorithms. You'll still be
> slow if you do things "one group of three" at a time. But -
> hypothetically at least - you don't have to. The ideal would be to
> come up with an algorithm for a block of text which complies (at least
> in terms of data structures) with the rfc and which behaves reasonably
> close to the behavior of the rfc described behavior.
> 
> But having a slow implementation that works is a really important
> step. That (with some test cases - like the one you described here)
> gives you a baseline to test against.
> 
> Hopefully this helps?
> 
> Thanks,
> 
> -- 
> Raul
> 
> 
> 
> 
> On Wed, Sep 17, 2014 at 5:23 AM, bill lam <[email protected]> wrote:
> > The LZ77 variation used in deflate is at the bottom (copied from rfc
> > 1951).  I tried to translate it into J using a hash function which
> > interprets XYZ as the lower 3 bytes of a 4 byte integer.  My J
> > implementation worked but efficient is rather poor.  I think the
> > inefficiency of my implementation came from the need to
> > 1. build a vector of pre-computed hash of size equal to sliding
> > window. (although the hash of all bytes had been pre-computed)
> > 2. find the last index of the hash of current XYZ in the vector.
> > (since the left argument changes, special code of the form m&i: does not 
> > apply)
> >
> > If data are truly random, the chance of matching would be very small,
> > thus the time spent should roughly the sum of the above 1 & 2 for
> > every byte, however it took a very long time even for data size of
> > only about 500KB.  Any idea for a better J implementation.
> >
> > NB. for deflate, minimum length for match is 3, max length is 285,
> > maximum distance is 32K.
> >
> > eg. LZ77 compression of the input 500#'abc'
> > is
> > aa(285,1)(213,1)b(285,1)(214,1)c(285,1)(214,1)
> >
> > inside each bracket is (length,distance) pair applied to the  history
> > buffer. Untested but should be something like this.
> >
> > (text from rfc)
> > The compressor uses a chained hash table to find duplicated strings,
> >    using a hash function that operates on 3-byte sequences.  At any
> >    given point during compression, let XYZ be the next 3 input bytes to
> >    be examined (not necessarily all different, of course).  First, the
> >    compressor examines the hash chain for XYZ.  If the chain is empty,
> >    the compressor simply writes out X as a literal byte and advances one
> >    byte in the input.  If the hash chain is not empty, indicating that
> >    the sequence XYZ (or, if we are unlucky, some other 3 bytes with the
> >    same hash function value) has occurred recently, the compressor
> >    compares all strings on the XYZ hash chain with the actual input data
> >    sequence starting at the current point, and selects the longest
> >    match.
> >
> >    The compressor searches the hash chains starting with the most recent
> >    strings, to favor small distances and thus take advantage of the
> >    Huffman encoding.  The hash chains are singly linked. There are no
> >    deletions from the hash chains; the algorithm simply discards matches
> >    that are too old.  To avoid a worst-case situation, very long hash
> >    chains are arbitrarily truncated at a certain length, determined by a
> >    run-time parameter.
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm

-- 
regards,
====================================================
GPG key 1024D/4434BAB3 2008-08-24
gpg --keyserver subkeys.pgp.net --recv-keys 4434BAB3
gpg --keyserver subkeys.pgp.net --armor --export 4434BAB3
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] LZ77

Reply via email to