Re: [Jprogramming] LZ77

bill lam Thu, 18 Sep 2014 08:53:23 -0700

AFAIK whether zlib/deflate itself is patent-free is still 
an unknown, because it has not yet trial tested.


patent for LZW expired but other variations of LZ77 might not.

Чт, 18 сен 2014, robert therriault написал(а):
> Hi Bill
> 
> Nice joke in your last two sentences. 
> 
> You wouldn't want to do this on the chance that J ends up with revenue and 
> could become a target (especially if your joke were produced as evidence). 
> Legal action ends up being very expensive to defend in time as well as money. 
> Also, things tend to work better if we play nice and it sounds like they have 
> been nice enough to provide a way to avoid the patent issue. 
> 
> Just my two bits, but your joke set off my program manager alarm bells.  :-)
> 
> cheers, bob
> 
> On Sep 18, 2014, at 3:09 AM, bill lam <[email protected]> wrote:
> 
> > Thanks. This is not urgent because zlib.so should usually
> > available.
> > 
> > Also as stated in rfc, many variation of LZ77 were patented so
> > it suggested implementors to follow the guidelines (3 bytes at a
> > time, hash etc) which is patent free.  If the improvement in 
> > speed is signaficant then we may ignore its guideline.  I guess
> > there would not be any revenue in sueing Jsoftware for damage.
> > 
> > Чт, 18 сен 2014, Raul Miller написал(а):
> >> I can try taking a look at the code, one of these days, if you want me to.
> >> 
> >> (Not right this second though.)
> >> 
> >> Thanks,
> >> 
> >> -- 
> >> Raul
> >> 
> >> 
> >> On Thu, Sep 18, 2014 at 5:45 AM, bill lam <[email protected]> wrote:
> >>> Thanks Raul,
> >>> 
> >>> I followed you suggestion to use 2 generations of hash vector,
> >>> Its speed is faster now although still far behind zlib.so.
> >>> 
> >>> Ср, 17 сен 2014, Raul Miller написал(а):
> >>>> I would ignore the stuff about "hashes" and "hash chains" and "singly
> >>>> linked". Hashes are nice sometimes but other times they're just a
> >>>> disaster. Sometimes that's better than the alternative, but their
> >>>> marketing is sometimes significantly better than their actual
> >>>> behavior.
> >>>> 
> >>>> Anyways, what's really happening here is an exact match search and
> >>>> some kind of constraint on the size of the table.
> >>>> 
> >>>> I think the best way to approach that in J is to search in sorted
> >>>> lists. The problem here is the incremental update. You don't want to
> >>>> re-sort a large list every time you add something to the list.
> >>>> 
> >>>> The way to approach that, I think, is to have several generations of
> >>>> lists, with each larger than the last, and each slower to be update
> >>>> than the last.
> >>>> 
> >>>> So, for example, have on list which is unsorted. Use i. to look things
> >>>> up in it, and merge it into the next list when it gets too long (maybe
> >>>> when it has 30 items in it).
> >>>> 
> >>>> The next list gets sorted every time it gets stuff added to it. Use I.
> >>>> to look things up in it. Merge it into the next list when it gets too
> >>>> long (maybe when it has 1000 items in it).
> >>>> 
> >>>> Same for the next list, but this one, when it gets too long, you
> >>>> arbitrarily discard items from it.
> >>>> 
> >>>> Now the other thing that's happening, here, is that J doesn't really
> >>>> do all that well with incremental update algorithms. You'll still be
> >>>> slow if you do things "one group of three" at a time. But -
> >>>> hypothetically at least - you don't have to. The ideal would be to
> >>>> come up with an algorithm for a block of text which complies (at least
> >>>> in terms of data structures) with the rfc and which behaves reasonably
> >>>> close to the behavior of the rfc described behavior.
> >>>> 
> >>>> But having a slow implementation that works is a really important
> >>>> step. That (with some test cases - like the one you described here)
> >>>> gives you a baseline to test against.
> >>>> 
> >>>> Hopefully this helps?
> >>>> 
> >>>> Thanks,
> >>>> 
> >>>> --
> >>>> Raul
> >>>> 
> >>>> 
> >>>> 
> >>>> 
> >>>> On Wed, Sep 17, 2014 at 5:23 AM, bill lam <[email protected]> wrote:
> >>>>> The LZ77 variation used in deflate is at the bottom (copied from rfc
> >>>>> 1951).  I tried to translate it into J using a hash function which
> >>>>> interprets XYZ as the lower 3 bytes of a 4 byte integer.  My J
> >>>>> implementation worked but efficient is rather poor.  I think the
> >>>>> inefficiency of my implementation came from the need to
> >>>>> 1. build a vector of pre-computed hash of size equal to sliding
> >>>>> window. (although the hash of all bytes had been pre-computed)
> >>>>> 2. find the last index of the hash of current XYZ in the vector.
> >>>>> (since the left argument changes, special code of the form m&i: does 
> >>>>> not apply)
> >>>>> 
> >>>>> If data are truly random, the chance of matching would be very small,
> >>>>> thus the time spent should roughly the sum of the above 1 & 2 for
> >>>>> every byte, however it took a very long time even for data size of
> >>>>> only about 500KB.  Any idea for a better J implementation.
> >>>>> 
> >>>>> NB. for deflate, minimum length for match is 3, max length is 285,
> >>>>> maximum distance is 32K.
> >>>>> 
> >>>>> eg. LZ77 compression of the input 500#'abc'
> >>>>> is
> >>>>> aa(285,1)(213,1)b(285,1)(214,1)c(285,1)(214,1)
> >>>>> 
> >>>>> inside each bracket is (length,distance) pair applied to the  history
> >>>>> buffer. Untested but should be something like this.
> >>>>> 
> >>>>> (text from rfc)
> >>>>> The compressor uses a chained hash table to find duplicated strings,
> >>>>>   using a hash function that operates on 3-byte sequences.  At any
> >>>>>   given point during compression, let XYZ be the next 3 input bytes to
> >>>>>   be examined (not necessarily all different, of course).  First, the
> >>>>>   compressor examines the hash chain for XYZ.  If the chain is empty,
> >>>>>   the compressor simply writes out X as a literal byte and advances one
> >>>>>   byte in the input.  If the hash chain is not empty, indicating that
> >>>>>   the sequence XYZ (or, if we are unlucky, some other 3 bytes with the
> >>>>>   same hash function value) has occurred recently, the compressor
> >>>>>   compares all strings on the XYZ hash chain with the actual input data
> >>>>>   sequence starting at the current point, and selects the longest
> >>>>>   match.
> >>>>> 
> >>>>>   The compressor searches the hash chains starting with the most recent
> >>>>>   strings, to favor small distances and thus take advantage of the
> >>>>>   Huffman encoding.  The hash chains are singly linked. There are no
> >>>>>   deletions from the hash chains; the algorithm simply discards matches
> >>>>>   that are too old.  To avoid a worst-case situation, very long hash
> >>>>>   chains are arbitrarily truncated at a certain length, determined by a
> >>>>>   run-time parameter.
> >>>>> ----------------------------------------------------------------------
> >>>>> For information about J forums see http://www.jsoftware.com/forums.htm
> >>>> ----------------------------------------------------------------------
> >>>> For information about J forums see http://www.jsoftware.com/forums.htm
> >>> 
> >>> --
> >>> regards,
> >>> ====================================================
> >>> GPG key 1024D/4434BAB3 2008-08-24
> >>> gpg --keyserver subkeys.pgp.net --recv-keys 4434BAB3
> >>> gpg --keyserver subkeys.pgp.net --armor --export 4434BAB3
> >>> ----------------------------------------------------------------------
> >>> For information about J forums see http://www.jsoftware.com/forums.htm
> >> ----------------------------------------------------------------------
> >> For information about J forums see http://www.jsoftware.com/forums.htm
> > 
> > -- 
> > regards,
> > ====================================================
> > GPG key 1024D/4434BAB3 2008-08-24
> > gpg --keyserver subkeys.pgp.net --recv-keys 4434BAB3
> > gpg --keyserver subkeys.pgp.net --armor --export 4434BAB3
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> 
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm

-- 
regards,
====================================================
GPG key 1024D/4434BAB3 2008-08-24
gpg --keyserver subkeys.pgp.net --recv-keys 4434BAB3
gpg --keyserver subkeys.pgp.net --armor --export 4434BAB3
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] LZ77

Reply via email to