Re: [Jprogramming] LZ77

robert therriault Thu, 18 Sep 2014 08:09:48 -0700

Hi Bill

Nice joke in your last two sentences.


You wouldn't want to do this on the chance that J ends up with revenue and 
could become a target (especially if your joke were produced as evidence). 
Legal action ends up being very expensive to defend in time as well as money. 
Also, things tend to work better if we play nice and it sounds like they have 
been nice enough to provide a way to avoid the patent issue. 

Just my two bits, but your joke set off my program manager alarm bells.  :-)

cheers, bob

On Sep 18, 2014, at 3:09 AM, bill lam <[email protected]> wrote:

> Thanks. This is not urgent because zlib.so should usually
> available.
> 
> Also as stated in rfc, many variation of LZ77 were patented so
> it suggested implementors to follow the guidelines (3 bytes at a
> time, hash etc) which is patent free.  If the improvement in 
> speed is signaficant then we may ignore its guideline.  I guess
> there would not be any revenue in sueing Jsoftware for damage.
> 
> Чт, 18 сен 2014, Raul Miller написал(а):
>> I can try taking a look at the code, one of these days, if you want me to.
>> 
>> (Not right this second though.)
>> 
>> Thanks,
>> 
>> -- 
>> Raul
>> 
>> 
>> On Thu, Sep 18, 2014 at 5:45 AM, bill lam <[email protected]> wrote:
>>> Thanks Raul,
>>> 
>>> I followed you suggestion to use 2 generations of hash vector,
>>> Its speed is faster now although still far behind zlib.so.
>>> 
>>> Ср, 17 сен 2014, Raul Miller написал(а):
>>>> I would ignore the stuff about "hashes" and "hash chains" and "singly
>>>> linked". Hashes are nice sometimes but other times they're just a
>>>> disaster. Sometimes that's better than the alternative, but their
>>>> marketing is sometimes significantly better than their actual
>>>> behavior.
>>>> 
>>>> Anyways, what's really happening here is an exact match search and
>>>> some kind of constraint on the size of the table.
>>>> 
>>>> I think the best way to approach that in J is to search in sorted
>>>> lists. The problem here is the incremental update. You don't want to
>>>> re-sort a large list every time you add something to the list.
>>>> 
>>>> The way to approach that, I think, is to have several generations of
>>>> lists, with each larger than the last, and each slower to be update
>>>> than the last.
>>>> 
>>>> So, for example, have on list which is unsorted. Use i. to look things
>>>> up in it, and merge it into the next list when it gets too long (maybe
>>>> when it has 30 items in it).
>>>> 
>>>> The next list gets sorted every time it gets stuff added to it. Use I.
>>>> to look things up in it. Merge it into the next list when it gets too
>>>> long (maybe when it has 1000 items in it).
>>>> 
>>>> Same for the next list, but this one, when it gets too long, you
>>>> arbitrarily discard items from it.
>>>> 
>>>> Now the other thing that's happening, here, is that J doesn't really
>>>> do all that well with incremental update algorithms. You'll still be
>>>> slow if you do things "one group of three" at a time. But -
>>>> hypothetically at least - you don't have to. The ideal would be to
>>>> come up with an algorithm for a block of text which complies (at least
>>>> in terms of data structures) with the rfc and which behaves reasonably
>>>> close to the behavior of the rfc described behavior.
>>>> 
>>>> But having a slow implementation that works is a really important
>>>> step. That (with some test cases - like the one you described here)
>>>> gives you a baseline to test against.
>>>> 
>>>> Hopefully this helps?
>>>> 
>>>> Thanks,
>>>> 
>>>> --
>>>> Raul
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Wed, Sep 17, 2014 at 5:23 AM, bill lam <[email protected]> wrote:
>>>>> The LZ77 variation used in deflate is at the bottom (copied from rfc
>>>>> 1951).  I tried to translate it into J using a hash function which
>>>>> interprets XYZ as the lower 3 bytes of a 4 byte integer.  My J
>>>>> implementation worked but efficient is rather poor.  I think the
>>>>> inefficiency of my implementation came from the need to
>>>>> 1. build a vector of pre-computed hash of size equal to sliding
>>>>> window. (although the hash of all bytes had been pre-computed)
>>>>> 2. find the last index of the hash of current XYZ in the vector.
>>>>> (since the left argument changes, special code of the form m&i: does not 
>>>>> apply)
>>>>> 
>>>>> If data are truly random, the chance of matching would be very small,
>>>>> thus the time spent should roughly the sum of the above 1 & 2 for
>>>>> every byte, however it took a very long time even for data size of
>>>>> only about 500KB.  Any idea for a better J implementation.
>>>>> 
>>>>> NB. for deflate, minimum length for match is 3, max length is 285,
>>>>> maximum distance is 32K.
>>>>> 
>>>>> eg. LZ77 compression of the input 500#'abc'
>>>>> is
>>>>> aa(285,1)(213,1)b(285,1)(214,1)c(285,1)(214,1)
>>>>> 
>>>>> inside each bracket is (length,distance) pair applied to the  history
>>>>> buffer. Untested but should be something like this.
>>>>> 
>>>>> (text from rfc)
>>>>> The compressor uses a chained hash table to find duplicated strings,
>>>>>   using a hash function that operates on 3-byte sequences.  At any
>>>>>   given point during compression, let XYZ be the next 3 input bytes to
>>>>>   be examined (not necessarily all different, of course).  First, the
>>>>>   compressor examines the hash chain for XYZ.  If the chain is empty,
>>>>>   the compressor simply writes out X as a literal byte and advances one
>>>>>   byte in the input.  If the hash chain is not empty, indicating that
>>>>>   the sequence XYZ (or, if we are unlucky, some other 3 bytes with the
>>>>>   same hash function value) has occurred recently, the compressor
>>>>>   compares all strings on the XYZ hash chain with the actual input data
>>>>>   sequence starting at the current point, and selects the longest
>>>>>   match.
>>>>> 
>>>>>   The compressor searches the hash chains starting with the most recent
>>>>>   strings, to favor small distances and thus take advantage of the
>>>>>   Huffman encoding.  The hash chains are singly linked. There are no
>>>>>   deletions from the hash chains; the algorithm simply discards matches
>>>>>   that are too old.  To avoid a worst-case situation, very long hash
>>>>>   chains are arbitrarily truncated at a certain length, determined by a
>>>>>   run-time parameter.
>>>>> ----------------------------------------------------------------------
>>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>>> ----------------------------------------------------------------------
>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>> 
>>> --
>>> regards,
>>> ====================================================
>>> GPG key 1024D/4434BAB3 2008-08-24
>>> gpg --keyserver subkeys.pgp.net --recv-keys 4434BAB3
>>> gpg --keyserver subkeys.pgp.net --armor --export 4434BAB3
>>> ----------------------------------------------------------------------
>>> For information about J forums see http://www.jsoftware.com/forums.htm
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
> 
> -- 
> regards,
> ====================================================
> GPG key 1024D/4434BAB3 2008-08-24
> gpg --keyserver subkeys.pgp.net --recv-keys 4434BAB3
> gpg --keyserver subkeys.pgp.net --armor --export 4434BAB3
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] LZ77

Reply via email to