Thanks Raul, I followed you suggestion to use 2 generations of hash vector, Its speed is faster now although still far behind zlib.so.
Ср, 17 сен 2014, Raul Miller написал(а): > I would ignore the stuff about "hashes" and "hash chains" and "singly > linked". Hashes are nice sometimes but other times they're just a > disaster. Sometimes that's better than the alternative, but their > marketing is sometimes significantly better than their actual > behavior. > > Anyways, what's really happening here is an exact match search and > some kind of constraint on the size of the table. > > I think the best way to approach that in J is to search in sorted > lists. The problem here is the incremental update. You don't want to > re-sort a large list every time you add something to the list. > > The way to approach that, I think, is to have several generations of > lists, with each larger than the last, and each slower to be update > than the last. > > So, for example, have on list which is unsorted. Use i. to look things > up in it, and merge it into the next list when it gets too long (maybe > when it has 30 items in it). > > The next list gets sorted every time it gets stuff added to it. Use I. > to look things up in it. Merge it into the next list when it gets too > long (maybe when it has 1000 items in it). > > Same for the next list, but this one, when it gets too long, you > arbitrarily discard items from it. > > Now the other thing that's happening, here, is that J doesn't really > do all that well with incremental update algorithms. You'll still be > slow if you do things "one group of three" at a time. But - > hypothetically at least - you don't have to. The ideal would be to > come up with an algorithm for a block of text which complies (at least > in terms of data structures) with the rfc and which behaves reasonably > close to the behavior of the rfc described behavior. > > But having a slow implementation that works is a really important > step. That (with some test cases - like the one you described here) > gives you a baseline to test against. > > Hopefully this helps? > > Thanks, > > -- > Raul > > > > > On Wed, Sep 17, 2014 at 5:23 AM, bill lam <[email protected]> wrote: > > The LZ77 variation used in deflate is at the bottom (copied from rfc > > 1951). I tried to translate it into J using a hash function which > > interprets XYZ as the lower 3 bytes of a 4 byte integer. My J > > implementation worked but efficient is rather poor. I think the > > inefficiency of my implementation came from the need to > > 1. build a vector of pre-computed hash of size equal to sliding > > window. (although the hash of all bytes had been pre-computed) > > 2. find the last index of the hash of current XYZ in the vector. > > (since the left argument changes, special code of the form m&i: does not > > apply) > > > > If data are truly random, the chance of matching would be very small, > > thus the time spent should roughly the sum of the above 1 & 2 for > > every byte, however it took a very long time even for data size of > > only about 500KB. Any idea for a better J implementation. > > > > NB. for deflate, minimum length for match is 3, max length is 285, > > maximum distance is 32K. > > > > eg. LZ77 compression of the input 500#'abc' > > is > > aa(285,1)(213,1)b(285,1)(214,1)c(285,1)(214,1) > > > > inside each bracket is (length,distance) pair applied to the history > > buffer. Untested but should be something like this. > > > > (text from rfc) > > The compressor uses a chained hash table to find duplicated strings, > > using a hash function that operates on 3-byte sequences. At any > > given point during compression, let XYZ be the next 3 input bytes to > > be examined (not necessarily all different, of course). First, the > > compressor examines the hash chain for XYZ. If the chain is empty, > > the compressor simply writes out X as a literal byte and advances one > > byte in the input. If the hash chain is not empty, indicating that > > the sequence XYZ (or, if we are unlucky, some other 3 bytes with the > > same hash function value) has occurred recently, the compressor > > compares all strings on the XYZ hash chain with the actual input data > > sequence starting at the current point, and selects the longest > > match. > > > > The compressor searches the hash chains starting with the most recent > > strings, to favor small distances and thus take advantage of the > > Huffman encoding. The hash chains are singly linked. There are no > > deletions from the hash chains; the algorithm simply discards matches > > that are too old. To avoid a worst-case situation, very long hash > > chains are arbitrarily truncated at a certain length, determined by a > > run-time parameter. > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm -- regards, ==================================================== GPG key 1024D/4434BAB3 2008-08-24 gpg --keyserver subkeys.pgp.net --recv-keys 4434BAB3 gpg --keyserver subkeys.pgp.net --armor --export 4434BAB3 ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
