is ForDecompressImpl generated by codes or manully coded? I am frustrated by http://code.google.com/p/integer-array-compress-kit/ which contains too many bugs( I fixed more than 20 but there still existed bugs) Because decoder has too many branches and in normal situation, some branches seldom occurs .
these decoder implemented in branch assume blockSize==128, it has less branches than decoder which support arbitrary blockSize. How do you test these decoder to ensure every branch is tested? 2010/12/16 Michael McCandless <luc...@mikemccandless.com>: > On the bulkpostings branch you can do something like this: > > CodecProvider cp = new CodecProvider(); > cp.register(new PatchedFrameOfRefCodec()); > cp.setDefaultFieldCodec("PatchedFrameOfRef"); > > Then whenever you create an IW or IR, use the advanced method that > accepts a CodecProvider. Then the index will always use PForDelta to > write/read. > > I suspect conjunction queries got faster because we no longer skip if > the docID we seek is already in the current buffer (currently sized > 64). Ie, skip is very costly when the target isn't far. This was > sort of an accidental byproduct of forcing even conjunction queries > using Standard (vInt) codec to work on block buffers, but I think it's > an important opto that we should more generally apply. > > Skipping for block codecs and Standard/vInt are done w/ the same class > now. It's just that the block codec must store the long filePointer > where the block starts *and* the int offset into the block, vs > Standard codec that just stores a filePointer. > > On "how do we implement bulk read" this is the core change on the > bulkpostings branch -- we have a new API to separately bulk-read > docDeltas, freqs, positionDeltas. But we are rapidly iterating on > improving this (and getting to a clean PFor/For impl) now... > > Mike > > On Thu, Dec 16, 2010 at 4:29 AM, Li Li <fancye...@gmail.com> wrote: >> hi Michael, >> lucene 4 has so much changes that I don't know how to index and >> search with specified codec. could you please give me some code >> snipplets that using PFor codec so I can trace the codes. >> in you blog >> http://chbits.blogspot.com/2010/08/lucene-performance-with-pfordelta-codec.html >> you said "The AND query, curiously, got faster; I think this is >> because I modified its scoring to first try seeking within the block >> of decoded ints". >> I am also curious about the result because VINT only need decode >> part of the doclist while PFor need decode the whole block. But I >> think with conjuction queries, the main time is used for searching in >> skiplist. I haven't read your codes yet. But I guess the skiplist for >> VINT and the skiplist for PFor is different. >> e.g. lucene 2.9's default skipInterval is 16, so it like >> level 1 256 >> level 0 16 32 48 64 80 96 112 128 ... 256 >> when need skipTo(60) we need read 0 16 32 48 64 in level0 >> but when use block, e.g. block size is 128, my implementation of skiplist >> is >> level 1 256 >> level 0 128 256 >> when skipTo(60) we only read 2 item in level0 and decode the first >> block which contains 128 docIDs >> >> How do you implement bulk read? >> I did like this: I decode a block and cache it in a int array. I >> think I can buffer up to 100K docIDs and tfs for disjuction queries(it >> cost less than 1MB memory for each term) >> SegmentTermDocs.read(final int[] docs, final int[] freqs) >> ........... >> while (i < length && count < df) { >> if (curBlockIdx >= curBlockSize) { //this >> condition is often >> false, we may optimize it. but JVM hotspots will cache "hot" codes. So >> ... >> int idBlockBytes = >> freqStream.readVInt(); >> curBlockIdx = 0; >> for (int k = 0; k < idBlockBytes; >> k++) { >> buffer[k] = >> freqStream.readInt(); >> } >> >> blockIds = >> code.decode(buffer,idBlockBytes); >> curBlockSize = blockIds.length; >> >> int tfBlockBytes = >> freqStream.readVInt(); >> for (int k = 0; k < tfBlockBytes; >> k++) { >> buffer[k] = >> freqStream.readInt(); >> } >> blockTfs = code.decode(buffer, >> tfBlockBytes); >> assert curBlockSize == decoded.length; >> >> } >> freq = blockTfs[curBlockIdx]; >> doc += blockIds[curBlockIdx++]; >> >> count++; >> >> if (deletedDocs == null || >> !deletedDocs.get(doc)) { >> docs[i] = doc; >> freqs[i] = freq; >> ++i; >> } >> } >> >> >> >> 2010/12/15 Michael McCandless <luc...@mikemccandless.com>: >>> Hi Li Li, >>> >>> That issue has such a big patch, and enough of us are now iterating on >>> it, that we cut a dedicated branch for it. >>> >>> But note that this branch is off of trunk (to be 4.0). >>> >>> You should be able to do this: >>> >>> svn checkout >>> https://svn.apache.org/repos/asf/lucene/dev/branches/bulkpostings >>> >>> And then run things in there. I just committed FOR/PFOR prototype >>> codecs from LUCENE-1410 onto that branch, so eg you can run unit tests >>> using those codecs by running "ant test >>> -Dtests.codec=PatchedFrameOfRef". >>> >>> Please post patches back if you improve things! We need all the help >>> we can get :) >>> >>> Mike >>> >>> On Wed, Dec 15, 2010 at 5:54 AM, Li Li <fancye...@gmail.com> wrote: >>>> hi Michael >>>> you posted a patch here >>>> https://issues.apache.org/jira/browse/LUCENE-2723 >>>> I am not familiar with patch. do I need download >>>> LUCENE-2723.patch(there are many patches after this name, do I need >>>> the latest one?) and LUCENE-2723_termscorer.patch and patch them >>>> (patch -p1 <LUCENE-2723.patch)? I just check out the latest source >>>> code from http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene >>>> >>>> >>>> 2010/12/14 Michael McCandless <luc...@mikemccandless.com>: >>>>> Likely you are seeing the startup cost of hotspot compiling the PFOR code? >>>>> >>>>> Ie, does your test first "warmup" the JRE and then do the real test? >>>>> >>>>> I've also found that running -Xbatch produces more consistent results >>>>> from run to run, however, those results may not be as fast as running >>>>> w/o -Xbatch. >>>>> >>>>> Also, it's better to test on actual data (ie a Lucene index's >>>>> postings), and in the full context of searching, because then we get a >>>>> sense of what speedups a real app will see... micro-benching is nearly >>>>> impossible in Java since Hotspot acts very differently vs the "real" >>>>> test. >>>>> >>>>> Mike >>>>> >>>>> On Tue, Dec 14, 2010 at 2:50 AM, Li Li <fancye...@gmail.com> wrote: >>>>>> Hi >>>>>> I tried to integrate PForDelta into lucene 2.9 but confronted a >>>>>> problem. >>>>>> I use the implementation in >>>>>> http://code.google.com/p/integer-array-compress-kit/ >>>>>> it implements a basic PForDelta algorithm and an improved one(which >>>>>> called NewPForDelta, but there are many bugs and I have fixed them), >>>>>> But compare it with VInt and S9, it's speed is very slow when only >>>>>> decode small number of integer arrays. >>>>>> e.g. when I decoded int[256] arrays which values are randomly >>>>>> generated between 0 and 100, if decode just one array. PFor(or >>>>>> NewPFor) is very slow. when it continuously decodes many arrays such >>>>>> as 10000, it's faster than s9 and vint. >>>>>> Another strange phenomena is that when call PFor decoder twice, the >>>>>> 2nd times it's faster. Or I call PFor first then NewPFor, the NewPFor >>>>>> is faster. reverse the call sequcence, the 2nd called decoder is >>>>>> faster >>>>>> e.g. >>>>>> ct.testNewPFDCodes(list); >>>>>> ct.testPFor(list); >>>>>> ct.testVInt(list); >>>>>> ct.testS9(list); >>>>>> >>>>>> NewPFD decode: 3614705 >>>>>> PForDelta decode: 17320 >>>>>> VINT decode: 16483 >>>>>> S9 decode: 19835 >>>>>> when I call by the following sequence >>>>>> >>>>>> ct.testPFor(list); >>>>>> ct.testNewPFDCodes(list); >>>>>> ct.testVInt(list); >>>>>> ct.testS9(list); >>>>>> >>>>>> PForDelta decode: 3212140 >>>>>> NewPFD decode: 19556 >>>>>> VINT decode: 16762 >>>>>> S9 decode: 16483 >>>>>> >>>>>> My implementation is -- group docIDs and termDocFreqs into block >>>>>> which contains 128 integers. when SegmentTermDocs's next method >>>>>> called(or read readNoTf).it decodes a block and save it to a cache. >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org >>>>>> >>>>>> >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>>>> For additional commands, e-mail: dev-h...@lucene.apache.org >>>>> >>>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: dev-h...@lucene.apache.org >>>> >>>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: dev-h...@lucene.apache.org >>> >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org