[ https://issues.apache.org/jira/browse/LUCENE-866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael Busch resolved LUCENE-866. ---------------------------------- Resolution: Fixed Committed. > Multi-level skipping on posting lists > ------------------------------------- > > Key: LUCENE-866 > URL: https://issues.apache.org/jira/browse/LUCENE-866 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.2 > > Attachments: fileformats.patch, lucene-866.patch, lucene-866.patch > > > To accelerate posting list skips (TermDocs.skipTo(int)) Lucene uses skip > lists. > The default skip interval is set to 16. If we want to skip e. g. 100 > documents, > then it is not necessary to read 100 entries from the posting list, but only > 100/16 = 6 skip list entries plus 100%16 = 4 entries from the posting list. > This > speeds up conjunction (AND) and phrase queries significantly. > However, the skip interval is always a compromise. If you have a very big > index > with huge posting lists and you want to skip over lets say 100k documents, > then > it is still necessary to read 100k/16 = 6250 entries from the skip list. For > big > indexes the skip interval could be set to a higher value, but then after a > big > skip a long scan to the target doc might be necessary. > A solution for this compromise is to have multi-level skip lists that > guarantee a > logarithmic amount of skips to any target in the posting list. This patch > implements such an approach in the following way: > Example for skipInterval = 3: > c (skip > level 2) > c c c (skip > level 1) > x x x x x x x x x x (skip > level 0) > d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d (posting > list) > 3 6 9 12 15 18 21 24 27 30 (df) > > d - document > x - skip data > c - skip data with child pointer > > Skip level i contains every skipInterval-th entry from skip level i-1. > Therefore the > number of entries on level i is: floor(df / ((skipInterval ^ (i + 1))). > > Each skip entry on a level i>0 contains a pointer to the corresponding skip > entry in > list i-1. This guarantees a logarithmic amount of skips to find the target > document. > Implementations details: > * I factored the skipping code out of SegmentMerger and SegmentTermDocs to > simplify those classes. The two new classes AbstractSkipListReader and > AbstractSkipListWriter implement the skipping functionality. > * While AbstractSkipListReader and Writer take care of writing and reading > the > multiple skip levels, they do not implement an actual skip data format. > The two > new subclasses DefaultSkipListReader and Writer implement the skip > data format > that is currently used in Lucene (with two file pointers for the freq > and prox > file and with payload length information). I added this extra layer to > be > prepared for flexible indexing and different posting list formats. > > > File format changes: > * I added the new parameter 'maxSkipLevels' to the term dictionary and > increased the > version of this file. If maxSkipLevels is set to one, then the format of > the freq > file does not change at all, because we only have one skip level as > before. For > backwards compatibility maxSkipLevels is set to one automatically if > an index > without the new parameter is read. > * In case maxSkipLevels > 1, then the frq file changes as follows: > FreqFile (.frq) --> <TermFreqs, SkipData>^TermCount > SkipData --> <<SkipLevelLength, SkipLevel>^(Min(maxSkipLevels, > floor(log(DocFreq/log(skipInterval))) - 1)>, > SkipLevel> > SkipLevel --> <SkipDatum>^DocFreq/(SkipInterval^(Level + 1)) > Remark: The length of the SkipLevel is not stored for level 0, because > 1) it is not > needed, and 2) the format of this file does not change for > maxSkipLevels=1 then. > > > All unit tests pass with this patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]