[jira] Commented: (LUCENE-866) Multi-level skipping on posting lists

Michael Busch (JIRA) Fri, 20 Apr 2007 20:12:02 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12490497
 ]


Michael Busch commented on LUCENE-866:
--------------------------------------

> and worst case would be skipTo(currDoc+2) in a loop 
> (or a conjunction across terms that appear in almost all docs).

True that should be the worst case. A query with 3 stop words
takes about 2% longer:

   Query: +the +and +a
   682737 total matching documents
   VInt reads: old: 481897700, new: 481416800, -0.09979296435737295%
   Time: old: 27812ms, new: 28406ms, 2.1357687329210413%

Maybe a possible optimization could be to avoid using the higher levels
after a certain amount of small skips have been performed. I will try out if 
this will improve things.

> do you avoid storing extra levels when the docfreq is small?
> What is the size cost for terms with docfreq=1? 

I only store another level if it contains at least one SkipDatum.
So there is no overhead for terms with df=1.

The .frq file in my index grew by 1.3% in the multi-level case
for an index with about 170MB.



> Multi-level skipping on posting lists
> -------------------------------------
>
>                 Key: LUCENE-866
>                 URL: https://issues.apache.org/jira/browse/LUCENE-866
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>         Assigned To: Michael Busch
>            Priority: Minor
>         Attachments: lucene-866.patch
>
>
> To accelerate posting list skips (TermDocs.skipTo(int)) Lucene uses skip 
> lists. 
> The default skip interval is set to 16. If we want to skip e. g. 100 
> documents, 
> then it is not necessary to read 100 entries from the posting list, but only 
> 100/16 = 6 skip list entries plus 100%16 = 4 entries from the posting list. 
> This 
> speeds up conjunction (AND) and phrase queries significantly.
> However, the skip interval is always a compromise. If you have a very big 
> index 
> with huge posting lists and you want to skip over lets say 100k documents, 
> then 
> it is still necessary to read 100k/16 = 6250 entries from the skip list. For 
> big 
> indexes the skip interval could be set to a higher value, but then after a 
> big 
> skip a long scan to the target doc might be necessary.
> A solution for this compromise is to have multi-level skip lists that 
> guarantee a 
> logarithmic amount of skips to any target in the posting list. This patch 
> implements such an approach in the following way:
>   Example for skipInterval = 3:
>                                                       c            (skip 
> level 2)
>                   c                 c                 c            (skip 
> level 1) 
>       x     x     x     x     x     x     x     x     x     x      (skip 
> level 0)
>   d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d  (posting 
> list)
>       3     6     9     12    15    18    21    24    27    30     (df)
>  
>   d - document
>   x - skip data
>   c - skip data with child pointer
>  
> Skip level i contains every skipInterval-th entry from skip level i-1. 
> Therefore the 
> number of entries on level i is: floor(df / ((skipInterval ^ (i + 1))).
>  
> Each skip entry on a level i>0 contains a pointer to the corresponding skip 
> entry in 
> list i-1. This guarantees a logarithmic amount of skips to find the target 
> document.
> Implementations details:
>    * I factored the skipping code out of SegmentMerger and SegmentTermDocs to 
>      simplify those classes. The two new classes AbstractSkipListReader and 
>        AbstractSkipListWriter implement the skipping functionality.
>    * While AbstractSkipListReader and Writer take care of writing and reading 
> the 
>      multiple skip levels, they do not implement an actual skip data format. 
> The two 
>        new subclasses DefaultSkipListReader and Writer implement the skip 
> data format 
>        that is currently used in Lucene (with two file pointers for the freq 
> and prox 
>        file and with payload length information). I added this extra layer to 
> be 
>        prepared for flexible indexing and different posting list formats. 
>       
>    
> File format changes: 
>    * I added the new parameter 'maxSkipLevels' to the term dictionary and 
> increased the
>      version of this file. If maxSkipLevels is set to one, then the format of 
> the freq 
>        file does not change at all, because we only have one skip level as 
> before. For 
>        backwards compatibility maxSkipLevels is set to one automatically if 
> an index 
>        without the new parameter is read. 
>    * In case maxSkipLevels > 1, then the frq file changes as follows:
>      FreqFile (.frq) --> <TermFreqs, SkipData>^TermCount
>        SkipData        --> <<SkipLevelLength, SkipLevel>^(Min(maxSkipLevels, 
>                              floor(log(DocFreq/log(skipInterval))) - 1)>, 
> SkipLevel>
>        SkipLevel       --> <SkipDatum>^DocFreq/(SkipInterval^(Level + 1))
>        Remark: The length of the SkipLevel is not stored for level 0, because 
> 1) it is not 
>        needed, and 2) the format of this file does not change for 
> maxSkipLevels=1 then.
>        
>        
> All unit tests pass with this patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-866) Multi-level skipping on posting lists

Reply via email to