[jira] [Commented] (LUCENE-2082) Performance improvement for merging posting lists

Aleksandra Wozniak (JIRA) Thu, 06 Jun 2013 03:52:34 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13676914#comment-13676914
 ]


Aleksandra Wozniak commented on LUCENE-2082:
--------------------------------------------

Michael,

thank you for your response. It really seems like a lot of work but hopefully I 
could do at least a part of it.

After digging a little bit in the MultiLevelSkipListReader/Writer 
implementation, I have a doubt about the first point of the above description:

"Currently the multilevel skiplist reader/writer can only deal with fixed-size 
skips (16 on the lowest level). It would be an easy change to allow 
variable-size skips"

Does "variable-size skips" mean that on each skip level we want to allow for 
different intervals between skip entries? For example, does a skip list like 
this would be correct?

posting list: doc0-doc1-doc2-doc3-...-doc29 (30 entries)
skip level 0: doc3-doc5-doc10-doc12-doc16-doc21-doc27 (intervals range from 2 
to 6)
skip level 1: doc5-doc27 (2nd and 7th position of the previous level)
                
> Performance improvement for merging posting lists
> -------------------------------------------------
>
>                 Key: LUCENE-2082
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2082
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Michael Busch
>            Priority: Minor
>              Labels: gsoc2013
>             Fix For: 4.4
>
>
> A while ago I had an idea about how to improve the merge performance
> for posting lists. This is currently by far the most expensive part of
> segment merging due to all the VInt de-/encoding. Not sure if an idea
> for improving this was already mentioned in the past?
> So the basic idea is it to perform a raw copy of as much posting data
> as possible. The reason why this is difficult is that we have to
> remove deleted documents. But often the fraction of deleted docs in a
> segment is rather low (<10%?), so it's likely that there are quite
> long consecutive sections without any deletions.
> To find these sections we could use the skip lists. Basically at any
> point during the merge we would find the skip entry before the next
> deleted doc. All entries to this point can be copied without
> de-/encoding of the VInts. Then for the section that has deleted docs
> we perform the "normal" way of merging to remove the deletes. Then we
> check again with the skip lists if we can raw copy the next section.
> To make this work there are a few different necessary changes:
> 1) Currently the multilevel skiplist reader/writer can only deal with 
> fixed-size
> skips (16 on the lowest level). It would be an easy change to allow
> variable-size skips, but then the MultiLevelSkipListReader can't
> return numSkippedDocs anymore, which SegmentTermDocs needs -> change 2)
> 2) Store the last docID in which a term occurred in the term
> dictionary. This would also be beneficial for other use cases. By
> doing that the SegmentTermDocs#next(), #read() and #skipTo() know when
> the end of the postinglist is reached. Currently they have to track
> the df, which is why after a skip it's important to take the
> numSkippedDocs into account.
> 3) Change the merging algorithm according to my description above. It's
> important to create a new skiplist entry at the beginning of every
> block that is copied in raw mode, because its next skip entry's values
> are deltas from the beginning of the block. Also the very first posting, and
> that one only, needs to be decoded/encoded to make sure that the
> payload length is explicitly written (i.e. must not depend on the
> previous length). Also such a skip entry has to be created at the
> beginning of each source segment's posting list. With change 2) we don't
> have to worry about the positions of the skip entries. And having a few
> extra skip entries in merged segments won't hurt much.
> If a segment has no deletions at all this will avoid any
> decoding/encoding of VInts (best case). I think it will also work
> great for segments with a rather low amount of deletions. We should
> probably then have a threshold: if the number of deletes exceeds this
> threshold we should fall back to old style merging.
> I haven't implemented any of this, so there might be complications I
> haven't thought about. Please let me know if you can think of reasons
> why this wouldn't work or if you think more changes are necessary.
> I will probably not have time to work on this soon, but I wanted to
> open this issue to not forget about it :). Anyone should feel free to
> take this!
> Btw: I think the flex-indexing branch would be a great place to try this
> out as a new codec. This would also be good to figure out what APIs
> are needed to make merging fully flexible as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2082) Performance improvement for merging posting lists

Reply via email to