[ https://issues.apache.org/jira/browse/LUCENE-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13676914#comment-13676914 ]
Aleksandra Wozniak commented on LUCENE-2082: -------------------------------------------- Michael, thank you for your response. It really seems like a lot of work but hopefully I could do at least a part of it. After digging a little bit in the MultiLevelSkipListReader/Writer implementation, I have a doubt about the first point of the above description: "Currently the multilevel skiplist reader/writer can only deal with fixed-size skips (16 on the lowest level). It would be an easy change to allow variable-size skips" Does "variable-size skips" mean that on each skip level we want to allow for different intervals between skip entries? For example, does a skip list like this would be correct? posting list: doc0-doc1-doc2-doc3-...-doc29 (30 entries) skip level 0: doc3-doc5-doc10-doc12-doc16-doc21-doc27 (intervals range from 2 to 6) skip level 1: doc5-doc27 (2nd and 7th position of the previous level) > Performance improvement for merging posting lists > ------------------------------------------------- > > Key: LUCENE-2082 > URL: https://issues.apache.org/jira/browse/LUCENE-2082 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index > Reporter: Michael Busch > Priority: Minor > Labels: gsoc2013 > Fix For: 4.4 > > > A while ago I had an idea about how to improve the merge performance > for posting lists. This is currently by far the most expensive part of > segment merging due to all the VInt de-/encoding. Not sure if an idea > for improving this was already mentioned in the past? > So the basic idea is it to perform a raw copy of as much posting data > as possible. The reason why this is difficult is that we have to > remove deleted documents. But often the fraction of deleted docs in a > segment is rather low (<10%?), so it's likely that there are quite > long consecutive sections without any deletions. > To find these sections we could use the skip lists. Basically at any > point during the merge we would find the skip entry before the next > deleted doc. All entries to this point can be copied without > de-/encoding of the VInts. Then for the section that has deleted docs > we perform the "normal" way of merging to remove the deletes. Then we > check again with the skip lists if we can raw copy the next section. > To make this work there are a few different necessary changes: > 1) Currently the multilevel skiplist reader/writer can only deal with > fixed-size > skips (16 on the lowest level). It would be an easy change to allow > variable-size skips, but then the MultiLevelSkipListReader can't > return numSkippedDocs anymore, which SegmentTermDocs needs -> change 2) > 2) Store the last docID in which a term occurred in the term > dictionary. This would also be beneficial for other use cases. By > doing that the SegmentTermDocs#next(), #read() and #skipTo() know when > the end of the postinglist is reached. Currently they have to track > the df, which is why after a skip it's important to take the > numSkippedDocs into account. > 3) Change the merging algorithm according to my description above. It's > important to create a new skiplist entry at the beginning of every > block that is copied in raw mode, because its next skip entry's values > are deltas from the beginning of the block. Also the very first posting, and > that one only, needs to be decoded/encoded to make sure that the > payload length is explicitly written (i.e. must not depend on the > previous length). Also such a skip entry has to be created at the > beginning of each source segment's posting list. With change 2) we don't > have to worry about the positions of the skip entries. And having a few > extra skip entries in merged segments won't hurt much. > If a segment has no deletions at all this will avoid any > decoding/encoding of VInts (best case). I think it will also work > great for segments with a rather low amount of deletions. We should > probably then have a threshold: if the number of deletes exceeds this > threshold we should fall back to old style merging. > I haven't implemented any of this, so there might be complications I > haven't thought about. Please let me know if you can think of reasons > why this wouldn't work or if you think more changes are necessary. > I will probably not have time to work on this soon, but I wanted to > open this issue to not forget about it :). Anyone should feel free to > take this! > Btw: I think the flex-indexing branch would be a great place to try this > out as a new codec. This would also be good to figure out what APIs > are needed to make merging fully flexible as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org