[jira] [Commented] (LUCENE-2962) Skip data should be inlined into the postings lists

Han Jiang (JIRA) Sun, 07 Apr 2013 08:13:17 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13624937#comment-13624937
 ]


Han Jiang commented on LUCENE-2962:
-----------------------------------

Hi, here is my understanding about this issue (after discussion with Mike), 
hope this can be a right summary:

Extra penalty on current impl:
1. Skip data for both doc(.doc) and positions(.pos) are gathered inside the 
same blob(in .doc). For non-proximity queries, it takes unnecessary decode time.
2. In MultiLevelSkipReader, each level of skip takes an inputstream, while they 
are jumping inside the same file, along with the docIn loading docid/freqs. If 
skip data are just interleaved, the jumping behavior might be less frequent (as 
is said "private") for IO cache.

And, to inline skip data into postings list, there will be something to dig 
more:
1. Buffering, since we can hardly predict which FP to skip to, we might have to 
buffer the following postings data in memory to calculate FP offset.
2. The file structure of MultiLevelSkipWriter(even with skip blob splitted) is 
still a little bit different from the paper in Mike's comment, which can be 
illustrated by Figure.4 vs Figure 7 in that paper.
                
> Skip data should be inlined into the postings lists
> ---------------------------------------------------
>
>                 Key: LUCENE-2962
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2962
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Michael McCandless
>              Labels: gsoc2013
>
> Today, we store all skip data as a separate blob at the end of a given term's 
> postings (if that term occurs in enough docs to warrant skip data).
> But this adds overhead during decoding -- we have to seek to a different 
> place for the initial load, we have to init separate readers, we have to seek 
> again while using the lower levels of the skip data, etc.  Also, we have to 
> fully decode all skip information even if we are not going to use it (eg if I 
> only want docIDs, I still must decode position offset and lastPayloadLength).
> If instead we interleaved skip data into the postings file, we could keep it 
> local, and "private" to each file that needs skipping.  This should make it 
> least costly to init and then use the skip data, which'd be a good perf gain 
> for eg PhraseQuery, AndQuery.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-2962) Skip data should be inlined into the postings lists

Reply via email to