[
https://issues.apache.org/jira/browse/LUCENE-5052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13935219#comment-13935219
]
Michael McCandless commented on LUCENE-5052:
--------------------------------------------
This patch looks like a great start!
Using BlockTermsDict makes sense; no need to reimplement a terms dictionary
(it's not easy!).
One problem I see is every term is written as a bitset? This may be OK for
some applications, but I think for wider usage, it'd be better if the postings
format wrapped another postings format, and then only used the bitset when the
docFreq was high enough, and otherwise delegate to the wrapped postings format?
Maybe have a look at PulsingPostingsFormat as an example of how to wrap
postings formats?
> bitset codec for off heap filters
> ---------------------------------
>
> Key: LUCENE-5052
> URL: https://issues.apache.org/jira/browse/LUCENE-5052
> Project: Lucene - Core
> Issue Type: New Feature
> Components: core/codecs
> Reporter: Mikhail Khludnev
> Labels: features
> Fix For: 5.0
>
> Attachments: LUCENE-5052.patch, bitsetcodec.zip, bitsetcodec.zip
>
>
> Colleagues,
> When we filter we don’t care any of scoring factors i.e. norms, positions,
> tf, but it should be fast. The obvious way to handle this is to decode
> postings list and cache it in heap (CachingWrappingFilter, Solr’s DocSet).
> Both of consuming a heap and decoding as well are expensive.
> Let’s write a posting list as a bitset, if df is greater than segment's
> maxdocs/8 (what about skiplists? and overall performance?).
> Beside of the codec implementation, the trickiest part to me is to design API
> for this. How we can let the app know that a term query don’t need to be
> cached in heap, but can be held as an mmaped bitset?
> WDYT?
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]