[
https://issues.apache.org/jira/browse/LUCENE-2506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935928#action_12935928
]
Trejkaz commented on LUCENE-2506:
---------------------------------
That sounds like it would cost a fair bit of memory if you had hundreds of
millions of documents. The worst thing is that people actually load this much
in, all the time using a desktop computer with only a few gigs of RAM. Because
it's a desktop app, "why don't you get another gig of RAM for the cache"
probably won't fly if it suddenly happened in a new release of our software.
If it were a server app, maybe that would fly... maybe.
But yeah, some variant on this which only reads some of it from disk instead of
all of it, might speed things up a bit. A giant IntBuffer over a memory mapped
file would probably be cached by the OS anyway.
> A Stateful Filter That Works Across Index Segments
> --------------------------------------------------
>
> Key: LUCENE-2506
> URL: https://issues.apache.org/jira/browse/LUCENE-2506
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 3.0.2
> Reporter: Karthick Sankarachary
> Attachments: LUCENE-2506.patch
>
>
> By design, Lucene's Filter abstraction is applied once for every segment in
> the index during searching. In particular, the reader provided to its
> #getDocIdSet method does not represent the whole underlying index. In other
> words, if the index has more than one segment the given reader only
> represents a single segment. As a result, that definition of the filter
> suffers the limitation of not having the ability to permit/prohibit documents
> in the search results based on the terms that reside in segments that precede
> the current one.
> To address this limitation, we introduce here a StatefulFilter which
> specifically builds on the Filter class so as to make it capable of
> remembering terms in segments spanning the whole underlying index. To
> reiterate, the need for making filters stateful stems from the fact that
> some, although not most, filters care about the terms that they may have come
> across in prior segments. It does so by keeping track of the past terms from
> prior segments in a cache that is maintained in a StatefulTermsEnum instance
> on a per-thread basis.
> Additionally, to address the case where a filter might want to accept the
> last matching term, we keep track of the TermsEnum#docFreq of the terms in
> the segments filtered thus far. By comparing the sum of such
> TermsEnum#docFreq with that of the top-level reader, we can tell if the
> current segment is the last segment in which the current term appears.
> Ideally, for this to work correctly, we require the user to explicitly set
> the top-level reader on the StatefulFilter. Knowing what the top-level reader
> is also helps the StatefulFilter to clean up after itself once the search has
> concluded.
> Note that we leave it up to each concrete sub-class of the stateful filter to
> decide what to remember in its state and what not to. In other words, it can
> choose to remember as much or as little from prior segments as it deems
> necessary. In keeping with the TermsEnum interface, which the
> StatefulTermsEnum class extends, the filter must decide which terms to accept
> or not, based on the holistic state of the search.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]