[ https://issues.apache.org/jira/browse/LUCENE-2506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935973#action_12935973 ]
Michael McCandless commented on LUCENE-2506: -------------------------------------------- bq. That assumption's gonna break very soon. Very very soon, when IndexWriter learns how to merge non-sequential segments. Even if we break this assumption on the ootb config we will still have to provide a way to get it back. EG in this case, a merge policy which only selects contiguous segments (like the LogMergePolicy today). > A Stateful Filter That Works Across Index Segments > -------------------------------------------------- > > Key: LUCENE-2506 > URL: https://issues.apache.org/jira/browse/LUCENE-2506 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 3.0.2 > Reporter: Karthick Sankarachary > Attachments: LUCENE-2506.patch > > > By design, Lucene's Filter abstraction is applied once for every segment in > the index during searching. In particular, the reader provided to its > #getDocIdSet method does not represent the whole underlying index. In other > words, if the index has more than one segment the given reader only > represents a single segment. As a result, that definition of the filter > suffers the limitation of not having the ability to permit/prohibit documents > in the search results based on the terms that reside in segments that precede > the current one. > To address this limitation, we introduce here a StatefulFilter which > specifically builds on the Filter class so as to make it capable of > remembering terms in segments spanning the whole underlying index. To > reiterate, the need for making filters stateful stems from the fact that > some, although not most, filters care about the terms that they may have come > across in prior segments. It does so by keeping track of the past terms from > prior segments in a cache that is maintained in a StatefulTermsEnum instance > on a per-thread basis. > Additionally, to address the case where a filter might want to accept the > last matching term, we keep track of the TermsEnum#docFreq of the terms in > the segments filtered thus far. By comparing the sum of such > TermsEnum#docFreq with that of the top-level reader, we can tell if the > current segment is the last segment in which the current term appears. > Ideally, for this to work correctly, we require the user to explicitly set > the top-level reader on the StatefulFilter. Knowing what the top-level reader > is also helps the StatefulFilter to clean up after itself once the search has > concluded. > Note that we leave it up to each concrete sub-class of the stateful filter to > decide what to remember in its state and what not to. In other words, it can > choose to remember as much or as little from prior segments as it deems > necessary. In keeping with the TermsEnum interface, which the > StatefulTermsEnum class extends, the filter must decide which terms to accept > or not, based on the holistic state of the search. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org