[
https://issues.apache.org/jira/browse/LUCENE-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12874529#action_12874529
]
Michael McCandless commented on LUCENE-2348:
--------------------------------------------
bq. What you describe is precisely the problem. It will deduplicate only over
each segment, not over the text index as one would expect given the name of the
class.
Duh, right! You want dedup to apply to the entire index....
Ugh, so this has been broken since the cutover to per-segment searching (2.9.x).
This is tricky to fix. Somehow DuplicateFilter needs to get ahold of the top
reader. It then must run its dup detection against the TermEnum from that top
reader, but then when requested per sub-reader, it must return a slice into the
bits for the top reader.
There's no way, now, given a sub-reader to figure out which parent reader it
belongs to... so I think we'd have to change DuplicateFilter to take in the top
reader to its ctor? (But this is sort of messy -- no other core/contrib
filters have this "state" -- they are normally free to be reused across
readers).
The only other [big] change I can think of is if we could change the Filter API
to be more like Scorer, which does first receive the top reader (since it needs
to init measures like idf across all segments), and then separately steps
through each sub-reader.
> DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment
> readers
> -------------------------------------------------------------------------------------
>
> Key: LUCENE-2348
> URL: https://issues.apache.org/jira/browse/LUCENE-2348
> Project: Lucene - Java
> Issue Type: Bug
> Components: contrib/*
> Affects Versions: 2.9.2
> Reporter: Trejkaz
>
> DuplicateFilter currently works by building a single doc ID set, without
> taking into account that getDocIdSet() will be called once per segment and
> only with each segment's local reader.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]