[
https://issues.apache.org/jira/browse/LUCENE-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881010#action_12881010
]
Karthick Sankarachary commented on LUCENE-2348:
-----------------------------------------------
Hi, All,
Having run into this very issue in our platform, I decided to take a stab at
addressing it by defining what is essentially a stateful type of filter (for
details, please see LUCENE-2506). In my mind, the stateful filter affords an
easy and intuitive way for filters such as the DuplicateFilter, to work
seamlessly across (the potentially many) segments of the index.
In a nutshell, I tweaked the DuplicateFilter such that it accepts a given term
if and only if it does not already exist in its "memory". For details, please
see the DedupingTermsEnum#accept method in the revised DuplicateFilter class
attached here.
Note that I took the liberty of incorporating the edge case shown above into
the DuplicateFilter's test case, which is also attached in the patch.
Regards,
Karthick Sankarachary
> DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment
> readers
> -------------------------------------------------------------------------------------
>
> Key: LUCENE-2348
> URL: https://issues.apache.org/jira/browse/LUCENE-2348
> Project: Lucene - Java
> Issue Type: Bug
> Components: contrib/*
> Affects Versions: 2.9.2
> Reporter: Trejkaz
> Attachments: LUCENE-2348.patch
>
>
> DuplicateFilter currently works by building a single doc ID set, without
> taking into account that getDocIdSet() will be called once per segment and
> only with each segment's local reader.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]