[
https://issues.apache.org/jira/browse/LUCENE-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934214#action_12934214
]
Trejkaz commented on LUCENE-2348:
---------------------------------
Field collapsing has different semantics which don't match those of
DuplicateFilter. It's useful if you want to collapse two hits down to one hit,
but it doesn't work if you are using DuplicateFilter to filter out previous
copies of a document (whether you are working around the issue of Lucene
shifting doc IDs when deleting, or simply want to keep the history in case you
need it later.) In this situation you want all but one filtered out, whether
the one that matches the query matches the filter or not. Initially this might
not seem like removing duplicates, but it really is, since you're just removing
duplicates based on the "id" field.
Similarly, I'm not sure how using a collector would help. There is even a note
in HitCollector saying not to look at the document during collection because it
will reduce performance by an order of magnitude or more. If you have to look
at a field, then you have to look at the document. FieldCache was introduced
to try and avoid this, but in practice, it doesn't work once you have tens of
millions of documents in your index, unless you have an extraordinary amount of
RAM allocated to the JVM (and not every application is a server application!)
Even supposing you were willing to take the performance hit, or had a system
where you had enough RAM to store the field cache, the collector only receives
the ID of the document that hit, it doesn't provide any of the context you need
to see which other documents had the same value in the field.
> DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment
> readers
> -------------------------------------------------------------------------------------
>
> Key: LUCENE-2348
> URL: https://issues.apache.org/jira/browse/LUCENE-2348
> Project: Lucene - Java
> Issue Type: Bug
> Components: contrib/*
> Affects Versions: 2.9.2
> Reporter: Trejkaz
> Attachments: LUCENE-2348.patch, LUCENE-2348.patch
>
>
> DuplicateFilter currently works by building a single doc ID set, without
> taking into account that getDocIdSet() will be called once per segment and
> only with each segment's local reader.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]