[jira] Commented: (LUCENE-2348) DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment readers

Trejkaz (JIRA) Sat, 20 Nov 2010 17:35:39 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934214#action_12934214
 ]


Trejkaz commented on LUCENE-2348:
---------------------------------

Field collapsing has different semantics which don't match those of 
DuplicateFilter.  It's useful if you want to collapse two hits down to one hit, 
but it doesn't work if you are using DuplicateFilter to filter out previous 
copies of a document (whether you are working around the issue of Lucene 
shifting doc IDs when deleting, or simply want to keep the history in case you 
need it later.)  In this situation you want all but one filtered out, whether 
the one that matches the query matches the filter or not.  Initially this might 
not seem like removing duplicates, but it really is, since you're just removing 
duplicates based on the "id" field.

Similarly, I'm not sure how using a collector would help.  There is even a note 
in HitCollector saying not to look at the document during collection because it 
will reduce performance by an order of magnitude or more.  If you have to look 
at a field, then you have to look at the document.  FieldCache was introduced 
to try and avoid this, but in practice, it doesn't work once you have tens of 
millions of documents in your index, unless you have an extraordinary amount of 
RAM allocated to the JVM (and not every application is a server application!)  
Even supposing you were willing to take the performance hit, or had a system 
where you had enough RAM to store the field cache, the collector only receives 
the ID of the document that hit, it doesn't provide any of the context you need 
to see which other documents had the same value in the field.


> DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment 
> readers
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2348
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2348
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/*
>    Affects Versions: 2.9.2
>            Reporter: Trejkaz
>         Attachments: LUCENE-2348.patch, LUCENE-2348.patch
>
>
> DuplicateFilter currently works by building a single doc ID set, without 
> taking into account that getDocIdSet() will be called once per segment and 
> only with each segment's local reader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2348) DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment readers

Reply via email to