[jira] Commented: (LUCENE-2348) DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment readers

Trejkaz (JIRA) Thu, 25 Nov 2010 05:59:41 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935765#action_12935765
 ]


Trejkaz commented on LUCENE-2348:
---------------------------------

That is exactly the workaround we performed for our own filters, including our 
private copy of a filter which works like DuplicateFilter.  All the ones which 
need the context now take the reader up-front.  The problem now, is that we 
have to use a different filter instance on each reader.  Previously we were 
caching them globally, and somewhere in the system we are evidently still 
caching them globally, because one time in a million we find the wrong filter 
being used on the wrong reader.  I am now thinking of making another kind of 
context-sensitive filter, which can somehow omnisciently know about all readers 
open in the entire JVM (e.g. we hook the place where we open the top-level 
reader, and push the information about its structure into some global watch.)

I think Robert's comments possibly stem from the misconception that the 
duplicate filter somehow works like field collapsing.  I wrote a test just to 
illustrate how it actually behaves, just to make sure I wasn't confused myself 
(since he seemed to think I was...)

{code}
public class TestDuplicateFilter {

    IndexReader reader;
    IndexSearcher searcher;

    @Before
    public void setUpSampleData() throws Exception {
        RAMDirectory dir = new RAMDirectory();
        IndexWriter writer = new IndexWriter(dir, new WhitespaceAnalyzer(), 
true, IndexWriter.MaxFieldLength.UNLIMITED);
        Document doc;
        doc = new Document();
        doc.add(new Field("id", "1", Field.Store.YES, Field.Index.ANALYZED));
        doc.add(new Field("text", "a", Field.Store.YES, Field.Index.ANALYZED));
        writer.addDocument(doc);
        doc = new Document();
        doc.add(new Field("id", "1", Field.Store.YES, Field.Index.ANALYZED));
        doc.add(new Field("text", "b", Field.Store.YES, Field.Index.ANALYZED));
        writer.addDocument(doc);
        doc = new Document();
        doc.add(new Field("id", "2", Field.Store.YES, Field.Index.ANALYZED));
        doc.add(new Field("text", "c", Field.Store.YES, Field.Index.ANALYZED));
        writer.addDocument(doc);
        writer.close();

        reader = IndexReader.open(dir, true);
        searcher = new IndexSearcher(reader);
    }

    @Test
    public void testHitOnOriginal() throws Exception {
        Filter filter = new DuplicateFilter("id", 
DuplicateFilter.KM_USE_FIRST_OCCURRENCE, DuplicateFilter.PM_FULL_VALIDATION);
        TopDocs docs = searcher.search(new TermQuery(new Term("text", "a")), 
filter, 3);
        assertEquals("Expected one hit - matched the original", 1, 
docs.totalHits);
        assertEquals("Wrong doc hit", 0, docs.scoreDocs[0].doc);
    }

    @Test
    public void testHitOnCopy() throws Exception {
        Filter filter = new DuplicateFilter("id", 
DuplicateFilter.KM_USE_FIRST_OCCURRENCE, DuplicateFilter.PM_FULL_VALIDATION);
        TopDocs docs = searcher.search(new TermQuery(new Term("text", "b")), 
filter, 3);
        // Field collapsing would return one hit here, which would be 
undesirable:
        assertEquals("Expected no hits - matched the copy", 0, docs.totalHits);
    }
}
{code}


> DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment 
> readers
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2348
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2348
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/*
>    Affects Versions: 2.9.2
>            Reporter: Trejkaz
>         Attachments: LUCENE-2348.patch, LUCENE-2348.patch
>
>
> DuplicateFilter currently works by building a single doc ID set, without 
> taking into account that getDocIdSet() will be called once per segment and 
> only with each segment's local reader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2348) DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment readers

Reply via email to