[
https://issues.apache.org/jira/browse/LUCENE-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935765#action_12935765
]
Trejkaz commented on LUCENE-2348:
---------------------------------
That is exactly the workaround we performed for our own filters, including our
private copy of a filter which works like DuplicateFilter. All the ones which
need the context now take the reader up-front. The problem now, is that we
have to use a different filter instance on each reader. Previously we were
caching them globally, and somewhere in the system we are evidently still
caching them globally, because one time in a million we find the wrong filter
being used on the wrong reader. I am now thinking of making another kind of
context-sensitive filter, which can somehow omnisciently know about all readers
open in the entire JVM (e.g. we hook the place where we open the top-level
reader, and push the information about its structure into some global watch.)
I think Robert's comments possibly stem from the misconception that the
duplicate filter somehow works like field collapsing. I wrote a test just to
illustrate how it actually behaves, just to make sure I wasn't confused myself
(since he seemed to think I was...)
{code}
public class TestDuplicateFilter {
IndexReader reader;
IndexSearcher searcher;
@Before
public void setUpSampleData() throws Exception {
RAMDirectory dir = new RAMDirectory();
IndexWriter writer = new IndexWriter(dir, new WhitespaceAnalyzer(),
true, IndexWriter.MaxFieldLength.UNLIMITED);
Document doc;
doc = new Document();
doc.add(new Field("id", "1", Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("text", "a", Field.Store.YES, Field.Index.ANALYZED));
writer.addDocument(doc);
doc = new Document();
doc.add(new Field("id", "1", Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("text", "b", Field.Store.YES, Field.Index.ANALYZED));
writer.addDocument(doc);
doc = new Document();
doc.add(new Field("id", "2", Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("text", "c", Field.Store.YES, Field.Index.ANALYZED));
writer.addDocument(doc);
writer.close();
reader = IndexReader.open(dir, true);
searcher = new IndexSearcher(reader);
}
@Test
public void testHitOnOriginal() throws Exception {
Filter filter = new DuplicateFilter("id",
DuplicateFilter.KM_USE_FIRST_OCCURRENCE, DuplicateFilter.PM_FULL_VALIDATION);
TopDocs docs = searcher.search(new TermQuery(new Term("text", "a")),
filter, 3);
assertEquals("Expected one hit - matched the original", 1,
docs.totalHits);
assertEquals("Wrong doc hit", 0, docs.scoreDocs[0].doc);
}
@Test
public void testHitOnCopy() throws Exception {
Filter filter = new DuplicateFilter("id",
DuplicateFilter.KM_USE_FIRST_OCCURRENCE, DuplicateFilter.PM_FULL_VALIDATION);
TopDocs docs = searcher.search(new TermQuery(new Term("text", "b")),
filter, 3);
// Field collapsing would return one hit here, which would be
undesirable:
assertEquals("Expected no hits - matched the copy", 0, docs.totalHits);
}
}
{code}
> DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment
> readers
> -------------------------------------------------------------------------------------
>
> Key: LUCENE-2348
> URL: https://issues.apache.org/jira/browse/LUCENE-2348
> Project: Lucene - Java
> Issue Type: Bug
> Components: contrib/*
> Affects Versions: 2.9.2
> Reporter: Trejkaz
> Attachments: LUCENE-2348.patch, LUCENE-2348.patch
>
>
> DuplicateFilter currently works by building a single doc ID set, without
> taking into account that getDocIdSet() will be called once per segment and
> only with each segment's local reader.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]