Hi all.
I notice that Filter.getDocIdSet() is now documented as follows:
Note: This method will be called once per segment in
the index during searching. The returned {...@link DocIdSet}
must refer to document IDs for that segment, not for
the top-level reader.
If I look at Lucene's own DuplicateFilter, isn't it making the
assumption that it will only be called once?
And a related question: for those of us who want to implement
something *like* DuplicateFilter (as I have done before discovering
this new Javadoc), is there a good way to go about it? It seems like
we now need to keep a hash of all terms previously seen so that when
we go over the new term enum we can check which ones have already been
seen. This will dramatically increase memory usage compared to a
single BitSet/OpenBitSet. Is there a better way?
Also, I presume this means that Filter is now explicitly not
threadsafe. We weren't keeping any state in them anyway, but now we
will have to, so there is potential for a lot of new bugs if a filter
is somehow used by two queries running at the same time.
Daniel
--
Daniel Noll Forensic and eDiscovery Software
Senior Developer The world's most advanced
Nuix email data analysis
http://nuix.com/ and eDiscovery software
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]