Filters and multiple, per-segment calls to getDocIdSet

Daniel Noll Wed, 24 Mar 2010 21:55:57 -0700

Hi all.

I notice that Filter.getDocIdSet() is now documented as follows:


    Note: This method will be called once per segment in
    the index during searching.  The returned {...@link DocIdSet}
    must refer to document IDs for that segment, not for
    the top-level reader.

If I look at Lucene's own DuplicateFilter, isn't it making the
assumption that it will only be called once?

And a related question: for those of us who want to implement
something *like* DuplicateFilter (as I have done before discovering
this new Javadoc), is there a good way to go about it?  It seems like
we now need to keep a hash of all terms previously seen so that when
we go over the new term enum we can check which ones have already been
seen.  This will dramatically increase memory usage compared to a
single BitSet/OpenBitSet.  Is there a better way?

Also, I presume this means that Filter is now explicitly not
threadsafe.  We weren't keeping any state in them anyway, but now we
will have to, so there is potential for a lot of new bugs if a filter
is somehow used by two queries running at the same time.

Daniel


-- 
Daniel Noll                            Forensic and eDiscovery Software
Senior Developer                              The world's most advanced
Nuix                                                email data analysis
http://nuix.com/                                and eDiscovery software

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Filters and multiple, per-segment calls to getDocIdSet

Reply via email to