RE: Bug in TopFieldCollector?

Uwe Schindler Mon, 30 Mar 2009 01:50:11 -0700

You are right, I forget the sorting. And I also think, the most important
thing would be to remove the need for the ctor in the custom sort.


-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -----Original Message-----
> From: Michael McCandless [mailto:luc...@mikemccandless.com]
> Sent: Monday, March 30, 2009 10:44 AM
> To: java-dev@lucene.apache.org
> Subject: Re: Bug in TopFieldCollector?
> 
> Well, IndexSearcher also sorts its readers biggest to smallest (by
> .numDocs()) for better performance (so that the queues fill up as much
> as possible before hitting reader transitions).
> 
> I think it's the exception, not the rule, for when a custom comparator
> would require the full array of sub-readers up front (vs, "on the fly"
> which it already gets with setNextReader), so I think we should not
> pass it in during construction.
> 
> This really was a rote carryover from the old API which passed in the
> top IndexReader when calling newComparator.
> 
> Mike
> 
> On Mon, Mar 30, 2009 at 4:39 AM, Uwe Schindler <u...@thetaphi.de> wrote:
> > Why not call IndexSearcher.getIndexReader().getSequentialSubReaders()
> (see
> > http://hudson.zones.apache.org/hudson/job/Lucene-
> trunk/javadoc/all/org/apache/lucene/index/IndexReader.html).
> > Its public and documented as this:
> >
> >
> >
> > public IndexReader[] getSequentialSubReaders()
> >
> >
> >
> > Expert: returns the sequential sub readers that this reader is logically
> > composed of. For example, IndexSearcher uses this API to drive searching
> by
> > one sub reader at a time. If this reader is not composed of sequential
> child
> > readers, it should return null. If this method returns an empty array,
> that
> > means this reader is a null reader (for example a MultiReader that has
> no
> > sub readers).
> >
> > NOTE: for a MultiSegmentReader, which is obtained by
> open(java.lang.String)
> > when the index has more than one segment, you should not use the sub-
> readers
> > returned by this method to make any changes (setNorm, deleteDocument,
> etc.).
> > Doing so will likely lead to index corruption. Use the parent reader
> > instead.
> >
> >
> >
> > You only have the problem to replicate the code that gathers the
> subreaders
> > of the subreaders itself recursively.
> >
> >
> >
> > Uwe
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: u...@thetaphi.de
> >
> > ________________________________
> >
> > From: Shai Erera [mailto:ser...@gmail.com]
> > Sent: Monday, March 30, 2009 10:20 AM
> > To: java-dev@lucene.apache.org
> > Subject: Re: Bug in TopFieldCollector?
> >
> >
> >
> > Already did !
> >
> > Another question - I think we somehow broke TopFieldCollector ...
> > Previously, in TopFieldDocCollector, it accepted an IndexReader as a
> > parameter, and now it requires IndexReader[], which is called
> subReaders.
> > Calling the 'fast' search methods with Sort has no problem obtaining
> that
> > IndexReader[] (and sort it), but how would someone use TopFieldCollector
> w/o
> > calling the appropriate Searcher methods?
> >
> > For example, since all the Searcher methods pass in fillFields = true, I
> > wanted to use the method Searcher.search(Query, TopFieldCollector) in
> the
> > test case I wrote, which BTW looks like this:
> >
> >   public void testSortWithoutFillFields() throws Exception {
> >
> >     // There was previously a bug in TopFieldCollector when fillFields
> was
> > set
> >     // to false - the same doc and score was set in ScoreDoc[] array.
> This
> > test
> >     // asserts that if fillFields is false, the documents are set
> properly.
> > It
> >     // does not use Searcher's default search methods (with Sort) since
> all
> > set
> >     // fillFields to true.
> >     Sort sort = new Sort();
> >     int nDocs=10;
> >
> >     TopDocsCollector tdc = new TopFieldCollector(sort, nDocs,
> >         new IndexReader[] { ((IndexSearcher) full).getIndexReader() },
> > false);
> >
> >     full.search(new MatchAllDocsQuery(), tdc);
> >
> >     ScoreDoc[] sd = tdc.topDocs().scoreDocs;
> >     for (int i = 1; i < sd.length; i++) {
> >       assertTrue(sd[i].doc != sd[i - 1].doc);
> >     }
> >   }
> >
> > You'll notice that creating a TopFieldCollector now is much more
> complicated
> > and *ugly*. As a user of IndexSearcher, I can only call getIndexReader()
> > which returns a single IndexReader. I don't have access to
> gatherSubReaders
> > and sortSubReaders. I don't see why I should have access to them. So it
> > forces me to create a dummy array with a single IndexReader.
> >
> > There are two ways I see to solve it:
> > 1. Introduce a getIndexReaders() method on IndexSearcher, which will
> return
> > an array of (sorted?) IndexReader.
> > 2. Introduce a new constructor in TopFieldCollector which accepts a
> single
> > IndexReader and make the other one package-private (for use by
> IndexSearcher
> > only). That constructor can internally create a dummy array of readers,
> but
> > at least it's private to the constructor and not exposed to the rest of
> the
> > world.
> >
> > Otherwise, I think it ruins TopFieldCollector and will make it a lot
> less
> > intuitive to use. At least, people who'd want to move from
> > TopFieldDocCollector to TopFieldCollector, will find it very
> inconvenient
> > and strange.
> >
> > What do you think? I can do that (2) as part of 1575. If (1) is better,
> then
> > I think a different issue should be opened, because we might want to
> return
> > such an array as sorted or something, which makes it less trivial.
> >
> > Shai
> >
> > On Mon, Mar 30, 2009 at 11:07 AM, Michael McCandless
> > <luc...@mikemccandless.com> wrote:
> >
> > Looks like quite a bug, Shai!  Thanks.  It came in with LUCENE-1483.
> > I would say add test case & fix it under 1575.
> >
> > Mike
> >
> > On Mon, Mar 30, 2009 at 3:50 AM, Shai Erera <ser...@gmail.com> wrote:
> >> Hi
> >>
> >> As I prepared the patch for 1575, I noticed a strange implementation in
> >> TopFieldCollector's topDocs():
> >>
> >>     ScoreDoc[] scoreDocs = new ScoreDoc[queue.size()];
> >>     if (fillFields) {
> >>       for (int i = queue.size() - 1; i >= 0; i--) {
> >>         scoreDocs[i] = queue.fillFields((FieldValueHitQueue.Entry)
> >> queue.pop());
> >>       }
> >>     } else {
> >>       Entry entry = (FieldValueHitQueue.Entry) queue.pop();
> >>       for (int i = queue.size() - 1; i >= 0; i--) {
> >>         scoreDocs[i] = new FieldDoc(entry.docID,
> >>                                     entry.score);
> >>       }
> >>     }
> >>
> >>     return new TopFieldDocs(totalHits, scoreDocs, queue.getFields(),
> >> maxScore);
> >>
> >>
> >> Notice that if fillFields is true, then documents are popped from the
> >> queue.
> >> However if it's false, then the first document is popped out of the
> queue
> >> and used to populate the entire ScoreDoc[]? I believe that's wrong,
> right?
> >> Otherwise, the returned TopFieldDocs.scoreDocs array will include the
> same
> >> document and score?
> >>
> >> I noticed there's no test case for that ... TopFieldCollector's
> >> constructor
> >> is called only from IndexSearcher.search(Weight, Filter, int, Sort,
> >> boolean
> >> /* fillFields */) which is called from IndexSearcher.search(Weight,
> >> Filter,
> >> int, sort) with fillFields always set to true. So perhaps that's why
> this
> >> hasn't showed up?
> >>
> >> BTW, the now deprecated TopFieldDocCollector's topDocs() implementation
> >> looks like this:
> >>
> >>     FieldSortedHitQueue fshq = (FieldSortedHitQueue)hq;
> >>     ScoreDoc[] scoreDocs = new ScoreDoc[fshq.size()];
> >>     for (int i = fshq.size()-1; i >= 0; i--)      // put docs in array
> >>       scoreDocs[i] = fshq.fillFields ((FieldDoc) fshq.pop());
> >>
> >>     return new TopFieldDocs(totalHits, scoreDocs,
> >>                             fshq.getFields(), fshq.getMaxScore());
> >>
> >> It assumes fillFields is always true and always pops elements out of
> the
> >> queue.
> >>
> >> If this is a bug, I can fix it as part of 1575, as I'm touching that
> class
> >> anyway. I can also add a test case ... The fix is very simple BTW, just
> >> move
> >> the line "Entry entry = (FieldValueHitQueue.Entry) queue.pop();" inside
> >> the
> >> for loop.
> >>
> >> Shai
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-dev-h...@lucene.apache.org
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: Bug in TopFieldCollector?

Reply via email to