Re: Is TopDocCollector's collect() implementation correct?

2009-03-27 Thread Michael McCandless
2009/3/27 Shai Erera ser...@gmail.com:
 I really liked HItCollector and hate to give up the name ... However
 Collector is fine with me either, and it is at least more generic than
 HitCollector ...

 Hitable sounds too aggressive/violent to me :)

Now that you raised this objection, I can't get the image out of my
head, and I agree :)

 BTW, I guess I should create some new searcher API which receives this
 Collector class (is Collector the chosen name?) and deprecate those who
 accept HitCollector?
 Those can also skip the instanceof check, and wrapping of HC to MRHC ...

OK that'd be great.  There are also default score methods in Scorer
that take HitCollector which'll have to be deprecated as well, and
some scorers (eg BooleanScorer) define those methods.

 That also means that I should throw that MRHC wrapper (which rebases doc
 Ids)? If HitCollector is deprecated, then there's no need to keep it.  But
 perhaps we want it there in 2.9 for easier migration? Personally I think
 it's redundant since in 3.0 people will need to change all their collectors
 anyway (since HitCollector will be removed, and every class which extends
 HitCollector will need be modified). What do you think?

I think discard it, assuming the above full-replacement approach
doesn't hit any snags.

 Also, there's no need to deprecate MRHC, since it's only in the trunk - I
 can simply rename it, right?

Exactly :)  A nice freedom... maybe we should never make any releases!
 (kidding)

 Ok I'll go ahead and prepare a patch. We can discuss the name more, at the
 end it will just be a short refactor action in Eclipse, so that shouldn't
 hold us (or me) up.

OK thanks Shai.  Make sure you mark the issue as Fix Version 2.9.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-27 Thread Shai Erera
I created LUCENE-1575 which summarizes what we've agreed to do in this
issue. Please review it. Also, let's move the discussion around the names to
the issue.

I hope I covered everything we agreed on - I had to create the issue twice
as it contains a lot of text and the session expired on me in the middle,
and everything I typed got lost ... damn.

Shai

On Fri, Mar 27, 2009 at 11:51 AM, Michael McCandless 
luc...@mikemccandless.com wrote:

 2009/3/27 Shai Erera ser...@gmail.com:
  I really liked HItCollector and hate to give up the name ... However
  Collector is fine with me either, and it is at least more generic than
  HitCollector ...
 
  Hitable sounds too aggressive/violent to me :)

 Now that you raised this objection, I can't get the image out of my
 head, and I agree :)

  BTW, I guess I should create some new searcher API which receives this
  Collector class (is Collector the chosen name?) and deprecate those who
  accept HitCollector?
  Those can also skip the instanceof check, and wrapping of HC to MRHC ...

 OK that'd be great.  There are also default score methods in Scorer
 that take HitCollector which'll have to be deprecated as well, and
 some scorers (eg BooleanScorer) define those methods.

  That also means that I should throw that MRHC wrapper (which rebases doc
  Ids)? If HitCollector is deprecated, then there's no need to keep it.
  But
  perhaps we want it there in 2.9 for easier migration? Personally I think
  it's redundant since in 3.0 people will need to change all their
 collectors
  anyway (since HitCollector will be removed, and every class which extends
  HitCollector will need be modified). What do you think?

 I think discard it, assuming the above full-replacement approach
 doesn't hit any snags.

  Also, there's no need to deprecate MRHC, since it's only in the trunk - I
  can simply rename it, right?

 Exactly :)  A nice freedom... maybe we should never make any releases!
  (kidding)

  Ok I'll go ahead and prepare a patch. We can discuss the name more, at
 the
  end it will just be a short refactor action in Eclipse, so that
 shouldn't
  hold us (or me) up.

 OK thanks Shai.  Make sure you mark the issue as Fix Version 2.9.

 Mike

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




Re: Is TopDocCollector's collect() implementation correct?

2009-03-27 Thread Michael McCandless
On Fri, Mar 27, 2009 at 12:46 PM, Shai Erera ser...@gmail.com wrote:

 I hope I covered everything we agreed on - I had to create the issue twice
 as it contains a lot of text and the session expired on me in the middle,
 and everything I typed got lost ... damn.

Bummer.

Strangely, this also happened to me today.  Maybe something changed on
how aggressively Jira logs you out. But: when this happens, Jira
parrots back the text you had just attempted to submit and gives you a
chance to copy/paste it.  (At least, it do so when I tried to append a
comment and got logged out between when I started and when I
finished... maybe we all just need to type faster!).

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: Is TopDocCollector's collect() implementation correct?

2009-03-27 Thread Uwe Schindler
Hi Shai, hi Mike,

You can also use the license of JIRA Client from the contributor's SVN
checkout (if you are committer and so have access to it). With it you can
manage your issues offline (it's a Java app for windows and other systems).

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Friday, March 27, 2009 9:02 PM
 To: java-dev@lucene.apache.org
 Subject: Re: Is TopDocCollector's collect() implementation correct?
 
 On Fri, Mar 27, 2009 at 12:46 PM, Shai Erera ser...@gmail.com wrote:
 
  I hope I covered everything we agreed on - I had to create the issue
 twice
  as it contains a lot of text and the session expired on me in the
 middle,
  and everything I typed got lost ... damn.
 
 Bummer.
 
 Strangely, this also happened to me today.  Maybe something changed on
 how aggressively Jira logs you out. But: when this happens, Jira
 parrots back the text you had just attempted to submit and gives you a
 chance to copy/paste it.  (At least, it do so when I tried to append a
 comment and got logged out between when I started and when I
 finished... maybe we all just need to type faster!).
 
 Mike
 
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-26 Thread Michael McCandless
Mark Miller markrmil...@gmail.com wrote:
 bq. I personally don't understand why MRHC was invented in the first place.

 The evolution of MRHC is in the comments of LUCENE-1483 - a lot of comments
 to wade through though.

MRHC was created because simply adding setNextReader to HC would break
back compat, because collect(...) is called on the un-rebased doc.  Ie
we need a new class so we can tell that it will handle re-basing the
doc itself.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-26 Thread Michael McCandless
Shai Erera ser...@gmail.com wrote:

 The difference is for the new code, it's an upcast, which catches any
 errors at compile time, not run time.  The compiler determines that
 the class implements the required interface.

 I still don't understand how a compiler can detect at compilation time that
 a HitCollector instance that is used in method A, and is casted to a
 TopDocsOutput instance by calling method B (from A) is actually ok ... I may
 be missing some Java information here, but I simply don't see how that
 happens at compilation time instead of at runtime ...

I may have lost the context here... but here's what I thought we were
talking about.

If we choose the interface option (adding a ProvidesTopDocsResults
(say) interface), then you would create method
renderResults(ProvidesTopDocResults).

Then, any collector implementing that interface could be safely passed
in, as the upcast is done at compile time not runtime.

 So in fact the internal Lucene code expects only MRHC from a certain point,
 and so even if I wrote a HC and passed it on Searcher, it's still converted
 to MRHC, with an empty setNextReader() method implementation. That's why I
 meant that HC is already deprecated, whether it's marked as deprecated or
 not.

The setNextReader() impl is not empty; it does the re-basing of docID
on behalf of the HC.

 What you say about deprecating HC to me is unnecessary. Simply pull
 setNextReader up with an empty implementation, get rid of all the
 instanceof, casting and wrapping code and you're fine. Nothing is broken.
 All works well and better (instanceof, casting and wrapping have their
 overheads).
 Isn't that right?

I think we need to deprecate HC, in favor of MRHC (or if we can think
of a better name... ResultCollector?).

 Regarding interfaces .. I don't think they're that bad. Perhaps a different
 viewing angle might help. Lucene processes a query and fires events. Today
 it fires an event every time a doc's score has been computed, and recently
 every time it starts to process a different reader. HitCollector is a
 listener implementation on the doc-score event, while MRHC is a listener
 on both.
 To me, interfaces play just nicely here. Assume that you have the following
 interfaces:
 - DocScoreEvent
 - ChangeReaderEvent
 - EndProcessingEvent (thrown when the query has finished processing -
 perhaps to aid collectors to free resources)
 - any other events you foresee?
 The Lucene code receives a HitCollector which listens on all events. In the
 future, Lucene might throw other events, but still get a HitCollector. Those
 methods will check for instanceof, and you as a user will know that if you
 want to catch those events, you pass in a collector implementation which
 does. Those events cannot of course be main-stream events (like
 DocScoreEvent), but new ones, perhaps even experimental.
 Since HitCollector is a concrete class, we can always add new interfaces to
 it in the future with empty implementations?

I agree interfaces clearly have useful properties, but their achilles
heel for Lucene in all but the most trivial needs is the
non-back-compatibility when you want to add a method to the interface.
There have been a number of java-dev discussons on this problem.

So, I think something like this:

  * Deprecate HitCollector in favor of MultiReaderHitCollector (any
better name here?) abstract class.  If you want to make a fully
custom HitCollector, you subclass this calss.

Let's change MRHC's collect to take only an [unbased] docID, and
expose a setScorer(Scorer) method.  Then if the collector needs
score it can call Scorer.score().

  * Subclass that to create an abstract tracks top N results
collector (TopDocsCollector?  TopHitsCollector?)

  * Subclass TopDocsCollector to a final, fast top N sorted by score
collector (exists already: TopScoreDocCollector)

  * Subclass TopDocsCollector to a final, fast top N sorted by field
collector (exists already: TopFieldCollector)

  * Subclass TopDocsCollector to a you provide your own pqueue and we
collect top N according to it collector (does not yet exist --
name?).  This is the way forward for existing subclasses of
TopDocCollector.

Shai do you want to take a first cut at making a patch?  Can you open
an issue?  Thanks.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-26 Thread DM Smith


On Mar 26, 2009, at 6:55 AM, Michael McCandless wrote:


 think we need to deprecate HC, in favor of MRHC (or if we can think
of a better name... ResultCollector?).


I like your suggestion for the name.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-26 Thread Shai Erera
You're right ... I missed that. My fault :)

On Thu, Mar 26, 2009 at 12:18 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 Mark Miller markrmil...@gmail.com wrote:
  bq. I personally don't understand why MRHC was invented in the first
 place.
 
  The evolution of MRHC is in the comments of LUCENE-1483 - a lot of
 comments
  to wade through though.

 MRHC was created because simply adding setNextReader to HC would break
 back compat, because collect(...) is called on the un-rebased doc.  Ie
 we need a new class so we can tell that it will handle re-basing the
 doc itself.

 Mike

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




Re: Is TopDocCollector's collect() implementation correct?

2009-03-26 Thread Earwin Burrfoot
I'd say it is a bad name. Raw hit is way far from being result of a search.

If you're already breaking back compat with 3.0 release (by
incrementing java version), maybe its worthy to break it in some more
places, just so ugly names like MRHC and special code paths that check
for n-year-old interfaces won't haunt us for the next century.

On Thu, Mar 26, 2009 at 14:15, DM Smith dmsmith...@gmail.com wrote:

 On Mar 26, 2009, at 6:55 AM, Michael McCandless wrote:

  think we need to deprecate HC, in favor of MRHC (or if we can think
 of a better name... ResultCollector?).

 I like your suggestion for the name.


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-26 Thread Michael McCandless
Earwin Burrfoot ear...@gmail.com wrote:

 I'd say it is a bad name. Raw hit is way far from being result of a search.

First off, from Lucene's standpoint, the docID *is* the result of the
search.  Your application will do further things (load titles, do
higlighting, etc.) with that result.

Second off, since ResultCollector is an abstract base class, it would
be subclassed to concrete versions that do more interesting things
(call Scorer.score(), etc) so as to make up what your application
considers the result.

So I understand your objection, but I still feel ResultCollector is an OK name.

Or do you have an alternative?

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-26 Thread Shai Erera

 I may have lost the context here... but here's what I thought we were
 talking about.

 If we choose the interface option (adding a ProvidesTopDocsResults
 (say) interface), then you would create method
 renderResults(ProvidesTopDocResults).

 Then, any collector implementing that interface could be safely passed
 in, as the upcast is done at compile time not runtime.


Consider this code snippet:

HitCollector hc = condition? new TopDocCollector() : TopFieldDocCollector();
searcher.search(hc);

The problem is that I need a base class for both collectors. If I use the
interface ProvidesTopDocsResults, then I cannot pass it to searcher, w/o
casting to HitCollector. If I use HitCollector, then I need to cast it
before passing it into rederResults(). Only when both class have the same
base class which is also a HitCollector, I don't need any casting. I.e.,
after I submit a patch that develops what we've agreed on, the code can look
like this:

TopResultsCollector trc = condition ? new TopScoreDocCollector() : new
TopFieldCollector();
searcher.search(trc);
renderResults(trc);

Here I can pass 'trc' to both methods since it both a HitCollector and a
TopResultsCollector. That's what I was missing in your proposal.

Shai do you want to take a first cut at making a patch?  Can you open
 an issue?  Thanks.


I can certainly do that. I think the summary of the steps make sense. I'll
check if TopScoreDocCollector and TopFieldCollector can also extend that
you provide your own pqueue and we collect top N according to it
collector, passing a null PQ and extending topDocs().

I also would like to propose an additional method to topDocs(), topDocs(int
start, int howMany) which will be more efficient to call in case of paging
through results. The reason is that topDocs() pops everything from the PQ,
then allocates a ScoreDoc[] array of size of number of results and returns a
TopDocs object. You can then choose just the ones you need.
On the other hand, topDocs(start, howMany) would allocate exactly the size
of array needed. E.g., in case someone pages through results 91-100, you
allocate an array of size 10, instead of 100.
It is not a huge improvement, but it does save some allocations, as well as
it's a convenient method.

BTW, I like the name ResultsCollector, as it's just like HitCollector, but
does not commit too much to hits .. i.e., facets aren't hits ... I think?

Shai

On Thu, Mar 26, 2009 at 12:55 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 Shai Erera ser...@gmail.com wrote:

  The difference is for the new code, it's an upcast, which catches any
  errors at compile time, not run time.  The compiler determines that
  the class implements the required interface.
 
  I still don't understand how a compiler can detect at compilation time
 that
  a HitCollector instance that is used in method A, and is casted to a
  TopDocsOutput instance by calling method B (from A) is actually ok ... I
 may
  be missing some Java information here, but I simply don't see how that
  happens at compilation time instead of at runtime ...

 I may have lost the context here... but here's what I thought we were
 talking about.

 If we choose the interface option (adding a ProvidesTopDocsResults
 (say) interface), then you would create method
 renderResults(ProvidesTopDocResults).

 Then, any collector implementing that interface could be safely passed
 in, as the upcast is done at compile time not runtime.

  So in fact the internal Lucene code expects only MRHC from a certain
 point,
  and so even if I wrote a HC and passed it on Searcher, it's still
 converted
  to MRHC, with an empty setNextReader() method implementation. That's why
 I
  meant that HC is already deprecated, whether it's marked as deprecated or
  not.

 The setNextReader() impl is not empty; it does the re-basing of docID
 on behalf of the HC.

  What you say about deprecating HC to me is unnecessary. Simply pull
  setNextReader up with an empty implementation, get rid of all the
  instanceof, casting and wrapping code and you're fine. Nothing is broken.
  All works well and better (instanceof, casting and wrapping have their
  overheads).
  Isn't that right?

 I think we need to deprecate HC, in favor of MRHC (or if we can think
 of a better name... ResultCollector?).

  Regarding interfaces .. I don't think they're that bad. Perhaps a
 different
  viewing angle might help. Lucene processes a query and fires events.
 Today
  it fires an event every time a doc's score has been computed, and
 recently
  every time it starts to process a different reader. HitCollector is a
  listener implementation on the doc-score event, while MRHC is a
 listener
  on both.
  To me, interfaces play just nicely here. Assume that you have the
 following
  interfaces:
  - DocScoreEvent
  - ChangeReaderEvent
  - EndProcessingEvent (thrown when the query has finished processing -
  perhaps to aid collectors to free resources)
  - any other events you foresee?
  The Lucene 

Re: Is TopDocCollector's collect() implementation correct?

2009-03-26 Thread Shai Erera
BTW Mike, I noticed that TopFieldDocCollector extends TopScoreDocCollector.
That is a problem if we want to make TSDC final. Now, TFDC is marked
deprecated, so it will be removed in the future.
I think an easy fix is just to have TFDC extend TopResultsCollector, right?

On Thu, Mar 26, 2009 at 2:52 PM, Shai Erera ser...@gmail.com wrote:

 I may have lost the context here... but here's what I thought we were
 talking about.

 If we choose the interface option (adding a ProvidesTopDocsResults
 (say) interface), then you would create method
 renderResults(ProvidesTopDocResults).

 Then, any collector implementing that interface could be safely passed
 in, as the upcast is done at compile time not runtime.


 Consider this code snippet:

 HitCollector hc = condition? new TopDocCollector() :
 TopFieldDocCollector();
 searcher.search(hc);

 The problem is that I need a base class for both collectors. If I use the
 interface ProvidesTopDocsResults, then I cannot pass it to searcher, w/o
 casting to HitCollector. If I use HitCollector, then I need to cast it
 before passing it into rederResults(). Only when both class have the same
 base class which is also a HitCollector, I don't need any casting. I.e.,
 after I submit a patch that develops what we've agreed on, the code can look
 like this:

 TopResultsCollector trc = condition ? new TopScoreDocCollector() : new
 TopFieldCollector();
 searcher.search(trc);
 renderResults(trc);

 Here I can pass 'trc' to both methods since it both a HitCollector and a
 TopResultsCollector. That's what I was missing in your proposal.

 Shai do you want to take a first cut at making a patch?  Can you open
 an issue?  Thanks.


 I can certainly do that. I think the summary of the steps make sense. I'll
 check if TopScoreDocCollector and TopFieldCollector can also extend that
 you provide your own pqueue and we collect top N according to it
 collector, passing a null PQ and extending topDocs().

 I also would like to propose an additional method to topDocs(), topDocs(int
 start, int howMany) which will be more efficient to call in case of paging
 through results. The reason is that topDocs() pops everything from the PQ,
 then allocates a ScoreDoc[] array of size of number of results and returns a
 TopDocs object. You can then choose just the ones you need.
 On the other hand, topDocs(start, howMany) would allocate exactly the size
 of array needed. E.g., in case someone pages through results 91-100, you
 allocate an array of size 10, instead of 100.
 It is not a huge improvement, but it does save some allocations, as well as
 it's a convenient method.

 BTW, I like the name ResultsCollector, as it's just like HitCollector, but
 does not commit too much to hits .. i.e., facets aren't hits ... I think?

 Shai

 On Thu, Mar 26, 2009 at 12:55 PM, Michael McCandless 
 luc...@mikemccandless.com wrote:

 Shai Erera ser...@gmail.com wrote:

  The difference is for the new code, it's an upcast, which catches any
  errors at compile time, not run time.  The compiler determines that
  the class implements the required interface.
 
  I still don't understand how a compiler can detect at compilation time
 that
  a HitCollector instance that is used in method A, and is casted to a
  TopDocsOutput instance by calling method B (from A) is actually ok ... I
 may
  be missing some Java information here, but I simply don't see how that
  happens at compilation time instead of at runtime ...

 I may have lost the context here... but here's what I thought we were
 talking about.

 If we choose the interface option (adding a ProvidesTopDocsResults
 (say) interface), then you would create method
 renderResults(ProvidesTopDocResults).

 Then, any collector implementing that interface could be safely passed
 in, as the upcast is done at compile time not runtime.

  So in fact the internal Lucene code expects only MRHC from a certain
 point,
  and so even if I wrote a HC and passed it on Searcher, it's still
 converted
  to MRHC, with an empty setNextReader() method implementation. That's why
 I
  meant that HC is already deprecated, whether it's marked as deprecated
 or
  not.

 The setNextReader() impl is not empty; it does the re-basing of docID
 on behalf of the HC.

  What you say about deprecating HC to me is unnecessary. Simply pull
  setNextReader up with an empty implementation, get rid of all the
  instanceof, casting and wrapping code and you're fine. Nothing is
 broken.
  All works well and better (instanceof, casting and wrapping have their
  overheads).
  Isn't that right?

 I think we need to deprecate HC, in favor of MRHC (or if we can think
 of a better name... ResultCollector?).

  Regarding interfaces .. I don't think they're that bad. Perhaps a
 different
  viewing angle might help. Lucene processes a query and fires events.
 Today
  it fires an event every time a doc's score has been computed, and
 recently
  every time it starts to process a different reader. HitCollector is a
  

Re: Is TopDocCollector's collect() implementation correct?

2009-03-26 Thread Earwin Burrfoot
 BTW, I like the name ResultsCollector, as it's just like HitCollector, but 
 does not commit too much to hits .. i.e., facets aren't hits ... I think?
What this class consumes and what it produces is a totally different
thing. HitCollector always collects 'hits', and then produces whatever
implementor needs.
For example mine collects hits, then collapses 1..N sequential hits
into a 'metahit', calculates facets, sorts, takes top and loads some
fields. And another one simply counts the hits without doing anything
else. But oh, my, I'm not implementing anything like void
collect(Facet f) method.

It's common sense to name consumer interfaces after what they consume,
not what their implementations might do.

 Or do you have an alternative?
HitCollector is absolutely cool with me. Okay, maybe DocCollector, or
DocIdCollector. Since that is exactly what 'all' of its
implementations do.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-26 Thread Marvin Humphrey
On Thu, Mar 26, 2009 at 08:44:57AM -0400, Michael McCandless wrote:

 do you have an alternative?

Brainstorming

  * Harvester
  * Trawler
  * HitPicker
  * HitGrabber

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-26 Thread Earwin Burrfoot
 On Thu, Mar 26, 2009 at 08:44:57AM -0400, Michael McCandless wrote:

 do you have an alternative?

 Brainstorming

  * Harvester
  * Trawler
  * HitPicker
  * HitGrabber

 Marvin Humphrey

NitPicker - that absolutely made my day


-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-26 Thread Shai Erera
I still think that ResultsCollector does what you describe. It simply
collects results, while the word 'result' is quite *open* and does not
commit to anything ...

How about dropping the word Collector, since it might not collect anything,
and just save the highest score, or compute some facets ..

What about something with a *Listener like: DocIdListener, SearchListener,
MatchListener (it listens on search matches).

On Thu, Mar 26, 2009 at 3:48 PM, Marvin Humphrey mar...@rectangular.comwrote:

 On Thu, Mar 26, 2009 at 08:44:57AM -0400, Michael McCandless wrote:

  do you have an alternative?

 Brainstorming

  * Harvester
  * Trawler
  * HitPicker
  * HitGrabber

 Marvin Humphrey


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




Re: Is TopDocCollector's collect() implementation correct?

2009-03-26 Thread Michael McCandless
Shai Erera ser...@gmail.com wrote:
 I may have lost the context here... but here's what I thought we were
 talking about.

 If we choose the interface option (adding a ProvidesTopDocsResults
 (say) interface), then you would create method
 renderResults(ProvidesTopDocResults).

 Then, any collector implementing that interface could be safely passed
 in, as the upcast is done at compile time not runtime.

 Consider this code snippet:

 HitCollector hc = condition? new TopDocCollector() : TopFieldDocCollector();
 searcher.search(hc);

 The problem is that I need a base class for both collectors. If I use the
 interface ProvidesTopDocsResults, then I cannot pass it to searcher, w/o
 casting to HitCollector. If I use HitCollector, then I need to cast it
 before passing it into rederResults(). Only when both class have the same
 base class which is also a HitCollector, I don't need any casting. I.e.,
 after I submit a patch that develops what we've agreed on, the code can look
 like this:

 TopResultsCollector trc = condition ? new TopScoreDocCollector() : new
 TopFieldCollector();
 searcher.search(trc);
 renderResults(trc);

 Here I can pass 'trc' to both methods since it both a HitCollector and a
 TopResultsCollector. That's what I was missing in your proposal.

OK I agree.  And with our proposed changes (TopResultsCollector), you
can do this.

 Shai do you want to take a first cut at making a patch?  Can you open
 an issue?  Thanks.

 I can certainly do that. I think the summary of the steps make sense. I'll
 check if TopScoreDocCollector and TopFieldCollector can also extend that
 you provide your own pqueue and we collect top N according to it
 collector, passing a null PQ and extending topDocs().

OK, thanks.

 I also would like to propose an additional method to topDocs(), topDocs(int
 start, int howMany) which will be more efficient to call in case of paging
 through results. The reason is that topDocs() pops everything from the PQ,
 then allocates a ScoreDoc[] array of size of number of results and returns a
 TopDocs object. You can then choose just the ones you need.
 On the other hand, topDocs(start, howMany) would allocate exactly the size
 of array needed. E.g., in case someone pages through results 91-100, you
 allocate an array of size 10, instead of 100.
 It is not a huge improvement, but it does save some allocations, as well as
 it's a convenient method.

Though... this is somewhat tricky to implement efficiently when using
pqueue: you pop the worst scoring hit first, then next worst scoring,
etc, into an array (in reverse order).  It would be conceivable to do
a [separate] partial sort of the queue to more efficiently retrieve a
top subset of N to save some time on the extraction.  But my guess is
extraction time is trivial; I don't think we need to optimize it.

That being said, we could make the API like this, but under the hood
simply do what we do today the first time it's called, leaving as a
future optimization to speed it up.

Alternatively we could make a ScoreDoc result(int n) to retrieve each
result one by one... or maybe doc(int n) and score(int n) since some
collectors won't score (but, then we'd need to handle FieldDoc, which
is used to more generically return the sort field values for each
result).

 BTW, I like the name ResultsCollector, as it's just like HitCollector, but
 does not commit too much to hits .. i.e., facets aren't hits ... I think?

I like it too!

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-26 Thread Michael McCandless
Shai Erera ser...@gmail.com wrote:
 BTW Mike, I noticed that TopFieldDocCollector extends TopScoreDocCollector.

Weird.  Probably we could put that back to extending [deprecated]
TopDocCollector?

 That is a problem if we want to make TSDC final. Now, TFDC is marked
 deprecated, so it will be removed in the future.

Right.

 I think an easy fix is just to have TFDC extend TopResultsCollector, right?

Or, back to the way it was pre-1483 (extend TopDocCollector).

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-26 Thread Michael McCandless
In 1483 Doug had also suggested:

  * Hitable

I suppose Collector shouldn't really be in the name, since the class
may not actually collect the results (eg if it simply counts).

Mike

Marvin Humphrey mar...@rectangular.com wrote:
 On Thu, Mar 26, 2009 at 08:44:57AM -0400, Michael McCandless wrote:

 do you have an alternative?

 Brainstorming

  * Harvester
  * Trawler
  * HitPicker
  * HitGrabber

 Marvin Humphrey


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-26 Thread Marvin Humphrey
On Thu, Mar 26, 2009 at 04:01:26PM +0200, Shai Erera wrote:
 I still think that ResultsCollector does what you describe. It simply
 collects results, while the word 'result' is quite *open* and does not
 commit to anything ...

Another advantage of ResultsCollector is that the name suggests good
self-documenting subclass names and variable names.  For instance, it's
reasonably clear what a BitSetCollector or a TopDocsCollector might do.
And when there's only one var around, the name collector is an obvious
choice no matter what the class.  This is all possible because there's no
other use of Collector within Lucene.

I just think ResultsCollector is less euphonic and zippy than
HitCollector, so it's worth exploring alternatives.  

 How about dropping the word Collector, since it might not collect anything,
 and just save the highest score, or compute some facets ..

Sure, wiping the slate clean and re-examining HitCollector's actual purpose
and usage to discover new names is a good approach.  Similar thinking went
into Hitable and Harvester.

FWIW, I'd have to disagree that HitCollector doesn't collect anything.  It may
not collect hits per se, but it's definitely iterating over hits (in the
sense of successful matches) and with only rare exceptions, it's collecting
*something*.

 What about something with a *Listener like: DocIdListener, SearchListener,
 MatchListener (it listens on search matches).

Considering how we attach HITCOLLECTORTHINGY onto the matching process is a
novel take and clarifying to see.  However, maybe it's just me, but *Listener
evokes the JavaScript EventListener stuff, which seems radically different.
Also, if I saw a listener variable in scoring loop code, or a
TopDocsListener module in the JavaDocs, it wouldn't spring out to me that it
would be doing what a HitCollector does right now.

Cheers,

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-26 Thread Michael McCandless
On Thu, Mar 26, 2009 at 5:28 PM, Marvin Humphrey mar...@rectangular.com wrote:
 On Thu, Mar 26, 2009 at 04:01:26PM +0200, Shai Erera wrote:
 I still think that ResultsCollector does what you describe. It simply
 collects results, while the word 'result' is quite *open* and does not
 commit to anything ...

 Another advantage of ResultsCollector is that the name suggests good
 self-documenting subclass names and variable names.  For instance, it's
 reasonably clear what a BitSetCollector or a TopDocsCollector might do.
 And when there's only one var around, the name collector is an obvious
 choice no matter what the class.  This is all possible because there's no
 other use of Collector within Lucene.

I think ResultsCollector (or maybe ResultCollector) is my favorite so far...

But how about simply Collector?  (I realize it's very generic... but
we don't collect anything else in Lucene?).

 What about something with a *Listener like: DocIdListener, SearchListener,
 MatchListener (it listens on search matches).

 Considering how we attach HITCOLLECTORTHINGY onto the matching process is a
 novel take and clarifying to see.  However, maybe it's just me, but *Listener
 evokes the JavaScript EventListener stuff, which seems radically different.
 Also, if I saw a listener variable in scoring loop code, or a
 TopDocsListener module in the JavaDocs, it wouldn't spring out to me that it
 would be doing what a HitCollector does right now.

Yeah I'm not really a fan of Listener either.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-26 Thread Marvin Humphrey
On Thu, Mar 26, 2009 at 06:03:07PM -0400, Michael McCandless wrote:

 But how about simply Collector?  (I realize it's very generic... but
 we don't collect anything else in Lucene?).

+1

Honorable mention to NitPicker, LOL.

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-26 Thread Earwin Burrfoot
 I think ResultsCollector (or maybe ResultCollector) is my favorite so far...

 But how about simply Collector?  (I realize it's very generic... but
 we don't collect anything else in Lucene?).
That's exactly what I'm using in my app - abstract class Collector
extends HitCollector, that serves as a base for all my custom
collectors :)
So, yeah, I like this name.

 Yeah I'm not really a fan of Listener either.
+1

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-26 Thread Shai Erera
I really liked HItCollector and hate to give up the name ... However
Collector is fine with me either, and it is at least more generic than
HitCollector ...

Hitable sounds too aggressive/violent to me :)

BTW, I guess I should create some new searcher API which receives this
Collector class (is Collector the chosen name?) and deprecate those who
accept HitCollector?
Those can also skip the instanceof check, and wrapping of HC to MRHC ...

That also means that I should throw that MRHC wrapper (which rebases doc
Ids)? If HitCollector is deprecated, then there's no need to keep it. But
perhaps we want it there in 2.9 for easier migration? Personally I think
it's redundant since in 3.0 people will need to change all their collectors
anyway (since HitCollector will be removed, and every class which extends
HitCollector will need be modified). What do you think?

Also, there's no need to deprecate MRHC, since it's only in the trunk - I
can simply rename it, right?

Ok I'll go ahead and prepare a patch. We can discuss the name more, at the
end it will just be a short refactor action in Eclipse, so that shouldn't
hold us (or me) up.

Shai

On Fri, Mar 27, 2009 at 1:24 AM, Earwin Burrfoot ear...@gmail.com wrote:

  I think ResultsCollector (or maybe ResultCollector) is my favorite so
 far...
 
  But how about simply Collector?  (I realize it's very generic... but
  we don't collect anything else in Lucene?).
 That's exactly what I'm using in my app - abstract class Collector
 extends HitCollector, that serves as a base for all my custom
 collectors :)
 So, yeah, I like this name.

  Yeah I'm not really a fan of Listener either.
 +1

 --
 Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




Re: Is TopDocCollector's collect() implementation correct?

2009-03-25 Thread Michael McCandless
This stuff is surprisingly hard to think about!

 Actually, it was Nadav who first proposed the read interface, to
 solve the there's no common way for reading its output problem.
 With an interface (say TopDocsOutput), then you could have some
 method somewhere:

  renderResults(TopDocsOutput results)

 and then any collector, independent of how it *collects* results,
 could implement TopDocsOutput if appropriate.

 You'd still need to cast the collector to TopDocsOutput, won't you?
 How's that different than the code snippet I showed above?

The difference is for the new code, it's an upcast, which catches any
errors at compile time, not run time.  The compiler determines that
the class implements the required interface.

 The current situation introduces a bug, that's true. However, unless
 something better pops up, shouldn't we just make it final?

But that leaves no way forward for current users subclassing
TopDocCollector (for the freedom of providing your own pqueue).

 May I suggest something else? What if MRHC was actually an
 interface?

I think interface is too dangerous in this case (the future back
compatibility problem).  EG here we are wanting to explore a way to
not pre-compute the score; had we released MRHC as an interface we'd
be in trouble.  (We may still be in trouble, anyway!).

 Would TopDocsCollector subclass HitCollector or
 MultiReaderHitCollector?

 Well ... we've been there as well already :). I don't think there's
 an easy answer here. I guess if MRHC is the better approach, and we
 think all Top***DocCollector would want to have the MRHC
 functionality, then I'd say let's extend MRHC. Otherwise, I don't
 have a good answer. When I started this thread, I only knew of
 HitCollector, so things were simpler at the time.

We have challenging goals here:

  * The collect top N by score collector should be final, use
ScorerDocQueue, specialized to sorting by score/docID: performance
is important.

  * Likewise for the collect top N by sorted field collector, though
it does provide extensibility by letting you make a custom
comparator (FieldComparatorSource).  Ideally this'd allow with and
without computing score (it does not today).

  * A top N by my own pqueue collector (this is what
TopDocCollector/TopScoreDocsCollector allow today, but it has the
bug).

  * Allow fully custom collection, with and without score.

Maybe we should in fact simply deprecate HitCollector (in favor of
MultiReaderHitCollector)?  After all, making your own HitCollector is
an advanced thing; expecting you to properly implement setNextReader
may be fine.

And then we can subclass MultiReaderHitCollector to TopDocsCollector
(which adds the totalHits/topDocs results delivery API).

And then the collect top docs by score, and collect top docs by
fields collectors subclass TopDocsCollector?

Finally, we add a collect top docs according to my own pqueue
class.

Then we wouldn't need an interface; this works because all core
collectors deliver top N results in the end.

All that's missing is a way to NOT compute score if it's not needed.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-25 Thread Shai Erera
Ok so I feel we're coming to an end. Few comments though:

The difference is for the new code, it's an upcast, which catches any
 errors at compile time, not run time.  The compiler determines that
 the class implements the required interface.


I still don't understand how a compiler can detect at compilation time that
a HitCollector instance that is used in method A, and is casted to a
TopDocsOutput instance by calling method B (from A) is actually ok ... I may
be missing some Java information here, but I simply don't see how that
happens at compilation time instead of at runtime ...

But that leaves no way forward for current users subclassing
 TopDocCollector (for the freedom of providing your own pqueue).


TopDocCollector is actually marked deprecated. So nobody should continue
subclassing it, as it will be removed in 3.0 no? Perhaps you mean
TopScoreDocCollector? If yes, then I think nobody should subclass it (like
you write at the end of the email).

Maybe we should in fact simply deprecate HitCollector (in favor of
 MultiReaderHitCollector)?  After all, making your own HitCollector is
 an advanced thing; expecting you to properly implement setNextReader
 may be fine.


I personally don't understand why MRHC was invented in the first place.
What's wrong with adding a setNextReader to HitCollector with an empty
implementation which does nothing? That doesn't break back-compat, and
everyone can still override that method and do whatever they need.

Notice also that Lucene *lies* to its users about the existence of HC and
MRHC. While I'm passing HC to IndexSearcher, I believe I noticed a code
segment like:

MultiReaderHitCollector mrhc;
if (c instanceof MRHC) {
   mrhc = (MRHC) c;
else {
   mrhc = new MRHC(c);
}

So in fact the internal Lucene code expects only MRHC from a certain point,
and so even if I wrote a HC and passed it on Searcher, it's still converted
to MRHC, with an empty setNextReader() method implementation. That's why I
meant that HC is already deprecated, whether it's marked as deprecated or
not.

What you say about deprecating HC to me is unnecessary. Simply pull
setNextReader up with an empty implementation, get rid of all the
instanceof, casting and wrapping code and you're fine. Nothing is broken.
All works well and better (instanceof, casting and wrapping have their
overheads).
Isn't that right?

I'm all for your suggestions, I just think we can achieve the same thing w/
HC. We can also not deprecate TopDocCollector and remove
TopScoreDocCollector (merging implementations so that TopDocCollector is
more efficient), as well as introduce that in between TopDocsOutputCollector
(or a better name) with the topDocs() and getTotalHits() methods ...

Regarding interfaces .. I don't think they're that bad. Perhaps a different
viewing angle might help. Lucene processes a query and fires events. Today
it fires an event every time a doc's score has been computed, and recently
every time it starts to process a different reader. HitCollector is a
listener implementation on the doc-score event, while MRHC is a listener
on both.
To me, interfaces play just nicely here. Assume that you have the following
interfaces:
- DocScoreEvent
- ChangeReaderEvent
- EndProcessingEvent (thrown when the query has finished processing -
perhaps to aid collectors to free resources)
- any other events you foresee?
The Lucene code receives a HitCollector which listens on all events. In the
future, Lucene might throw other events, but still get a HitCollector. Those
methods will check for instanceof, and you as a user will know that if you
want to catch those events, you pass in a collector implementation which
does. Those events cannot of course be main-stream events (like
DocScoreEvent), but new ones, perhaps even experimental.
Since HitCollector is a concrete class, we can always add new interfaces to
it in the future with empty implementations?

Anyway, this is just a thought. I wouldn't want to deviate the discussion
we're having.

So how about what I propose:

   1. Pull setNextReader up to HitCollector, w/ empty implementation (easy
   change).
   2. Get rid of MRHC and change the classes which extend it (easy change).
   3. Get rid of all the instanceof checks (easy change - the compiler will
   warn you right away since MRHC is deleted).
   4. Create a new TopDocOutputCollector which allows you to set PQ and
   implements topDocs() and getTotalHits().
   5. Un-deprecate TopDocCollector, merge its implementation with
   TopScoreDocCollector and make it final. Override topDocs() and getTotalHits
   (if necessary).

Did I miss anything?

Shai

On Wed, Mar 25, 2009 at 3:38 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 This stuff is surprisingly hard to think about!

  Actually, it was Nadav who first proposed the read interface, to
  solve the there's no common way for reading its output problem.
  With an interface (say TopDocsOutput), then you could have some
  method somewhere:
 
   

Re: Is TopDocCollector's collect() implementation correct?

2009-03-25 Thread Shai Erera
Just minor correction - we can't make TopDocCollector final, as it will
break back-compat. So we keep it deprecated, keep TopScoreDocCollector and
make it extend that TopDocOutputCollector.

On Wed, Mar 25, 2009 at 8:55 PM, Shai Erera ser...@gmail.com wrote:

 Ok so I feel we're coming to an end. Few comments though:

 The difference is for the new code, it's an upcast, which catches any
 errors at compile time, not run time.  The compiler determines that
 the class implements the required interface.


 I still don't understand how a compiler can detect at compilation time that
 a HitCollector instance that is used in method A, and is casted to a
 TopDocsOutput instance by calling method B (from A) is actually ok ... I may
 be missing some Java information here, but I simply don't see how that
 happens at compilation time instead of at runtime ...

 But that leaves no way forward for current users subclassing
 TopDocCollector (for the freedom of providing your own pqueue).


 TopDocCollector is actually marked deprecated. So nobody should continue
 subclassing it, as it will be removed in 3.0 no? Perhaps you mean
 TopScoreDocCollector? If yes, then I think nobody should subclass it (like
 you write at the end of the email).

 Maybe we should in fact simply deprecate HitCollector (in favor of
 MultiReaderHitCollector)?  After all, making your own HitCollector is
 an advanced thing; expecting you to properly implement setNextReader
 may be fine.


 I personally don't understand why MRHC was invented in the first place.
 What's wrong with adding a setNextReader to HitCollector with an empty
 implementation which does nothing? That doesn't break back-compat, and
 everyone can still override that method and do whatever they need.

 Notice also that Lucene *lies* to its users about the existence of HC and
 MRHC. While I'm passing HC to IndexSearcher, I believe I noticed a code
 segment like:

 MultiReaderHitCollector mrhc;
 if (c instanceof MRHC) {
mrhc = (MRHC) c;
 else {
mrhc = new MRHC(c);
 }

 So in fact the internal Lucene code expects only MRHC from a certain point,
 and so even if I wrote a HC and passed it on Searcher, it's still converted
 to MRHC, with an empty setNextReader() method implementation. That's why I
 meant that HC is already deprecated, whether it's marked as deprecated or
 not.

 What you say about deprecating HC to me is unnecessary. Simply pull
 setNextReader up with an empty implementation, get rid of all the
 instanceof, casting and wrapping code and you're fine. Nothing is broken.
 All works well and better (instanceof, casting and wrapping have their
 overheads).
 Isn't that right?

 I'm all for your suggestions, I just think we can achieve the same thing w/
 HC. We can also not deprecate TopDocCollector and remove
 TopScoreDocCollector (merging implementations so that TopDocCollector is
 more efficient), as well as introduce that in between TopDocsOutputCollector
 (or a better name) with the topDocs() and getTotalHits() methods ...

 Regarding interfaces .. I don't think they're that bad. Perhaps a different
 viewing angle might help. Lucene processes a query and fires events. Today
 it fires an event every time a doc's score has been computed, and recently
 every time it starts to process a different reader. HitCollector is a
 listener implementation on the doc-score event, while MRHC is a listener
 on both.
 To me, interfaces play just nicely here. Assume that you have the following
 interfaces:
 - DocScoreEvent
 - ChangeReaderEvent
 - EndProcessingEvent (thrown when the query has finished processing -
 perhaps to aid collectors to free resources)
 - any other events you foresee?
 The Lucene code receives a HitCollector which listens on all events. In the
 future, Lucene might throw other events, but still get a HitCollector. Those
 methods will check for instanceof, and you as a user will know that if you
 want to catch those events, you pass in a collector implementation which
 does. Those events cannot of course be main-stream events (like
 DocScoreEvent), but new ones, perhaps even experimental.
 Since HitCollector is a concrete class, we can always add new interfaces to
 it in the future with empty implementations?

 Anyway, this is just a thought. I wouldn't want to deviate the discussion
 we're having.

 So how about what I propose:

1. Pull setNextReader up to HitCollector, w/ empty implementation (easy
change).
2. Get rid of MRHC and change the classes which extend it (easy
change).
3. Get rid of all the instanceof checks (easy change - the compiler
will warn you right away since MRHC is deleted).
4. Create a new TopDocOutputCollector which allows you to set PQ and
implements topDocs() and getTotalHits().
5. Un-deprecate TopDocCollector, merge its implementation with
TopScoreDocCollector and make it final. Override topDocs() and getTotalHits
(if necessary).

 Did I miss anything?

 Shai


 On Wed, Mar 25, 2009 at 3:38 

Re: Is TopDocCollector's collect() implementation correct?

2009-03-25 Thread Mark Miller

bq. I personally don't understand why MRHC was invented in the first place.

The evolution of MRHC is in the comments of LUCENE-1483 - a lot of 
comments to wade through though.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-25 Thread Shai Erera
Ok I didn't phrase it well ... what I meant to say is I don't understand why
you had to invent a new class MRHC instead of just coding that logic into
HC, with an empty implementation like I suggest. I'm not saying this is
something that should have been thought of, but that MRHC in and on itself
does not do anything, besides declaring an abstract method.
And since the Lucene code has moved internally to use MRHC, I think that my
suggestion to pull that method up to HC w/ an empty implementation is a
reasonable one, and it simplifies things a lot.

Because now, like Mike asked somewhere at the beginning of this thread, one
does not know which collector to extend - HC or MRHC. The decisions is even
tougher because I believe MRHC does not solve anything for the regular
TopDocCollector (original) case? Meaning, isn't Lucene already iterating
over the segments one by one? Isn't MRHC good mostly for TopFieldCollector?

Don't get me wrong, I think that MRHC makes sense for someone who wants to
write a HitCollector and take advantage of knowing when a reader changes /
moving to a new segment. But asking everyone to choose between HC and MRHC
is a bit too much IMO. If HC would have had this setNextReader() with empty
implementation, all that someone had to choose from is whether to override
that method or not. Now he needs to choose which HC implementation to
extend, which is at the core of this thread's problems.

On Wed, Mar 25, 2009 at 10:14 PM, Mark Miller markrmil...@gmail.com wrote:

 bq. I personally don't understand why MRHC was invented in the first place.

 The evolution of MRHC is in the comments of LUCENE-1483 - a lot of comments
 to wade through though.


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




Re: Is TopDocCollector's collect() implementation correct?

2009-03-24 Thread Shai Erera
I agree about the unnecessary method call - we should make a collector's
implementation as efficient as possible.

One concern I have about the direction you'd like a HitCollector to evolve
to is that if a collector will need to ask for the document's score instead
of retrieving it, we might face other performance problems. I agree that not
all collectors need it, and therefore there's no need to compute the score
if no one will use it.
But what about cases like collectors chaining, extensions and running w/
several collectors? If each collector will need to request for the
document's score, it might be computed over and over again. Consider a case
for example, of a TopScoreDocCollector, running together w/ another
collector that extracts information about the document, and uses the score
to compute something (easy to write a collector which delegates the
collect() call to each of them). Today, I could just call collect(doc,
score) on each collector. But in the proposed way, I'd call collect(doc) and
then each of them will need to request the score.

Perhaps we can introduce a collect(doc) on HitCollector which does not
accept score, but keep the other collect? I am not sure if that's any
better, because then the Lucene search code would need to decide to which
collect method to call ...

Regarding the TopDocs interface (we should have a better name as TopDocs is
already taken), my point was that just introducing an interface will not
solve the problem. If HitCollector was an interface then yes, that would
have solved the problem, as we could have introduced a new interface
TopDocsCollector that extends HitCollector (the interface) and expose a
topDocs() and getTotalHits() methods. But since HitCollector is an abstract
class, this will not solve the problem.
The reason is that IndexSearcher accepts a HitCollector. So the code I'd
need to write is:

HitCollector hc = new TopScoreDocCollector();
searcher.search(query, hc);
TopDocs td = ((TopDocsCollector) hc).topDocs();

That case looks very strange and is completely not safe, as someone could
change TopScoreDocCollector to not implement the TopDocsCollector interface,
and I'd discover that only at runtime.

Instead, I could have written:
TopDocsCollctor hc = new TopScoreDocCollector();
searcher.search(query, hc);
TopDocs td = hc.topDocs();
If someone changes TSDC to not implement TDC, I'd get a compilation error
and no runtime exceptions will be thrown.

What I also wanted to achieve by introducing a TopDocsCollector abstract
class is the shared implementation of topDocs() and getTotalHits(). All of
the Top***DocCollector extensions can just implement collect() and use the
base implementation of topDocs(). If there's a collector that needs a
different implementation, like TopFieldCollector, it can override it.

Is my purpose better explained now?

Shai

On Mon, Mar 23, 2009 at 8:56 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 Shai Erera ser...@gmail.com wrote:

  As a side comment, why not add setNextReader to HitCollector and
  then a getDocId(int doc) method which will do the doc + base
  arithmetic?

 One problem is this breaks back compatibility on any current
 subclasses of HitCollector.

 Another problem is: not all collectors would need to add the base on
 each doc.  EG a collector that puts hits into separate pqueues per
 segment could defer the addition until the end when only the top
 results are pulled out of each pqueue.

 Also, I am concerned about the method call overhead.  This is the
 absolute ultimate hot spot for Lucene and we should worry about
 causing even a single added instruction in this path.

 That said... I would like to [eventually] change the collection API
 along the lines of what Marvin proposed for Matcher in Lucy, here:

  http://markmail.org/message/jxshhiqr6wvq77xu

 Specifically, I think it should be the collector's job to ask for the
 score for this doc, rather than Lucene's job to pre-compute it, so
 that collectors that don't need the score won't waste CPU.  EG, if you
 are sorting by field (and don't present the relevance score) you
 shouldn't compute it.

 Then, we could add other somewhat expensive things you might
 retrieve, such as a way to ask which terms participated in the match
 (discussed today on java-user), and/or all term positions that
 participated (discussed in LUCENE-1522).  EG, a top doc collector
 could choose to call these methods only when the doc was competitive.

  Anyway, I don't want to add topDocs and getTotalHits to
  HitCollector, it will destroy its generic purpose.

 I agree.

  An interface is also problematic, as it just means all of these
  collectors have these methods declared, but they need to implement
  them. An abstract class grants you w/ both.

 I'm confused on this objection -- only collectors that do let you ask
 for the top N set of docs would implement this interface?  (Ie it'd
 only be the TopXXXCollector's that'd implement the interface).  While
 interfaces 

Re: Is TopDocCollector's collect() implementation correct?

2009-03-24 Thread Michael McCandless
Shai Erera ser...@gmail.com wrote:

 But what about cases like collectors chaining, extensions and
 running w/ several collectors? If each collector will need to
 request for the document's score, it might be computed over and over
 again. Consider a case for example, of a TopScoreDocCollector,
 running together w/ another collector that extracts information
 about the document, and uses the score to compute something (easy to
 write a collector which delegates the collect() call to each of
 them). Today, I could just call collect(doc, score) on each
 collector. But in the proposed way, I'd call collect(doc) and then
 each of them will need to request the score.

Yeah good point... we'd somehow need such chaining collectors to
simply return the score if getScore() had already been called?

 Perhaps we can introduce a collect(doc) on HitCollector which does not
 accept score, but keep the other collect? I am not sure if that's any
 better, because then the Lucene search code would need to decide to which
 collect method to call ...

Yeah not sure I like that... and it doesn't scale up well (to other
getXXX's that we may want to make optionally available from Scorer).

 Regarding the TopDocs interface (we should have a better name as
 TopDocs is already taken), my point was that just introducing an
 interface will not solve the problem. If HitCollector was an
 interface then yes, that would have solved the problem, as we could
 have introduced a new interface TopDocsCollector that extends
 HitCollector (the interface) and expose a topDocs() and
 getTotalHits() methods. But since HitCollector is an abstract class,
 this will not solve the problem.  The reason is that IndexSearcher
 accepts a HitCollector.

Actually, it was Nadav who first proposed the read interface, to
solve the there's no common way for reading its output problem.
With an interface (say TopDocsOutput), then you could have some method
somewhere:

  renderResults(TopDocsOutput results)

and then any collector, independent of how it *collects* results,
could implement TopDocsOutput if appropriate.

 So the code I'd need to write is:

 HitCollector hc = new TopScoreDocCollector();
 searcher.search(query, hc);
 TopDocs td = ((TopDocsCollector) hc).topDocs();

 That case looks very strange and is completely not safe, as someone
 could change TopScoreDocCollector to not implement the
 TopDocsCollector interface, and I'd discover that only at runtime.

I agree one should rely as little as possible on runtime casting,
though we wouldn't make such a change to TopScoreDocCollector (it
breaks back compat).

 Instead, I could have written:

 TopDocsCollctor hc = new TopScoreDocCollector();
 searcher.search(query, hc);
 TopDocs td = hc.topDocs();

 If someone changes TSDC to not implement TDC, I'd get a compilation
 error and no runtime exceptions will be thrown.

I think we should aim to separate collection, which is how Lucene
delivers hits to the collector you passed in, from results delivery,
which is how your code talks to the collector to read back the
results.

 What I also wanted to achieve by introducing a TopDocsCollector
 abstract class is the shared implementation of topDocs() and
 getTotalHits(). All of the Top***DocCollector extensions can just
 implement collect() and use the base implementation of topDocs(). If
 there's a collector that needs a different implementation, like
 TopFieldCollector, it can override it.

Would TopDocsCollector subclass HitCollector or
MultiReaderHitCollector?

 Is my purpose better explained now?

Well... I understand the problem: we have a real bug.  If you subclass
TopDocColector or TopScoreDocCollector, and don't do your sorting by
score, then collect() does the wrong thing.  So we need to fix that.

We've gone through various options for fixing it, all of which have
various drawbacks (breaks back compat, possible performance hit,
etc.), so we're still trying to converge on the lesser evil.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-24 Thread Marvin Humphrey
On Tue, Mar 24, 2009 at 02:47:07PM +0200, Shai Erera wrote:

 I agree about the unnecessary method call - we should make a collector's
 implementation as efficient as possible.

Maybe it makes sense to just bite the bullet and duplicate the unrolled code?

There's precedent: ScorerDocQueue is not a subclass of PriorityQueue.

 But what about cases like collectors chaining, extensions and running w/
 several collectors? If each collector will need to request for the
 document's score, it might be computed over and over again. Consider a case
 for example, of a TopScoreDocCollector, running together w/ another
 collector that extracts information about the document, and uses the score
 to compute something (easy to write a collector which delegates the
 collect() call to each of them). Today, I could just call collect(doc,
 score) on each collector. But in the proposed way, I'd call collect(doc) and
 then each of them will need to request the score.

In such a case, perhaps it would be possible to supply a trivial Scorer
wrapper subclass which caches a score.  Then you still have the overhead of
the method calls, but not the overhead of calculating the score.

That's not ideal, but I think the case of matching-without-scoring is more
important to optimize for.

 Perhaps we can introduce a collect(doc) on HitCollector which does not
 accept score, but keep the other collect? I am not sure if that's any
 better, because then the Lucene search code would need to decide to which
 collect method to call ...

Also, passing arguments is dirt cheap.  A HitCollector that only cares about
adding doc nums to a BitVector can just ignore the second argument.

 Regarding the TopDocs interface (we should have a better name as TopDocs is
 already taken), 

Winners.

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-24 Thread Shai Erera

 Would TopDocsCollector subclass HitCollector or
 MultiReaderHitCollector?


Well ... we've been there as well already :). I don't think there's an easy
answer here. I guess if MRHC is the better approach, and we think all
Top***DocCollector would want to have the MRHC functionality, then I'd say
let's extend MRHC. Otherwise, I don't have a good answer. When I started
this thread, I only knew of HitCollector, so things were simpler at the
time.

Well... I understand the problem: we have a real bug.  If you subclass
 TopDocColector or TopScoreDocCollector, and don't do your sorting by
 score, then collect() does the wrong thing.  So we need to fix that.


I actually wanted to avoid fixing it. Hypothetically, if we had such a
TopsDocCollector abstract class, TopScoreDocCollector could have extended
it, and declared final. Its implementation makes a lot of sense if you want
to only provide a top-scoring docs order, which is the most common search
operation. Therefore making it efficient is very important.
The current situation introduces a bug, that's true. However, unless
something better pops up, shouldn't we just make it final? After all, the
class is now called TopScoreDocCollector, which is more aimed at scoring, as
opposed to TopDocCollector, which just declares it collects 'top docs' w/o
committing to score or anything. So why would someone want to extend TSDC to
provide a sort by date for example? The names would not make sense.
And if someone really wants to have a different TSDC implementation, it can
just copy the implementation. There's no real need to extend TSDC. It's a
short, clean and easy to understand implementation.


 Actually, it was Nadav who first proposed the read interface, to
 solve the there's no common way for reading its output problem.
 With an interface (say TopDocsOutput), then you could have some method
 somewhere:

  renderResults(TopDocsOutput results)

 and then any collector, independent of how it *collects* results,
 could implement TopDocsOutput if appropriate.


You'd still need to cast the collector to TopDocsOutput, won't you? How's
that different than the code snippet I showed above?

I think we should aim to separate collection, which is how Lucene
 delivers hits to the collector you passed in, from results delivery,
 which is how your code talks to the collector to read back the
 results.


I thought we already do that .. HitCollector has a simple collect() method,
which is the only one IndexSearcher knows of and cares about. No runtime
code executed in Lucene has to know that the concrete HItCollector class it
deals with has a topDocs() API.
On the other hand, my application does know that, and therefore I can
instantiate a HitCollector which has more methods than just collect(). I
won't use HitCollector, but that base TopDocsOutputCollector (?), pass it in
to Searcher (which knows only of HitCollector) and when it's back, I don't
need to do any casting.
If only HitCollector was an interface  :).

May I suggest something else? What if MRHC was actually an interface? It
doesn't contain any implementation. Just an abstract method. That I think
would have solved anything for us:

   - We'd extend HitCollector with a TopDocsOutputCollector (?) that
   provides topDocs(), getTotalHits() and a protected constructor which accepts
   a PQ.
   - Every implementation which extends MRHC can move to extend HC and
   implement MRHC.
   - Nobody outside the Lucene code should care whether it's a MRHC or HC
   that was instantiated. This is very internal to Lucene. That's why making
   MRHC an interface makes sense - the Lucene code still deals w/ HitCollector,
   and wherever it expected or wanted to utilize MRHC, it can still do that (by
   checking instanceof, as I've seen in some places in the code).
   - Since it's still in the trunk, we shouldn't have any back compat
   issues, right?

The point here is that MRHC is really internal to Lucene, while HitCollector
is not (it's part of the API). Therefore making MRHC an interface should not
make any difference to any search application (it's a property of a
HitCollector implementation), while HitCollector and its sub-classes are.

I have a feeling this might work, but I could be missing something.

Shai

On Tue, Mar 24, 2009 at 7:51 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 Shai Erera ser...@gmail.com wrote:

  But what about cases like collectors chaining, extensions and
  running w/ several collectors? If each collector will need to
  request for the document's score, it might be computed over and over
  again. Consider a case for example, of a TopScoreDocCollector,
  running together w/ another collector that extracts information
  about the document, and uses the score to compute something (easy to
  write a collector which delegates the collect() call to each of
  them). Today, I could just call collect(doc, score) on each
  collector. But in the proposed way, I'd call collect(doc) and then
  each 

Re: Is TopDocCollector's collect() implementation correct?

2009-03-23 Thread Michael McCandless
 If we're already creating a new TopScoreDocCollector (when was it
 added?  I must have been dozing off while this happened...)

This was LUCENE-1483.

 How about if we introduce an abstract ScoringCollector (about the
 name later) which implements topDocs() and getTotalHits() and there
 will be several implementations of it, such as:
 TopScoreDocCollector, which sorts the documents by their score, in
 descending order only, TopFieldDocCollector - for sorting by fields,
 and additional sort-by collectors.

This sounds good... but the challenge is we also need to get both
HitCollector and MultiReaderHitCollector in there.

HitCollector is the simplest way to create a custom collector.
MultiReaderHitCollector (added with LUCENE-1483) is the more
performant way, since it lets your collector operate per-segment.  All
non-deprecated core collectors in Lucene now subclass
MultiReaderHitCollector.

So would we make separate subclasses for each of them to add
getTotalHits() / topDocs()?  EG TopDocsHitCollector and
TopDocsMultiReaderHitCollector?  It's getting confusing.

Or maybe we just add totalHits() and topDocs() to HitCollector even
though for advanced case (non-top-N-collection) the methods would not
be used?

Or... maybe this is a time when an interface is the lesser evil: we
could make a TopDocs interface that the necessary classes implement?

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-23 Thread Shai Erera
ok I missed 1483 completely.

As a side comment, why not add setNextReader to HitCollector and then a
getDocId(int doc) method which will do the doc + base arithmetic? I think
it's very easy for someone to forget to add that (+ base) to doc. You could
then just change TopDocCollector to call getDocId() instead of duplicating
it into TopScoreDocCollector.

Isn't that something you'd want all HitCollector implementations to use? I
consider some extensions of HitCollector we have - we now will probably want
to change them to extend MultiReaderHitCollector, but we'll have to remember
to do that +base arithmatic everywhere, instead of calling getDocId(). I
understand that changing the call to getDocId is the same as adding +
base, from an effort perspective, but I think it's better this way. It does
involve an additional method call, but I wonder how good compilers will
handle that.

Anyway, I don't want to add topDocs and getTotalHits to HitCollector, it
will destroy its generic purpose. An interface is also problematic, as it
just means all of these collectors have these methods declared, but they
need to implement them. An abstract class grants you w/ both.

So in case you agree that the logic of MultiReaderHitCollector can (and
should?) be in HitCollector, we can create an abstract class called
ScoringCollector (or if nobody objects TopDocsCollector) which will
implement these two methods.
In case you disagree, we can have that abstract class extend
MultiReaderHitCollector instead.

I'm in favor of the first option as at least as it looks in the code,
HitCollector is not extended by any class anymore, except TopDocCollector
which is marked as deprecated, and 3 anonymous implementations. So it looks
like HitCollector itself is deprecated as far as the Lucene core code sees
it.

What do you think?

Shai

On Mon, Mar 23, 2009 at 4:43 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

  If we're already creating a new TopScoreDocCollector (when was it
  added?  I must have been dozing off while this happened...)

 This was LUCENE-1483.

  How about if we introduce an abstract ScoringCollector (about the
  name later) which implements topDocs() and getTotalHits() and there
  will be several implementations of it, such as:
  TopScoreDocCollector, which sorts the documents by their score, in
  descending order only, TopFieldDocCollector - for sorting by fields,
  and additional sort-by collectors.

 This sounds good... but the challenge is we also need to get both
 HitCollector and MultiReaderHitCollector in there.

 HitCollector is the simplest way to create a custom collector.
 MultiReaderHitCollector (added with LUCENE-1483) is the more
 performant way, since it lets your collector operate per-segment.  All
 non-deprecated core collectors in Lucene now subclass
 MultiReaderHitCollector.

 So would we make separate subclasses for each of them to add
 getTotalHits() / topDocs()?  EG TopDocsHitCollector and
 TopDocsMultiReaderHitCollector?  It's getting confusing.

 Or maybe we just add totalHits() and topDocs() to HitCollector even
 though for advanced case (non-top-N-collection) the methods would not
 be used?

 Or... maybe this is a time when an interface is the lesser evil: we
 could make a TopDocs interface that the necessary classes implement?

 Mike

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




Re: Is TopDocCollector's collect() implementation correct?

2009-03-23 Thread Michael McCandless
Shai Erera ser...@gmail.com wrote:

 As a side comment, why not add setNextReader to HitCollector and
 then a getDocId(int doc) method which will do the doc + base
 arithmetic?

One problem is this breaks back compatibility on any current
subclasses of HitCollector.

Another problem is: not all collectors would need to add the base on
each doc.  EG a collector that puts hits into separate pqueues per
segment could defer the addition until the end when only the top
results are pulled out of each pqueue.

Also, I am concerned about the method call overhead.  This is the
absolute ultimate hot spot for Lucene and we should worry about
causing even a single added instruction in this path.

That said... I would like to [eventually] change the collection API
along the lines of what Marvin proposed for Matcher in Lucy, here:

  http://markmail.org/message/jxshhiqr6wvq77xu

Specifically, I think it should be the collector's job to ask for the
score for this doc, rather than Lucene's job to pre-compute it, so
that collectors that don't need the score won't waste CPU.  EG, if you
are sorting by field (and don't present the relevance score) you
shouldn't compute it.

Then, we could add other somewhat expensive things you might
retrieve, such as a way to ask which terms participated in the match
(discussed today on java-user), and/or all term positions that
participated (discussed in LUCENE-1522).  EG, a top doc collector
could choose to call these methods only when the doc was competitive.

 Anyway, I don't want to add topDocs and getTotalHits to
 HitCollector, it will destroy its generic purpose.

I agree.

 An interface is also problematic, as it just means all of these
 collectors have these methods declared, but they need to implement
 them. An abstract class grants you w/ both.

I'm confused on this objection -- only collectors that do let you ask
for the top N set of docs would implement this interface?  (Ie it'd
only be the TopXXXCollector's that'd implement the interface).  While
interfaces clearly have the future problem of back-compatibility, this
case may be simple enough to make an exception.

 So it looks like HitCollector itself is deprecated as far as the
 Lucene core code sees it.

I think HitCollector has a purpose, which is to be the simplest way to
make a custom collector.  Ie I think it makes sense to offer a simple
way and a high performance way.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-22 Thread Nadav Har'El
On Sat, Mar 21, 2009, Michael McCandless wrote about Re: Is TopDocCollector's 
collect() implementation correct?:
 
 I think I'd lean towards a third solution: tighten up
 TopScoreDocCollector (make it final, remove ability to change its PQ,
 make things private) and have it focus on high performance collection
 by score.

The problem I see is this: TopDocCollector currently does not implement
any sort of interface for reading its output. This means that if you create
a completely different class implementing, for example, a different
sorting criteria (e.g., sort by date), there will be no base class or
interface that you could use for both of them, to allow changing the sort
criterion easily at run time. On the other hand, with the existing collector,
you can subclass TopDocCollector, and use that as the common base class.

If we're already creating a new TopScoreDocCollector (when was it added?
I must have been dozing off while this happened...) perhaps we can create
a read interface for it (with the getTotalHits() and topDocs() methods), and
have this class implement that interface? Then, indeed, nobody will have
any reason to extend the TopScoreDocCollector class, and it can be final.


-- 
Nadav Har'El|Sunday, Mar 22 2009, 26 Adar 5769
IBM Haifa Research Lab  |-
|Did you sleep well? No, I made a
http://nadav.harel.org.il   |couple of mistakes.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-22 Thread Shai Erera
How about if we introduce an abstract ScoringCollector (about the name
later) which implements topDocs() and getTotalHits() and there will be
several implementations of it, such as: TopScoreDocCollector, which sorts
the documents by their score, in descending order only, TopFieldDocCollector
- for sorting by fields, and additional sort-by collectors.

All these can share the topDocs and getTotalHits methods. Nadav - this is
just like your proposed interface, but I would like to propose an abstract
class which will implement the common functionality. The only non-common
functionality is collect(), and this one will be implemented by subclasses.
That way, all of these can be of the same type, which makes it easier to
write search applications who offer the user to sort results based on other
attributes than just score.
This class can have a protected c'tor which accepts a PQ and nothing more.
It will also make its PQ and totalHits protected.

About the name - TopDocCollector or TopDocsCollector is the perfect name for
this class. But the first one is already taken and the second one will just
confuse users (with the first one). Unless we can decide to make
TopDocCollector abstract n 2.9, instead of just removing it?
Or if you are not happy with ScoringCollector, please provide a better name.

The more I think about it, I realize that my intentions with 1356 were to
make TopDocCollector a superclass for all scoring documents, and sharing its
PQ led to the problem I reported in this thread. Perhaps it was better than
to define that abstract class .. but better later than never.

What do you think?

Shai

On Sun, Mar 22, 2009 at 3:16 PM, Nadav Har'El n...@math.technion.ac.ilwrote:

 On Sat, Mar 21, 2009, Michael McCandless wrote about Re: Is
 TopDocCollector's collect() implementation correct?:
 
  I think I'd lean towards a third solution: tighten up
  TopScoreDocCollector (make it final, remove ability to change its PQ,
  make things private) and have it focus on high performance collection
  by score.

 The problem I see is this: TopDocCollector currently does not implement
 any sort of interface for reading its output. This means that if you create
 a completely different class implementing, for example, a different
 sorting criteria (e.g., sort by date), there will be no base class or
 interface that you could use for both of them, to allow changing the sort
 criterion easily at run time. On the other hand, with the existing
 collector,
 you can subclass TopDocCollector, and use that as the common base class.

 If we're already creating a new TopScoreDocCollector (when was it added?
 I must have been dozing off while this happened...) perhaps we can create
 a read interface for it (with the getTotalHits() and topDocs() methods),
 and
 have this class implement that interface? Then, indeed, nobody will have
 any reason to extend the TopScoreDocCollector class, and it can be final.


 --
 Nadav Har'El|Sunday, Mar 22 2009, 26 Adar
 5769
 IBM Haifa Research Lab
  |-
|Did you sleep well? No, I made a
 http://nadav.harel.org.il   |couple of mistakes.

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




Re: Is TopDocCollector's collect() implementation correct?

2009-03-21 Thread Michael McCandless


I think I'd lean towards a third solution: tighten up
TopScoreDocCollector (make it final, remove ability to change its PQ,
make things private) and have it focus on high performance collection
by score.  This is the default collector for Lucene searches so I think
keeping it high performance (not adding additional method calls nor
boolean checks) is important.  You choose this for performance

Then introduce a new hit collector that makes subclassing / changing
the PQ easier. You choose this one for flexibility.

And 2.9 is a great chance to do this since we've deprecated the old
class (TopDocCollector) in favor of TopScoreDocCollector.

Mike

Shai Erera wrote:


Thanks Chris/Hoss (not sure who sent the original reply).

I don't like calling pq.lessThan, as pq.insert and  
pq.insertWithOverflow call it anyway internally and since it would  
add a method call (something that was tried to be avoided in the  
current implementation), I prefer the code I proposed below.


BTW, I introduced 1356, so I take full responsibility on this  
overlooking. The main reason for 1356 was to allow creating  
extensions of TopDocCollector so they can be of the same type, and  
share the topDocs() and totalHIts() implementations.


I can file an issue. Any other comments?

Shai

On Sat, Mar 21, 2009 at 3:48 AM, Chris Hostetter hossman_luc...@fucit.org 
 wrote:


(resending msg from earlier today during @apache mail outage -- i  
didn't get a copy from the list, so i'm assuming no one did)



-- Forwarded message --
Date: Fri, 20 Mar 2009 15:29:13 -0700 (PDT)

: TopDocCollector's (TDC) implementation of collect() seems a bit  
problematic

: to me.

This code isn't an area i'm very familiar with, but your assessment  
seems

correct ... it looks like when LUCENE-1356 introduced the ability to
provide a PriorityQueue to the constructor, the existing  
optimization when

the score was obvoiusly too low was overlooked.

It looks like this same bug got propogated to TopScoreDocCollector
when it was introduced as well.

: Introduce in TDC a private boolean which signals whether the  
default PQ is


: used or not. If it's not used, don't do the 'else if' at all. If  
it is used,

: then the 'else if' is safe. Then code could look like:

my vote would just be to change the = comarison to a hq.lessThan  
call
... but i can understand how your proposal might be more efficient  
-- I'll
let the performance experts fight it out ... but i definitely think  
you

should fil a bug.



-Hoss


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-20 Thread Chris Hostetter


(resending msg from earlier today during @apache mail outage -- i didn't 
get a copy from the list, so i'm assuming no one did)


-- Forwarded message --
Date: Fri, 20 Mar 2009 15:29:13 -0700 (PDT)

: TopDocCollector's (TDC) implementation of collect() seems a bit problematic
: to me.

This code isn't an area i'm very familiar with, but your assessment seems
correct ... it looks like when LUCENE-1356 introduced the ability to
provide a PriorityQueue to the constructor, the existing optimization when
the score was obvoiusly too low was overlooked.

It looks like this same bug got propogated to TopScoreDocCollector
when it was introduced as well.

: Introduce in TDC a private boolean which signals whether the default PQ is
: used or not. If it's not used, don't do the 'else if' at all. If it is used,
: then the 'else if' is safe. Then code could look like:

my vote would just be to change the = comarison to a hq.lessThan call
... but i can understand how your proposal might be more efficient -- I'll
let the performance experts fight it out ... but i definitely think you
should fil a bug.



-Hoss


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Is TopDocCollector's collect() implementation correct?

2009-03-20 Thread Shai Erera
Thanks Chris/Hoss (not sure who sent the original reply).

I don't like calling pq.lessThan, as pq.insert and pq.insertWithOverflow
call it anyway internally and since it would add a method call (something
that was tried to be avoided in the current implementation), I prefer the
code I proposed below.

BTW, I introduced 1356, so I take full responsibility on this overlooking.
The main reason for 1356 was to allow creating extensions of TopDocCollector
so they can be of the same type, and share the topDocs() and totalHIts()
implementations.

I can file an issue. Any other comments?

Shai

On Sat, Mar 21, 2009 at 3:48 AM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 (resending msg from earlier today during @apache mail outage -- i didn't
 get a copy from the list, so i'm assuming no one did)

 -- Forwarded message --
 Date: Fri, 20 Mar 2009 15:29:13 -0700 (PDT)

 : TopDocCollector's (TDC) implementation of collect() seems a bit
 problematic
 : to me.

 This code isn't an area i'm very familiar with, but your assessment seems
 correct ... it looks like when LUCENE-1356 introduced the ability to
 provide a PriorityQueue to the constructor, the existing optimization when
 the score was obvoiusly too low was overlooked.

 It looks like this same bug got propogated to TopScoreDocCollector
 when it was introduced as well.

 : Introduce in TDC a private boolean which signals whether the default PQ
 is
 : used or not. If it's not used, don't do the 'else if' at all. If it is
 used,
 : then the 'else if' is safe. Then code could look like:

 my vote would just be to change the = comarison to a hq.lessThan call
 ... but i can understand how your proposal might be more efficient -- I'll
 let the performance experts fight it out ... but i definitely think you
 should fil a bug.



 -Hoss


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




Is TopDocCollector's collect() implementation correct?

2009-03-17 Thread Shai Erera
Hi

TopDocCollector's (TDC) implementation of collect() seems a bit problematic
to me.

  if (reusableSD == null) {
reusableSD = new ScoreDoc(doc, score);
  } else if (score = reusableSD.score) {
// reusableSD holds the last rejected entry, so, if
// this new score is not better than that, there's no
// need to try inserting it
reusableSD.doc = doc;
reusableSD.score = score;
  } else {
return;
  }
  reusableSD = (ScoreDoc) hq.insertWithOverflow(reusableSD);

The first 'if' is safe, as it assumes that if reusableSD is null, then there
is more room in PQ.

The second (else if) is a bit problematic. It assumes that if the given
score is less than the latest item rejected from PQ, then it should not be
added. The problem I see here is if someone extends TDC and provides his own
PQ, which for example, sorts based on the lowest scoring documents, or just
by document IDs. TDC has a protected constructor which allows that. In that
case, comparing the current score to the latest rejected ScoreDoc is wrong.
Also, PQ itself will check that (using lessThan) before attempting to call
the insert itself.

On the other hand, this check saves two method calls (PQ.insert and
PQ.lessThan) in case someone wants to sort by top ranking documents, which
is the default behavior.

Of course we could say to that person who wants to sort by docIDs, to extend
TDC and override collect() itself, but I would like to propose an
alternative, which will allow extending TDC by providing a different PQ,
while still using its collect() method.

Introduce in TDC a private boolean which signals whether the default PQ is
used or not. If it's not used, don't do the 'else if' at all. If it is used,
then the 'else if' is safe. Then code could look like:

  if (reusableSD == null) {
reusableSD = new ScoreDoc(doc, score);
  } else if (useDefaultPQ){
if (score = reusableSD.score) {
  // reusableSD holds the last rejected entry, so, if
  // this new score is not better than that, there's no
  // need to try inserting it
  reusableSD.doc = doc;
  reusableSD.score = score;
} else {
  return;
}
  } else {
 reusableSD.doc = doc;
 reusableSD.score = score;
  }

  reusableSD = (ScoreDoc) hq.insertWithOverflow(reusableSD);

I don't think that adding useDefaultPQ will have any performance
implications, but it does add one extra boolean check. Therefore if you
agree that this should be fixed in TDC, rather than forcing anyone who
provides his own PQ to also override collect() (in the cases I've noted
above), then I can submit a patch.

Shai