subject:"\[jira\] Updated\: \(LUCENE\-1575\) Refactoring Lucene collectors \(HitCollector and extensions\)"

[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-10 Thread Shai Erera (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Shai Erera updated LUCENE-1575:
---

Attachment: LUCENE-1575.9.patch

bq. Did you add a test case verifying maxScore is correct (so that the
Float.NaN issue would trip the test)?

I added to following tests:
* testSortWithoutScoreTracking - asserts that ScoreDoc.score is set to
Float.NaN as well as maxScore.
* testSortWithScoreNoMaxScoreTracking - asserts that ScoreDoc.score is not
Float.NaN, but maxScore is.
* testSortWithScoreAndMaxScoreTracking - asserts that both ScoreDoc.score and
maxScore are not set to NaN.
* testSortWithScoreAndMaxScoreTrackingNoResults - asserts that in case of a
maxScore tracking collector with 0 results, maxScore is set to Float.NaN,
rather than NEG_INF.

bq. HitCollector isn't deprecated

Somehow when I applied your patch, this change wasn't taken in. Anyway, I
noticed its javadocs also referenced MRHC, so I fixed it also.

Refactoring Lucene collectors (HitCollector and extensions)
---

Key: LUCENE-1575
URL: https://issues.apache.org/jira/browse/LUCENE-1575
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
Fix For: 2.9

Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch,
LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch,
LUCENE-1575.6.patch, LUCENE-1575.7.patch, LUCENE-1575.8.patch,
LUCENE-1575.9.patch, LUCENE-1575.9.patch, LUCENE-1575.patch,
LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch,
PerfTest.java, sortBench5.py, sortCollate5.py

This issue is a result of a recent discussion we've had on the mailing list.
You can read the thread
[here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
We have agreed to do the following refactoring:
* Rename MultiReaderHitCollector to Collector, with the purpose that it will
be the base class for all Collector implementations.
* Deprecate HitCollector in favor of the new Collector.
* Introduce new methods in IndexSearcher that accept Collector, and deprecate
those that accept HitCollector.
** Create a final class HitCollectorWrapper, and use it in the deprecated
methods in IndexSearcher, wrapping the given HitCollector.
** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0,
when we remove HitCollector.
** It will remove any instanceof checks that currently exist in IndexSearcher
code.
* Create a new (abstract) TopDocsCollector, which will:
** Leave collect and setNextReader unimplemented.
** Introduce protected members PriorityQueue and totalHits.
** Introduce a single protected constructor which accepts a PriorityQueue.
** Implement topDocs() and getTotalHits() using the PQ and totalHits members.
These can be used as-are by extending classes, as well as be overridden.
** Introduce a new topDocs(start, howMany) method which will be used a
convenience method when implementing a search application which allows paging
through search results. It will also attempt to improve the memory
allocation, by allocating a ScoreDoc[] of the requested size only.
* Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs()
and getTotalHits() implementations as they are from TopDocsCollector. The
class will also be made final.
* Change TopFieldCollector to extend TopDocsCollector, and make the class
final. Also implement topDocs(start, howMany).
* Change TopFieldDocCollector (deprecated) to extend TopDocsCollector,
instead of TopScoreDocCollector. Implement topDocs(start, howMany)
* Review other places where HitCollector is used, such as in Scorer,
deprecate those places and use Collector instead.
Additionally, the following proposal was made w.r.t. decoupling score from
collect():
* Change collect to accecpt only a doc Id (unbased).
* Introduce a setScorer(Scorer) method.
* If during collect the implementation needs the score, it can call
scorer.score().
If we do this, then we need to review all places in the code where
collect(doc, score) is called, and assert whether Scorer can be passed. Also
this raises few questions:
* What if during collect() Scorer is null? (i.e., not set) - is it even
possible?
* I noticed that many (if not all) of the collect() implementations discard
the document if its score is not greater than 0. Doesn't it mean that score
is needed in collect() always?
Open issues:
* The name for Collector
* TopDocsCollector was mentioned on the thread as TopResultsCollector, but
that was when we thought to call Colletor ResultsColletor. Since we decided
(so far) on Collector, I think TopDocsCollector makes sense, because of its

[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-09 Thread Shai Erera (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Shai Erera updated LUCENE-1575:
---

Attachment: (was: LUCENE-1575.9.patch)

Refactoring Lucene collectors (HitCollector and extensions)
---

Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch,
LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch,
LUCENE-1575.6.patch, LUCENE-1575.7.patch, LUCENE-1575.8.patch,
LUCENE-1575.9.patch, LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch,
LUCENE-1575.patch, LUCENE-1575.patch, PerfTest.java, sortBench5.py,
sortCollate5.py

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-07 Thread Michael McCandless (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Michael McCandless updated LUCENE-1575:
---

Attachment: LUCENE-1575.patch

Attached new patch:

* Changed members methods in TopFieldCollector from protected to
package-private.

* Tweaked javadocs, CHANGES.txt

* Removed some dead code, nocommits

* Re-added TestTimeLimitedCollector

Besides the java ghosts, for which we will close our eyes and hope
they disappear, I think this is ready to go in!

I'll way a few days and then commit.

Refactoring Lucene collectors (HitCollector and extensions)
---

Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch,
LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch,
LUCENE-1575.6.patch, LUCENE-1575.7.patch, LUCENE-1575.8.patch,
LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch,
PerfTest.java, sortBench5.py, sortCollate5.py

--
This message is automatically generated by JIRA.
-

[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-07 Thread Michael McCandless (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Michael McCandless updated LUCENE-1575:
---

Attachment: LUCENE-1575.patch

New patch which just fixes contrib/spatial's cutover to the new API to further
cutover to the new new API.

Refactoring Lucene collectors (HitCollector and extensions)
---

Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch,
LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch,
LUCENE-1575.6.patch, LUCENE-1575.7.patch, LUCENE-1575.8.patch,
LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch,
LUCENE-1575.patch, PerfTest.java, sortBench5.py, sortCollate5.py

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail:

[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-06 Thread Shai Erera (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Shai Erera updated LUCENE-1575:
---

Attachment: LUCENE-1575.8.patch

Added JustCompileSearch, JustCompileSearchFunction and JustCompileSearchSpans
that extend/implement all abstract classes/interfaces in o.a.l.s, o.a.l.s.s and
o.a.l.s.f. Those are not unit tests per-sei, however if anyone will change the
interfaces/abstract classes in a way that it breaks back-compat, we'll know it
right away. I think that in general this is something good to have for Lucene
overall, however I only took care of the search.* packages in this patch.

Refactoring Lucene collectors (HitCollector and extensions)
---

Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch,
LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch,
LUCENE-1575.6.patch, LUCENE-1575.7.patch, LUCENE-1575.8.patch,
LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, PerfTest.java,
sortBench5.py, sortCollate5.py

[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-04 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1575:
---

Attachment: sortCollate5.py
sortBench5.py

I'm attaching the Python scripts I use to run the tests.  You also need this 
small mod:

{code}
Index: 
contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/ReadTask.java
===
--- 
contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/ReadTask.java
   (revision 761709)
+++ 
contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/ReadTask.java
   (working copy)
@@ -63,6 +63,9 @@
 super(runData);
   }
 
+  // nocommit
+  static boolean first = true;
+
   public int doLogic() throws Exception {
 int res = 0;
 boolean closeReader = false;
@@ -101,6 +104,11 @@
 } else {
   hits = searcher.search(q, numHits);
 }
+// nocommit
+if (first) {
+  System.out.println(NUMHITS= + hits.totalHits);
+  first = false;
+}
 //System.out.println(q= + q + : + hits.totalHits +  total hits); 
 
 if (withTraverse()) {
{code}

All the python scripts do is write an alg, run it, gather the results, and 
collate in the end.  You run sortBench5.py once on trunk and once in a checkout 
with this patch, each time in the contrib/benchmark directory.  It saves a 
pickle file (results.pk) which sortCollate5.py then loads (you'll have to edit 
the hardwired paths in sortCollate5.py).

 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
 LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
 LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch, sortBench5.py, 
 sortCollate5.py


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of TopScoreDocCollector. Implement topDocs(start, howMany)
 * Review other places where HitCollector is used, such as in Scorer, 
 deprecate those places and use Collector instead.
 Additionally, the following proposal was made w.r.t. decoupling score from 
 collect():
 * Change collect to accecpt only a doc Id (unbased).
 * Introduce a setScorer(Scorer) method.
 * If during collect the implementation needs the score, it can call 
 scorer.score().
 If we do this, then we need to review all places in the code where 
 collect(doc, score) is called, and assert whether Scorer can be passed. Also 
 this raises few questions:
 * What if during collect() Scorer is null? (i.e., not set) - is it even 
 possible?
 * I noticed that

[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-04 Thread Shai Erera (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-1575:
---

Attachment: LUCENE-1575.7.patch

- Changed TermScorer.score() method to not call Similarity.decodeNorm. If we 
can change Scorer.similarity to be protected, we can give up getSimilarity() 
call in score(). Also changed TermScorer.score(Collector) to set 'this' as the 
collector's scorer.
- Deprecated TimeLimitedCollector, created new TimeLimitingCollector, renamed 
TestTimeLimitedCollector to TestTimeLimitingCollector and used the new 
TimeLimitingCollector.
- Changed FVHQ to have a static create which returns 
One/MultiComparatorFieldValueHitQueue version.
- Changed TopFieldCollector setNextReader versions to not call pq.size() but 
rather use numHits.


 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
 LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
 LUCENE-1575.6.patch, LUCENE-1575.7.patch, LUCENE-1575.patch, 
 LUCENE-1575.patch, sortBench5.py, sortCollate5.py


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of TopScoreDocCollector. Implement topDocs(start, howMany)
 * Review other places where HitCollector is used, such as in Scorer, 
 deprecate those places and use Collector instead.
 Additionally, the following proposal was made w.r.t. decoupling score from 
 collect():
 * Change collect to accecpt only a doc Id (unbased).
 * Introduce a setScorer(Scorer) method.
 * If during collect the implementation needs the score, it can call 
 scorer.score().
 If we do this, then we need to review all places in the code where 
 collect(doc, score) is called, and assert whether Scorer can be passed. Also 
 this raises few questions:
 * What if during collect() Scorer is null? (i.e., not set) - is it even 
 possible?
 * I noticed that many (if not all) of the collect() implementations discard 
 the document if its score is not greater than 0. Doesn't it mean that score 
 is needed in collect() always?
 Open issues:
 * The name for Collector
 * TopDocsCollector was mentioned on the thread as TopResultsCollector, but 
 that was when we thought to call Colletor ResultsColletor. Since we decided 
 (so far) on Collector, I think TopDocsCollector makes sense, because of its 
 TopDocs output.
 * Decoupling score from collect().
 I will post a patch a bit later, as this is expected to be a very large 
 patch. I will split it into 2: (1) code patch (2) test cases (moving to use 
 Collector instead of HitCollector, as well as testing the new topDocs(start, 
 howMany) method.
 There

[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-04 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1575:
---

Attachment: LUCENE-1575.patch

Attached patch; only differences are:

  * Under contrib/benchmark I made changes so you can specify non-scoring field 
sorting

  * Fixed the rename of TestTimeLimitedCollector -- Limiting to be 
patch-friendly

OK I ran performance with score tracking disabled during field sorted search:

||query||sort||hits||qps||qpsnew||pctg||
|147|title|   6953|2915.7|4043.3| 38.7%|
|147|doc|   6953|3265.6|4840.1| 48.2%|
|text|title| 157101|  97.0| 128.0| 32.0%|
|text|doc| 157101| 174.3| 273.2| 56.7%|
|1|title| 565452|  44.6|  60.2| 35.0%|
|1|doc| 565452|  49.2|  75.3| 53.0%|
|1 OR 2|title| 784928|  12.6|  14.8| 17.5%|
|1 OR 2|doc| 784928|  13.0|  15.2| 16.9%|
|1 AND 2|title| 333153|  14.8|  17.9| 20.9%|
|1 AND 2|doc| 333153|  15.2|  18.9| 24.3%|

Very nice speedups!  We just have to figure out why the score-tracking variant 
got slower...


 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
 LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
 LUCENE-1575.6.patch, LUCENE-1575.7.patch, LUCENE-1575.patch, 
 LUCENE-1575.patch, LUCENE-1575.patch, sortBench5.py, sortCollate5.py


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of TopScoreDocCollector. Implement topDocs(start, howMany)
 * Review other places where HitCollector is used, such as in Scorer, 
 deprecate those places and use Collector instead.
 Additionally, the following proposal was made w.r.t. decoupling score from 
 collect():
 * Change collect to accecpt only a doc Id (unbased).
 * Introduce a setScorer(Scorer) method.
 * If during collect the implementation needs the score, it can call 
 scorer.score().
 If we do this, then we need to review all places in the code where 
 collect(doc, score) is called, and assert whether Scorer can be passed. Also 
 this raises few questions:
 * What if during collect() Scorer is null? (i.e., not set) - is it even 
 possible?
 * I noticed that many (if not all) of the collect() implementations discard 
 the document if its score is not greater than 0. Doesn't it mean that score 
 is needed in collect() always?
 Open issues:
 * The name for Collector
 * TopDocsCollector was mentioned on the thread as TopResultsCollector, but 
 that was when we thought to call Colletor ResultsColletor. Since we decided 
 (so far) on Collector, I think TopDocsCollector makes sense, because of its 
 TopDocs output.
 * Decoupling score from collect().
 I will post a

[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-03 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1575:
---

Attachment: LUCENE-1575.patch


OK, I attached a new patch with some minor changes:

  * Beefed up javadocs in Collector.java; fixed other javadocs
warnings.  Tweaked CHANGES.txt.

  * Renamed PositiveOnlyScoresCollector --
PositiveScoresOnlyCollector

And also came across these questions/issues:

  * TopFieldCollector's updateBottom  add methods take score, and are
passed score from the non-scoring collectors, but shouldn't?

  * TermScorer need not override score(HitCollector hc) (super does
the same thing).

  * The changes to TermScorer make me a bit nervous.  EG, the new
InternalScorer: will it hurt performance?  Also this part:
{code}
+// Set the Scorer doc and score before calling collect in case it will 
be
+// used in collect()
+s.d = doc;
+s.score = score;
+c.collect(doc);  // collect score
{code}
is spooky: I don't like how we worry that one may call scorer.doc() (I
don't like the ambiguity in the API -- we both pass doc and fear you
may call scorer.doc()).  Not sure how to resolve it.

  * Hmm -- we added a new abstract method to
src/java/org/apache/lucene/search/Searcher.java (that accepts
Collector).  Should that method be concrete (and throw UOE), for
back compat?

  * We've also added a method to the Searchable interface, which is
a break in back-compat.  But my feeling is we should allow this
break (but Shai can you add another Note at the top of
CHANGES.txt, calling this out?).


 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
 LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
 LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of TopScoreDocCollector. Implement topDocs(start, howMany)
 * Review other places where HitCollector is used, such as in Scorer, 
 deprecate those places and use Collector instead.
 Additionally, the following proposal was made w.r.t. decoupling score from 
 collect():
 * Change collect to accecpt only a doc Id (unbased).
 * Introduce a setScorer(Scorer) method.
 * If during collect the implementation needs the score, it can call 
 scorer.score().
 If we do this, then we need to review all places in the code where 
 collect(doc, score) is called, and assert whether Scorer can be passed. Also 
 this raises few questions:
 * What if during collect() Scorer

[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-02 Thread Shai Erera (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Shai Erera updated LUCENE-1575:
---

Attachment: LUCENE-1575.5.patch

Fixed TestFieldNormModifier and TestLengthNormModifier.
All tests pass now (including contrib)

Refactoring Lucene collectors (HitCollector and extensions)
---

Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch,
LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch,
LUCENE-1575.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-02 Thread Shai Erera (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Shai Erera updated LUCENE-1575:
---

Attachment: LUCENE-1575.6.patch

Changes:
# TimeLimitedCollector, TestTimeLimitedCollector and CHANGES.
# I also fixed a bug in TestTermScorer, which was discovered by the test-tag
task, and existed since 1483 and propagated into HitCollectorWrapper as well:
docBase was set to -1 by default, relying on setNextReader to be called.
However if it's not called (as in TestTermScorer, or if someone called
Scorer.score(Collector)), all document Ids are shifted backwards by 1. The test
had a bug which asserted on the unshifted doc Id, and after I fixed the Ids to
shift, it failed. Anyway, the test now works correctly, as well as HCW.
# I checked all other Collector implementations and changed the default base to
0, unless in some test cases, where -1 had a meaning.

All tests (contrib, core and tags) pass.

Refactoring Lucene collectors (HitCollector and extensions)
---

Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch,
LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch,
LUCENE-1575.6.patch, LUCENE-1575.patch

[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-01 Thread Shai Erera (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-1575:
---

Attachment: LUCENE-1575.patch

Eventually I decided to include just one patch file (instead of code and test) 
since it was simpler after all. Please be sure to review the following:
# Collector class and documentation.
# New TopDocsCollector class.
# TopFieldCollector refactoring.
# Methods deprecation.
# New TestTopDocsCollector as well as test cases in TestSort.

 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9

 Attachments: LUCENE-1575.patch


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of TopScoreDocCollector. Implement topDocs(start, howMany)
 * Review other places where HitCollector is used, such as in Scorer, 
 deprecate those places and use Collector instead.
 Additionally, the following proposal was made w.r.t. decoupling score from 
 collect():
 * Change collect to accecpt only a doc Id (unbased).
 * Introduce a setScorer(Scorer) method.
 * If during collect the implementation needs the score, it can call 
 scorer.score().
 If we do this, then we need to review all places in the code where 
 collect(doc, score) is called, and assert whether Scorer can be passed. Also 
 this raises few questions:
 * What if during collect() Scorer is null? (i.e., not set) - is it even 
 possible?
 * I noticed that many (if not all) of the collect() implementations discard 
 the document if its score is not greater than 0. Doesn't it mean that score 
 is needed in collect() always?
 Open issues:
 * The name for Collector
 * TopDocsCollector was mentioned on the thread as TopResultsCollector, but 
 that was when we thought to call Colletor ResultsColletor. Since we decided 
 (so far) on Collector, I think TopDocsCollector makes sense, because of its 
 TopDocs output.
 * Decoupling score from collect().
 I will post a patch a bit later, as this is expected to be a very large 
 patch. I will split it into 2: (1) code patch (2) test cases (moving to use 
 Collector instead of HitCollector, as well as testing the new topDocs(start, 
 howMany) method.
 There might be even a 3rd patch which handles the setScorer thing in 
 Collector (maybe even a different issue?)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-01 Thread Shai Erera (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Shai Erera updated LUCENE-1575:
---

Attachment: LUCENE-1575.1.patch

oops :) leftovers from when it extended MultiReaderHitCollector (now called
Collector)

Refactoring Lucene collectors (HitCollector and extensions)
---

Key: LUCENE-1575
URL: https://issues.apache.org/jira/browse/LUCENE-1575
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Reporter: Shai Erera
Fix For: 2.9

Attachments: LUCENE-1575.1.patch, LUCENE-1575.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-01 Thread Shai Erera (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Shai Erera updated LUCENE-1575:
---

Attachment: LUCENE-1575.2.patch

Thanks Mike. I ran the javadocs task and found other mentions of
MultiReaderHitCollector as well as fixed some more javadocs. BTW, the javadoc
Ant task outputs many errors on missing files/names, but that something for
another issue.

Refactoring Lucene collectors (HitCollector and extensions)
---

Key: LUCENE-1575
URL: https://issues.apache.org/jira/browse/LUCENE-1575
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Reporter: Shai Erera
Fix For: 2.9

Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch,
LUCENE-1575.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-01 Thread Shai Erera (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Shai Erera updated LUCENE-1575:
---

Attachment: LUCENE-1575.3.patch

Includes the latest comments from Mike.

Refactoring Lucene collectors (HitCollector and extensions)
---

Key: LUCENE-1575
URL: https://issues.apache.org/jira/browse/LUCENE-1575
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Reporter: Shai Erera
Fix For: 2.9

Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch,
LUCENE-1575.3.patch, LUCENE-1575.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

15 matches

Site Navigation

Mail list logo

Footer information