date:20090404


[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695674#action_12695674
 ] 

Shai Erera commented on LUCENE-1575:


Mike - about your comments on the new Searcher and Searchable search(Weight, 
Filter, Collector). I think that best (if not only) option currently is to 
remove them from the interface (comment out I mean) with a TODO to add in 3.0.

I tried to just comment out in Searchable, and empty impl in Searcher which 
throws UOE. However that caused a problem in in MultiSearcher, 
ParallelMultiSearcher and RemoteSearchable:
* RemoteSearchable impls Searchable - commenting out the new impl method with a 
TODO for 3.0 will be fine, but
* MS and PMS accept Searchable in their ctor and use them in search(W, F, C) 
which they extend from Searcher (they MUST extend it because Searcher's throws 
UOE). However they call searchable.search, which accepts just a HC, and we 
can't wrap a Collector with a HC.

Previously, MS and PMS implemented the HC version by always wrapping with a 
MRHC. I think we should just pass in the given HC to the Searchable.search 
method, and rely on its wrapping by a HCW later on. In 3.0 we'll delete it 
entirely and use the Collector implementation.

Do you see any other way?

 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
 LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
 LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of TopScoreDocCollector. Implement topDocs(start, howMany)
 * Review other places where HitCollector is used, such as in Scorer, 
 deprecate those places and use Collector instead.
 Additionally, the following proposal was made w.r.t. decoupling score from 
 collect():
 * Change collect to accecpt only a doc Id (unbased).
 * Introduce a setScorer(Scorer) method.
 * If during collect the implementation needs the score, it can call 
 scorer.score().
 If we do this, then we need to review all places in the code where 
 collect(doc, score) is called, and assert whether Scorer can be passed. Also 
 this raises few questions:
 * What if during collect() Scorer is null? (i.e., not set) - is it even 
 possible?
 * I noticed that many (if not all) of the collect() implementations discard 
 the document if its score is not greater than 0. Doesn't it mean that score 
 is needed in collect() always?
 Open issues:
 * The name for Collector
 * TopDocsCollector was mentioned on the thread as TopResultsCollector, but 
 that was when we thought to

[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1575:
---

Attachment: sortCollate5.py
sortBench5.py

I'm attaching the Python scripts I use to run the tests.  You also need this 
small mod:

{code}
Index: 
contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/ReadTask.java
===
--- 
contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/ReadTask.java
   (revision 761709)
+++ 
contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/ReadTask.java
   (working copy)
@@ -63,6 +63,9 @@
 super(runData);
   }
 
+  // nocommit
+  static boolean first = true;
+
   public int doLogic() throws Exception {
 int res = 0;
 boolean closeReader = false;
@@ -101,6 +104,11 @@
 } else {
   hits = searcher.search(q, numHits);
 }
+// nocommit
+if (first) {
+  System.out.println(NUMHITS= + hits.totalHits);
+  first = false;
+}
 //System.out.println(q= + q + : + hits.totalHits +  total hits); 
 
 if (withTraverse()) {
{code}

All the python scripts do is write an alg, run it, gather the results, and 
collate in the end.  You run sortBench5.py once on trunk and once in a checkout 
with this patch, each time in the contrib/benchmark directory.  It saves a 
pickle file (results.pk) which sortCollate5.py then loads (you'll have to edit 
the hardwired paths in sortCollate5.py).

 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
 LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
 LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch, sortBench5.py, 
 sortCollate5.py


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of TopScoreDocCollector. Implement topDocs(start, howMany)
 * Review other places where HitCollector is used, such as in Scorer, 
 deprecate those places and use Collector instead.
 Additionally, the following proposal was made w.r.t. decoupling score from 
 collect():
 * Change collect to accecpt only a doc Id (unbased).
 * Introduce a setScorer(Scorer) method.
 * If during collect the implementation needs the score, it can call 
 scorer.score().
 If we do this, then we need to review all places in the code where 
 collect(doc, score) is called, and assert whether Scorer can be passed. Also 
 this raises few questions:
 * What if during collect() Scorer is null? (i.e., not set) - is it even 
 possible?
 * I noticed that

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)


[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695700#action_12695700
 ] 

Michael McCandless commented on LUCENE-1575:


{quote}
BTW, I can change FieldValueHitQueue like I changed TopFieldCollector by
introducing a factory create() method which will return a
OneComparaterFieldValueHitQueue and MultiComparatorsFieldValueHitQueue.

Today, FVHQ.lessThan checks the numComparators in each call, which is
redundant.
{quote}

Seems good, unless the extra subclassing (and additions of super.XXX()) is 
somehow cause our performance loss.

bq. Also the class isn't final and I'm not sure if we want to change it.

Yes let's make it final.  We need to eek...


 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
 LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
 LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch, sortBench5.py, 
 sortCollate5.py


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of TopScoreDocCollector. Implement topDocs(start, howMany)
 * Review other places where HitCollector is used, such as in Scorer, 
 deprecate those places and use Collector instead.
 Additionally, the following proposal was made w.r.t. decoupling score from 
 collect():
 * Change collect to accecpt only a doc Id (unbased).
 * Introduce a setScorer(Scorer) method.
 * If during collect the implementation needs the score, it can call 
 scorer.score().
 If we do this, then we need to review all places in the code where 
 collect(doc, score) is called, and assert whether Scorer can be passed. Also 
 this raises few questions:
 * What if during collect() Scorer is null? (i.e., not set) - is it even 
 possible?
 * I noticed that many (if not all) of the collect() implementations discard 
 the document if its score is not greater than 0. Doesn't it mean that score 
 is needed in collect() always?
 Open issues:
 * The name for Collector
 * TopDocsCollector was mentioned on the thread as TopResultsCollector, but 
 that was when we thought to call Colletor ResultsColletor. Since we decided 
 (so far) on Collector, I think TopDocsCollector makes sense, because of its 
 TopDocs output.
 * Decoupling score from collect().
 I will post a patch a bit later, as this is expected to be a very large 
 patch. I will split it into 2: (1) code patch (2) test cases (moving to use 
 Collector instead of HitCollector, as well as testing the new topDocs(start, 
 howMany) method.
 There might be even a 3rd patch which handles the setScorer thing in 
 Collector (maybe even

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)


[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695701#action_12695701
 ] 

Michael McCandless commented on LUCENE-1575:


{quote}
How problematic is this break in back-compat, given it will be documented in
CHANGES?

* Have search(W, F, C) on Searchable? I don't think it will have such a
great impact as I don't believe too many actually implement Searchable.

* Have search(W, F, C) on Searcher as abstract? I know you offered, Mike, to
create an empty impl which throws UOE, but I'm not sure what's worse: having
a compilation error or UOE at runtime (which can happen at the customer's).
After all, all the search methods call this one eventually, and if you did
extend Searcher (rather than IndexSearcher), you'll get UOE on every search.
{quote}

OK let's add both and call it out in CHANGES.txt?

 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
 LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
 LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch, sortBench5.py, 
 sortCollate5.py


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of TopScoreDocCollector. Implement topDocs(start, howMany)
 * Review other places where HitCollector is used, such as in Scorer, 
 deprecate those places and use Collector instead.
 Additionally, the following proposal was made w.r.t. decoupling score from 
 collect():
 * Change collect to accecpt only a doc Id (unbased).
 * Introduce a setScorer(Scorer) method.
 * If during collect the implementation needs the score, it can call 
 scorer.score().
 If we do this, then we need to review all places in the code where 
 collect(doc, score) is called, and assert whether Scorer can be passed. Also 
 this raises few questions:
 * What if during collect() Scorer is null? (i.e., not set) - is it even 
 possible?
 * I noticed that many (if not all) of the collect() implementations discard 
 the document if its score is not greater than 0. Doesn't it mean that score 
 is needed in collect() always?
 Open issues:
 * The name for Collector
 * TopDocsCollector was mentioned on the thread as TopResultsCollector, but 
 that was when we thought to call Colletor ResultsColletor. Since we decided 
 (so far) on Collector, I think TopDocsCollector makes sense, because of its 
 TopDocs output.
 * Decoupling score from collect().
 I will post a patch a bit later, as this is expected to be a very large 
 patch. I will split it into 2: (1) code patch (2) test cases (moving to use 
 Collector instead of HitCollector, as well

Re: Future projects

On Fri, Apr 3, 2009 at 5:32 PM, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
 meaning in Bobo you'd like to manage your own memory resident
 field caches, and merge them whenever IW has merged a segment?
 Seems like you don't need genealogy for that.

 Agreed, there is no need for full genealogy.

OK

 CSF isn't really designed yet. How come it can't be used with
 Bobo's field caches?

 I guess CSF should be able to support it, makes sense. As long
 as the container is flexible with the encoding (I need to look
 into this more on the Bobo side).

Well as CSF unfolds let's take Bobo's usage into account.

 Lucene's internal field cache usage is now entirely at the
 segment level (ie, Lucene core should never request full field
 cache array at the MultiSegmentReader level). I think Bobo must
 have to do the same, if it handles near realtime updates, to get
 adequate performance.

 Bobo needs to migrate to this model, I don't think we've done
 that yet.

Hmm OK.  That's means reopen is very costly?

 EG how come Bobo made its own field cache impl? Just because
 uninversion is too slow?

 It could be integrated once LUCENE-831 is completed. I think the
 current model of a weak reference and the inability to unload if
 needed is a concern.  I don't think it's because of uninversion.

Ahh OK.  We know about that one (for LUCENE-831): the app should have
control over the caching policy.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Future projects

On Fri, Apr 3, 2009 at 5:42 PM, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
 I think the realtime reader'd just store the maxDocID it's allowed to
 search, and we would likely keep using the RAM format now used.

 Sounds pretty good.  Are there any other gotchas in the design?

Yes: the flushing process becomes challenging.  When we flush, we must
forcefully cutover any open readers searching that RAM buffer, to the
now-on-disk segment.  Such swap-out is tricky because there could be
[many threads of] searches in-flight, iterating through postings, etc.
 Which means the RAM buffer would have to become an independent entity
that is not re-used after flushing, but instead sticks around until GC
determines all outstanding readers have switched to the on-disk
segment.

I would rather not go here unless it's clear the current near
real-time performance is too limiting.  But my simplistic test the
current performance looks good.

And, if the performance does turn out to be lacking, the next step
(before searching IW's ram buffer) is to flush the new tiny segments
through a RAMDir, first.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

[
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695711#action_12695711
]

Shai Erera commented on LUCENE-1575:

There are no super.XXX calls. The two FVHQ implementations just implement
lessThan according to whether it's a single comparator or muli case. This
removes the check of numComparators == 1.

On Sat, Apr 4, 2009 at 12:53 PM, Michael McCandless (JIRA)

Refactoring Lucene collectors (HitCollector and extensions)
---

Key: LUCENE-1575
URL: https://issues.apache.org/jira/browse/LUCENE-1575
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
Fix For: 2.9

Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch,
LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch,
LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch, sortBench5.py,
sortCollate5.py

This issue is a result of a recent discussion we've had on the mailing list.
You can read the thread
[here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
We have agreed to do the following refactoring:
* Rename MultiReaderHitCollector to Collector, with the purpose that it will
be the base class for all Collector implementations.
* Deprecate HitCollector in favor of the new Collector.
* Introduce new methods in IndexSearcher that accept Collector, and deprecate
those that accept HitCollector.
** Create a final class HitCollectorWrapper, and use it in the deprecated
methods in IndexSearcher, wrapping the given HitCollector.
** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0,
when we remove HitCollector.
** It will remove any instanceof checks that currently exist in IndexSearcher
code.
* Create a new (abstract) TopDocsCollector, which will:
** Leave collect and setNextReader unimplemented.
** Introduce protected members PriorityQueue and totalHits.
** Introduce a single protected constructor which accepts a PriorityQueue.
** Implement topDocs() and getTotalHits() using the PQ and totalHits members.
These can be used as-are by extending classes, as well as be overridden.
** Introduce a new topDocs(start, howMany) method which will be used a
convenience method when implementing a search application which allows paging
through search results. It will also attempt to improve the memory
allocation, by allocating a ScoreDoc[] of the requested size only.
* Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs()
and getTotalHits() implementations as they are from TopDocsCollector. The
class will also be made final.
* Change TopFieldCollector to extend TopDocsCollector, and make the class
final. Also implement topDocs(start, howMany).
* Change TopFieldDocCollector (deprecated) to extend TopDocsCollector,
instead of TopScoreDocCollector. Implement topDocs(start, howMany)
* Review other places where HitCollector is used, such as in Scorer,
deprecate those places and use Collector instead.
Additionally, the following proposal was made w.r.t. decoupling score from
collect():
* Change collect to accecpt only a doc Id (unbased).
* Introduce a setScorer(Scorer) method.
* If during collect the implementation needs the score, it can call
scorer.score().
If we do this, then we need to review all places in the code where
collect(doc, score) is called, and assert whether Scorer can be passed. Also
this raises few questions:
* What if during collect() Scorer is null? (i.e., not set) - is it even
possible?
* I noticed that many (if not all) of the collect() implementations discard
the document if its score is not greater than 0. Doesn't it mean that score
is needed in collect() always?
Open issues:
* The name for Collector
* TopDocsCollector was mentioned on the thread as TopResultsCollector, but
that was when we thought to call Colletor ResultsColletor. Since we decided
(so far) on Collector, I think TopDocsCollector makes sense, because of its
TopDocs output.
* Decoupling score from collect().
I will post a patch a bit later, as this is expected to be a very large
patch. I will split it into 2: (1) code patch (2) test cases (moving to use
Collector instead of HitCollector, as well as testing the new topDocs(start,
howMany) method.
There might be even a 3rd patch which handles the setScorer thing in
Collector (maybe even a different issue?)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional

Re: Future projects

On Fri, Apr 3, 2009 at 7:11 PM, Michael Busch busch...@gmail.com wrote:

 Yeah me too. I think eventually we want this to be a Codec, but we probably
 don't want to wait until all the flexible indexing work is done.
 So maybe we should just not worry too much about a perfectly integrated API
 yet and release it as experimental API with 2.9 and 3.0, just like we did
 with the initial payloads implementation. Then when we hopefully get
 flexible indexing and codecs into 3.1 we can rework CSF and integrate it as
 a Codec.

+1

LUCENE-1458 shouldn't block 831/1231.

 As I recently mentioned on 1231 I'm looking into changing the Document and
 Field APIs. I've some rough prototype. I think we should also try to get it
 in before 2.9? On the other hand I don't want to block the 2.9 release with
 too much stuff.

That'd be great -- I'd say post the rough prototype and let's iterate?

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Future projects

On Fri, Apr 3, 2009 at 8:01 PM, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
 I looked at the IndexWriter code in regards to creating a realtime reader,
 with the many flexible indexing classes I'm unsure of how one would get a
 frozenish IndexInput of the byte slices, given the byte slices are attached
 to different threads?

The realtime reader would have to have sub-readers per thread, and an
aggregate reader that joins them by interleaving the docIDs.  When
flushing we create such a beast, but, it's not general purpose (ie it
does not implement IndexReader API; it only implements enough to write
the postings).

BTW there are benefits to not reusing the RAM buffer, outside of
faster near real-time search: it would allow flushing to be done in
the BG.  Ie, flush could start, and we'd immediately switch to a new
RAM buffer, thus allowing add/deletes in other threads to run.
Currently they are all blocked (stop the world) during flush, though
it's not clear on a fast IO device (SSD) how big a deal this stop the
world really is to indexing throughput.

But still it's a complex change.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)


[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695723#action_12695723
 ] 

Shai Erera commented on LUCENE-1575:


bq. OK let's add both and call it out in CHANGES.txt?

great. so I leave them as they are in the latest patch and add a note to 
CHANGES.

bq. Yes let's make it final. We need to eek...

This isn't necessary after all, since the class is now abstract, with a private 
ctor and two private final internal classes, which will be the concrete objects 
returned by create().

Before submitting the next patch version, I'd like to verify if super.collect() 
in TFC is the cause of the perf. degradation. We should also perf. test sorting 
w/o score tracking and note if there is any improvement over trunk. I'm 
downloading the latest enwiki xml (20090306), so I hope that sometime tomorrow 
it will finish the download, extract, indexing and search tests.


 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
 LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
 LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch, sortBench5.py, 
 sortCollate5.py


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of TopScoreDocCollector. Implement topDocs(start, howMany)
 * Review other places where HitCollector is used, such as in Scorer, 
 deprecate those places and use Collector instead.
 Additionally, the following proposal was made w.r.t. decoupling score from 
 collect():
 * Change collect to accecpt only a doc Id (unbased).
 * Introduce a setScorer(Scorer) method.
 * If during collect the implementation needs the score, it can call 
 scorer.score().
 If we do this, then we need to review all places in the code where 
 collect(doc, score) is called, and assert whether Scorer can be passed. Also 
 this raises few questions:
 * What if during collect() Scorer is null? (i.e., not set) - is it even 
 possible?
 * I noticed that many (if not all) of the collect() implementations discard 
 the document if its score is not greater than 0. Doesn't it mean that score 
 is needed in collect() always?
 Open issues:
 * The name for Collector
 * TopDocsCollector was mentioned on the thread as TopResultsCollector, but 
 that was when we thought to call Colletor ResultsColletor. Since we decided 
 (so far) on Collector, I think TopDocsCollector makes sense, because of its 
 TopDocs output.
 * Decoupling score from collect().
 I will post a patch a bit later, as this is expected to be a very large 
 patch. I will split it into 2: (1) code patch (2) test cases

[jira] Commented: (LUCENE-1231) Column-stride fields (aka per-document Payloads)

[
https://issues.apache.org/jira/browse/LUCENE-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695726#action_12695726
]

Michael McCandless commented on LUCENE-1231:

{quote}
Eventually we need more flexibility to utilize the flexible indexing
chain anyway. We need to store which codec to use for a field. Then we
could also just make a new codec for column-stride fields and maybe
then we do not have to introduce a new Field API.
{quote}

By creating a custom indexing chain you could actually write CSF,
today.

But the lack of extensibility of Field needs to be addressed: you need
some way to store something arbitrary opaque into a field such that
your indexing chain could pick it up and act.

And FieldInfos also needs store this opaque thing for me API.

One of the big changes in LUCENE-1458 is to strongly separate
different fields on the read APIs. EG there is a separate FieldsEnum
from TermsEnum, meaning you first seek to the field you want, then
request a TermsEnum from that, which can iterate through the terms
only for that field. It's the codec's job to return the right
TermsEnum for a given field.

Not to delay 2.9 further, but... I also wonder if Lucene had
NumericField (say), how it would simplify things here. EG, today, if
I have a field weight that is a float, I'm going to have to set
something to tell the CSF (man the similarity of that to CFS is going
to cause problems!) writer to cast-it-and-save-it-as-float-array to
disk; I'm going to have to tell the TrieRangeUtil to do the same, etc.
It'd be much better if that field stored a float (not String), and if
it default naturally to using these two special indexers...

{quote}
DataIn(Out)put would implement the different read and
write methods, whereas IndexIn(Out)put would only implement methods
like close(), seek(), getFilePointer(), length(), flush(), etc.
{quote}

What is the fastest way in Java to slurp in a bunch of bytes as an
int[], short[], float[], etc? Seems that we need to answer that first
and then work out how to fix our store APIs to enable that. (Maybe
it's IntBuffer wrapping ByteBuffer, instead of an int[]?).

{quote}
The danger here compared to the current
payloads API would be that the user might read too few or too many
bytes of a CSF, which would result in an undefined and possibly hard
to debug behavior.
{quote}

I think it's better to have good performance with added risk of
danger, then forced handholding always.

{quote}
The SafeAccessor would count for you the number of read bytes and
throw exceptions if you don't consume the number of bytes you should
consume.
{quote}

I generally prefer liberal use of asserts to trip bugs like this,
instead of explicit strongly divoced code paths / classes / modes
etc., containing real if statements at production runtime.

Column-stride fields (aka per-document Payloads)

Key: LUCENE-1231
URL: https://issues.apache.org/jira/browse/LUCENE-1231
Project: Lucene - Java
Issue Type: New Feature
Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
Fix For: 3.0

This new feature has been proposed and discussed here:
http://markmail.org/search/?q=per-document+payloads#query:per-document%20payloads+page:1+mid:jq4g5myhlvidw3oc+state:results
Currently it is possible in Lucene to store data as stored fields or as
payloads.
Stored fields provide good performance if you want to load all fields for one
document, because this is an sequential I/O operation.
If you however want to load the data from one field for a large number of
documents, then stored fields perform quite badly, because lot's of I/O seeks
might have to be performed.
A better way to do this is using payloads. By creating a special posting
list
that has one posting with payload for each document you can simulate a
column-
stride field. The performance is significantly better compared to stored
fields,
however still not optimal. The reason is that for each document the freq
value,
which is in this particular case always 1, has to be decoded, also one
position
value, which is always 0, has to be loaded.
As a solution we want to add real column-stride fields to Lucene. A possible
format for the new data structure could look like this (CSD stands for column-
stride data, once we decide for a final name for this feature we can change
this):
CSDList -- FixedLengthList | VariableLengthList, SkipList
FixedLengthList -- Payload^SegSize
VariableLengthList -- DocDelta, PayloadLength?, Payload
Payload -- Byte^PayloadLength
PayloadLength -- VInt
SkipList -- see frq.file
We distinguish here between the fixed length and the variable length cases. To
allow flexibility, Lucene could automatically pick the right

[jira] Commented: (LUCENE-1584) Callback for intercepting merging segments in IndexWriter


[ 
https://issues.apache.org/jira/browse/LUCENE-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695742#action_12695742
 ] 

Michael McCandless commented on LUCENE-1584:


I'd like to step back and understand the wider use case / context that's 
driving this need (to know precisely when segments got merged).  EG if we fix 
Lucene's field cache, and Lucene's near real-time search manages CSF's 
efficiently in memory, does that address the use case behind this?

It's possible that we should simply make SegmentInfo(s) public, so that 
MergePolicy/Scheduler can be fully created external to Lucene, and track all 
specifics of why/when merges are happening.  But those APIs have a high surface 
area, and we do make changes over time.

 Callback for intercepting merging segments in IndexWriter
 -

 Key: LUCENE-1584
 URL: https://issues.apache.org/jira/browse/LUCENE-1584
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1584.patch

   Original Estimate: 96h
  Remaining Estimate: 96h

 For things like merging field caches or bitsets, it's useful to
 know which segments were merged to create a new segment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)


[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695743#action_12695743
 ] 

Michael McCandless commented on LUCENE-1575:


{quote}
There are no super.XXX calls. The two FVHQ implementations just implement
lessThan according to whether it's a single comparator or muli case. This
removes the check of numComparators == 1.
{quote}
Excellent!

bq. Before submitting the next patch version, I'd like to verify if 
super.collect() in TFC is the cause of the perf. degradation

I'll run this  post back.

 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
 LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
 LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch, sortBench5.py, 
 sortCollate5.py


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of TopScoreDocCollector. Implement topDocs(start, howMany)
 * Review other places where HitCollector is used, such as in Scorer, 
 deprecate those places and use Collector instead.
 Additionally, the following proposal was made w.r.t. decoupling score from 
 collect():
 * Change collect to accecpt only a doc Id (unbased).
 * Introduce a setScorer(Scorer) method.
 * If during collect the implementation needs the score, it can call 
 scorer.score().
 If we do this, then we need to review all places in the code where 
 collect(doc, score) is called, and assert whether Scorer can be passed. Also 
 this raises few questions:
 * What if during collect() Scorer is null? (i.e., not set) - is it even 
 possible?
 * I noticed that many (if not all) of the collect() implementations discard 
 the document if its score is not greater than 0. Doesn't it mean that score 
 is needed in collect() always?
 Open issues:
 * The name for Collector
 * TopDocsCollector was mentioned on the thread as TopResultsCollector, but 
 that was when we thought to call Colletor ResultsColletor. Since we decided 
 (so far) on Collector, I think TopDocsCollector makes sense, because of its 
 TopDocs output.
 * Decoupling score from collect().
 I will post a patch a bit later, as this is expected to be a very large 
 patch. I will split it into 2: (1) code patch (2) test cases (moving to use 
 Collector instead of HitCollector, as well as testing the new topDocs(start, 
 howMany) method.
 There might be even a 3rd patch which handles the setScorer thing in 
 Collector (maybe even a different issue?)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)


[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695745#action_12695745
 ] 

Michael McCandless commented on LUCENE-1575:



Odd -- inlining super.collect into OCSC, and making OCSC final, did not alter 
the numbers much (I re-ran trunk baseline to confirm its close to prior trunk 
baseline):

||query||sort||hits||qps||qpsnew||pctg||
|147|score|   6953|3635.8|3650.1|  0.4%|
|147|title|   6953|2915.7|2297.6|-21.2%|
|147|doc|   6953|3265.6|2665.8|-18.4%|
|text|score| 157101| 208.5| 202.9| -2.7%|
|text|title| 157101|  97.0|  85.4|-12.0%|
|text|doc| 157101| 174.3| 125.0|-28.3%|
|1|score| 565452|  58.2|  56.6| -2.7%|
|1|title| 565452|  44.6|  34.6|-22.4%|
|1|doc| 565452|  49.2|  35.2|-28.5%|
|1 OR 2|score| 784928|  14.1|  13.7| -2.8%|
|1 OR 2|title| 784928|  12.6|  11.5| -8.7%|
|1 OR 2|doc| 784928|  13.0|  11.9| -8.5%|
|1 AND 2|score| 333153|  15.6|  15.5| -0.6%|
|1 AND 2|title| 333153|  14.8|  13.7| -7.4%|
|1 AND 2|doc| 333153|  15.2|  14.2| -6.6%|


 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
 LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
 LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch, sortBench5.py, 
 sortCollate5.py


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of TopScoreDocCollector. Implement topDocs(start, howMany)
 * Review other places where HitCollector is used, such as in Scorer, 
 deprecate those places and use Collector instead.
 Additionally, the following proposal was made w.r.t. decoupling score from 
 collect():
 * Change collect to accecpt only a doc Id (unbased).
 * Introduce a setScorer(Scorer) method.
 * If during collect the implementation needs the score, it can call 
 scorer.score().
 If we do this, then we need to review all places in the code where 
 collect(doc, score) is called, and assert whether Scorer can be passed. Also 
 this raises few questions:
 * What if during collect() Scorer is null? (i.e., not set) - is it even 
 possible?
 * I noticed that many (if not all) of the collect() implementations discard 
 the document if its score is not greater than 0. Doesn't it mean that score 
 is needed in collect() always?
 Open issues:
 * The name for Collector
 * TopDocsCollector was mentioned on the thread as TopResultsCollector, but 
 that was when we thought to call Colletor ResultsColletor. Since we decided 
 (so far) on Collector, I think TopDocsCollector makes sense, because of its 
 TopDocs output.
 * Decoupling score from collect().
 I will post a patch a bit later, as

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

[
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695746#action_12695746
]

Michael McCandless commented on LUCENE-1575:

bq. We should also perf. test sorting w/o score tracking and note if there is
any improvement over trunk.

Let's wait a bit until we sort things out (eg, w/ current patch, TermScorer
will still compute its score even if I don't need it).

Refactoring Lucene collectors (HitCollector and extensions)
---

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

[
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695750#action_12695750
]

Michael McCandless commented on LUCENE-1575:

Shai can you post your latest patch, where TermScorer itself is passed down to
the collector?

Refactoring Lucene collectors (HitCollector and extensions)
---

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-1586) add IndexReader.getUniqueTermCount

add IndexReader.getUniqueTermCount
--

 Key: LUCENE-1586
 URL: https://issues.apache.org/jira/browse/LUCENE-1586
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 2.9


Simple API to return number of unique terms (across all fields).  Spinoff from 
here:

http://www.lucidimagination.com/search/document/536b22e017be3e27/term_limit

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1586) add IndexReader.getUniqueTermCount


 [ 
https://issues.apache.org/jira/browse/LUCENE-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1586:
---

Attachment: LUCENE-1586.patch

Attached patch.  I plan to commit in a day or two...

 add IndexReader.getUniqueTermCount
 --

 Key: LUCENE-1586
 URL: https://issues.apache.org/jira/browse/LUCENE-1586
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1586.patch


 Simple API to return number of unique terms (across all fields).  Spinoff 
 from here:
 http://www.lucidimagination.com/search/document/536b22e017be3e27/term_limit

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-1575:
---

Attachment: LUCENE-1575.7.patch

- Changed TermScorer.score() method to not call Similarity.decodeNorm. If we 
can change Scorer.similarity to be protected, we can give up getSimilarity() 
call in score(). Also changed TermScorer.score(Collector) to set 'this' as the 
collector's scorer.
- Deprecated TimeLimitedCollector, created new TimeLimitingCollector, renamed 
TestTimeLimitedCollector to TestTimeLimitingCollector and used the new 
TimeLimitingCollector.
- Changed FVHQ to have a static create which returns 
One/MultiComparatorFieldValueHitQueue version.
- Changed TopFieldCollector setNextReader versions to not call pq.size() but 
rather use numHits.


 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
 LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
 LUCENE-1575.6.patch, LUCENE-1575.7.patch, LUCENE-1575.patch, 
 LUCENE-1575.patch, sortBench5.py, sortCollate5.py


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of TopScoreDocCollector. Implement topDocs(start, howMany)
 * Review other places where HitCollector is used, such as in Scorer, 
 deprecate those places and use Collector instead.
 Additionally, the following proposal was made w.r.t. decoupling score from 
 collect():
 * Change collect to accecpt only a doc Id (unbased).
 * Introduce a setScorer(Scorer) method.
 * If during collect the implementation needs the score, it can call 
 scorer.score().
 If we do this, then we need to review all places in the code where 
 collect(doc, score) is called, and assert whether Scorer can be passed. Also 
 this raises few questions:
 * What if during collect() Scorer is null? (i.e., not set) - is it even 
 possible?
 * I noticed that many (if not all) of the collect() implementations discard 
 the document if its score is not greater than 0. Doesn't it mean that score 
 is needed in collect() always?
 Open issues:
 * The name for Collector
 * TopDocsCollector was mentioned on the thread as TopResultsCollector, but 
 that was when we thought to call Colletor ResultsColletor. Since we decided 
 (so far) on Collector, I think TopDocsCollector makes sense, because of its 
 TopDocs output.
 * Decoupling score from collect().
 I will post a patch a bit later, as this is expected to be a very large 
 patch. I will split it into 2: (1) code patch (2) test cases (moving to use 
 Collector instead of HitCollector, as well as testing the new topDocs(start, 
 howMany) method.
 There

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)


[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695782#action_12695782
 ] 

Michael McCandless commented on LUCENE-1575:


OK thanks.  Numbers w/ new patch:

||query||sort||hits||qps||qpsnew||pctg||
|147|score|   6953|3635.8|3704.1|  1.9%|
|147|title|   6953|2915.7|2262.9|-22.4%|
|147|doc|   6953|3265.6|2655.1|-18.7%|
|text|score| 157101| 208.5| 199.9| -4.1%|
|text|title| 157101|  97.0|  87.1|-10.2%|
|text|doc| 157101| 174.3| 134.6|-22.8%|
|1|score| 565452|  58.2|  56.5| -2.9%|
|1|title| 565452|  44.6|  35.3|-20.9%|
|1|doc| 565452|  49.2|  38.0|-22.8%|
|1 OR 2|score| 784928|  14.1|  13.8| -2.1%|
|1 OR 2|title| 784928|  12.6|  11.6| -7.9%|
|1 OR 2|doc| 784928|  13.0|  11.9| -8.5%|
|1 AND 2|score| 333153|  15.6|  15.4| -1.3%|
|1 AND 2|title| 333153|  14.8|  13.7| -7.4%|
|1 AND 2|doc| 333153|  15.2|  14.2| -6.6%|


 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
 LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
 LUCENE-1575.6.patch, LUCENE-1575.7.patch, LUCENE-1575.patch, 
 LUCENE-1575.patch, sortBench5.py, sortCollate5.py


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of TopScoreDocCollector. Implement topDocs(start, howMany)
 * Review other places where HitCollector is used, such as in Scorer, 
 deprecate those places and use Collector instead.
 Additionally, the following proposal was made w.r.t. decoupling score from 
 collect():
 * Change collect to accecpt only a doc Id (unbased).
 * Introduce a setScorer(Scorer) method.
 * If during collect the implementation needs the score, it can call 
 scorer.score().
 If we do this, then we need to review all places in the code where 
 collect(doc, score) is called, and assert whether Scorer can be passed. Also 
 this raises few questions:
 * What if during collect() Scorer is null? (i.e., not set) - is it even 
 possible?
 * I noticed that many (if not all) of the collect() implementations discard 
 the document if its score is not greater than 0. Doesn't it mean that score 
 is needed in collect() always?
 Open issues:
 * The name for Collector
 * TopDocsCollector was mentioned on the thread as TopResultsCollector, but 
 that was when we thought to call Colletor ResultsColletor. Since we decided 
 (so far) on Collector, I think TopDocsCollector makes sense, because of its 
 TopDocs output.
 * Decoupling score from collect().
 I will post a patch a bit later, as this is expected to be a very large 
 patch. I will split it into 2: (1) code patch (2) test cases (moving to use

[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)