[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-06 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696010#action_12696010
 ] 

Shai Erera commented on LUCENE-1575:


I wasn't able to run the test on 64-bit JRE. Here are the results on 32-bit 
JREs:

||OS||JRE||Trunk||Patch||%tg
|XP|IBM 1.5| 573 | 571 | {color:green}0.34%{color}
| XP | 1.6.07 (32 bit) | 752 | 804 | {color:red}-6.4 %{color}
|SRV 2003| IBM 1.5 | 530/469 | 536/493 | 
{color:green}1%{color}/{color:red}-4.86%{color}
|SRV 2003| 1.6.07 (32 bit) | 858 | 699 | {color:green}22.7%{color}

I ran each twice, and just in the SRV-2003-1.5 case there were differences 
between the two runs. Also, it's important to notice that unlike Mike's 
results, the SRV2003-JRE1.6 run had 22.7% improvement with the patched version. 
I re-ran the 2003 runs a couple of times and the results were consistent.

> Refactoring Lucene collectors (HitCollector and extensions)
> ---
>
> Key: LUCENE-1575
> URL: https://issues.apache.org/jira/browse/LUCENE-1575
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
> LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
> LUCENE-1575.6.patch, LUCENE-1575.7.patch, LUCENE-1575.patch, 
> LUCENE-1575.patch, LUCENE-1575.patch, PerfTest.java, sortBench5.py, 
> sortCollate5.py
>
>
> This issue is a result of a recent discussion we've had on the mailing list. 
> You can read the thread 
> [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
> We have agreed to do the following refactoring:
> * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
> be the base class for all Collector implementations.
> * Deprecate HitCollector in favor of the new Collector.
> * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
> those that accept HitCollector.
> ** Create a final class HitCollectorWrapper, and use it in the deprecated 
> methods in IndexSearcher, wrapping the given HitCollector.
> ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
> when we remove HitCollector.
> ** It will remove any instanceof checks that currently exist in IndexSearcher 
> code.
> * Create a new (abstract) TopDocsCollector, which will:
> ** Leave collect and setNextReader unimplemented.
> ** Introduce protected members PriorityQueue and totalHits.
> ** Introduce a single protected constructor which accepts a PriorityQueue.
> ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
> These can be used as-are by extending classes, as well as be overridden.
> ** Introduce a new topDocs(start, howMany) method which will be used a 
> convenience method when implementing a search application which allows paging 
> through search results. It will also attempt to improve the memory 
> allocation, by allocating a ScoreDoc[] of the requested size only.
> * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
> and getTotalHits() implementations as they are from TopDocsCollector. The 
> class will also be made final.
> * Change TopFieldCollector to extend TopDocsCollector, and make the class 
> final. Also implement topDocs(start, howMany).
> * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
> instead of TopScoreDocCollector. Implement topDocs(start, howMany)
> * Review other places where HitCollector is used, such as in Scorer, 
> deprecate those places and use Collector instead.
> Additionally, the following proposal was made w.r.t. decoupling score from 
> collect():
> * Change collect to accecpt only a doc Id (unbased).
> * Introduce a setScorer(Scorer) method.
> * If during collect the implementation needs the score, it can call 
> scorer.score().
> If we do this, then we need to review all places in the code where 
> collect(doc, score) is called, and assert whether Scorer can be passed. Also 
> this raises few questions:
> * What if during collect() Scorer is null? (i.e., not set) - is it even 
> possible?
> * I noticed that many (if not all) of the collect() implementations discard 
> the document if its score is not greater than 0. Doesn't it mean that score 
> is needed in collect() always?
> Open issues:
> * The name for Collector
> * TopDocsCollector was mentioned on the thread as TopResultsCollector, but 
> that was when we thought to call Colletor ResultsColletor. Since we decided 
> (so far) on Collector, I think TopDocsCollector makes sense, because of its 
> TopDocs output.
> * Decoupling score from collect().
> I will post a patch a bit later, as this is expected to be a very large

[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-06 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-1575:
---

Attachment: LUCENE-1575.8.patch

Added JustCompileSearch, JustCompileSearchFunction and JustCompileSearchSpans 
that extend/implement all abstract classes/interfaces in o.a.l.s, o.a.l.s.s and 
o.a.l.s.f. Those are not unit tests per-sei, however if anyone will change the 
interfaces/abstract classes in a way that it breaks back-compat, we'll know it 
right away. I think that in general this is something good to have for Lucene 
overall, however I only took care of the search.* packages in this patch.

> Refactoring Lucene collectors (HitCollector and extensions)
> ---
>
> Key: LUCENE-1575
> URL: https://issues.apache.org/jira/browse/LUCENE-1575
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
> LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
> LUCENE-1575.6.patch, LUCENE-1575.7.patch, LUCENE-1575.8.patch, 
> LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, PerfTest.java, 
> sortBench5.py, sortCollate5.py
>
>
> This issue is a result of a recent discussion we've had on the mailing list. 
> You can read the thread 
> [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
> We have agreed to do the following refactoring:
> * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
> be the base class for all Collector implementations.
> * Deprecate HitCollector in favor of the new Collector.
> * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
> those that accept HitCollector.
> ** Create a final class HitCollectorWrapper, and use it in the deprecated 
> methods in IndexSearcher, wrapping the given HitCollector.
> ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
> when we remove HitCollector.
> ** It will remove any instanceof checks that currently exist in IndexSearcher 
> code.
> * Create a new (abstract) TopDocsCollector, which will:
> ** Leave collect and setNextReader unimplemented.
> ** Introduce protected members PriorityQueue and totalHits.
> ** Introduce a single protected constructor which accepts a PriorityQueue.
> ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
> These can be used as-are by extending classes, as well as be overridden.
> ** Introduce a new topDocs(start, howMany) method which will be used a 
> convenience method when implementing a search application which allows paging 
> through search results. It will also attempt to improve the memory 
> allocation, by allocating a ScoreDoc[] of the requested size only.
> * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
> and getTotalHits() implementations as they are from TopDocsCollector. The 
> class will also be made final.
> * Change TopFieldCollector to extend TopDocsCollector, and make the class 
> final. Also implement topDocs(start, howMany).
> * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
> instead of TopScoreDocCollector. Implement topDocs(start, howMany)
> * Review other places where HitCollector is used, such as in Scorer, 
> deprecate those places and use Collector instead.
> Additionally, the following proposal was made w.r.t. decoupling score from 
> collect():
> * Change collect to accecpt only a doc Id (unbased).
> * Introduce a setScorer(Scorer) method.
> * If during collect the implementation needs the score, it can call 
> scorer.score().
> If we do this, then we need to review all places in the code where 
> collect(doc, score) is called, and assert whether Scorer can be passed. Also 
> this raises few questions:
> * What if during collect() Scorer is null? (i.e., not set) - is it even 
> possible?
> * I noticed that many (if not all) of the collect() implementations discard 
> the document if its score is not greater than 0. Doesn't it mean that score 
> is needed in collect() always?
> Open issues:
> * The name for Collector
> * TopDocsCollector was mentioned on the thread as TopResultsCollector, but 
> that was when we thought to call Colletor ResultsColletor. Since we decided 
> (so far) on Collector, I think TopDocsCollector makes sense, because of its 
> TopDocs output.
> * Decoupling score from collect().
> I will post a patch a bit later, as this is expected to be a very large 
> patch. I will split it into 2: (1) code patch (2) test cases (moving to use 
> Collector instead of HitCollector, as well as testing the new topDocs(start, 
> howMany) method.
> There might be even a 3rd patch

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-06 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696020#action_12696020
 ] 

Shai Erera commented on LUCENE-1575:


I'm using the latest version which sorts by that random field (the output 
includes the prints of best, avg. and sum, so I'm sure of that). Also, the 
times I reported are the 'best' time. I launch the JRE like you posted with 
those args: "-Xms1024M -Xmx1024M -Xbatch -server".

I reran now, and the results are consistent.

> Refactoring Lucene collectors (HitCollector and extensions)
> ---
>
> Key: LUCENE-1575
> URL: https://issues.apache.org/jira/browse/LUCENE-1575
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
> LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
> LUCENE-1575.6.patch, LUCENE-1575.7.patch, LUCENE-1575.8.patch, 
> LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, PerfTest.java, 
> sortBench5.py, sortCollate5.py
>
>
> This issue is a result of a recent discussion we've had on the mailing list. 
> You can read the thread 
> [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
> We have agreed to do the following refactoring:
> * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
> be the base class for all Collector implementations.
> * Deprecate HitCollector in favor of the new Collector.
> * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
> those that accept HitCollector.
> ** Create a final class HitCollectorWrapper, and use it in the deprecated 
> methods in IndexSearcher, wrapping the given HitCollector.
> ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
> when we remove HitCollector.
> ** It will remove any instanceof checks that currently exist in IndexSearcher 
> code.
> * Create a new (abstract) TopDocsCollector, which will:
> ** Leave collect and setNextReader unimplemented.
> ** Introduce protected members PriorityQueue and totalHits.
> ** Introduce a single protected constructor which accepts a PriorityQueue.
> ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
> These can be used as-are by extending classes, as well as be overridden.
> ** Introduce a new topDocs(start, howMany) method which will be used a 
> convenience method when implementing a search application which allows paging 
> through search results. It will also attempt to improve the memory 
> allocation, by allocating a ScoreDoc[] of the requested size only.
> * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
> and getTotalHits() implementations as they are from TopDocsCollector. The 
> class will also be made final.
> * Change TopFieldCollector to extend TopDocsCollector, and make the class 
> final. Also implement topDocs(start, howMany).
> * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
> instead of TopScoreDocCollector. Implement topDocs(start, howMany)
> * Review other places where HitCollector is used, such as in Scorer, 
> deprecate those places and use Collector instead.
> Additionally, the following proposal was made w.r.t. decoupling score from 
> collect():
> * Change collect to accecpt only a doc Id (unbased).
> * Introduce a setScorer(Scorer) method.
> * If during collect the implementation needs the score, it can call 
> scorer.score().
> If we do this, then we need to review all places in the code where 
> collect(doc, score) is called, and assert whether Scorer can be passed. Also 
> this raises few questions:
> * What if during collect() Scorer is null? (i.e., not set) - is it even 
> possible?
> * I noticed that many (if not all) of the collect() implementations discard 
> the document if its score is not greater than 0. Doesn't it mean that score 
> is needed in collect() always?
> Open issues:
> * The name for Collector
> * TopDocsCollector was mentioned on the thread as TopResultsCollector, but 
> that was when we thought to call Colletor ResultsColletor. Since we decided 
> (so far) on Collector, I think TopDocsCollector makes sense, because of its 
> TopDocs output.
> * Decoupling score from collect().
> I will post a patch a bit later, as this is expected to be a very large 
> patch. I will split it into 2: (1) code patch (2) test cases (moving to use 
> Collector instead of HitCollector, as well as testing the new topDocs(start, 
> howMany) method.
> There might be even a 3rd patch which handles the setScorer thing in 
> Collector (maybe even a different issue?)

-- 
This message is automatically generated b

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-06 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696017#action_12696017
 ] 

Michael McCandless commented on LUCENE-1575:


Mark and Shai, are you guys using the last version of the bench (that sorts by 
random int field)?  Are you using the "best" time for your results?  How are 
you launching the JRE?

bq. BTW, if you look at Mike's table above, it's a black and white thing: the 
1.5 JRE really like this patch and 1.6 really hate it. Maybe we should not move 
to 1.6 then?

Actually, for my run on Linux, the patch was faster for both 1.5 & 1.6 JREs.

> Refactoring Lucene collectors (HitCollector and extensions)
> ---
>
> Key: LUCENE-1575
> URL: https://issues.apache.org/jira/browse/LUCENE-1575
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
> LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
> LUCENE-1575.6.patch, LUCENE-1575.7.patch, LUCENE-1575.patch, 
> LUCENE-1575.patch, LUCENE-1575.patch, PerfTest.java, sortBench5.py, 
> sortCollate5.py
>
>
> This issue is a result of a recent discussion we've had on the mailing list. 
> You can read the thread 
> [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
> We have agreed to do the following refactoring:
> * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
> be the base class for all Collector implementations.
> * Deprecate HitCollector in favor of the new Collector.
> * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
> those that accept HitCollector.
> ** Create a final class HitCollectorWrapper, and use it in the deprecated 
> methods in IndexSearcher, wrapping the given HitCollector.
> ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
> when we remove HitCollector.
> ** It will remove any instanceof checks that currently exist in IndexSearcher 
> code.
> * Create a new (abstract) TopDocsCollector, which will:
> ** Leave collect and setNextReader unimplemented.
> ** Introduce protected members PriorityQueue and totalHits.
> ** Introduce a single protected constructor which accepts a PriorityQueue.
> ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
> These can be used as-are by extending classes, as well as be overridden.
> ** Introduce a new topDocs(start, howMany) method which will be used a 
> convenience method when implementing a search application which allows paging 
> through search results. It will also attempt to improve the memory 
> allocation, by allocating a ScoreDoc[] of the requested size only.
> * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
> and getTotalHits() implementations as they are from TopDocsCollector. The 
> class will also be made final.
> * Change TopFieldCollector to extend TopDocsCollector, and make the class 
> final. Also implement topDocs(start, howMany).
> * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
> instead of TopScoreDocCollector. Implement topDocs(start, howMany)
> * Review other places where HitCollector is used, such as in Scorer, 
> deprecate those places and use Collector instead.
> Additionally, the following proposal was made w.r.t. decoupling score from 
> collect():
> * Change collect to accecpt only a doc Id (unbased).
> * Introduce a setScorer(Scorer) method.
> * If during collect the implementation needs the score, it can call 
> scorer.score().
> If we do this, then we need to review all places in the code where 
> collect(doc, score) is called, and assert whether Scorer can be passed. Also 
> this raises few questions:
> * What if during collect() Scorer is null? (i.e., not set) - is it even 
> possible?
> * I noticed that many (if not all) of the collect() implementations discard 
> the document if its score is not greater than 0. Doesn't it mean that score 
> is needed in collect() always?
> Open issues:
> * The name for Collector
> * TopDocsCollector was mentioned on the thread as TopResultsCollector, but 
> that was when we thought to call Colletor ResultsColletor. Since we decided 
> (so far) on Collector, I think TopDocsCollector makes sense, because of its 
> TopDocs output.
> * Decoupling score from collect().
> I will post a patch a bit later, as this is expected to be a very large 
> patch. I will split it into 2: (1) code patch (2) test cases (moving to use 
> Collector instead of HitCollector, as well as testing the new topDocs(start, 
> howMany) method.
> There might be even a 3rd patch which handles the setScor

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-06 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696084#action_12696084
 ] 

Mark Miller commented on LUCENE-1575:
-

I just used the defaults for cmd line - I can give it another go ensuring 
server and more RAM. I used the latest perf code provided by Mike and the 
latest patch.

I didn't look at the numbers too closely - my plan was to do a quick profile 
with each, but eyeballing runs with each over and over, they were approx the 
same (both best and avg), so I skipped the profiling.


> Refactoring Lucene collectors (HitCollector and extensions)
> ---
>
> Key: LUCENE-1575
> URL: https://issues.apache.org/jira/browse/LUCENE-1575
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
> LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
> LUCENE-1575.6.patch, LUCENE-1575.7.patch, LUCENE-1575.8.patch, 
> LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, PerfTest.java, 
> sortBench5.py, sortCollate5.py
>
>
> This issue is a result of a recent discussion we've had on the mailing list. 
> You can read the thread 
> [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
> We have agreed to do the following refactoring:
> * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
> be the base class for all Collector implementations.
> * Deprecate HitCollector in favor of the new Collector.
> * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
> those that accept HitCollector.
> ** Create a final class HitCollectorWrapper, and use it in the deprecated 
> methods in IndexSearcher, wrapping the given HitCollector.
> ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
> when we remove HitCollector.
> ** It will remove any instanceof checks that currently exist in IndexSearcher 
> code.
> * Create a new (abstract) TopDocsCollector, which will:
> ** Leave collect and setNextReader unimplemented.
> ** Introduce protected members PriorityQueue and totalHits.
> ** Introduce a single protected constructor which accepts a PriorityQueue.
> ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
> These can be used as-are by extending classes, as well as be overridden.
> ** Introduce a new topDocs(start, howMany) method which will be used a 
> convenience method when implementing a search application which allows paging 
> through search results. It will also attempt to improve the memory 
> allocation, by allocating a ScoreDoc[] of the requested size only.
> * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
> and getTotalHits() implementations as they are from TopDocsCollector. The 
> class will also be made final.
> * Change TopFieldCollector to extend TopDocsCollector, and make the class 
> final. Also implement topDocs(start, howMany).
> * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
> instead of TopScoreDocCollector. Implement topDocs(start, howMany)
> * Review other places where HitCollector is used, such as in Scorer, 
> deprecate those places and use Collector instead.
> Additionally, the following proposal was made w.r.t. decoupling score from 
> collect():
> * Change collect to accecpt only a doc Id (unbased).
> * Introduce a setScorer(Scorer) method.
> * If during collect the implementation needs the score, it can call 
> scorer.score().
> If we do this, then we need to review all places in the code where 
> collect(doc, score) is called, and assert whether Scorer can be passed. Also 
> this raises few questions:
> * What if during collect() Scorer is null? (i.e., not set) - is it even 
> possible?
> * I noticed that many (if not all) of the collect() implementations discard 
> the document if its score is not greater than 0. Doesn't it mean that score 
> is needed in collect() always?
> Open issues:
> * The name for Collector
> * TopDocsCollector was mentioned on the thread as TopResultsCollector, but 
> that was when we thought to call Colletor ResultsColletor. Since we decided 
> (so far) on Collector, I think TopDocsCollector makes sense, because of its 
> TopDocs output.
> * Decoupling score from collect().
> I will post a patch a bit later, as this is expected to be a very large 
> patch. I will split it into 2: (1) code patch (2) test cases (moving to use 
> Collector instead of HitCollector, as well as testing the new topDocs(start, 
> howMany) method.
> There might be even a 3rd patch which handles the setScorer thing in 
> Collector (maybe even a different i

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-06 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696101#action_12696101
 ] 

Michael McCandless commented on LUCENE-1575:



I ran 2 more JREs under linux:

||OS||JRE||Trunk||Patch||%tg||
||Linux|1.7.0 ea|333 ms|320 ms|{color:green}3.9%{color}|
||Linux|IBM JRE 1.5.0|401 ms|352 ms|{color:green}12.2%{color}|


> Refactoring Lucene collectors (HitCollector and extensions)
> ---
>
> Key: LUCENE-1575
> URL: https://issues.apache.org/jira/browse/LUCENE-1575
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
> LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
> LUCENE-1575.6.patch, LUCENE-1575.7.patch, LUCENE-1575.8.patch, 
> LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, PerfTest.java, 
> sortBench5.py, sortCollate5.py
>
>
> This issue is a result of a recent discussion we've had on the mailing list. 
> You can read the thread 
> [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
> We have agreed to do the following refactoring:
> * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
> be the base class for all Collector implementations.
> * Deprecate HitCollector in favor of the new Collector.
> * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
> those that accept HitCollector.
> ** Create a final class HitCollectorWrapper, and use it in the deprecated 
> methods in IndexSearcher, wrapping the given HitCollector.
> ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
> when we remove HitCollector.
> ** It will remove any instanceof checks that currently exist in IndexSearcher 
> code.
> * Create a new (abstract) TopDocsCollector, which will:
> ** Leave collect and setNextReader unimplemented.
> ** Introduce protected members PriorityQueue and totalHits.
> ** Introduce a single protected constructor which accepts a PriorityQueue.
> ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
> These can be used as-are by extending classes, as well as be overridden.
> ** Introduce a new topDocs(start, howMany) method which will be used a 
> convenience method when implementing a search application which allows paging 
> through search results. It will also attempt to improve the memory 
> allocation, by allocating a ScoreDoc[] of the requested size only.
> * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
> and getTotalHits() implementations as they are from TopDocsCollector. The 
> class will also be made final.
> * Change TopFieldCollector to extend TopDocsCollector, and make the class 
> final. Also implement topDocs(start, howMany).
> * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
> instead of TopScoreDocCollector. Implement topDocs(start, howMany)
> * Review other places where HitCollector is used, such as in Scorer, 
> deprecate those places and use Collector instead.
> Additionally, the following proposal was made w.r.t. decoupling score from 
> collect():
> * Change collect to accecpt only a doc Id (unbased).
> * Introduce a setScorer(Scorer) method.
> * If during collect the implementation needs the score, it can call 
> scorer.score().
> If we do this, then we need to review all places in the code where 
> collect(doc, score) is called, and assert whether Scorer can be passed. Also 
> this raises few questions:
> * What if during collect() Scorer is null? (i.e., not set) - is it even 
> possible?
> * I noticed that many (if not all) of the collect() implementations discard 
> the document if its score is not greater than 0. Doesn't it mean that score 
> is needed in collect() always?
> Open issues:
> * The name for Collector
> * TopDocsCollector was mentioned on the thread as TopResultsCollector, but 
> that was when we thought to call Colletor ResultsColletor. Since we decided 
> (so far) on Collector, I think TopDocsCollector makes sense, because of its 
> TopDocs output.
> * Decoupling score from collect().
> I will post a patch a bit later, as this is expected to be a very large 
> patch. I will split it into 2: (1) code patch (2) test cases (moving to use 
> Collector instead of HitCollector, as well as testing the new topDocs(start, 
> howMany) method.
> There might be even a 3rd patch which handles the setScorer thing in 
> Collector (maybe even a different issue?)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


--

[jira] Created: (LUCENE-1587) RangeQuery equals method does not compare collator property fully

2009-04-06 Thread Mark Platvoet (JIRA)
RangeQuery equals method does not compare collator property fully
-

 Key: LUCENE-1587
 URL: https://issues.apache.org/jira/browse/LUCENE-1587
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.4.1
Reporter: Mark Platvoet
Priority: Minor


The equals method in the range query has the collator comparison implemented as:
(this.collator != null && ! this.collator.equals(other.collator))

When _this.collator = null_ and _other.collator = someCollator_  this method 
will incorrectly assume they are equal. 

So adding something like
|| (this.collator == null && other.collator != null)
would fix the problem


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-1587) RangeQuery equals method does not compare collator property fully

2009-04-06 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller reassigned LUCENE-1587:
---

Assignee: Mark Miller

> RangeQuery equals method does not compare collator property fully
> -
>
> Key: LUCENE-1587
> URL: https://issues.apache.org/jira/browse/LUCENE-1587
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.4.1
>Reporter: Mark Platvoet
>Assignee: Mark Miller
>Priority: Minor
> Fix For: 2.9
>
>
> The equals method in the range query has the collator comparison implemented 
> as:
> (this.collator != null && ! this.collator.equals(other.collator))
> When _this.collator = null_ and _other.collator = someCollator_  this method 
> will incorrectly assume they are equal. 
> So adding something like
> || (this.collator == null && other.collator != null)
> would fix the problem

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1587) RangeQuery equals method does not compare collator property fully

2009-04-06 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-1587:


Fix Version/s: 2.9

> RangeQuery equals method does not compare collator property fully
> -
>
> Key: LUCENE-1587
> URL: https://issues.apache.org/jira/browse/LUCENE-1587
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.4.1
>Reporter: Mark Platvoet
>Assignee: Mark Miller
>Priority: Minor
> Fix For: 2.9
>
>
> The equals method in the range query has the collator comparison implemented 
> as:
> (this.collator != null && ! this.collator.equals(other.collator))
> When _this.collator = null_ and _other.collator = someCollator_  this method 
> will incorrectly assume they are equal. 
> So adding something like
> || (this.collator == null && other.collator != null)
> would fix the problem

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1588) Update Spatial Lucene sort to use FieldComparatorSource

2009-04-06 Thread patrick o'leary (JIRA)
Update Spatial Lucene sort to use FieldComparatorSource
---

 Key: LUCENE-1588
 URL: https://issues.apache.org/jira/browse/LUCENE-1588
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/spatial
Affects Versions: 2.9
Reporter: patrick o'leary
Assignee: patrick o'leary
Priority: Trivial
 Fix For: 2.9


Update distance sorting to use FieldComparator sorting as opposed to 
SortComparator

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1588) Update Spatial Lucene sort to use FieldComparatorSource

2009-04-06 Thread patrick o'leary (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

patrick o'leary updated LUCENE-1588:


Attachment: LUCENE-1588.patch

Deprecate DistanceSortSource and Add DistanceFieldComparator
updated Test case to use DistanceFieldComparator

Usage
{code}
// Create a distance sort
// As the radius filter has performed the distance calculations
// already, pass in the filter to reuse the results.
// 
DistanceFieldComparatorSource dsort = new 
DistanceFieldComparatorSource(dq.distanceFilter);
Sort sort = new Sort(new SortField("foo", dsort,false));

// Perform the search, using the term query, the serial chain filter, and the
// distance sort
Hits hits = searcher.search(customScore, dq.getFilter(),sort);
{code}

If nobody objects I'll apply this later today

> Update Spatial Lucene sort to use FieldComparatorSource
> ---
>
> Key: LUCENE-1588
> URL: https://issues.apache.org/jira/browse/LUCENE-1588
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/spatial
>Affects Versions: 2.9
>Reporter: patrick o'leary
>Assignee: patrick o'leary
>Priority: Trivial
> Fix For: 2.9
>
> Attachments: LUCENE-1588.patch
>
>
> Update distance sorting to use FieldComparator sorting as opposed to 
> SortComparator

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-06 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696145#action_12696145
 ] 

Shai Erera commented on LUCENE-1575:


So how do we proceed? It looks like we get inconsistent results, sometimes over 
the same OS and JRE, just different machine. Perhaps the test is too synthetic, 
although it does capture the essence of the changes. Mike, can you post your 
Wikipedia index somewhere so I can download and run your previous queries and 
compare the results?

> Refactoring Lucene collectors (HitCollector and extensions)
> ---
>
> Key: LUCENE-1575
> URL: https://issues.apache.org/jira/browse/LUCENE-1575
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
> LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
> LUCENE-1575.6.patch, LUCENE-1575.7.patch, LUCENE-1575.8.patch, 
> LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, PerfTest.java, 
> sortBench5.py, sortCollate5.py
>
>
> This issue is a result of a recent discussion we've had on the mailing list. 
> You can read the thread 
> [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
> We have agreed to do the following refactoring:
> * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
> be the base class for all Collector implementations.
> * Deprecate HitCollector in favor of the new Collector.
> * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
> those that accept HitCollector.
> ** Create a final class HitCollectorWrapper, and use it in the deprecated 
> methods in IndexSearcher, wrapping the given HitCollector.
> ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
> when we remove HitCollector.
> ** It will remove any instanceof checks that currently exist in IndexSearcher 
> code.
> * Create a new (abstract) TopDocsCollector, which will:
> ** Leave collect and setNextReader unimplemented.
> ** Introduce protected members PriorityQueue and totalHits.
> ** Introduce a single protected constructor which accepts a PriorityQueue.
> ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
> These can be used as-are by extending classes, as well as be overridden.
> ** Introduce a new topDocs(start, howMany) method which will be used a 
> convenience method when implementing a search application which allows paging 
> through search results. It will also attempt to improve the memory 
> allocation, by allocating a ScoreDoc[] of the requested size only.
> * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
> and getTotalHits() implementations as they are from TopDocsCollector. The 
> class will also be made final.
> * Change TopFieldCollector to extend TopDocsCollector, and make the class 
> final. Also implement topDocs(start, howMany).
> * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
> instead of TopScoreDocCollector. Implement topDocs(start, howMany)
> * Review other places where HitCollector is used, such as in Scorer, 
> deprecate those places and use Collector instead.
> Additionally, the following proposal was made w.r.t. decoupling score from 
> collect():
> * Change collect to accecpt only a doc Id (unbased).
> * Introduce a setScorer(Scorer) method.
> * If during collect the implementation needs the score, it can call 
> scorer.score().
> If we do this, then we need to review all places in the code where 
> collect(doc, score) is called, and assert whether Scorer can be passed. Also 
> this raises few questions:
> * What if during collect() Scorer is null? (i.e., not set) - is it even 
> possible?
> * I noticed that many (if not all) of the collect() implementations discard 
> the document if its score is not greater than 0. Doesn't it mean that score 
> is needed in collect() always?
> Open issues:
> * The name for Collector
> * TopDocsCollector was mentioned on the thread as TopResultsCollector, but 
> that was when we thought to call Colletor ResultsColletor. Since we decided 
> (so far) on Collector, I think TopDocsCollector makes sense, because of its 
> TopDocs output.
> * Decoupling score from collect().
> I will post a patch a bit later, as this is expected to be a very large 
> patch. I will split it into 2: (1) code patch (2) test cases (moving to use 
> Collector instead of HitCollector, as well as testing the new topDocs(start, 
> howMany) method.
> There might be even a 3rd patch which handles the setScorer thing in 
> Collector (maybe even a different issue?)

-- 
This message is automatica

[jira] Updated: (LUCENE-1587) RangeQuery equals method does not compare collator property fully

2009-04-06 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-1587:


Attachment: LUCENE-1587.patch

> RangeQuery equals method does not compare collator property fully
> -
>
> Key: LUCENE-1587
> URL: https://issues.apache.org/jira/browse/LUCENE-1587
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.4.1
>Reporter: Mark Platvoet
>Assignee: Mark Miller
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1587.patch
>
>
> The equals method in the range query has the collator comparison implemented 
> as:
> (this.collator != null && ! this.collator.equals(other.collator))
> When _this.collator = null_ and _other.collator = someCollator_  this method 
> will incorrectly assume they are equal. 
> So adding something like
> || (this.collator == null && other.collator != null)
> would fix the problem

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-06 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696160#action_12696160
 ] 

Michael McCandless commented on LUCENE-1575:


bq. So how do we proceed?

The results are definitely highly varying...

It seems like I'm the only one seeing sizable performance loss with the patch,
and then only with 64bit JREs (on OS X and Windows Server 2004 x64).

Mark when you saw no performance loss on  64 bit linux, was the JRE
64 bit?

If so, then maybe we should simply proceed with the patch as is.
These differences are clearly java ghosts and there's not much we can
do about that

The index is a little too large (2.6G) to schlepp around -- instead,
here's the alg I used to create it:

{code}
analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer

doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker

merge.policy=org.apache.lucene.index.LogDocMergePolicy

docs.file=/Volumes/External/lucene/wiki.txt
doc.stored = false
doc.term.vector = false
doc.add.log.step=1000
max.field.length=2147483647

directory=FSDirectory
autocommit=false
compound=false
ram.flush.mb = 128
doc.maker.forever = false

work.dir=/lucene/work

{ "Rounds"
  ResetSystemErase
  { "BuildIndex"
- CreateIndex
 { "AddDocs" AddDoc > : *
- CloseIndex
  }
  NewRound
} : 1

RepSumByPrefRound BuildIndex
{code}


> Refactoring Lucene collectors (HitCollector and extensions)
> ---
>
> Key: LUCENE-1575
> URL: https://issues.apache.org/jira/browse/LUCENE-1575
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
> LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
> LUCENE-1575.6.patch, LUCENE-1575.7.patch, LUCENE-1575.8.patch, 
> LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, PerfTest.java, 
> sortBench5.py, sortCollate5.py
>
>
> This issue is a result of a recent discussion we've had on the mailing list. 
> You can read the thread 
> [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
> We have agreed to do the following refactoring:
> * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
> be the base class for all Collector implementations.
> * Deprecate HitCollector in favor of the new Collector.
> * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
> those that accept HitCollector.
> ** Create a final class HitCollectorWrapper, and use it in the deprecated 
> methods in IndexSearcher, wrapping the given HitCollector.
> ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
> when we remove HitCollector.
> ** It will remove any instanceof checks that currently exist in IndexSearcher 
> code.
> * Create a new (abstract) TopDocsCollector, which will:
> ** Leave collect and setNextReader unimplemented.
> ** Introduce protected members PriorityQueue and totalHits.
> ** Introduce a single protected constructor which accepts a PriorityQueue.
> ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
> These can be used as-are by extending classes, as well as be overridden.
> ** Introduce a new topDocs(start, howMany) method which will be used a 
> convenience method when implementing a search application which allows paging 
> through search results. It will also attempt to improve the memory 
> allocation, by allocating a ScoreDoc[] of the requested size only.
> * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
> and getTotalHits() implementations as they are from TopDocsCollector. The 
> class will also be made final.
> * Change TopFieldCollector to extend TopDocsCollector, and make the class 
> final. Also implement topDocs(start, howMany).
> * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
> instead of TopScoreDocCollector. Implement topDocs(start, howMany)
> * Review other places where HitCollector is used, such as in Scorer, 
> deprecate those places and use Collector instead.
> Additionally, the following proposal was made w.r.t. decoupling score from 
> collect():
> * Change collect to accecpt only a doc Id (unbased).
> * Introduce a setScorer(Scorer) method.
> * If during collect the implementation needs the score, it can call 
> scorer.score().
> If we do this, then we need to review all places in the code where 
> collect(doc, score) is called, and assert whether Scorer can be passed. Also 
> this raises few questions:
> * What if during collect() Scorer is null? (i.e., not set) - is it even 
> possible?
> * I noticed that many (if not all) of the collect() implementati

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-06 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696178#action_12696178
 ] 

Mark Miller commented on LUCENE-1575:
-

Yes, both 64-bit versions - openjdk 6 and sun java 1.5. I appeared to be 
getting the same results with both jvm's and patched or not. I figured I'd try 
a bit of profiling, since I have a 64-bit setup, but doesnt appear I'd learn 
much. I'm going to try a bit more testing tonight for the heck of it - I've got 
sun 1.6 and a 32-bit 1.5 I could check with as well.

> Refactoring Lucene collectors (HitCollector and extensions)
> ---
>
> Key: LUCENE-1575
> URL: https://issues.apache.org/jira/browse/LUCENE-1575
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
> LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
> LUCENE-1575.6.patch, LUCENE-1575.7.patch, LUCENE-1575.8.patch, 
> LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, PerfTest.java, 
> sortBench5.py, sortCollate5.py
>
>
> This issue is a result of a recent discussion we've had on the mailing list. 
> You can read the thread 
> [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
> We have agreed to do the following refactoring:
> * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
> be the base class for all Collector implementations.
> * Deprecate HitCollector in favor of the new Collector.
> * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
> those that accept HitCollector.
> ** Create a final class HitCollectorWrapper, and use it in the deprecated 
> methods in IndexSearcher, wrapping the given HitCollector.
> ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
> when we remove HitCollector.
> ** It will remove any instanceof checks that currently exist in IndexSearcher 
> code.
> * Create a new (abstract) TopDocsCollector, which will:
> ** Leave collect and setNextReader unimplemented.
> ** Introduce protected members PriorityQueue and totalHits.
> ** Introduce a single protected constructor which accepts a PriorityQueue.
> ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
> These can be used as-are by extending classes, as well as be overridden.
> ** Introduce a new topDocs(start, howMany) method which will be used a 
> convenience method when implementing a search application which allows paging 
> through search results. It will also attempt to improve the memory 
> allocation, by allocating a ScoreDoc[] of the requested size only.
> * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
> and getTotalHits() implementations as they are from TopDocsCollector. The 
> class will also be made final.
> * Change TopFieldCollector to extend TopDocsCollector, and make the class 
> final. Also implement topDocs(start, howMany).
> * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
> instead of TopScoreDocCollector. Implement topDocs(start, howMany)
> * Review other places where HitCollector is used, such as in Scorer, 
> deprecate those places and use Collector instead.
> Additionally, the following proposal was made w.r.t. decoupling score from 
> collect():
> * Change collect to accecpt only a doc Id (unbased).
> * Introduce a setScorer(Scorer) method.
> * If during collect the implementation needs the score, it can call 
> scorer.score().
> If we do this, then we need to review all places in the code where 
> collect(doc, score) is called, and assert whether Scorer can be passed. Also 
> this raises few questions:
> * What if during collect() Scorer is null? (i.e., not set) - is it even 
> possible?
> * I noticed that many (if not all) of the collect() implementations discard 
> the document if its score is not greater than 0. Doesn't it mean that score 
> is needed in collect() always?
> Open issues:
> * The name for Collector
> * TopDocsCollector was mentioned on the thread as TopResultsCollector, but 
> that was when we thought to call Colletor ResultsColletor. Since we decided 
> (so far) on Collector, I think TopDocsCollector makes sense, because of its 
> TopDocs output.
> * Decoupling score from collect().
> I will post a patch a bit later, as this is expected to be a very large 
> patch. I will split it into 2: (1) code patch (2) test cases (moving to use 
> Collector instead of HitCollector, as well as testing the new topDocs(start, 
> howMany) method.
> There might be even a 3rd patch which handles the setScorer thing in 
> Collector (maybe even a different issue?)

--

[jira] Commented: (LUCENE-1584) Callback for intercepting merging segments in IndexWriter

2009-04-06 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696184#action_12696184
 ] 

Jason Rutherglen commented on LUCENE-1584:
--

I think it's good to take a step back, "if we fix Lucene's field
cache, and Lucene's near real-time search manages CSF's
efficiently in memory" fixes the use case. Relying on CSF coming
in probably won't help this the case if it doesn't make it into
the 2.9 release. I like the callback method because it does not
rely on passing segment infos around and instead uses the
already public IndexReader classes.


> Callback for intercepting merging segments in IndexWriter
> -
>
> Key: LUCENE-1584
> URL: https://issues.apache.org/jira/browse/LUCENE-1584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1584.patch
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> For things like merging field caches or bitsets, it's useful to
> know which segments were merged to create a new segment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1313) Realtime Search

2009-04-06 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696186#action_12696186
 ] 

Jason Rutherglen commented on LUCENE-1313:
--

bq. So this has no external dependencies, right?

Yes.

{quote}I'd be very interested to compare (benchmark) this approach
vs solely LUCENE-1516.{quote}

Is the .alg using the NearRealtimeReader from LUCENE-1516 our
best measure of realtime performance?

{quote} 
the transactional restriction could/should layer on
top of this performance optimization for near-realtime search?
{quote}

The transactional system should be able to support both methods.
Perhaps a non-locking setting would allow the same RealtimeIndex
class support both modes of operation?

> Realtime Search
> ---
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira/browse/LUCENE-1313
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, 
> lucene-1313.patch, lucene-1313.patch, lucene-1313.patch
>
>
> Realtime search with transactional semantics.  
> Possible future directions:
>   * Optimistic concurrency
>   * Replication
> Encoding each transaction into a set of bytes by writing to a RAMDirectory 
> enables replication.  It is difficult to replicate using other methods 
> because while the document may easily be serialized, the analyzer cannot.
> I think this issue can hold realtime benchmarks which include indexing and 
> searching concurrently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1313) Realtime Search

2009-04-06 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696277#action_12696277
 ] 

Jason Rutherglen commented on LUCENE-1313:
--

We'll need to integrate the RAM based indexer into IndexWriter
to carry over the deletes to the ram index while it's copied to
disk. This is similar to IndexWriter.commitMergedDeletes
carrying deletes over at the segment reader level based on a
comparison of the current reader and the cloned reader.
Otherwise there's redundant deletions to the disk index using
IW.deleteDocuments which can be unnecessarily expensive. To make
external we would need to do the delete by doc id genealogy. 

> Realtime Search
> ---
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira/browse/LUCENE-1313
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, 
> lucene-1313.patch, lucene-1313.patch, lucene-1313.patch
>
>
> Realtime search with transactional semantics.  
> Possible future directions:
>   * Optimistic concurrency
>   * Replication
> Encoding each transaction into a set of bytes by writing to a RAMDirectory 
> enables replication.  It is difficult to replicate using other methods 
> because while the document may easily be serialized, the analyzer cannot.
> I think this issue can hold realtime benchmarks which include indexing and 
> searching concurrently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Future projects

2009-04-06 Thread Jason Rutherglen
> The realtime reader would have to have sub-readers per thread,
and an aggregate reader that "joins" them by interleaving the
docIDs

Nice (i.e. nice and complex)! Not knowing too much about the
internals, how would the interleaving work? Does each subreader
have a "start" ala Multi*Reader? Or are the doc ids incremented
from a synced place such that no two readers have the same doc
id?

> BTW there are benefits to not reusing the RAM buffer, outside
of faster near real-time search

Not reusing the RAM buffer means not reusing the pooled byte
arrays after a flush or something else?

> thus allowing add/deletes in other threads to run. Currently
they are all blocked ("stop the world") during flush

SSDs are cool, I can't see management approving of those quite
yet, are there many places piloting Lucene on SSDs that you're
aware of?

>From what you've said so far, this is how I understand realtime
ram buffer readers could work:

There'd be a IndexWriter.getRAMReader method that gathers all
the ram buffers from the various threads, marks a doc id as the
last one for the overall RAMBufferMultiReader. A new set of
classes, RAMBufferTermEnum, RAMBufferTermDocs,
RAMBufferTermPositions would be implemented that can read from
the ram buffer.

I don't think the current field cache API would like growing
arrays? Something hopefully LUCENE-831 will support.

On Sat, Apr 4, 2009 at 4:46 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> On Fri, Apr 3, 2009 at 8:01 PM, Jason Rutherglen
>  wrote:
> > I looked at the IndexWriter code in regards to creating a realtime
> reader,
> > with the many flexible indexing classes I'm unsure of how one would get a
> > frozenish IndexInput of the byte slices, given the byte slices are
> attached
> > to different threads?
>
> The realtime reader would have to have sub-readers per thread, and an
> aggregate reader that "joins" them by interleaving the docIDs.  When
> flushing we create such a beast, but, it's not general purpose (ie it
> does not implement IndexReader API; it only implements enough to write
> the postings).
>
> BTW there are benefits to not reusing the RAM buffer, outside of
> faster near real-time search: it would allow flushing to be done in
> the BG.  Ie, flush could start, and we'd immediately switch to a new
> RAM buffer, thus allowing add/deletes in other threads to run.
> Currently they are all blocked ("stop the world") during flush, though
> it's not clear on a fast IO device (SSD) how big a deal this "stop the
> world" really is to indexing throughput.
>
> But still it's a complex change.
>
> Mike
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


HitCollector#collect(int,float,Collection)

2009-04-06 Thread Karl Wettin
How crazy would it be to refactor HitCollector so it also accept the  
matching queries?


Let's ignore my use case (not sure it makes sense yet, it's related to  
finding a threadshold between probably interesting and definitly not  
interesting results of huge OR-statements, but I really have to try it  
out before I can say if it's any good) and just focus on the speed  
impact. If I cleared and reused the Collection passed down to the  
HitCollector then it shouldn't really slow things down, right? And if  
I reused the collections in my TopDocsCollector as low scoring results  
was pushed down then it shouldn't have to be expensive there either. Or?



karl

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1589) IndexWriter.addIndexesNoOptimize(IndexReader[] readers)

2009-04-06 Thread Jason Rutherglen (JIRA)
IndexWriter.addIndexesNoOptimize(IndexReader[] readers)
---

 Key: LUCENE-1589
 URL: https://issues.apache.org/jira/browse/LUCENE-1589
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9


Similar to IndexWriter.addIndexesNoOptimize(Directory[] dirs)
but for IndexReaders. This will be used to flush cloned ram
indexes to disk for near realtime indexing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: HitCollector#collect(int,float,Collection)

2009-04-06 Thread Shai Erera
Hi Karl,

LUCENE-1575 refactors HitCollector by seperating the score from document
collection. So if we were to introduce this type of method (that you
suggest), it would be through a setQueries(Collection) method.

Maybe you try to verify if your use case makes sense, is efficient etc.,
before we do this change. Adding a setQueries to Collector (the new name of
HC) shouldn't be a problem since we can always add an empty-impl method, not
affecting back-compat. However I wonder from where will it be called,
whether it makes sense to create that Collection object, pass it around
while knowing that most collectors will not use it?

Is it something that you perhaps can implement by extending Collector (and
some other classes), and in your extending code call to setQueries? Today,
as far as I remember, only Scorer calls collect() and I'm not sure if Scorer
has the information of the matching queries. Even if it does, extending it
and calling setQueries from the extension seems more reasonable, than adding
such call to every query execution, which also means instantiating a new
Collection for every search (unless we provide an API on
IndexSearcher which allows you to pass such object).

What do you think?

On Tue, Apr 7, 2009 at 3:21 AM, Karl Wettin  wrote:

> How crazy would it be to refactor HitCollector so it also accept the
> matching queries?
>
> Let's ignore my use case (not sure it makes sense yet, it's related to
> finding a threadshold between probably interesting and definitly not
> interesting results of huge OR-statements, but I really have to try it out
> before I can say if it's any good) and just focus on the speed impact. If I
> cleared and reused the Collection passed down to the HitCollector then it
> shouldn't really slow things down, right? And if I reused the collections in
> my TopDocsCollector as low scoring results was pushed down then it shouldn't
> have to be expensive there either. Or?
>
>
>karl
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


MoreLikeThisQuery term frequency caching

2009-04-06 Thread Richard Marr
Hi all,

I've been exploring MoreLikeThisQuery as part of a recent project and
something that came out of that might be useful to others here.

I found that using MoreLikeThisQuery could be quite slow for my use
case, but that most of the time involved was spent looking up term
frequencies to calculate weightings. Since those term frequencies
usually don't need to be anywhere near real-time I found that caching
them in a hashmap had a very good cost/benefit ratio for my
application, speeding up MLT queries by an order of magnitude.

My use case was possibly unusual in that I was looking at a limited
vocabulary rather than full English, but in theory other applications
that make use of the MLT class could benefit.

So at this point I have some questions: (1) Have others experienced
similar performance characteristics for MLT code? (2) Am I missing
some fatal flaw in this approach? (3) Are the modifications worth
sharing?

Cheers,

Rich

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org