[jira] Commented: (LUCENE-697) Scorer.skipTo affects sloppyPhrase scoring

2006-10-24 Thread Yonik Seeley (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-697?page=comments#action_12444573 ] 

Yonik Seeley commented on LUCENE-697:
-

Comment out line 104 of QueryUtils.java to reproduce this problem:

  scoreDiff=0; // TODO: remove this go get LUCENE-697 failures 


> Scorer.skipTo affects sloppyPhrase scoring
> --
>
> Key: LUCENE-697
> URL: http://issues.apache.org/jira/browse/LUCENE-697
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.0.0
>Reporter: Yonik Seeley
>
> If you mix skipTo() and next(), you get different scores than what is 
> returned to a hit collector.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-696) Scorer.skipTo() doesn't always work if called before next()

2006-10-24 Thread Yonik Seeley (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-696?page=all ]

Yonik Seeley resolved LUCENE-696.
-

Fix Version/s: 2.0.1
   Resolution: Fixed
 Assignee: Yonik Seeley

Patch committed after further tests were added.

> Scorer.skipTo() doesn't always work if called before next()
> ---
>
> Key: LUCENE-696
> URL: http://issues.apache.org/jira/browse/LUCENE-696
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Yonik Seeley
> Assigned To: Yonik Seeley
> Fix For: 2.0.1
>
> Attachments: dismax.patch
>
>
> skipTo() doesn't work for all scorers if called before next().

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-698) FilteredQuery ignores boost

2006-10-24 Thread Yonik Seeley (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-698?page=comments#action_12444570 ] 

Yonik Seeley commented on LUCENE-698:
-

I just commited hashCode() and equals() changes to take boost into account so 
that
generic tests in QueryUtils.check(query) can pass.

One should arguably be able to set the boost on any query clause, so I'm 
leaving this open since I think scoring probably ignores the boost too.

> FilteredQuery ignores boost
> ---
>
> Key: LUCENE-698
> URL: http://issues.apache.org/jira/browse/LUCENE-698
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Yonik Seeley
>
> Filtered query ignores it's own boost.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-698) FilteredQuery ignores boost

2006-10-24 Thread Yonik Seeley (JIRA)
FilteredQuery ignores boost
---

 Key: LUCENE-698
 URL: http://issues.apache.org/jira/browse/LUCENE-698
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Yonik Seeley


Filtered query ignores it's own boost.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-697) Scorer.skipTo affects sloppyPhrase scoring

2006-10-24 Thread Yonik Seeley (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-697?page=comments#action_12444565 ] 

Yonik Seeley commented on LUCENE-697:
-

Here's the ant output from test code to be checked in shortly.
The test code calls skipTo(), skipTo(), next(), next(), etc 
while checking that the results match the hitcollector version.

[junit] Testcase: testP6(org.apache.lucene.search.TestSimpleExplanations):
Caused an ERROR
[junit] ERROR matching docs:
[junit] scorer.more=true doc=1 score=0.7849069
[junit] hitCollector.doc=1 score=0.67974937
[junit]  Scorer=scorer(weight(field:"w3 w2"~2))
[junit]  Query=field:"w3 w2"~2
[junit]  [EMAIL PROTECTED]
[junit] java.lang.RuntimeException: ERROR matching docs:
[junit] scorer.more=true doc=1 score=0.7849069
[junit] hitCollector.doc=1 score=0.67974937
[junit]  Scorer=scorer(weight(field:"w3 w2"~2))
[junit]  Query=field:"w3 w2"~2
[junit]  [EMAIL PROTECTED]
[junit] at org.apache.lucene.search.QueryUtils$2.collect(QueryUtils.java
:104)
[junit] at org.apache.lucene.search.Scorer.score(Scorer.java:48)
[junit] at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.j
ava:132)
[junit] at org.apache.lucene.search.Searcher.search(Searcher.java:116)
[junit] at org.apache.lucene.search.Searcher.search(Searcher.java:95)
[junit] at org.apache.lucene.search.QueryUtils.checkSkipTo(QueryUtils.ja
va:97)
[junit] at org.apache.lucene.search.QueryUtils.check(QueryUtils.java:75)

[junit] at org.apache.lucene.search.CheckHits.checkHitCollector(CheckHit
s.java:91)
[junit] at org.apache.lucene.search.TestExplanations.qtest(TestExplanati
ons.java:90)
[junit] at org.apache.lucene.search.TestExplanations.qtest(TestExplanati
ons.java:86)
[junit] at org.apache.lucene.search.TestSimpleExplanations.testP6(TestSi
mpleExplanations.java:87)


[junit] Testcase: testP7(org.apache.lucene.search.TestSimpleExplanations):
Caused an ERROR
[junit] ERROR matching docs:
[junit] scorer.more=true doc=1 score=0.7849069
[junit] hitCollector.doc=1 score=0.67974937
[junit]  Scorer=scorer(weight(field:"w3 w2"~3))
[junit]  Query=field:"w3 w2"~3
[junit]  [EMAIL PROTECTED]
[junit] java.lang.RuntimeException: ERROR matching docs:
[junit] scorer.more=true doc=1 score=0.7849069
[junit] hitCollector.doc=1 score=0.67974937
[junit]  Scorer=scorer(weight(field:"w3 w2"~3))
[junit]  Query=field:"w3 w2"~3
[junit]  [EMAIL PROTECTED]
[junit] at org.apache.lucene.search.QueryUtils$2.collect(QueryUtils.java
:104)
[junit] at org.apache.lucene.search.Scorer.score(Scorer.java:48)
[junit] at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.j
ava:132)
[junit] at org.apache.lucene.search.Searcher.search(Searcher.java:116)
[junit] at org.apache.lucene.search.Searcher.search(Searcher.java:95)
[junit] at org.apache.lucene.search.QueryUtils.checkSkipTo(QueryUtils.ja
va:97)
[junit] at org.apache.lucene.search.QueryUtils.check(QueryUtils.java:75)

[junit] at org.apache.lucene.search.CheckHits.checkHitCollector(CheckHit
s.java:91)
[junit] at org.apache.lucene.search.TestExplanations.qtest(TestExplanati
ons.java:90)
[junit] at org.apache.lucene.search.TestExplanations.qtest(TestExplanati
ons.java:86)
[junit] at org.apache.lucene.search.TestSimpleExplanations.testP7(TestSi
mpleExplanations.java:90)


[junit] Test org.apache.lucene.search.TestSimpleExplanations FAILED

> Scorer.skipTo affects sloppyPhrase scoring
> --
>
> Key: LUCENE-697
> URL: http://issues.apache.org/jira/browse/LUCENE-697
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.0.0
>Reporter: Yonik Seeley
>
> If you mix skipTo() and next(), you get different scores than what is 
> returned to a hit collector.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-697) Scorer.skipTo affects sloppyPhrase scoring

2006-10-24 Thread Yonik Seeley (JIRA)
Scorer.skipTo affects sloppyPhrase scoring
--

 Key: LUCENE-697
 URL: http://issues.apache.org/jira/browse/LUCENE-697
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.0.0
Reporter: Yonik Seeley


If you mix skipTo() and next(), you get different scores than what is returned 
to a hit collector.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-696) Scorer.skipTo() doesn't always work if called before next()

2006-10-24 Thread Yonik Seeley (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-696?page=all ]

Yonik Seeley updated LUCENE-696:


Attachment: dismax.patch

DisjunctionMaxScorer turned out to be the only scorer I could see with that 
problem.
Here's the patch w/ tests.

> Scorer.skipTo() doesn't always work if called before next()
> ---
>
> Key: LUCENE-696
> URL: http://issues.apache.org/jira/browse/LUCENE-696
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Yonik Seeley
> Attachments: dismax.patch
>
>
> skipTo() doesn't work for all scorers if called before next().

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-528) Optimization for IndexWriter.addIndexes()

2006-10-24 Thread Ning Li (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-528?page=all ]

Ning Li updated LUCENE-528:
---

Attachment: AddIndexesNoOptimize.patch

This patch implements addIndexesNoOptimize() following the algorithm described 
earlier.
  - The patch is based on the latest version from trunk.
  - AddIndexesNoOptimize() is implemented. The algorithm description is 
included as comment and the code is commented.
  - The patch includes a test called TestAddIndexesNoOptimize which covers all 
the code in addIndexesNoOptimize().
  - maybeMergeSegments() was conservative and checked for more merges only when 
"upperBound * mergeFactor <= maxMergeDocs". Change it to check for more merges 
when "upperBound < maxMergeDocs".
  - Minor changes in TestIndexWriterMergePolicy to better verify merge 
invariants.
  - The patch passes all unit tests.

One more comment on the implementation:
  - When we copy un-merged segments from S in step 4, ideally, we want to 
simply copy
those segments. However, directory does not support copy yet. In addition, 
source may
use compound file or not and target may use compound file or not. So we use
mergeSegments() to copy each segment, which may cause doc count to change
because deleted docs are garbage collected. That case is handled properly.  

> Optimization for IndexWriter.addIndexes()
> -
>
> Key: LUCENE-528
> URL: http://issues.apache.org/jira/browse/LUCENE-528
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Steven Tamm
> Assigned To: Otis Gospodnetic
>Priority: Minor
> Attachments: AddIndexes.patch, AddIndexesNoOptimize.patch
>
>
> One big performance problem with IndexWriter.addIndexes() is that it has to 
> optimize the index both before and after adding the segments.  When you have 
> a very large index, to which you are adding batches of small updates, these 
> calls to optimize make using addIndexes() impossible.  It makes parallel 
> updates very frustrating.
> Here is an optimized function that helps out by calling mergeSegments only on 
> the newly added documents.  It will try to avoid calling mergeSegments until 
> the end, unless you're adding a lot of documents at once.
> I also have an extensive unit test that verifies that this function works 
> correctly if people are interested.  I gave it a different name because it 
> has very different performance characteristics which can make querying take 
> longer.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-696) Scorer.skipTo() doesn't always work if called before next()

2006-10-24 Thread Chris Hostetter

: It would also simplify some scorers if doc() wasn't undefined before
: next() or skipTo() was called, but instead -1.

+1 ... but if we are goingg to change the API requirements for doc(),
we should clarify the requirements or score() ... with doc(), negative
numbers can easily be used as a marker of "invalid", but the same rule
isn't as easy to apply with the score() method ... perhaps the
documentation for doc() and score() should be...

doc():   Returns the current document number matching the query.
 Returns -1 if neither next() or skipTo() have been called at
 least once, behavior is undefined if the last call to next()
 or skipTo returned false.
score(): Returns the score of the current document matching the query.
 The value is undefined if doc() reurns -1, or if the last
 call to next() or skipTo returned false.


...we probably want to make the same API changes to Spans, TermEnum,
and TermDocs as well to be consistent.

-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-696) Scorer.skipTo() doesn't always work if called before next()

2006-10-24 Thread Paul Elschot (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-696?page=comments#action_12444506 ] 

Paul Elschot commented on LUCENE-696:
-

Repeating a comment just posted at LUCENE-693:

skipTo() as first call on a scorer should work. ReqExclScorer and 
ReqOptSumScorer depend on that for the excluded and optional scorers.


> Scorer.skipTo() doesn't always work if called before next()
> ---
>
> Key: LUCENE-696
> URL: http://issues.apache.org/jira/browse/LUCENE-696
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Yonik Seeley
>
> skipTo() doesn't work for all scorers if called before next().

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-696) Scorer.skipTo() doesn't always work if called before next()

2006-10-24 Thread Yonik Seeley (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-696?page=comments#action_12444500 ] 

Yonik Seeley commented on LUCENE-696:
-

It would also simplify some scorers if doc() wasn't undefined before next() or 
skipTo() was called, but instead -1.
This undefined nature of doc() often requires more state to be kept around 
about the scorers.
Things like TermScorer would just need a change from "int doc" to "int doc=-1"

Is there any scorer that this would impose a burden or cost on?
Thoughts?

> Scorer.skipTo() doesn't always work if called before next()
> ---
>
> Key: LUCENE-696
> URL: http://issues.apache.org/jira/browse/LUCENE-696
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Yonik Seeley
>
> skipTo() doesn't work for all scorers if called before next().

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-693) ConjunctionScorer - more tuneup

2006-10-24 Thread Yonik Seeley (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-693?page=comments#action_1296 ] 

Yonik Seeley commented on LUCENE-693:
-

> Could you describe a case in which skipTo() before next() does not work?

I don't recall, but my attempt to speed up ConjunctionScorer flushed them out.
I'll move back to an older version of that to see what failed and put
details here: http://issues.apache.org/jira/browse/LUCENE-696

> ConjunctionScorer - more tuneup
> ---
>
> Key: LUCENE-693
> URL: http://issues.apache.org/jira/browse/LUCENE-693
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.1
> Environment: Windows Server 2003 x64, Java 1.6, pretty large index
>Reporter: Peter Keegan
> Attachments: conjunction.patch, conjunction.patch, conjunction.patch
>
>
> (See also: #LUCENE-443)
> I did some profile testing with the new ConjuctionScorer in 2.1 and 
> discovered a new bottleneck in ConjunctionScorer.sortScorers. The 
> java.utils.Arrays.sort method is cloning the Scorers array on every sort, 
> which is quite expensive on large indexes because of the size of the 'norms' 
> array within, and isn't necessary. 
> Here is one possible solution:
>   private void sortScorers() {
> // squeeze the array down for the sort
> //if (length != scorers.length) {
> //  Scorer[] temps = new Scorer[length];
> //  System.arraycopy(scorers, 0, temps, 0, length);
> //  scorers = temps;
> //}
> insertionSort( scorers,length );
> // note that this comparator is not consistent with equals!
> //Arrays.sort(scorers, new Comparator() { // sort the array
> //public int compare(Object o1, Object o2) {
> //  return ((Scorer)o1).doc() - ((Scorer)o2).doc();
> //}
> //  });
>   
> first = 0;
> last = length - 1;
>   }
>   private void insertionSort( Scorer[] scores, int len)
>   {
>   for (int i=0; i   for (int j=i; j>0 && scores[j-1].doc() > scores[j].doc();j-- ) {
>   swap (scores, j, j-1);
>   }
>   }
>   return;
>   }
>   private void swap(Object[] x, int a, int b) {
> Object t = x[a];
> x[a] = x[b];
> x[b] = t;
>   }
>  
> The squeezing of the array is no longer needed. 
> We also initialized the Scorers array to 8 (instead of 2) to avoid having to 
> grow the array for common queries, although this probably has less 
> performance impact.
> This change added about 3% to query throughput in my testing.
> Peter

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-696) Scorer.skipTo() doesn't always work if called before next()

2006-10-24 Thread Yonik Seeley (JIRA)
Scorer.skipTo() doesn't always work if called before next()
---

 Key: LUCENE-696
 URL: http://issues.apache.org/jira/browse/LUCENE-696
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Yonik Seeley


skipTo() doesn't work for all scorers if called before next().

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-693) ConjunctionScorer - more tuneup

2006-10-24 Thread Paul Elschot (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-693?page=comments#action_1287 ] 

Paul Elschot commented on LUCENE-693:
-

Yonik,

you wrote: 
> but then learned that calling skipTo() before calling next() doesn't always 
> work.

Could you describe a case in which skipTo() before next()  does not work?

skipTo() as first call on a scorer should work. ReqExclScorer and 
ReqOptSumScorer depend on that for the excluded and optional scorers.

Regards,
Paul Elschot


> ConjunctionScorer - more tuneup
> ---
>
> Key: LUCENE-693
> URL: http://issues.apache.org/jira/browse/LUCENE-693
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.1
> Environment: Windows Server 2003 x64, Java 1.6, pretty large index
>Reporter: Peter Keegan
> Attachments: conjunction.patch, conjunction.patch, conjunction.patch
>
>
> (See also: #LUCENE-443)
> I did some profile testing with the new ConjuctionScorer in 2.1 and 
> discovered a new bottleneck in ConjunctionScorer.sortScorers. The 
> java.utils.Arrays.sort method is cloning the Scorers array on every sort, 
> which is quite expensive on large indexes because of the size of the 'norms' 
> array within, and isn't necessary. 
> Here is one possible solution:
>   private void sortScorers() {
> // squeeze the array down for the sort
> //if (length != scorers.length) {
> //  Scorer[] temps = new Scorer[length];
> //  System.arraycopy(scorers, 0, temps, 0, length);
> //  scorers = temps;
> //}
> insertionSort( scorers,length );
> // note that this comparator is not consistent with equals!
> //Arrays.sort(scorers, new Comparator() { // sort the array
> //public int compare(Object o1, Object o2) {
> //  return ((Scorer)o1).doc() - ((Scorer)o2).doc();
> //}
> //  });
>   
> first = 0;
> last = length - 1;
>   }
>   private void insertionSort( Scorer[] scores, int len)
>   {
>   for (int i=0; i   for (int j=i; j>0 && scores[j-1].doc() > scores[j].doc();j-- ) {
>   swap (scores, j, j-1);
>   }
>   }
>   return;
>   }
>   private void swap(Object[] x, int a, int b) {
> Object t = x[a];
> x[a] = x[b];
> x[b] = t;
>   }
>  
> The squeezing of the array is no longer needed. 
> We also initialized the Scorers array to 8 (instead of 2) to avoid having to 
> grow the array for common queries, although this probably has less 
> performance impact.
> This change added about 3% to query throughput in my testing.
> Peter

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Scorer.skipTo() valid before next()?

2006-10-24 Thread Chris Hostetter
: I got a bit of a surprise trying to re-implement the ConjunctionScorer.
: It turns out that skipTo(0) does not always return the same thing as
: next() on a newly created scorer.  Some scorers give invalid results
: if skipTo() is called before next().

that sounds like a bug to me...

: The javddoc is unclear on the subject, but the javadoc for both
: score() and skipTo() suggest that calling skipTo() first is valid, and
: that seems to make more sense.

i don't see why you would say the javadoc is unclear, the javadoc for
skipTo seems very clear on the subject.  skipTo(0) should be functionaly
equivilent to...

do {
  if (!next())
return false;
} while (0 > doc());
return true;



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-686) Resources not always reclaimed in scorers after each search

2006-10-24 Thread Hoss Man (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-686?page=comments#action_1238 ] 

Hoss Man commented on LUCENE-686:
-

Quick summary of some discussion from the mailing list...

1) i replied to paul's comments in the bug indicating that while there may not 
be any leaks in the core code base, these changes were needed to allow people 
writing custom Directories or custom Scorers to avoid memory leaks.
2) paul suggested that people writing custom code can work arround this by 
subclassing/customizing the Directory, and all the Scorers, and the 
IndexSearcher
3) i suggested that made the barrier for new custom code rather high, and made 
a poor comparison that got us on a tangent.
4) i argued that since TermDocs had a close method, Scorers needed to call it, 
which ment they needed a close method which was garunteed to be called.
5) paul argued that TermDocs.close in the core right now isn't needed, and we 
might be better off removing it, and requiring any more complicated custom 
implimentations to rely on GC to clean up any resources they have (presumably 
using a finalize method)
6) steven_parkes then raised the point that the fundemental issue is design 
integrity ... we have to agree what the point of TermDocs.close is from an API 
standpoint, and that callers should not have to know what the concrete 
implimentation of hte callee is to know wether close needs to be called.  
Better documentation on the purpose of the method can lead to better discussion 
about wether it can be removed, or if the current behavior is a bug that needs 
fixed.

> Resources not always reclaimed in scorers after each search
> ---
>
> Key: LUCENE-686
> URL: http://issues.apache.org/jira/browse/LUCENE-686
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
> Environment: All
>Reporter: Ning Li
> Attachments: ScorerResourceGC.patch
>
>
> Resources are not always reclaimed in scorers after each search.
> For example, close() is not always called for term docs in TermScorer.
> A test will be attached to show when resources are not reclaimed.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-693) ConjunctionScorer - more tuneup

2006-10-24 Thread Peter Keegan (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-693?page=comments#action_1236 ] 

Peter Keegan commented on LUCENE-693:
-

fwiw, my tests were done using 'real world' queries and index. Most queries
have several required clauses. The jvm is 1.6 beta2 with -server. I would be
interested to see results from others, too.

thanks Yonik!

Peter



> ConjunctionScorer - more tuneup
> ---
>
> Key: LUCENE-693
> URL: http://issues.apache.org/jira/browse/LUCENE-693
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.1
> Environment: Windows Server 2003 x64, Java 1.6, pretty large index
>Reporter: Peter Keegan
> Attachments: conjunction.patch, conjunction.patch, conjunction.patch
>
>
> (See also: #LUCENE-443)
> I did some profile testing with the new ConjuctionScorer in 2.1 and 
> discovered a new bottleneck in ConjunctionScorer.sortScorers. The 
> java.utils.Arrays.sort method is cloning the Scorers array on every sort, 
> which is quite expensive on large indexes because of the size of the 'norms' 
> array within, and isn't necessary. 
> Here is one possible solution:
>   private void sortScorers() {
> // squeeze the array down for the sort
> //if (length != scorers.length) {
> //  Scorer[] temps = new Scorer[length];
> //  System.arraycopy(scorers, 0, temps, 0, length);
> //  scorers = temps;
> //}
> insertionSort( scorers,length );
> // note that this comparator is not consistent with equals!
> //Arrays.sort(scorers, new Comparator() { // sort the array
> //public int compare(Object o1, Object o2) {
> //  return ((Scorer)o1).doc() - ((Scorer)o2).doc();
> //}
> //  });
>   
> first = 0;
> last = length - 1;
>   }
>   private void insertionSort( Scorer[] scores, int len)
>   {
>   for (int i=0; i   for (int j=i; j>0 && scores[j-1].doc() > scores[j].doc();j-- ) {
>   swap (scores, j, j-1);
>   }
>   }
>   return;
>   }
>   private void swap(Object[] x, int a, int b) {
> Object t = x[a];
> x[a] = x[b];
> x[b] = t;
>   }
>  
> The squeezing of the array is no longer needed. 
> We also initialized the Scorers array to 8 (instead of 2) to avoid having to 
> grow the array for common queries, although this probably has less 
> performance impact.
> This change added about 3% to query throughput in my testing.
> Peter

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-693) ConjunctionScorer - more tuneup

2006-10-24 Thread Yonik Seeley (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-693?page=all ]

Yonik Seeley updated LUCENE-693:


Attachment: conjunction.patch

This version removes the docs[] array and seems to be slightly faster.
Still slower on the synthetic random ConstantScoreQuery tests though.

If anyone else as real-world benchmarks they can try, I'd appreciate the data.

> ConjunctionScorer - more tuneup
> ---
>
> Key: LUCENE-693
> URL: http://issues.apache.org/jira/browse/LUCENE-693
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.1
> Environment: Windows Server 2003 x64, Java 1.6, pretty large index
>Reporter: Peter Keegan
> Attachments: conjunction.patch, conjunction.patch, conjunction.patch
>
>
> (See also: #LUCENE-443)
> I did some profile testing with the new ConjuctionScorer in 2.1 and 
> discovered a new bottleneck in ConjunctionScorer.sortScorers. The 
> java.utils.Arrays.sort method is cloning the Scorers array on every sort, 
> which is quite expensive on large indexes because of the size of the 'norms' 
> array within, and isn't necessary. 
> Here is one possible solution:
>   private void sortScorers() {
> // squeeze the array down for the sort
> //if (length != scorers.length) {
> //  Scorer[] temps = new Scorer[length];
> //  System.arraycopy(scorers, 0, temps, 0, length);
> //  scorers = temps;
> //}
> insertionSort( scorers,length );
> // note that this comparator is not consistent with equals!
> //Arrays.sort(scorers, new Comparator() { // sort the array
> //public int compare(Object o1, Object o2) {
> //  return ((Scorer)o1).doc() - ((Scorer)o2).doc();
> //}
> //  });
>   
> first = 0;
> last = length - 1;
>   }
>   private void insertionSort( Scorer[] scores, int len)
>   {
>   for (int i=0; i   for (int j=i; j>0 && scores[j-1].doc() > scores[j].doc();j-- ) {
>   swap (scores, j, j-1);
>   }
>   }
>   return;
>   }
>   private void swap(Object[] x, int a, int b) {
> Object t = x[a];
> x[a] = x[b];
> x[b] = t;
>   }
>  
> The squeezing of the array is no longer needed. 
> We also initialized the Scorers array to 8 (instead of 2) to avoid having to 
> grow the array for common queries, although this probably has less 
> performance impact.
> This change added about 3% to query throughput in my testing.
> Peter

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-693) ConjunctionScorer - more tuneup

2006-10-24 Thread Yonik Seeley (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-693?page=comments#action_1211 ] 

Yonik Seeley commented on LUCENE-693:
-

> Well, I'm seeing a good 7% increase over the trunk version.

Yay!  Now only if I could get my random synthetic tests to show an improvement 
too...
Were you testing with -server?  My -client showed a speedup and -server showed 
a slowdown.

I think the difference is on *which* scorers I'm skipping on, even though I'm 
always skipping to the highest doc yet seen.  Skipping on denser scorers will 
be a waste of time, and if the list is sorted one is more likely to be skipping 
on the sparse scorers.  My code is optimal when the density of the scorers is 
similar.

Think of the case of two sparse scorers and a dense scorer... you really want 
to be skipping on the two sparse scorers until they happen to agree.  Until 
they agree, skipping on the dense scorer is a waste.  My code round robins and 
throws the dense scorer into the mix.

The question is, what are the real world usecases like, and what is important 
to speed up.
I'd argue that the case of all dense scorers, while more rare, is more 
important (sparse scorers
will cause the queries to be faster anyway).

> Do the test cases try queries with non-existent terms? 

They will I was able to reproduce by earlier bug with the new 
TestScorerPerf.testConjunctions() included in the last patch.


> ConjunctionScorer - more tuneup
> ---
>
> Key: LUCENE-693
> URL: http://issues.apache.org/jira/browse/LUCENE-693
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.1
> Environment: Windows Server 2003 x64, Java 1.6, pretty large index
>Reporter: Peter Keegan
> Attachments: conjunction.patch, conjunction.patch
>
>
> (See also: #LUCENE-443)
> I did some profile testing with the new ConjuctionScorer in 2.1 and 
> discovered a new bottleneck in ConjunctionScorer.sortScorers. The 
> java.utils.Arrays.sort method is cloning the Scorers array on every sort, 
> which is quite expensive on large indexes because of the size of the 'norms' 
> array within, and isn't necessary. 
> Here is one possible solution:
>   private void sortScorers() {
> // squeeze the array down for the sort
> //if (length != scorers.length) {
> //  Scorer[] temps = new Scorer[length];
> //  System.arraycopy(scorers, 0, temps, 0, length);
> //  scorers = temps;
> //}
> insertionSort( scorers,length );
> // note that this comparator is not consistent with equals!
> //Arrays.sort(scorers, new Comparator() { // sort the array
> //public int compare(Object o1, Object o2) {
> //  return ((Scorer)o1).doc() - ((Scorer)o2).doc();
> //}
> //  });
>   
> first = 0;
> last = length - 1;
>   }
>   private void insertionSort( Scorer[] scores, int len)
>   {
>   for (int i=0; i   for (int j=i; j>0 && scores[j-1].doc() > scores[j].doc();j-- ) {
>   swap (scores, j, j-1);
>   }
>   }
>   return;
>   }
>   private void swap(Object[] x, int a, int b) {
> Object t = x[a];
> x[a] = x[b];
> x[b] = t;
>   }
>  
> The squeezing of the array is no longer needed. 
> We also initialized the Scorers array to 8 (instead of 2) to avoid having to 
> grow the array for common queries, although this probably has less 
> performance impact.
> This change added about 3% to query throughput in my testing.
> Peter

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-693) ConjunctionScorer - more tuneup

2006-10-24 Thread Peter Keegan (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-693?page=comments#action_1208 ] 

Peter Keegan commented on LUCENE-693:
-

Well, I'm seeing a good 7% increase over the trunk version. Conjunction
scorer time is mostly in 'skipto' now, which seems reasonable.

Do the test cases try queries with non-existent terms? My failed query
contained 3 required terms, but one of the terms was misspelled and didn't
exist in the index.

Peter



> ConjunctionScorer - more tuneup
> ---
>
> Key: LUCENE-693
> URL: http://issues.apache.org/jira/browse/LUCENE-693
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.1
> Environment: Windows Server 2003 x64, Java 1.6, pretty large index
>Reporter: Peter Keegan
> Attachments: conjunction.patch, conjunction.patch
>
>
> (See also: #LUCENE-443)
> I did some profile testing with the new ConjuctionScorer in 2.1 and 
> discovered a new bottleneck in ConjunctionScorer.sortScorers. The 
> java.utils.Arrays.sort method is cloning the Scorers array on every sort, 
> which is quite expensive on large indexes because of the size of the 'norms' 
> array within, and isn't necessary. 
> Here is one possible solution:
>   private void sortScorers() {
> // squeeze the array down for the sort
> //if (length != scorers.length) {
> //  Scorer[] temps = new Scorer[length];
> //  System.arraycopy(scorers, 0, temps, 0, length);
> //  scorers = temps;
> //}
> insertionSort( scorers,length );
> // note that this comparator is not consistent with equals!
> //Arrays.sort(scorers, new Comparator() { // sort the array
> //public int compare(Object o1, Object o2) {
> //  return ((Scorer)o1).doc() - ((Scorer)o2).doc();
> //}
> //  });
>   
> first = 0;
> last = length - 1;
>   }
>   private void insertionSort( Scorer[] scores, int len)
>   {
>   for (int i=0; i   for (int j=i; j>0 && scores[j-1].doc() > scores[j].doc();j-- ) {
>   swap (scores, j, j-1);
>   }
>   }
>   return;
>   }
>   private void swap(Object[] x, int a, int b) {
> Object t = x[a];
> x[a] = x[b];
> x[b] = t;
>   }
>  
> The squeezing of the array is no longer needed. 
> We also initialized the Scorers array to 8 (instead of 2) to avoid having to 
> grow the array for common queries, although this probably has less 
> performance impact.
> This change added about 3% to query throughput in my testing.
> Peter

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-695) Improve BufferedIndexInput.readBytes() performance

2006-10-24 Thread Yonik Seeley (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-695?page=comments#action_12444350 ] 

Yonik Seeley commented on LUCENE-695:
-

> One unit test assumed that readBytes() can work if given a null array, if the 
> length requested is 0. Unfortunately,
> System.arraycopy doesn't share this permiscousity, so I had to add another 
> silly if(len>0) test in the readBytes() 
> code.

If "given" a null array?  Is this ever done in Lucene?  Which should be fixed, 
the testcase or the code?


> Improve BufferedIndexInput.readBytes() performance
> --
>
> Key: LUCENE-695
> URL: http://issues.apache.org/jira/browse/LUCENE-695
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Affects Versions: 2.0.0
>Reporter: Nadav Har'El
>Priority: Minor
> Attachments: readbytes.patch, readbytes.patch
>
>
> During a profiling session, I discovered that BufferedIndexInput.readBytes(),
> the function which reads a bunch of bytes from an index, is very inefficient
> in many cases. It is efficient for one or two bytes, and also efficient
> for a very large number of bytes (e.g., when the norms are read all at once);
> But for anything in between (e.g., 100 bytes), it is a performance disaster.
> It can easily be improved, though, and below I include a patch to do that.
> The basic problem in the existing code was that if you ask it to read 100
> bytes, readBytes() simply calls readByte() 100 times in a loop, which means
> we check byte after byte if the buffer has another character, instead of just
> checking once how many bytes we have left, and copy them all at once.
> My version, attached below, copies these 100 bytes if they are available at
> bulk (using System.arraycopy), and if less than 100 are available, whatever
> is available gets copied, and then the rest. (as before, when a very large
> number of bytes is requested, it is read directly into the final buffer).
> In my profiling, this fix caused amazing performance
> improvement: previously, BufferedIndexInput.readBytes() took as much as 25%
> of the run time, and after the fix, this was down to 1% of the run time! 
> However, my scenario is *not* the typical Lucene code, but rather a version 
> of Lucene with added payloads, and these payloads average at 100 bytes, where 
> the original readBytes() did worst. I expect that my fix will have less of an 
> impact on "vanilla" Lucene, but it still can have an impact because it is 
> used for things like reading fields. (I am not aware of a standard Lucene 
> benchmark, so I can't provide benchmarks on a more typical case).
> In addition to the change to readBytes(), my attached patch also adds a new
> unit test to BufferedIndexInput (which previously did not have a unit test).
> This test simulates a "file" which contains a predictable series of bytes, and
> then tries to read from it with readByte() and readButes() with various
> sizes (many thousands of combinations are tried) and see that exactly the
> expected bytes are read. This test is independent of my new readBytes()
> inplementation, and can be used to check the old implementation as well.
> By the way, it's interesting that BufferedIndexOutput.writeBytes was already 
> efficient, and wasn't simply a loop of writeByte(). Only the reading code was 
> inefficient. I wonder why this happened.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-695) Improve BufferedIndexInput.readBytes() performance

2006-10-24 Thread Nadav Har'El (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-695?page=all ]

Nadav Har'El updated LUCENE-695:


Attachment: readbytes.patch

A fixed patch, which now checks that we don't read past of of file. This is now 
checked correctly in all three cases (1. data already in the buffer, 2. small 
number of bytes in addition to buffer 3. large number of bytes in addition to 
the buffer).

Note that the original code (before my patch) did not check length()  for large 
number of bytes, only in refill() (which was only called for a small number of 
bytes). This code now checks in this case as well, so it is more correct than 
it was.

The TestCompoundFile test now passes, and I also added to my new 
BufferedIndexInput unit test a third test case, testEOF, which tests that we 
can read up to EOF, but not past it. This test tests that small overflows (a 
few bytes) and very large overflows both throw an exception.

I also made another change in this patch which I wish I didn't have to make, to 
account for other unit tests: One unit test assumed that readBytes() can work 
if given a null array, if the length requested is 0. Unfortunately, 
System.arraycopy doesn't share this permiscousity, so I had to add another 
silly if(len>0) test in the readBytes() code.

> Improve BufferedIndexInput.readBytes() performance
> --
>
> Key: LUCENE-695
> URL: http://issues.apache.org/jira/browse/LUCENE-695
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Affects Versions: 2.0.0
>Reporter: Nadav Har'El
>Priority: Minor
> Attachments: readbytes.patch, readbytes.patch
>
>
> During a profiling session, I discovered that BufferedIndexInput.readBytes(),
> the function which reads a bunch of bytes from an index, is very inefficient
> in many cases. It is efficient for one or two bytes, and also efficient
> for a very large number of bytes (e.g., when the norms are read all at once);
> But for anything in between (e.g., 100 bytes), it is a performance disaster.
> It can easily be improved, though, and below I include a patch to do that.
> The basic problem in the existing code was that if you ask it to read 100
> bytes, readBytes() simply calls readByte() 100 times in a loop, which means
> we check byte after byte if the buffer has another character, instead of just
> checking once how many bytes we have left, and copy them all at once.
> My version, attached below, copies these 100 bytes if they are available at
> bulk (using System.arraycopy), and if less than 100 are available, whatever
> is available gets copied, and then the rest. (as before, when a very large
> number of bytes is requested, it is read directly into the final buffer).
> In my profiling, this fix caused amazing performance
> improvement: previously, BufferedIndexInput.readBytes() took as much as 25%
> of the run time, and after the fix, this was down to 1% of the run time! 
> However, my scenario is *not* the typical Lucene code, but rather a version 
> of Lucene with added payloads, and these payloads average at 100 bytes, where 
> the original readBytes() did worst. I expect that my fix will have less of an 
> impact on "vanilla" Lucene, but it still can have an impact because it is 
> used for things like reading fields. (I am not aware of a standard Lucene 
> benchmark, so I can't provide benchmarks on a more typical case).
> In addition to the change to readBytes(), my attached patch also adds a new
> unit test to BufferedIndexInput (which previously did not have a unit test).
> This test simulates a "file" which contains a predictable series of bytes, and
> then tries to read from it with readByte() and readButes() with various
> sizes (many thousands of combinations are tried) and see that exactly the
> expected bytes are read. This test is independent of my new readBytes()
> inplementation, and can be used to check the old implementation as well.
> By the way, it's interesting that BufferedIndexOutput.writeBytes was already 
> efficient, and wasn't simply a loop of writeByte(). Only the reading code was 
> inefficient. I wonder why this happened.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-693) ConjunctionScorer - more tuneup

2006-10-24 Thread Yonik Seeley (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-693?page=all ]

Yonik Seeley updated LUCENE-693:


Attachment: conjunction.patch

Here is my current patch and test code (which currently seems to be slower with 
this patch).

> ConjunctionScorer - more tuneup
> ---
>
> Key: LUCENE-693
> URL: http://issues.apache.org/jira/browse/LUCENE-693
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.1
> Environment: Windows Server 2003 x64, Java 1.6, pretty large index
>Reporter: Peter Keegan
> Attachments: conjunction.patch, conjunction.patch
>
>
> (See also: #LUCENE-443)
> I did some profile testing with the new ConjuctionScorer in 2.1 and 
> discovered a new bottleneck in ConjunctionScorer.sortScorers. The 
> java.utils.Arrays.sort method is cloning the Scorers array on every sort, 
> which is quite expensive on large indexes because of the size of the 'norms' 
> array within, and isn't necessary. 
> Here is one possible solution:
>   private void sortScorers() {
> // squeeze the array down for the sort
> //if (length != scorers.length) {
> //  Scorer[] temps = new Scorer[length];
> //  System.arraycopy(scorers, 0, temps, 0, length);
> //  scorers = temps;
> //}
> insertionSort( scorers,length );
> // note that this comparator is not consistent with equals!
> //Arrays.sort(scorers, new Comparator() { // sort the array
> //public int compare(Object o1, Object o2) {
> //  return ((Scorer)o1).doc() - ((Scorer)o2).doc();
> //}
> //  });
>   
> first = 0;
> last = length - 1;
>   }
>   private void insertionSort( Scorer[] scores, int len)
>   {
>   for (int i=0; i   for (int j=i; j>0 && scores[j-1].doc() > scores[j].doc();j-- ) {
>   swap (scores, j, j-1);
>   }
>   }
>   return;
>   }
>   private void swap(Object[] x, int a, int b) {
> Object t = x[a];
> x[a] = x[b];
> x[b] = t;
>   }
>  
> The squeezing of the array is no longer needed. 
> We also initialized the Scorers array to 8 (instead of 2) to avoid having to 
> grow the array for common queries, although this probably has less 
> performance impact.
> This change added about 3% to query throughput in my testing.
> Peter

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-693) ConjunctionScorer - more tuneup

2006-10-24 Thread Yonik Seeley (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-693?page=comments#action_12444334 ] 

Yonik Seeley commented on LUCENE-693:
-

I'm not sure how it's possible, but my version is *solwer* in the performance 
test I came up with.
Very odd... I'm not sure why.

> ConjunctionScorer - more tuneup
> ---
>
> Key: LUCENE-693
> URL: http://issues.apache.org/jira/browse/LUCENE-693
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.1
> Environment: Windows Server 2003 x64, Java 1.6, pretty large index
>Reporter: Peter Keegan
> Attachments: conjunction.patch
>
>
> (See also: #LUCENE-443)
> I did some profile testing with the new ConjuctionScorer in 2.1 and 
> discovered a new bottleneck in ConjunctionScorer.sortScorers. The 
> java.utils.Arrays.sort method is cloning the Scorers array on every sort, 
> which is quite expensive on large indexes because of the size of the 'norms' 
> array within, and isn't necessary. 
> Here is one possible solution:
>   private void sortScorers() {
> // squeeze the array down for the sort
> //if (length != scorers.length) {
> //  Scorer[] temps = new Scorer[length];
> //  System.arraycopy(scorers, 0, temps, 0, length);
> //  scorers = temps;
> //}
> insertionSort( scorers,length );
> // note that this comparator is not consistent with equals!
> //Arrays.sort(scorers, new Comparator() { // sort the array
> //public int compare(Object o1, Object o2) {
> //  return ((Scorer)o1).doc() - ((Scorer)o2).doc();
> //}
> //  });
>   
> first = 0;
> last = length - 1;
>   }
>   private void insertionSort( Scorer[] scores, int len)
>   {
>   for (int i=0; i   for (int j=i; j>0 && scores[j-1].doc() > scores[j].doc();j-- ) {
>   swap (scores, j, j-1);
>   }
>   }
>   return;
>   }
>   private void swap(Object[] x, int a, int b) {
> Object t = x[a];
> x[a] = x[b];
> x[b] = t;
>   }
>  
> The squeezing of the array is no longer needed. 
> We also initialized the Scorers array to 8 (instead of 2) to avoid having to 
> grow the array for common queries, although this probably has less 
> performance impact.
> This change added about 3% to query throughput in my testing.
> Peter

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-695) Improve BufferedIndexInput.readBytes() performance

2006-10-24 Thread Nadav Har'El (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-695?page=comments#action_12444322 ] 

Nadav Har'El commented on LUCENE-695:
-

Sorry, I didn't notice that my fix broke this unit test. Thanks for catching 
that.

What is happening is interesting: this test 
(TestCompoundFile.testReadPastEof()) is testing what happens when you read 40 
bytes beyond the end of file, and expects the appropriate exception to be 
thrown. The old code actually did this for 40 bytes, so it passed this test, 
but the interesting thing is that when you asked for more than a buffer-full of 
bytes, say, 10K, the length() checking code was not there! So the old code was 
broken in this respect, just not for 40 bytes which were tested.

I'll fix my patch to add this beyond-end-of-file check, and will post the new 
patch ASAP.

> Improve BufferedIndexInput.readBytes() performance
> --
>
> Key: LUCENE-695
> URL: http://issues.apache.org/jira/browse/LUCENE-695
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Affects Versions: 2.0.0
>Reporter: Nadav Har'El
>Priority: Minor
> Attachments: readbytes.patch
>
>
> During a profiling session, I discovered that BufferedIndexInput.readBytes(),
> the function which reads a bunch of bytes from an index, is very inefficient
> in many cases. It is efficient for one or two bytes, and also efficient
> for a very large number of bytes (e.g., when the norms are read all at once);
> But for anything in between (e.g., 100 bytes), it is a performance disaster.
> It can easily be improved, though, and below I include a patch to do that.
> The basic problem in the existing code was that if you ask it to read 100
> bytes, readBytes() simply calls readByte() 100 times in a loop, which means
> we check byte after byte if the buffer has another character, instead of just
> checking once how many bytes we have left, and copy them all at once.
> My version, attached below, copies these 100 bytes if they are available at
> bulk (using System.arraycopy), and if less than 100 are available, whatever
> is available gets copied, and then the rest. (as before, when a very large
> number of bytes is requested, it is read directly into the final buffer).
> In my profiling, this fix caused amazing performance
> improvement: previously, BufferedIndexInput.readBytes() took as much as 25%
> of the run time, and after the fix, this was down to 1% of the run time! 
> However, my scenario is *not* the typical Lucene code, but rather a version 
> of Lucene with added payloads, and these payloads average at 100 bytes, where 
> the original readBytes() did worst. I expect that my fix will have less of an 
> impact on "vanilla" Lucene, but it still can have an impact because it is 
> used for things like reading fields. (I am not aware of a standard Lucene 
> benchmark, so I can't provide benchmarks on a more typical case).
> In addition to the change to readBytes(), my attached patch also adds a new
> unit test to BufferedIndexInput (which previously did not have a unit test).
> This test simulates a "file" which contains a predictable series of bytes, and
> then tries to read from it with readByte() and readButes() with various
> sizes (many thousands of combinations are tried) and see that exactly the
> expected bytes are read. This test is independent of my new readBytes()
> inplementation, and can be used to check the old implementation as well.
> By the way, it's interesting that BufferedIndexOutput.writeBytes was already 
> efficient, and wasn't simply a loop of writeByte(). Only the reading code was 
> inefficient. I wonder why this happened.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-693) ConjunctionScorer - more tuneup

2006-10-24 Thread Yonik Seeley (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-693?page=comments#action_12444320 ] 

Yonik Seeley commented on LUCENE-693:
-

Ah, I see the problem... in the constructor I have
  boolean more = scorers[i].next();
for each scorer... but note that the local "more" is masking the member "more". 
 Doh!
You can just remove "boolean" from "boolean more" in the ConjunctionScorer 
constructor, and I'll try to see why this was never reproduced by any test 
cases in the meantime.

> ConjunctionScorer - more tuneup
> ---
>
> Key: LUCENE-693
> URL: http://issues.apache.org/jira/browse/LUCENE-693
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.1
> Environment: Windows Server 2003 x64, Java 1.6, pretty large index
>Reporter: Peter Keegan
> Attachments: conjunction.patch
>
>
> (See also: #LUCENE-443)
> I did some profile testing with the new ConjuctionScorer in 2.1 and 
> discovered a new bottleneck in ConjunctionScorer.sortScorers. The 
> java.utils.Arrays.sort method is cloning the Scorers array on every sort, 
> which is quite expensive on large indexes because of the size of the 'norms' 
> array within, and isn't necessary. 
> Here is one possible solution:
>   private void sortScorers() {
> // squeeze the array down for the sort
> //if (length != scorers.length) {
> //  Scorer[] temps = new Scorer[length];
> //  System.arraycopy(scorers, 0, temps, 0, length);
> //  scorers = temps;
> //}
> insertionSort( scorers,length );
> // note that this comparator is not consistent with equals!
> //Arrays.sort(scorers, new Comparator() { // sort the array
> //public int compare(Object o1, Object o2) {
> //  return ((Scorer)o1).doc() - ((Scorer)o2).doc();
> //}
> //  });
>   
> first = 0;
> last = length - 1;
>   }
>   private void insertionSort( Scorer[] scores, int len)
>   {
>   for (int i=0; i   for (int j=i; j>0 && scores[j-1].doc() > scores[j].doc();j-- ) {
>   swap (scores, j, j-1);
>   }
>   }
>   return;
>   }
>   private void swap(Object[] x, int a, int b) {
> Object t = x[a];
> x[a] = x[b];
> x[b] = t;
>   }
>  
> The squeezing of the array is no longer needed. 
> We also initialized the Scorers array to 8 (instead of 2) to avoid having to 
> grow the array for common queries, although this probably has less 
> performance impact.
> This change added about 3% to query throughput in my testing.
> Peter

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-693) ConjunctionScorer - more tuneup

2006-10-24 Thread Yonik Seeley (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-693?page=comments#action_12444319 ] 

Yonik Seeley commented on LUCENE-693:
-

Thanks for trying it out Peter.
Odd it could fail after passing all the Lucene unit tests... I assume this was 
the lucene trunk you were trying?
So the query was just a boolean query with three required term queries?

> ConjunctionScorer - more tuneup
> ---
>
> Key: LUCENE-693
> URL: http://issues.apache.org/jira/browse/LUCENE-693
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.1
> Environment: Windows Server 2003 x64, Java 1.6, pretty large index
>Reporter: Peter Keegan
> Attachments: conjunction.patch
>
>
> (See also: #LUCENE-443)
> I did some profile testing with the new ConjuctionScorer in 2.1 and 
> discovered a new bottleneck in ConjunctionScorer.sortScorers. The 
> java.utils.Arrays.sort method is cloning the Scorers array on every sort, 
> which is quite expensive on large indexes because of the size of the 'norms' 
> array within, and isn't necessary. 
> Here is one possible solution:
>   private void sortScorers() {
> // squeeze the array down for the sort
> //if (length != scorers.length) {
> //  Scorer[] temps = new Scorer[length];
> //  System.arraycopy(scorers, 0, temps, 0, length);
> //  scorers = temps;
> //}
> insertionSort( scorers,length );
> // note that this comparator is not consistent with equals!
> //Arrays.sort(scorers, new Comparator() { // sort the array
> //public int compare(Object o1, Object o2) {
> //  return ((Scorer)o1).doc() - ((Scorer)o2).doc();
> //}
> //  });
>   
> first = 0;
> last = length - 1;
>   }
>   private void insertionSort( Scorer[] scores, int len)
>   {
>   for (int i=0; i   for (int j=i; j>0 && scores[j-1].doc() > scores[j].doc();j-- ) {
>   swap (scores, j, j-1);
>   }
>   }
>   return;
>   }
>   private void swap(Object[] x, int a, int b) {
> Object t = x[a];
> x[a] = x[b];
> x[b] = t;
>   }
>  
> The squeezing of the array is no longer needed. 
> We also initialized the Scorers array to 8 (instead of 2) to avoid having to 
> grow the array for common queries, although this probably has less 
> performance impact.
> This change added about 3% to query throughput in my testing.
> Peter

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-693) ConjunctionScorer - more tuneup

2006-10-24 Thread Peter Keegan (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-693?page=comments#action_12444317 ] 

Peter Keegan commented on LUCENE-693:
-

Yonik,

I tried out your patch, but it causes an exception on some boolean queries.
This one occurred on a boolean query with 3 required terms:

java.lang.ArrayIndexOutOfBoundsException: 2147483647
at org.apache.lucene.search.TermScorer.score(TermScorer.java:129)
at org.apache.lucene.search.ConjunctionScorer.score(
ConjunctionScorer.java:97)
at org.apache.lucene.search.BooleanScorer2$2.score(BooleanScorer2.java
:186)
at org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java
:318)
at org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java
:282)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:132)
at org.apache.lucene.search.Searcher.search(Searcher.java:116)
at org.apache.lucene.search.Searcher.search(Searcher.java:95)

It looks like the doc id has the sentinel value (Integer.MAX_VALUE).
Note: one of the terms had no occurrences in the index.

Peter



> ConjunctionScorer - more tuneup
> ---
>
> Key: LUCENE-693
> URL: http://issues.apache.org/jira/browse/LUCENE-693
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.1
> Environment: Windows Server 2003 x64, Java 1.6, pretty large index
>Reporter: Peter Keegan
> Attachments: conjunction.patch
>
>
> (See also: #LUCENE-443)
> I did some profile testing with the new ConjuctionScorer in 2.1 and 
> discovered a new bottleneck in ConjunctionScorer.sortScorers. The 
> java.utils.Arrays.sort method is cloning the Scorers array on every sort, 
> which is quite expensive on large indexes because of the size of the 'norms' 
> array within, and isn't necessary. 
> Here is one possible solution:
>   private void sortScorers() {
> // squeeze the array down for the sort
> //if (length != scorers.length) {
> //  Scorer[] temps = new Scorer[length];
> //  System.arraycopy(scorers, 0, temps, 0, length);
> //  scorers = temps;
> //}
> insertionSort( scorers,length );
> // note that this comparator is not consistent with equals!
> //Arrays.sort(scorers, new Comparator() { // sort the array
> //public int compare(Object o1, Object o2) {
> //  return ((Scorer)o1).doc() - ((Scorer)o2).doc();
> //}
> //  });
>   
> first = 0;
> last = length - 1;
>   }
>   private void insertionSort( Scorer[] scores, int len)
>   {
>   for (int i=0; i   for (int j=i; j>0 && scores[j-1].doc() > scores[j].doc();j-- ) {
>   swap (scores, j, j-1);
>   }
>   }
>   return;
>   }
>   private void swap(Object[] x, int a, int b) {
> Object t = x[a];
> x[a] = x[b];
> x[b] = t;
>   }
>  
> The squeezing of the array is no longer needed. 
> We also initialized the Scorers array to 8 (instead of 2) to avoid having to 
> grow the array for common queries, although this probably has less 
> performance impact.
> This change added about 3% to query throughput in my testing.
> Peter

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-695) Improve BufferedIndexInput.readBytes() performance

2006-10-24 Thread Yonik Seeley (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-695?page=comments#action_12444316 ] 

Yonik Seeley commented on LUCENE-695:
-

> I wonder why this happened.

readBytes on less than a buffer size probably only happens with binary (or 
compressed) fields, relatively new additions to Lucene, so it probably didn't 
have much of a real-world impact.   I think it is important to fix though, as 
more things may be byte-oriented in the future.

After applying the patch, at least one unit test fails:

[junit] Testcase: testReadPastEOF(org.apache.lucene.index.TestCompoundFile):
FAILED
[junit] Block read past end of file
[junit] junit.framework.AssertionFailedError: Block read past end of file
[junit] at org.apache.lucene.index.TestCompoundFile.testReadPastEOF(Test
CompoundFile.java:616)


> Improve BufferedIndexInput.readBytes() performance
> --
>
> Key: LUCENE-695
> URL: http://issues.apache.org/jira/browse/LUCENE-695
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Affects Versions: 2.0.0
>Reporter: Nadav Har'El
>Priority: Minor
> Attachments: readbytes.patch
>
>
> During a profiling session, I discovered that BufferedIndexInput.readBytes(),
> the function which reads a bunch of bytes from an index, is very inefficient
> in many cases. It is efficient for one or two bytes, and also efficient
> for a very large number of bytes (e.g., when the norms are read all at once);
> But for anything in between (e.g., 100 bytes), it is a performance disaster.
> It can easily be improved, though, and below I include a patch to do that.
> The basic problem in the existing code was that if you ask it to read 100
> bytes, readBytes() simply calls readByte() 100 times in a loop, which means
> we check byte after byte if the buffer has another character, instead of just
> checking once how many bytes we have left, and copy them all at once.
> My version, attached below, copies these 100 bytes if they are available at
> bulk (using System.arraycopy), and if less than 100 are available, whatever
> is available gets copied, and then the rest. (as before, when a very large
> number of bytes is requested, it is read directly into the final buffer).
> In my profiling, this fix caused amazing performance
> improvement: previously, BufferedIndexInput.readBytes() took as much as 25%
> of the run time, and after the fix, this was down to 1% of the run time! 
> However, my scenario is *not* the typical Lucene code, but rather a version 
> of Lucene with added payloads, and these payloads average at 100 bytes, where 
> the original readBytes() did worst. I expect that my fix will have less of an 
> impact on "vanilla" Lucene, but it still can have an impact because it is 
> used for things like reading fields. (I am not aware of a standard Lucene 
> benchmark, so I can't provide benchmarks on a more typical case).
> In addition to the change to readBytes(), my attached patch also adds a new
> unit test to BufferedIndexInput (which previously did not have a unit test).
> This test simulates a "file" which contains a predictable series of bytes, and
> then tries to read from it with readByte() and readButes() with various
> sizes (many thousands of combinations are tried) and see that exactly the
> expected bytes are read. This test is independent of my new readBytes()
> inplementation, and can be used to check the old implementation as well.
> By the way, it's interesting that BufferedIndexOutput.writeBytes was already 
> efficient, and wasn't simply a loop of writeByte(). Only the reading code was 
> inefficient. I wonder why this happened.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-551) Make Lucene - Java 1.9.1 Available in Maven2 repository in iBibilio.org

2006-10-24 Thread Marcel Reutegger (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-551?page=comments#action_12444300 ] 

Marcel Reutegger commented on LUCENE-551:
-

Are there any plans to also publish the new release to the Maven 1 repository 
on ibiblio.org? We at Jackrabbit still use Maven 1.0.2 as our build tool.

> Make Lucene - Java 1.9.1 Available in Maven2 repository in iBibilio.org
> ---
>
> Key: LUCENE-551
> URL: http://issues.apache.org/jira/browse/LUCENE-551
> Project: Lucene - Java
>  Issue Type: Task
>Affects Versions: 1.9
>Reporter: Stephen Duncan Jr
>
> Please upload 1.9.1 release to iBiblio so that Maven users can easily use the 
> latest release.  Currently 1.4.3 is the most recently available version: 
> http://www.ibiblio.org/maven2/lucene/lucene/
> Please read the following FAQ for more information: 
> http://maven.apache.org/project-faq.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-695) Improve BufferedIndexInput.readBytes() performance

2006-10-24 Thread Nadav Har'El (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-695?page=all ]

Nadav Har'El updated LUCENE-695:


Attachment: readbytes.patch

The patch, which includes the change to BufferedIndexInput.readBytes(), and a 
new unit test for that class.

> Improve BufferedIndexInput.readBytes() performance
> --
>
> Key: LUCENE-695
> URL: http://issues.apache.org/jira/browse/LUCENE-695
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Affects Versions: 2.0.0
>Reporter: Nadav Har'El
>Priority: Minor
> Attachments: readbytes.patch
>
>
> During a profiling session, I discovered that BufferedIndexInput.readBytes(),
> the function which reads a bunch of bytes from an index, is very inefficient
> in many cases. It is efficient for one or two bytes, and also efficient
> for a very large number of bytes (e.g., when the norms are read all at once);
> But for anything in between (e.g., 100 bytes), it is a performance disaster.
> It can easily be improved, though, and below I include a patch to do that.
> The basic problem in the existing code was that if you ask it to read 100
> bytes, readBytes() simply calls readByte() 100 times in a loop, which means
> we check byte after byte if the buffer has another character, instead of just
> checking once how many bytes we have left, and copy them all at once.
> My version, attached below, copies these 100 bytes if they are available at
> bulk (using System.arraycopy), and if less than 100 are available, whatever
> is available gets copied, and then the rest. (as before, when a very large
> number of bytes is requested, it is read directly into the final buffer).
> In my profiling, this fix caused amazing performance
> improvement: previously, BufferedIndexInput.readBytes() took as much as 25%
> of the run time, and after the fix, this was down to 1% of the run time! 
> However, my scenario is *not* the typical Lucene code, but rather a version 
> of Lucene with added payloads, and these payloads average at 100 bytes, where 
> the original readBytes() did worst. I expect that my fix will have less of an 
> impact on "vanilla" Lucene, but it still can have an impact because it is 
> used for things like reading fields. (I am not aware of a standard Lucene 
> benchmark, so I can't provide benchmarks on a more typical case).
> In addition to the change to readBytes(), my attached patch also adds a new
> unit test to BufferedIndexInput (which previously did not have a unit test).
> This test simulates a "file" which contains a predictable series of bytes, and
> then tries to read from it with readByte() and readButes() with various
> sizes (many thousands of combinations are tried) and see that exactly the
> expected bytes are read. This test is independent of my new readBytes()
> inplementation, and can be used to check the old implementation as well.
> By the way, it's interesting that BufferedIndexOutput.writeBytes was already 
> efficient, and wasn't simply a loop of writeByte(). Only the reading code was 
> inefficient. I wonder why this happened.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-695) Improve BufferedIndexInput.readBytes() performance

2006-10-24 Thread Nadav Har'El (JIRA)
Improve BufferedIndexInput.readBytes() performance
--

 Key: LUCENE-695
 URL: http://issues.apache.org/jira/browse/LUCENE-695
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Affects Versions: 2.0.0
Reporter: Nadav Har'El
Priority: Minor


During a profiling session, I discovered that BufferedIndexInput.readBytes(),
the function which reads a bunch of bytes from an index, is very inefficient
in many cases. It is efficient for one or two bytes, and also efficient
for a very large number of bytes (e.g., when the norms are read all at once);
But for anything in between (e.g., 100 bytes), it is a performance disaster.
It can easily be improved, though, and below I include a patch to do that.

The basic problem in the existing code was that if you ask it to read 100
bytes, readBytes() simply calls readByte() 100 times in a loop, which means
we check byte after byte if the buffer has another character, instead of just
checking once how many bytes we have left, and copy them all at once.

My version, attached below, copies these 100 bytes if they are available at
bulk (using System.arraycopy), and if less than 100 are available, whatever
is available gets copied, and then the rest. (as before, when a very large
number of bytes is requested, it is read directly into the final buffer).

In my profiling, this fix caused amazing performance
improvement: previously, BufferedIndexInput.readBytes() took as much as 25%
of the run time, and after the fix, this was down to 1% of the run time! 
However, my scenario is *not* the typical Lucene code, but rather a version of 
Lucene with added payloads, and these payloads average at 100 bytes, where the 
original readBytes() did worst. I expect that my fix will have less of an 
impact on "vanilla" Lucene, but it still can have an impact because it is used 
for things like reading fields. (I am not aware of a standard Lucene benchmark, 
so I can't provide benchmarks on a more typical case).

In addition to the change to readBytes(), my attached patch also adds a new
unit test to BufferedIndexInput (which previously did not have a unit test).
This test simulates a "file" which contains a predictable series of bytes, and
then tries to read from it with readByte() and readButes() with various
sizes (many thousands of combinations are tried) and see that exactly the
expected bytes are read. This test is independent of my new readBytes()
inplementation, and can be used to check the old implementation as well.

By the way, it's interesting that BufferedIndexOutput.writeBytes was already 
efficient, and wasn't simply a loop of writeByte(). Only the reading code was 
inefficient. I wonder why this happened.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-694) Query parser doesn't warn about unmatched ')'

2006-10-24 Thread Eric Jain (JIRA)
Query parser doesn't warn about unmatched ')'
-

 Key: LUCENE-694
 URL: http://issues.apache.org/jira/browse/LUCENE-694
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Affects Versions: 2.0.0
Reporter: Eric Jain
Priority: Minor


If there is an unmatched '(', as in protein 'foo( bar', the query parser 
reports an error. But if you search for 'foo) bar', everything after the 
unmatched ')' seems to be ignored!

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]