[jira] Commented: (LUCENE-697) Scorer.skipTo affects sloppyPhrase scoring
[ http://issues.apache.org/jira/browse/LUCENE-697?page=comments#action_12444573 ] Yonik Seeley commented on LUCENE-697: - Comment out line 104 of QueryUtils.java to reproduce this problem: scoreDiff=0; // TODO: remove this go get LUCENE-697 failures > Scorer.skipTo affects sloppyPhrase scoring > -- > > Key: LUCENE-697 > URL: http://issues.apache.org/jira/browse/LUCENE-697 > Project: Lucene - Java > Issue Type: Bug > Components: Search >Affects Versions: 2.0.0 >Reporter: Yonik Seeley > > If you mix skipTo() and next(), you get different scores than what is > returned to a hit collector. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Resolved: (LUCENE-696) Scorer.skipTo() doesn't always work if called before next()
[ http://issues.apache.org/jira/browse/LUCENE-696?page=all ] Yonik Seeley resolved LUCENE-696. - Fix Version/s: 2.0.1 Resolution: Fixed Assignee: Yonik Seeley Patch committed after further tests were added. > Scorer.skipTo() doesn't always work if called before next() > --- > > Key: LUCENE-696 > URL: http://issues.apache.org/jira/browse/LUCENE-696 > Project: Lucene - Java > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Yonik Seeley > Assigned To: Yonik Seeley > Fix For: 2.0.1 > > Attachments: dismax.patch > > > skipTo() doesn't work for all scorers if called before next(). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-698) FilteredQuery ignores boost
[ http://issues.apache.org/jira/browse/LUCENE-698?page=comments#action_12444570 ] Yonik Seeley commented on LUCENE-698: - I just commited hashCode() and equals() changes to take boost into account so that generic tests in QueryUtils.check(query) can pass. One should arguably be able to set the boost on any query clause, so I'm leaving this open since I think scoring probably ignores the boost too. > FilteredQuery ignores boost > --- > > Key: LUCENE-698 > URL: http://issues.apache.org/jira/browse/LUCENE-698 > Project: Lucene - Java > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Yonik Seeley > > Filtered query ignores it's own boost. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-698) FilteredQuery ignores boost
FilteredQuery ignores boost --- Key: LUCENE-698 URL: http://issues.apache.org/jira/browse/LUCENE-698 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.0.0 Reporter: Yonik Seeley Filtered query ignores it's own boost. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-697) Scorer.skipTo affects sloppyPhrase scoring
[ http://issues.apache.org/jira/browse/LUCENE-697?page=comments#action_12444565 ] Yonik Seeley commented on LUCENE-697: - Here's the ant output from test code to be checked in shortly. The test code calls skipTo(), skipTo(), next(), next(), etc while checking that the results match the hitcollector version. [junit] Testcase: testP6(org.apache.lucene.search.TestSimpleExplanations): Caused an ERROR [junit] ERROR matching docs: [junit] scorer.more=true doc=1 score=0.7849069 [junit] hitCollector.doc=1 score=0.67974937 [junit] Scorer=scorer(weight(field:"w3 w2"~2)) [junit] Query=field:"w3 w2"~2 [junit] [EMAIL PROTECTED] [junit] java.lang.RuntimeException: ERROR matching docs: [junit] scorer.more=true doc=1 score=0.7849069 [junit] hitCollector.doc=1 score=0.67974937 [junit] Scorer=scorer(weight(field:"w3 w2"~2)) [junit] Query=field:"w3 w2"~2 [junit] [EMAIL PROTECTED] [junit] at org.apache.lucene.search.QueryUtils$2.collect(QueryUtils.java :104) [junit] at org.apache.lucene.search.Scorer.score(Scorer.java:48) [junit] at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.j ava:132) [junit] at org.apache.lucene.search.Searcher.search(Searcher.java:116) [junit] at org.apache.lucene.search.Searcher.search(Searcher.java:95) [junit] at org.apache.lucene.search.QueryUtils.checkSkipTo(QueryUtils.ja va:97) [junit] at org.apache.lucene.search.QueryUtils.check(QueryUtils.java:75) [junit] at org.apache.lucene.search.CheckHits.checkHitCollector(CheckHit s.java:91) [junit] at org.apache.lucene.search.TestExplanations.qtest(TestExplanati ons.java:90) [junit] at org.apache.lucene.search.TestExplanations.qtest(TestExplanati ons.java:86) [junit] at org.apache.lucene.search.TestSimpleExplanations.testP6(TestSi mpleExplanations.java:87) [junit] Testcase: testP7(org.apache.lucene.search.TestSimpleExplanations): Caused an ERROR [junit] ERROR matching docs: [junit] scorer.more=true doc=1 score=0.7849069 [junit] hitCollector.doc=1 score=0.67974937 [junit] Scorer=scorer(weight(field:"w3 w2"~3)) [junit] Query=field:"w3 w2"~3 [junit] [EMAIL PROTECTED] [junit] java.lang.RuntimeException: ERROR matching docs: [junit] scorer.more=true doc=1 score=0.7849069 [junit] hitCollector.doc=1 score=0.67974937 [junit] Scorer=scorer(weight(field:"w3 w2"~3)) [junit] Query=field:"w3 w2"~3 [junit] [EMAIL PROTECTED] [junit] at org.apache.lucene.search.QueryUtils$2.collect(QueryUtils.java :104) [junit] at org.apache.lucene.search.Scorer.score(Scorer.java:48) [junit] at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.j ava:132) [junit] at org.apache.lucene.search.Searcher.search(Searcher.java:116) [junit] at org.apache.lucene.search.Searcher.search(Searcher.java:95) [junit] at org.apache.lucene.search.QueryUtils.checkSkipTo(QueryUtils.ja va:97) [junit] at org.apache.lucene.search.QueryUtils.check(QueryUtils.java:75) [junit] at org.apache.lucene.search.CheckHits.checkHitCollector(CheckHit s.java:91) [junit] at org.apache.lucene.search.TestExplanations.qtest(TestExplanati ons.java:90) [junit] at org.apache.lucene.search.TestExplanations.qtest(TestExplanati ons.java:86) [junit] at org.apache.lucene.search.TestSimpleExplanations.testP7(TestSi mpleExplanations.java:90) [junit] Test org.apache.lucene.search.TestSimpleExplanations FAILED > Scorer.skipTo affects sloppyPhrase scoring > -- > > Key: LUCENE-697 > URL: http://issues.apache.org/jira/browse/LUCENE-697 > Project: Lucene - Java > Issue Type: Bug > Components: Search >Affects Versions: 2.0.0 >Reporter: Yonik Seeley > > If you mix skipTo() and next(), you get different scores than what is > returned to a hit collector. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-697) Scorer.skipTo affects sloppyPhrase scoring
Scorer.skipTo affects sloppyPhrase scoring -- Key: LUCENE-697 URL: http://issues.apache.org/jira/browse/LUCENE-697 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.0.0 Reporter: Yonik Seeley If you mix skipTo() and next(), you get different scores than what is returned to a hit collector. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-696) Scorer.skipTo() doesn't always work if called before next()
[ http://issues.apache.org/jira/browse/LUCENE-696?page=all ] Yonik Seeley updated LUCENE-696: Attachment: dismax.patch DisjunctionMaxScorer turned out to be the only scorer I could see with that problem. Here's the patch w/ tests. > Scorer.skipTo() doesn't always work if called before next() > --- > > Key: LUCENE-696 > URL: http://issues.apache.org/jira/browse/LUCENE-696 > Project: Lucene - Java > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Yonik Seeley > Attachments: dismax.patch > > > skipTo() doesn't work for all scorers if called before next(). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-528) Optimization for IndexWriter.addIndexes()
[ http://issues.apache.org/jira/browse/LUCENE-528?page=all ] Ning Li updated LUCENE-528: --- Attachment: AddIndexesNoOptimize.patch This patch implements addIndexesNoOptimize() following the algorithm described earlier. - The patch is based on the latest version from trunk. - AddIndexesNoOptimize() is implemented. The algorithm description is included as comment and the code is commented. - The patch includes a test called TestAddIndexesNoOptimize which covers all the code in addIndexesNoOptimize(). - maybeMergeSegments() was conservative and checked for more merges only when "upperBound * mergeFactor <= maxMergeDocs". Change it to check for more merges when "upperBound < maxMergeDocs". - Minor changes in TestIndexWriterMergePolicy to better verify merge invariants. - The patch passes all unit tests. One more comment on the implementation: - When we copy un-merged segments from S in step 4, ideally, we want to simply copy those segments. However, directory does not support copy yet. In addition, source may use compound file or not and target may use compound file or not. So we use mergeSegments() to copy each segment, which may cause doc count to change because deleted docs are garbage collected. That case is handled properly. > Optimization for IndexWriter.addIndexes() > - > > Key: LUCENE-528 > URL: http://issues.apache.org/jira/browse/LUCENE-528 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Steven Tamm > Assigned To: Otis Gospodnetic >Priority: Minor > Attachments: AddIndexes.patch, AddIndexesNoOptimize.patch > > > One big performance problem with IndexWriter.addIndexes() is that it has to > optimize the index both before and after adding the segments. When you have > a very large index, to which you are adding batches of small updates, these > calls to optimize make using addIndexes() impossible. It makes parallel > updates very frustrating. > Here is an optimized function that helps out by calling mergeSegments only on > the newly added documents. It will try to avoid calling mergeSegments until > the end, unless you're adding a lot of documents at once. > I also have an extensive unit test that verifies that this function works > correctly if people are interested. I gave it a different name because it > has very different performance characteristics which can make querying take > longer. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-696) Scorer.skipTo() doesn't always work if called before next()
: It would also simplify some scorers if doc() wasn't undefined before : next() or skipTo() was called, but instead -1. +1 ... but if we are goingg to change the API requirements for doc(), we should clarify the requirements or score() ... with doc(), negative numbers can easily be used as a marker of "invalid", but the same rule isn't as easy to apply with the score() method ... perhaps the documentation for doc() and score() should be... doc(): Returns the current document number matching the query. Returns -1 if neither next() or skipTo() have been called at least once, behavior is undefined if the last call to next() or skipTo returned false. score(): Returns the score of the current document matching the query. The value is undefined if doc() reurns -1, or if the last call to next() or skipTo returned false. ...we probably want to make the same API changes to Spans, TermEnum, and TermDocs as well to be consistent. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-696) Scorer.skipTo() doesn't always work if called before next()
[ http://issues.apache.org/jira/browse/LUCENE-696?page=comments#action_12444506 ] Paul Elschot commented on LUCENE-696: - Repeating a comment just posted at LUCENE-693: skipTo() as first call on a scorer should work. ReqExclScorer and ReqOptSumScorer depend on that for the excluded and optional scorers. > Scorer.skipTo() doesn't always work if called before next() > --- > > Key: LUCENE-696 > URL: http://issues.apache.org/jira/browse/LUCENE-696 > Project: Lucene - Java > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Yonik Seeley > > skipTo() doesn't work for all scorers if called before next(). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-696) Scorer.skipTo() doesn't always work if called before next()
[ http://issues.apache.org/jira/browse/LUCENE-696?page=comments#action_12444500 ] Yonik Seeley commented on LUCENE-696: - It would also simplify some scorers if doc() wasn't undefined before next() or skipTo() was called, but instead -1. This undefined nature of doc() often requires more state to be kept around about the scorers. Things like TermScorer would just need a change from "int doc" to "int doc=-1" Is there any scorer that this would impose a burden or cost on? Thoughts? > Scorer.skipTo() doesn't always work if called before next() > --- > > Key: LUCENE-696 > URL: http://issues.apache.org/jira/browse/LUCENE-696 > Project: Lucene - Java > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Yonik Seeley > > skipTo() doesn't work for all scorers if called before next(). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-693) ConjunctionScorer - more tuneup
[ http://issues.apache.org/jira/browse/LUCENE-693?page=comments#action_1296 ] Yonik Seeley commented on LUCENE-693: - > Could you describe a case in which skipTo() before next() does not work? I don't recall, but my attempt to speed up ConjunctionScorer flushed them out. I'll move back to an older version of that to see what failed and put details here: http://issues.apache.org/jira/browse/LUCENE-696 > ConjunctionScorer - more tuneup > --- > > Key: LUCENE-693 > URL: http://issues.apache.org/jira/browse/LUCENE-693 > Project: Lucene - Java > Issue Type: Bug > Components: Search >Affects Versions: 2.1 > Environment: Windows Server 2003 x64, Java 1.6, pretty large index >Reporter: Peter Keegan > Attachments: conjunction.patch, conjunction.patch, conjunction.patch > > > (See also: #LUCENE-443) > I did some profile testing with the new ConjuctionScorer in 2.1 and > discovered a new bottleneck in ConjunctionScorer.sortScorers. The > java.utils.Arrays.sort method is cloning the Scorers array on every sort, > which is quite expensive on large indexes because of the size of the 'norms' > array within, and isn't necessary. > Here is one possible solution: > private void sortScorers() { > // squeeze the array down for the sort > //if (length != scorers.length) { > // Scorer[] temps = new Scorer[length]; > // System.arraycopy(scorers, 0, temps, 0, length); > // scorers = temps; > //} > insertionSort( scorers,length ); > // note that this comparator is not consistent with equals! > //Arrays.sort(scorers, new Comparator() { // sort the array > //public int compare(Object o1, Object o2) { > // return ((Scorer)o1).doc() - ((Scorer)o2).doc(); > //} > // }); > > first = 0; > last = length - 1; > } > private void insertionSort( Scorer[] scores, int len) > { > for (int i=0; i for (int j=i; j>0 && scores[j-1].doc() > scores[j].doc();j-- ) { > swap (scores, j, j-1); > } > } > return; > } > private void swap(Object[] x, int a, int b) { > Object t = x[a]; > x[a] = x[b]; > x[b] = t; > } > > The squeezing of the array is no longer needed. > We also initialized the Scorers array to 8 (instead of 2) to avoid having to > grow the array for common queries, although this probably has less > performance impact. > This change added about 3% to query throughput in my testing. > Peter -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-696) Scorer.skipTo() doesn't always work if called before next()
Scorer.skipTo() doesn't always work if called before next() --- Key: LUCENE-696 URL: http://issues.apache.org/jira/browse/LUCENE-696 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.0.0 Reporter: Yonik Seeley skipTo() doesn't work for all scorers if called before next(). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-693) ConjunctionScorer - more tuneup
[ http://issues.apache.org/jira/browse/LUCENE-693?page=comments#action_1287 ] Paul Elschot commented on LUCENE-693: - Yonik, you wrote: > but then learned that calling skipTo() before calling next() doesn't always > work. Could you describe a case in which skipTo() before next() does not work? skipTo() as first call on a scorer should work. ReqExclScorer and ReqOptSumScorer depend on that for the excluded and optional scorers. Regards, Paul Elschot > ConjunctionScorer - more tuneup > --- > > Key: LUCENE-693 > URL: http://issues.apache.org/jira/browse/LUCENE-693 > Project: Lucene - Java > Issue Type: Bug > Components: Search >Affects Versions: 2.1 > Environment: Windows Server 2003 x64, Java 1.6, pretty large index >Reporter: Peter Keegan > Attachments: conjunction.patch, conjunction.patch, conjunction.patch > > > (See also: #LUCENE-443) > I did some profile testing with the new ConjuctionScorer in 2.1 and > discovered a new bottleneck in ConjunctionScorer.sortScorers. The > java.utils.Arrays.sort method is cloning the Scorers array on every sort, > which is quite expensive on large indexes because of the size of the 'norms' > array within, and isn't necessary. > Here is one possible solution: > private void sortScorers() { > // squeeze the array down for the sort > //if (length != scorers.length) { > // Scorer[] temps = new Scorer[length]; > // System.arraycopy(scorers, 0, temps, 0, length); > // scorers = temps; > //} > insertionSort( scorers,length ); > // note that this comparator is not consistent with equals! > //Arrays.sort(scorers, new Comparator() { // sort the array > //public int compare(Object o1, Object o2) { > // return ((Scorer)o1).doc() - ((Scorer)o2).doc(); > //} > // }); > > first = 0; > last = length - 1; > } > private void insertionSort( Scorer[] scores, int len) > { > for (int i=0; i for (int j=i; j>0 && scores[j-1].doc() > scores[j].doc();j-- ) { > swap (scores, j, j-1); > } > } > return; > } > private void swap(Object[] x, int a, int b) { > Object t = x[a]; > x[a] = x[b]; > x[b] = t; > } > > The squeezing of the array is no longer needed. > We also initialized the Scorers array to 8 (instead of 2) to avoid having to > grow the array for common queries, although this probably has less > performance impact. > This change added about 3% to query throughput in my testing. > Peter -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Scorer.skipTo() valid before next()?
: I got a bit of a surprise trying to re-implement the ConjunctionScorer. : It turns out that skipTo(0) does not always return the same thing as : next() on a newly created scorer. Some scorers give invalid results : if skipTo() is called before next(). that sounds like a bug to me... : The javddoc is unclear on the subject, but the javadoc for both : score() and skipTo() suggest that calling skipTo() first is valid, and : that seems to make more sense. i don't see why you would say the javadoc is unclear, the javadoc for skipTo seems very clear on the subject. skipTo(0) should be functionaly equivilent to... do { if (!next()) return false; } while (0 > doc()); return true; -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-686) Resources not always reclaimed in scorers after each search
[ http://issues.apache.org/jira/browse/LUCENE-686?page=comments#action_1238 ] Hoss Man commented on LUCENE-686: - Quick summary of some discussion from the mailing list... 1) i replied to paul's comments in the bug indicating that while there may not be any leaks in the core code base, these changes were needed to allow people writing custom Directories or custom Scorers to avoid memory leaks. 2) paul suggested that people writing custom code can work arround this by subclassing/customizing the Directory, and all the Scorers, and the IndexSearcher 3) i suggested that made the barrier for new custom code rather high, and made a poor comparison that got us on a tangent. 4) i argued that since TermDocs had a close method, Scorers needed to call it, which ment they needed a close method which was garunteed to be called. 5) paul argued that TermDocs.close in the core right now isn't needed, and we might be better off removing it, and requiring any more complicated custom implimentations to rely on GC to clean up any resources they have (presumably using a finalize method) 6) steven_parkes then raised the point that the fundemental issue is design integrity ... we have to agree what the point of TermDocs.close is from an API standpoint, and that callers should not have to know what the concrete implimentation of hte callee is to know wether close needs to be called. Better documentation on the purpose of the method can lead to better discussion about wether it can be removed, or if the current behavior is a bug that needs fixed. > Resources not always reclaimed in scorers after each search > --- > > Key: LUCENE-686 > URL: http://issues.apache.org/jira/browse/LUCENE-686 > Project: Lucene - Java > Issue Type: Bug > Components: Search > Environment: All >Reporter: Ning Li > Attachments: ScorerResourceGC.patch > > > Resources are not always reclaimed in scorers after each search. > For example, close() is not always called for term docs in TermScorer. > A test will be attached to show when resources are not reclaimed. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-693) ConjunctionScorer - more tuneup
[ http://issues.apache.org/jira/browse/LUCENE-693?page=comments#action_1236 ] Peter Keegan commented on LUCENE-693: - fwiw, my tests were done using 'real world' queries and index. Most queries have several required clauses. The jvm is 1.6 beta2 with -server. I would be interested to see results from others, too. thanks Yonik! Peter > ConjunctionScorer - more tuneup > --- > > Key: LUCENE-693 > URL: http://issues.apache.org/jira/browse/LUCENE-693 > Project: Lucene - Java > Issue Type: Bug > Components: Search >Affects Versions: 2.1 > Environment: Windows Server 2003 x64, Java 1.6, pretty large index >Reporter: Peter Keegan > Attachments: conjunction.patch, conjunction.patch, conjunction.patch > > > (See also: #LUCENE-443) > I did some profile testing with the new ConjuctionScorer in 2.1 and > discovered a new bottleneck in ConjunctionScorer.sortScorers. The > java.utils.Arrays.sort method is cloning the Scorers array on every sort, > which is quite expensive on large indexes because of the size of the 'norms' > array within, and isn't necessary. > Here is one possible solution: > private void sortScorers() { > // squeeze the array down for the sort > //if (length != scorers.length) { > // Scorer[] temps = new Scorer[length]; > // System.arraycopy(scorers, 0, temps, 0, length); > // scorers = temps; > //} > insertionSort( scorers,length ); > // note that this comparator is not consistent with equals! > //Arrays.sort(scorers, new Comparator() { // sort the array > //public int compare(Object o1, Object o2) { > // return ((Scorer)o1).doc() - ((Scorer)o2).doc(); > //} > // }); > > first = 0; > last = length - 1; > } > private void insertionSort( Scorer[] scores, int len) > { > for (int i=0; i for (int j=i; j>0 && scores[j-1].doc() > scores[j].doc();j-- ) { > swap (scores, j, j-1); > } > } > return; > } > private void swap(Object[] x, int a, int b) { > Object t = x[a]; > x[a] = x[b]; > x[b] = t; > } > > The squeezing of the array is no longer needed. > We also initialized the Scorers array to 8 (instead of 2) to avoid having to > grow the array for common queries, although this probably has less > performance impact. > This change added about 3% to query throughput in my testing. > Peter -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-693) ConjunctionScorer - more tuneup
[ http://issues.apache.org/jira/browse/LUCENE-693?page=all ] Yonik Seeley updated LUCENE-693: Attachment: conjunction.patch This version removes the docs[] array and seems to be slightly faster. Still slower on the synthetic random ConstantScoreQuery tests though. If anyone else as real-world benchmarks they can try, I'd appreciate the data. > ConjunctionScorer - more tuneup > --- > > Key: LUCENE-693 > URL: http://issues.apache.org/jira/browse/LUCENE-693 > Project: Lucene - Java > Issue Type: Bug > Components: Search >Affects Versions: 2.1 > Environment: Windows Server 2003 x64, Java 1.6, pretty large index >Reporter: Peter Keegan > Attachments: conjunction.patch, conjunction.patch, conjunction.patch > > > (See also: #LUCENE-443) > I did some profile testing with the new ConjuctionScorer in 2.1 and > discovered a new bottleneck in ConjunctionScorer.sortScorers. The > java.utils.Arrays.sort method is cloning the Scorers array on every sort, > which is quite expensive on large indexes because of the size of the 'norms' > array within, and isn't necessary. > Here is one possible solution: > private void sortScorers() { > // squeeze the array down for the sort > //if (length != scorers.length) { > // Scorer[] temps = new Scorer[length]; > // System.arraycopy(scorers, 0, temps, 0, length); > // scorers = temps; > //} > insertionSort( scorers,length ); > // note that this comparator is not consistent with equals! > //Arrays.sort(scorers, new Comparator() { // sort the array > //public int compare(Object o1, Object o2) { > // return ((Scorer)o1).doc() - ((Scorer)o2).doc(); > //} > // }); > > first = 0; > last = length - 1; > } > private void insertionSort( Scorer[] scores, int len) > { > for (int i=0; i for (int j=i; j>0 && scores[j-1].doc() > scores[j].doc();j-- ) { > swap (scores, j, j-1); > } > } > return; > } > private void swap(Object[] x, int a, int b) { > Object t = x[a]; > x[a] = x[b]; > x[b] = t; > } > > The squeezing of the array is no longer needed. > We also initialized the Scorers array to 8 (instead of 2) to avoid having to > grow the array for common queries, although this probably has less > performance impact. > This change added about 3% to query throughput in my testing. > Peter -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-693) ConjunctionScorer - more tuneup
[ http://issues.apache.org/jira/browse/LUCENE-693?page=comments#action_1211 ] Yonik Seeley commented on LUCENE-693: - > Well, I'm seeing a good 7% increase over the trunk version. Yay! Now only if I could get my random synthetic tests to show an improvement too... Were you testing with -server? My -client showed a speedup and -server showed a slowdown. I think the difference is on *which* scorers I'm skipping on, even though I'm always skipping to the highest doc yet seen. Skipping on denser scorers will be a waste of time, and if the list is sorted one is more likely to be skipping on the sparse scorers. My code is optimal when the density of the scorers is similar. Think of the case of two sparse scorers and a dense scorer... you really want to be skipping on the two sparse scorers until they happen to agree. Until they agree, skipping on the dense scorer is a waste. My code round robins and throws the dense scorer into the mix. The question is, what are the real world usecases like, and what is important to speed up. I'd argue that the case of all dense scorers, while more rare, is more important (sparse scorers will cause the queries to be faster anyway). > Do the test cases try queries with non-existent terms? They will I was able to reproduce by earlier bug with the new TestScorerPerf.testConjunctions() included in the last patch. > ConjunctionScorer - more tuneup > --- > > Key: LUCENE-693 > URL: http://issues.apache.org/jira/browse/LUCENE-693 > Project: Lucene - Java > Issue Type: Bug > Components: Search >Affects Versions: 2.1 > Environment: Windows Server 2003 x64, Java 1.6, pretty large index >Reporter: Peter Keegan > Attachments: conjunction.patch, conjunction.patch > > > (See also: #LUCENE-443) > I did some profile testing with the new ConjuctionScorer in 2.1 and > discovered a new bottleneck in ConjunctionScorer.sortScorers. The > java.utils.Arrays.sort method is cloning the Scorers array on every sort, > which is quite expensive on large indexes because of the size of the 'norms' > array within, and isn't necessary. > Here is one possible solution: > private void sortScorers() { > // squeeze the array down for the sort > //if (length != scorers.length) { > // Scorer[] temps = new Scorer[length]; > // System.arraycopy(scorers, 0, temps, 0, length); > // scorers = temps; > //} > insertionSort( scorers,length ); > // note that this comparator is not consistent with equals! > //Arrays.sort(scorers, new Comparator() { // sort the array > //public int compare(Object o1, Object o2) { > // return ((Scorer)o1).doc() - ((Scorer)o2).doc(); > //} > // }); > > first = 0; > last = length - 1; > } > private void insertionSort( Scorer[] scores, int len) > { > for (int i=0; i for (int j=i; j>0 && scores[j-1].doc() > scores[j].doc();j-- ) { > swap (scores, j, j-1); > } > } > return; > } > private void swap(Object[] x, int a, int b) { > Object t = x[a]; > x[a] = x[b]; > x[b] = t; > } > > The squeezing of the array is no longer needed. > We also initialized the Scorers array to 8 (instead of 2) to avoid having to > grow the array for common queries, although this probably has less > performance impact. > This change added about 3% to query throughput in my testing. > Peter -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-693) ConjunctionScorer - more tuneup
[ http://issues.apache.org/jira/browse/LUCENE-693?page=comments#action_1208 ] Peter Keegan commented on LUCENE-693: - Well, I'm seeing a good 7% increase over the trunk version. Conjunction scorer time is mostly in 'skipto' now, which seems reasonable. Do the test cases try queries with non-existent terms? My failed query contained 3 required terms, but one of the terms was misspelled and didn't exist in the index. Peter > ConjunctionScorer - more tuneup > --- > > Key: LUCENE-693 > URL: http://issues.apache.org/jira/browse/LUCENE-693 > Project: Lucene - Java > Issue Type: Bug > Components: Search >Affects Versions: 2.1 > Environment: Windows Server 2003 x64, Java 1.6, pretty large index >Reporter: Peter Keegan > Attachments: conjunction.patch, conjunction.patch > > > (See also: #LUCENE-443) > I did some profile testing with the new ConjuctionScorer in 2.1 and > discovered a new bottleneck in ConjunctionScorer.sortScorers. The > java.utils.Arrays.sort method is cloning the Scorers array on every sort, > which is quite expensive on large indexes because of the size of the 'norms' > array within, and isn't necessary. > Here is one possible solution: > private void sortScorers() { > // squeeze the array down for the sort > //if (length != scorers.length) { > // Scorer[] temps = new Scorer[length]; > // System.arraycopy(scorers, 0, temps, 0, length); > // scorers = temps; > //} > insertionSort( scorers,length ); > // note that this comparator is not consistent with equals! > //Arrays.sort(scorers, new Comparator() { // sort the array > //public int compare(Object o1, Object o2) { > // return ((Scorer)o1).doc() - ((Scorer)o2).doc(); > //} > // }); > > first = 0; > last = length - 1; > } > private void insertionSort( Scorer[] scores, int len) > { > for (int i=0; i for (int j=i; j>0 && scores[j-1].doc() > scores[j].doc();j-- ) { > swap (scores, j, j-1); > } > } > return; > } > private void swap(Object[] x, int a, int b) { > Object t = x[a]; > x[a] = x[b]; > x[b] = t; > } > > The squeezing of the array is no longer needed. > We also initialized the Scorers array to 8 (instead of 2) to avoid having to > grow the array for common queries, although this probably has less > performance impact. > This change added about 3% to query throughput in my testing. > Peter -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-695) Improve BufferedIndexInput.readBytes() performance
[ http://issues.apache.org/jira/browse/LUCENE-695?page=comments#action_12444350 ] Yonik Seeley commented on LUCENE-695: - > One unit test assumed that readBytes() can work if given a null array, if the > length requested is 0. Unfortunately, > System.arraycopy doesn't share this permiscousity, so I had to add another > silly if(len>0) test in the readBytes() > code. If "given" a null array? Is this ever done in Lucene? Which should be fixed, the testcase or the code? > Improve BufferedIndexInput.readBytes() performance > -- > > Key: LUCENE-695 > URL: http://issues.apache.org/jira/browse/LUCENE-695 > Project: Lucene - Java > Issue Type: Improvement > Components: Store >Affects Versions: 2.0.0 >Reporter: Nadav Har'El >Priority: Minor > Attachments: readbytes.patch, readbytes.patch > > > During a profiling session, I discovered that BufferedIndexInput.readBytes(), > the function which reads a bunch of bytes from an index, is very inefficient > in many cases. It is efficient for one or two bytes, and also efficient > for a very large number of bytes (e.g., when the norms are read all at once); > But for anything in between (e.g., 100 bytes), it is a performance disaster. > It can easily be improved, though, and below I include a patch to do that. > The basic problem in the existing code was that if you ask it to read 100 > bytes, readBytes() simply calls readByte() 100 times in a loop, which means > we check byte after byte if the buffer has another character, instead of just > checking once how many bytes we have left, and copy them all at once. > My version, attached below, copies these 100 bytes if they are available at > bulk (using System.arraycopy), and if less than 100 are available, whatever > is available gets copied, and then the rest. (as before, when a very large > number of bytes is requested, it is read directly into the final buffer). > In my profiling, this fix caused amazing performance > improvement: previously, BufferedIndexInput.readBytes() took as much as 25% > of the run time, and after the fix, this was down to 1% of the run time! > However, my scenario is *not* the typical Lucene code, but rather a version > of Lucene with added payloads, and these payloads average at 100 bytes, where > the original readBytes() did worst. I expect that my fix will have less of an > impact on "vanilla" Lucene, but it still can have an impact because it is > used for things like reading fields. (I am not aware of a standard Lucene > benchmark, so I can't provide benchmarks on a more typical case). > In addition to the change to readBytes(), my attached patch also adds a new > unit test to BufferedIndexInput (which previously did not have a unit test). > This test simulates a "file" which contains a predictable series of bytes, and > then tries to read from it with readByte() and readButes() with various > sizes (many thousands of combinations are tried) and see that exactly the > expected bytes are read. This test is independent of my new readBytes() > inplementation, and can be used to check the old implementation as well. > By the way, it's interesting that BufferedIndexOutput.writeBytes was already > efficient, and wasn't simply a loop of writeByte(). Only the reading code was > inefficient. I wonder why this happened. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-695) Improve BufferedIndexInput.readBytes() performance
[ http://issues.apache.org/jira/browse/LUCENE-695?page=all ] Nadav Har'El updated LUCENE-695: Attachment: readbytes.patch A fixed patch, which now checks that we don't read past of of file. This is now checked correctly in all three cases (1. data already in the buffer, 2. small number of bytes in addition to buffer 3. large number of bytes in addition to the buffer). Note that the original code (before my patch) did not check length() for large number of bytes, only in refill() (which was only called for a small number of bytes). This code now checks in this case as well, so it is more correct than it was. The TestCompoundFile test now passes, and I also added to my new BufferedIndexInput unit test a third test case, testEOF, which tests that we can read up to EOF, but not past it. This test tests that small overflows (a few bytes) and very large overflows both throw an exception. I also made another change in this patch which I wish I didn't have to make, to account for other unit tests: One unit test assumed that readBytes() can work if given a null array, if the length requested is 0. Unfortunately, System.arraycopy doesn't share this permiscousity, so I had to add another silly if(len>0) test in the readBytes() code. > Improve BufferedIndexInput.readBytes() performance > -- > > Key: LUCENE-695 > URL: http://issues.apache.org/jira/browse/LUCENE-695 > Project: Lucene - Java > Issue Type: Improvement > Components: Store >Affects Versions: 2.0.0 >Reporter: Nadav Har'El >Priority: Minor > Attachments: readbytes.patch, readbytes.patch > > > During a profiling session, I discovered that BufferedIndexInput.readBytes(), > the function which reads a bunch of bytes from an index, is very inefficient > in many cases. It is efficient for one or two bytes, and also efficient > for a very large number of bytes (e.g., when the norms are read all at once); > But for anything in between (e.g., 100 bytes), it is a performance disaster. > It can easily be improved, though, and below I include a patch to do that. > The basic problem in the existing code was that if you ask it to read 100 > bytes, readBytes() simply calls readByte() 100 times in a loop, which means > we check byte after byte if the buffer has another character, instead of just > checking once how many bytes we have left, and copy them all at once. > My version, attached below, copies these 100 bytes if they are available at > bulk (using System.arraycopy), and if less than 100 are available, whatever > is available gets copied, and then the rest. (as before, when a very large > number of bytes is requested, it is read directly into the final buffer). > In my profiling, this fix caused amazing performance > improvement: previously, BufferedIndexInput.readBytes() took as much as 25% > of the run time, and after the fix, this was down to 1% of the run time! > However, my scenario is *not* the typical Lucene code, but rather a version > of Lucene with added payloads, and these payloads average at 100 bytes, where > the original readBytes() did worst. I expect that my fix will have less of an > impact on "vanilla" Lucene, but it still can have an impact because it is > used for things like reading fields. (I am not aware of a standard Lucene > benchmark, so I can't provide benchmarks on a more typical case). > In addition to the change to readBytes(), my attached patch also adds a new > unit test to BufferedIndexInput (which previously did not have a unit test). > This test simulates a "file" which contains a predictable series of bytes, and > then tries to read from it with readByte() and readButes() with various > sizes (many thousands of combinations are tried) and see that exactly the > expected bytes are read. This test is independent of my new readBytes() > inplementation, and can be used to check the old implementation as well. > By the way, it's interesting that BufferedIndexOutput.writeBytes was already > efficient, and wasn't simply a loop of writeByte(). Only the reading code was > inefficient. I wonder why this happened. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-693) ConjunctionScorer - more tuneup
[ http://issues.apache.org/jira/browse/LUCENE-693?page=all ] Yonik Seeley updated LUCENE-693: Attachment: conjunction.patch Here is my current patch and test code (which currently seems to be slower with this patch). > ConjunctionScorer - more tuneup > --- > > Key: LUCENE-693 > URL: http://issues.apache.org/jira/browse/LUCENE-693 > Project: Lucene - Java > Issue Type: Bug > Components: Search >Affects Versions: 2.1 > Environment: Windows Server 2003 x64, Java 1.6, pretty large index >Reporter: Peter Keegan > Attachments: conjunction.patch, conjunction.patch > > > (See also: #LUCENE-443) > I did some profile testing with the new ConjuctionScorer in 2.1 and > discovered a new bottleneck in ConjunctionScorer.sortScorers. The > java.utils.Arrays.sort method is cloning the Scorers array on every sort, > which is quite expensive on large indexes because of the size of the 'norms' > array within, and isn't necessary. > Here is one possible solution: > private void sortScorers() { > // squeeze the array down for the sort > //if (length != scorers.length) { > // Scorer[] temps = new Scorer[length]; > // System.arraycopy(scorers, 0, temps, 0, length); > // scorers = temps; > //} > insertionSort( scorers,length ); > // note that this comparator is not consistent with equals! > //Arrays.sort(scorers, new Comparator() { // sort the array > //public int compare(Object o1, Object o2) { > // return ((Scorer)o1).doc() - ((Scorer)o2).doc(); > //} > // }); > > first = 0; > last = length - 1; > } > private void insertionSort( Scorer[] scores, int len) > { > for (int i=0; i for (int j=i; j>0 && scores[j-1].doc() > scores[j].doc();j-- ) { > swap (scores, j, j-1); > } > } > return; > } > private void swap(Object[] x, int a, int b) { > Object t = x[a]; > x[a] = x[b]; > x[b] = t; > } > > The squeezing of the array is no longer needed. > We also initialized the Scorers array to 8 (instead of 2) to avoid having to > grow the array for common queries, although this probably has less > performance impact. > This change added about 3% to query throughput in my testing. > Peter -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-693) ConjunctionScorer - more tuneup
[ http://issues.apache.org/jira/browse/LUCENE-693?page=comments#action_12444334 ] Yonik Seeley commented on LUCENE-693: - I'm not sure how it's possible, but my version is *solwer* in the performance test I came up with. Very odd... I'm not sure why. > ConjunctionScorer - more tuneup > --- > > Key: LUCENE-693 > URL: http://issues.apache.org/jira/browse/LUCENE-693 > Project: Lucene - Java > Issue Type: Bug > Components: Search >Affects Versions: 2.1 > Environment: Windows Server 2003 x64, Java 1.6, pretty large index >Reporter: Peter Keegan > Attachments: conjunction.patch > > > (See also: #LUCENE-443) > I did some profile testing with the new ConjuctionScorer in 2.1 and > discovered a new bottleneck in ConjunctionScorer.sortScorers. The > java.utils.Arrays.sort method is cloning the Scorers array on every sort, > which is quite expensive on large indexes because of the size of the 'norms' > array within, and isn't necessary. > Here is one possible solution: > private void sortScorers() { > // squeeze the array down for the sort > //if (length != scorers.length) { > // Scorer[] temps = new Scorer[length]; > // System.arraycopy(scorers, 0, temps, 0, length); > // scorers = temps; > //} > insertionSort( scorers,length ); > // note that this comparator is not consistent with equals! > //Arrays.sort(scorers, new Comparator() { // sort the array > //public int compare(Object o1, Object o2) { > // return ((Scorer)o1).doc() - ((Scorer)o2).doc(); > //} > // }); > > first = 0; > last = length - 1; > } > private void insertionSort( Scorer[] scores, int len) > { > for (int i=0; i for (int j=i; j>0 && scores[j-1].doc() > scores[j].doc();j-- ) { > swap (scores, j, j-1); > } > } > return; > } > private void swap(Object[] x, int a, int b) { > Object t = x[a]; > x[a] = x[b]; > x[b] = t; > } > > The squeezing of the array is no longer needed. > We also initialized the Scorers array to 8 (instead of 2) to avoid having to > grow the array for common queries, although this probably has less > performance impact. > This change added about 3% to query throughput in my testing. > Peter -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-695) Improve BufferedIndexInput.readBytes() performance
[ http://issues.apache.org/jira/browse/LUCENE-695?page=comments#action_12444322 ] Nadav Har'El commented on LUCENE-695: - Sorry, I didn't notice that my fix broke this unit test. Thanks for catching that. What is happening is interesting: this test (TestCompoundFile.testReadPastEof()) is testing what happens when you read 40 bytes beyond the end of file, and expects the appropriate exception to be thrown. The old code actually did this for 40 bytes, so it passed this test, but the interesting thing is that when you asked for more than a buffer-full of bytes, say, 10K, the length() checking code was not there! So the old code was broken in this respect, just not for 40 bytes which were tested. I'll fix my patch to add this beyond-end-of-file check, and will post the new patch ASAP. > Improve BufferedIndexInput.readBytes() performance > -- > > Key: LUCENE-695 > URL: http://issues.apache.org/jira/browse/LUCENE-695 > Project: Lucene - Java > Issue Type: Improvement > Components: Store >Affects Versions: 2.0.0 >Reporter: Nadav Har'El >Priority: Minor > Attachments: readbytes.patch > > > During a profiling session, I discovered that BufferedIndexInput.readBytes(), > the function which reads a bunch of bytes from an index, is very inefficient > in many cases. It is efficient for one or two bytes, and also efficient > for a very large number of bytes (e.g., when the norms are read all at once); > But for anything in between (e.g., 100 bytes), it is a performance disaster. > It can easily be improved, though, and below I include a patch to do that. > The basic problem in the existing code was that if you ask it to read 100 > bytes, readBytes() simply calls readByte() 100 times in a loop, which means > we check byte after byte if the buffer has another character, instead of just > checking once how many bytes we have left, and copy them all at once. > My version, attached below, copies these 100 bytes if they are available at > bulk (using System.arraycopy), and if less than 100 are available, whatever > is available gets copied, and then the rest. (as before, when a very large > number of bytes is requested, it is read directly into the final buffer). > In my profiling, this fix caused amazing performance > improvement: previously, BufferedIndexInput.readBytes() took as much as 25% > of the run time, and after the fix, this was down to 1% of the run time! > However, my scenario is *not* the typical Lucene code, but rather a version > of Lucene with added payloads, and these payloads average at 100 bytes, where > the original readBytes() did worst. I expect that my fix will have less of an > impact on "vanilla" Lucene, but it still can have an impact because it is > used for things like reading fields. (I am not aware of a standard Lucene > benchmark, so I can't provide benchmarks on a more typical case). > In addition to the change to readBytes(), my attached patch also adds a new > unit test to BufferedIndexInput (which previously did not have a unit test). > This test simulates a "file" which contains a predictable series of bytes, and > then tries to read from it with readByte() and readButes() with various > sizes (many thousands of combinations are tried) and see that exactly the > expected bytes are read. This test is independent of my new readBytes() > inplementation, and can be used to check the old implementation as well. > By the way, it's interesting that BufferedIndexOutput.writeBytes was already > efficient, and wasn't simply a loop of writeByte(). Only the reading code was > inefficient. I wonder why this happened. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-693) ConjunctionScorer - more tuneup
[ http://issues.apache.org/jira/browse/LUCENE-693?page=comments#action_12444320 ] Yonik Seeley commented on LUCENE-693: - Ah, I see the problem... in the constructor I have boolean more = scorers[i].next(); for each scorer... but note that the local "more" is masking the member "more". Doh! You can just remove "boolean" from "boolean more" in the ConjunctionScorer constructor, and I'll try to see why this was never reproduced by any test cases in the meantime. > ConjunctionScorer - more tuneup > --- > > Key: LUCENE-693 > URL: http://issues.apache.org/jira/browse/LUCENE-693 > Project: Lucene - Java > Issue Type: Bug > Components: Search >Affects Versions: 2.1 > Environment: Windows Server 2003 x64, Java 1.6, pretty large index >Reporter: Peter Keegan > Attachments: conjunction.patch > > > (See also: #LUCENE-443) > I did some profile testing with the new ConjuctionScorer in 2.1 and > discovered a new bottleneck in ConjunctionScorer.sortScorers. The > java.utils.Arrays.sort method is cloning the Scorers array on every sort, > which is quite expensive on large indexes because of the size of the 'norms' > array within, and isn't necessary. > Here is one possible solution: > private void sortScorers() { > // squeeze the array down for the sort > //if (length != scorers.length) { > // Scorer[] temps = new Scorer[length]; > // System.arraycopy(scorers, 0, temps, 0, length); > // scorers = temps; > //} > insertionSort( scorers,length ); > // note that this comparator is not consistent with equals! > //Arrays.sort(scorers, new Comparator() { // sort the array > //public int compare(Object o1, Object o2) { > // return ((Scorer)o1).doc() - ((Scorer)o2).doc(); > //} > // }); > > first = 0; > last = length - 1; > } > private void insertionSort( Scorer[] scores, int len) > { > for (int i=0; i for (int j=i; j>0 && scores[j-1].doc() > scores[j].doc();j-- ) { > swap (scores, j, j-1); > } > } > return; > } > private void swap(Object[] x, int a, int b) { > Object t = x[a]; > x[a] = x[b]; > x[b] = t; > } > > The squeezing of the array is no longer needed. > We also initialized the Scorers array to 8 (instead of 2) to avoid having to > grow the array for common queries, although this probably has less > performance impact. > This change added about 3% to query throughput in my testing. > Peter -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-693) ConjunctionScorer - more tuneup
[ http://issues.apache.org/jira/browse/LUCENE-693?page=comments#action_12444319 ] Yonik Seeley commented on LUCENE-693: - Thanks for trying it out Peter. Odd it could fail after passing all the Lucene unit tests... I assume this was the lucene trunk you were trying? So the query was just a boolean query with three required term queries? > ConjunctionScorer - more tuneup > --- > > Key: LUCENE-693 > URL: http://issues.apache.org/jira/browse/LUCENE-693 > Project: Lucene - Java > Issue Type: Bug > Components: Search >Affects Versions: 2.1 > Environment: Windows Server 2003 x64, Java 1.6, pretty large index >Reporter: Peter Keegan > Attachments: conjunction.patch > > > (See also: #LUCENE-443) > I did some profile testing with the new ConjuctionScorer in 2.1 and > discovered a new bottleneck in ConjunctionScorer.sortScorers. The > java.utils.Arrays.sort method is cloning the Scorers array on every sort, > which is quite expensive on large indexes because of the size of the 'norms' > array within, and isn't necessary. > Here is one possible solution: > private void sortScorers() { > // squeeze the array down for the sort > //if (length != scorers.length) { > // Scorer[] temps = new Scorer[length]; > // System.arraycopy(scorers, 0, temps, 0, length); > // scorers = temps; > //} > insertionSort( scorers,length ); > // note that this comparator is not consistent with equals! > //Arrays.sort(scorers, new Comparator() { // sort the array > //public int compare(Object o1, Object o2) { > // return ((Scorer)o1).doc() - ((Scorer)o2).doc(); > //} > // }); > > first = 0; > last = length - 1; > } > private void insertionSort( Scorer[] scores, int len) > { > for (int i=0; i for (int j=i; j>0 && scores[j-1].doc() > scores[j].doc();j-- ) { > swap (scores, j, j-1); > } > } > return; > } > private void swap(Object[] x, int a, int b) { > Object t = x[a]; > x[a] = x[b]; > x[b] = t; > } > > The squeezing of the array is no longer needed. > We also initialized the Scorers array to 8 (instead of 2) to avoid having to > grow the array for common queries, although this probably has less > performance impact. > This change added about 3% to query throughput in my testing. > Peter -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-693) ConjunctionScorer - more tuneup
[ http://issues.apache.org/jira/browse/LUCENE-693?page=comments#action_12444317 ] Peter Keegan commented on LUCENE-693: - Yonik, I tried out your patch, but it causes an exception on some boolean queries. This one occurred on a boolean query with 3 required terms: java.lang.ArrayIndexOutOfBoundsException: 2147483647 at org.apache.lucene.search.TermScorer.score(TermScorer.java:129) at org.apache.lucene.search.ConjunctionScorer.score( ConjunctionScorer.java:97) at org.apache.lucene.search.BooleanScorer2$2.score(BooleanScorer2.java :186) at org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java :318) at org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java :282) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:132) at org.apache.lucene.search.Searcher.search(Searcher.java:116) at org.apache.lucene.search.Searcher.search(Searcher.java:95) It looks like the doc id has the sentinel value (Integer.MAX_VALUE). Note: one of the terms had no occurrences in the index. Peter > ConjunctionScorer - more tuneup > --- > > Key: LUCENE-693 > URL: http://issues.apache.org/jira/browse/LUCENE-693 > Project: Lucene - Java > Issue Type: Bug > Components: Search >Affects Versions: 2.1 > Environment: Windows Server 2003 x64, Java 1.6, pretty large index >Reporter: Peter Keegan > Attachments: conjunction.patch > > > (See also: #LUCENE-443) > I did some profile testing with the new ConjuctionScorer in 2.1 and > discovered a new bottleneck in ConjunctionScorer.sortScorers. The > java.utils.Arrays.sort method is cloning the Scorers array on every sort, > which is quite expensive on large indexes because of the size of the 'norms' > array within, and isn't necessary. > Here is one possible solution: > private void sortScorers() { > // squeeze the array down for the sort > //if (length != scorers.length) { > // Scorer[] temps = new Scorer[length]; > // System.arraycopy(scorers, 0, temps, 0, length); > // scorers = temps; > //} > insertionSort( scorers,length ); > // note that this comparator is not consistent with equals! > //Arrays.sort(scorers, new Comparator() { // sort the array > //public int compare(Object o1, Object o2) { > // return ((Scorer)o1).doc() - ((Scorer)o2).doc(); > //} > // }); > > first = 0; > last = length - 1; > } > private void insertionSort( Scorer[] scores, int len) > { > for (int i=0; i for (int j=i; j>0 && scores[j-1].doc() > scores[j].doc();j-- ) { > swap (scores, j, j-1); > } > } > return; > } > private void swap(Object[] x, int a, int b) { > Object t = x[a]; > x[a] = x[b]; > x[b] = t; > } > > The squeezing of the array is no longer needed. > We also initialized the Scorers array to 8 (instead of 2) to avoid having to > grow the array for common queries, although this probably has less > performance impact. > This change added about 3% to query throughput in my testing. > Peter -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-695) Improve BufferedIndexInput.readBytes() performance
[ http://issues.apache.org/jira/browse/LUCENE-695?page=comments#action_12444316 ] Yonik Seeley commented on LUCENE-695: - > I wonder why this happened. readBytes on less than a buffer size probably only happens with binary (or compressed) fields, relatively new additions to Lucene, so it probably didn't have much of a real-world impact. I think it is important to fix though, as more things may be byte-oriented in the future. After applying the patch, at least one unit test fails: [junit] Testcase: testReadPastEOF(org.apache.lucene.index.TestCompoundFile): FAILED [junit] Block read past end of file [junit] junit.framework.AssertionFailedError: Block read past end of file [junit] at org.apache.lucene.index.TestCompoundFile.testReadPastEOF(Test CompoundFile.java:616) > Improve BufferedIndexInput.readBytes() performance > -- > > Key: LUCENE-695 > URL: http://issues.apache.org/jira/browse/LUCENE-695 > Project: Lucene - Java > Issue Type: Improvement > Components: Store >Affects Versions: 2.0.0 >Reporter: Nadav Har'El >Priority: Minor > Attachments: readbytes.patch > > > During a profiling session, I discovered that BufferedIndexInput.readBytes(), > the function which reads a bunch of bytes from an index, is very inefficient > in many cases. It is efficient for one or two bytes, and also efficient > for a very large number of bytes (e.g., when the norms are read all at once); > But for anything in between (e.g., 100 bytes), it is a performance disaster. > It can easily be improved, though, and below I include a patch to do that. > The basic problem in the existing code was that if you ask it to read 100 > bytes, readBytes() simply calls readByte() 100 times in a loop, which means > we check byte after byte if the buffer has another character, instead of just > checking once how many bytes we have left, and copy them all at once. > My version, attached below, copies these 100 bytes if they are available at > bulk (using System.arraycopy), and if less than 100 are available, whatever > is available gets copied, and then the rest. (as before, when a very large > number of bytes is requested, it is read directly into the final buffer). > In my profiling, this fix caused amazing performance > improvement: previously, BufferedIndexInput.readBytes() took as much as 25% > of the run time, and after the fix, this was down to 1% of the run time! > However, my scenario is *not* the typical Lucene code, but rather a version > of Lucene with added payloads, and these payloads average at 100 bytes, where > the original readBytes() did worst. I expect that my fix will have less of an > impact on "vanilla" Lucene, but it still can have an impact because it is > used for things like reading fields. (I am not aware of a standard Lucene > benchmark, so I can't provide benchmarks on a more typical case). > In addition to the change to readBytes(), my attached patch also adds a new > unit test to BufferedIndexInput (which previously did not have a unit test). > This test simulates a "file" which contains a predictable series of bytes, and > then tries to read from it with readByte() and readButes() with various > sizes (many thousands of combinations are tried) and see that exactly the > expected bytes are read. This test is independent of my new readBytes() > inplementation, and can be used to check the old implementation as well. > By the way, it's interesting that BufferedIndexOutput.writeBytes was already > efficient, and wasn't simply a loop of writeByte(). Only the reading code was > inefficient. I wonder why this happened. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-551) Make Lucene - Java 1.9.1 Available in Maven2 repository in iBibilio.org
[ http://issues.apache.org/jira/browse/LUCENE-551?page=comments#action_12444300 ] Marcel Reutegger commented on LUCENE-551: - Are there any plans to also publish the new release to the Maven 1 repository on ibiblio.org? We at Jackrabbit still use Maven 1.0.2 as our build tool. > Make Lucene - Java 1.9.1 Available in Maven2 repository in iBibilio.org > --- > > Key: LUCENE-551 > URL: http://issues.apache.org/jira/browse/LUCENE-551 > Project: Lucene - Java > Issue Type: Task >Affects Versions: 1.9 >Reporter: Stephen Duncan Jr > > Please upload 1.9.1 release to iBiblio so that Maven users can easily use the > latest release. Currently 1.4.3 is the most recently available version: > http://www.ibiblio.org/maven2/lucene/lucene/ > Please read the following FAQ for more information: > http://maven.apache.org/project-faq.html -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-695) Improve BufferedIndexInput.readBytes() performance
[ http://issues.apache.org/jira/browse/LUCENE-695?page=all ] Nadav Har'El updated LUCENE-695: Attachment: readbytes.patch The patch, which includes the change to BufferedIndexInput.readBytes(), and a new unit test for that class. > Improve BufferedIndexInput.readBytes() performance > -- > > Key: LUCENE-695 > URL: http://issues.apache.org/jira/browse/LUCENE-695 > Project: Lucene - Java > Issue Type: Improvement > Components: Store >Affects Versions: 2.0.0 >Reporter: Nadav Har'El >Priority: Minor > Attachments: readbytes.patch > > > During a profiling session, I discovered that BufferedIndexInput.readBytes(), > the function which reads a bunch of bytes from an index, is very inefficient > in many cases. It is efficient for one or two bytes, and also efficient > for a very large number of bytes (e.g., when the norms are read all at once); > But for anything in between (e.g., 100 bytes), it is a performance disaster. > It can easily be improved, though, and below I include a patch to do that. > The basic problem in the existing code was that if you ask it to read 100 > bytes, readBytes() simply calls readByte() 100 times in a loop, which means > we check byte after byte if the buffer has another character, instead of just > checking once how many bytes we have left, and copy them all at once. > My version, attached below, copies these 100 bytes if they are available at > bulk (using System.arraycopy), and if less than 100 are available, whatever > is available gets copied, and then the rest. (as before, when a very large > number of bytes is requested, it is read directly into the final buffer). > In my profiling, this fix caused amazing performance > improvement: previously, BufferedIndexInput.readBytes() took as much as 25% > of the run time, and after the fix, this was down to 1% of the run time! > However, my scenario is *not* the typical Lucene code, but rather a version > of Lucene with added payloads, and these payloads average at 100 bytes, where > the original readBytes() did worst. I expect that my fix will have less of an > impact on "vanilla" Lucene, but it still can have an impact because it is > used for things like reading fields. (I am not aware of a standard Lucene > benchmark, so I can't provide benchmarks on a more typical case). > In addition to the change to readBytes(), my attached patch also adds a new > unit test to BufferedIndexInput (which previously did not have a unit test). > This test simulates a "file" which contains a predictable series of bytes, and > then tries to read from it with readByte() and readButes() with various > sizes (many thousands of combinations are tried) and see that exactly the > expected bytes are read. This test is independent of my new readBytes() > inplementation, and can be used to check the old implementation as well. > By the way, it's interesting that BufferedIndexOutput.writeBytes was already > efficient, and wasn't simply a loop of writeByte(). Only the reading code was > inefficient. I wonder why this happened. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-695) Improve BufferedIndexInput.readBytes() performance
Improve BufferedIndexInput.readBytes() performance -- Key: LUCENE-695 URL: http://issues.apache.org/jira/browse/LUCENE-695 Project: Lucene - Java Issue Type: Improvement Components: Store Affects Versions: 2.0.0 Reporter: Nadav Har'El Priority: Minor During a profiling session, I discovered that BufferedIndexInput.readBytes(), the function which reads a bunch of bytes from an index, is very inefficient in many cases. It is efficient for one or two bytes, and also efficient for a very large number of bytes (e.g., when the norms are read all at once); But for anything in between (e.g., 100 bytes), it is a performance disaster. It can easily be improved, though, and below I include a patch to do that. The basic problem in the existing code was that if you ask it to read 100 bytes, readBytes() simply calls readByte() 100 times in a loop, which means we check byte after byte if the buffer has another character, instead of just checking once how many bytes we have left, and copy them all at once. My version, attached below, copies these 100 bytes if they are available at bulk (using System.arraycopy), and if less than 100 are available, whatever is available gets copied, and then the rest. (as before, when a very large number of bytes is requested, it is read directly into the final buffer). In my profiling, this fix caused amazing performance improvement: previously, BufferedIndexInput.readBytes() took as much as 25% of the run time, and after the fix, this was down to 1% of the run time! However, my scenario is *not* the typical Lucene code, but rather a version of Lucene with added payloads, and these payloads average at 100 bytes, where the original readBytes() did worst. I expect that my fix will have less of an impact on "vanilla" Lucene, but it still can have an impact because it is used for things like reading fields. (I am not aware of a standard Lucene benchmark, so I can't provide benchmarks on a more typical case). In addition to the change to readBytes(), my attached patch also adds a new unit test to BufferedIndexInput (which previously did not have a unit test). This test simulates a "file" which contains a predictable series of bytes, and then tries to read from it with readByte() and readButes() with various sizes (many thousands of combinations are tried) and see that exactly the expected bytes are read. This test is independent of my new readBytes() inplementation, and can be used to check the old implementation as well. By the way, it's interesting that BufferedIndexOutput.writeBytes was already efficient, and wasn't simply a loop of writeByte(). Only the reading code was inefficient. I wonder why this happened. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-694) Query parser doesn't warn about unmatched ')'
Query parser doesn't warn about unmatched ')' - Key: LUCENE-694 URL: http://issues.apache.org/jira/browse/LUCENE-694 Project: Lucene - Java Issue Type: Bug Components: QueryParser Affects Versions: 2.0.0 Reporter: Eric Jain Priority: Minor If there is an unmatched '(', as in protein 'foo( bar', the query parser reports an error. But if you search for 'foo) bar', everything after the unmatched ')' seems to be ignored! -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]