[jira] Resolved: (LUCENE-2274) Catch exceptions in Threads created by JUnit tasks
[ https://issues.apache.org/jira/browse/LUCENE-2274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler resolved LUCENE-2274. --- Resolution: Fixed Committed Revision: 912376 Catch exceptions in Threads created by JUnit tasks -- Key: LUCENE-2274 URL: https://issues.apache.org/jira/browse/LUCENE-2274 Project: Lucene - Java Issue Type: Test Reporter: Uwe Schindler Assignee: Uwe Schindler Priority: Minor Fix For: 3.1 Attachments: LUCENE-2274.patch, LUCENE-2274.patch On hudson we had several assertions failed in TestRAMDirectory, that were never caught by the error reportier in JUnit (as the test itsself did not fail). This patch adds a handler for uncaught exceptions to LuceneTestCase(J4) that let the test fail in tearDown(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2190) CustomScoreQuery (function query) is broken (due to per-segment searching)
[ https://issues.apache.org/jira/browse/LUCENE-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2190: -- Attachment: LUCENE-2190-2-branch30.patch LUCENE-2190-2-trunk.patch Here the patches for trunk (without deprecations) and 3.0 brach. 2.9 will be merged later. Merging from trunk - 3.0 is not possible as TestCase heavily rewritten. CustomScoreQuery (function query) is broken (due to per-segment searching) -- Key: LUCENE-2190 URL: https://issues.apache.org/jira/browse/LUCENE-2190 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9, 2.9.1, 3.0, 3.0.1, 3.1 Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2190-2-branch30.patch, LUCENE-2190-2-trunk.patch, LUCENE-2190-2.patch, LUCENE-2190-2.patch, LUCENE-2190.patch Spinoff from here: http://lucene.markmail.org/message/psw2m3adzibaixbq With the cutover to per-segment searching, CustomScoreQuery is not really usable anymore, because the per-doc custom scoring method (customScore) receives a per-segment docID, yet there is no way to figure out which segment you are currently searching. I think to fix this we must also notify the subclass whenever a new segment is switched to. I think if we copy Collector.setNextReader, that would be sufficient. It would by default do nothing in CustomScoreQuery, but a subclass could override. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2190) CustomScoreQuery (function query) is broken (due to per-segment searching)
[ https://issues.apache.org/jira/browse/LUCENE-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2190: -- Attachment: LUCENE-2190-2-branch30.patch LUCENE-2190-2-trunk.patch Updated patches without javadocs-warnings / fixed javadocs. In trunk the backwards branch needs to be patched, too (merge from 3.0 branch). CustomScoreQuery (function query) is broken (due to per-segment searching) -- Key: LUCENE-2190 URL: https://issues.apache.org/jira/browse/LUCENE-2190 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9, 2.9.1, 3.0, 3.0.1, 3.1 Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2190-2-branch30.patch, LUCENE-2190-2-branch30.patch, LUCENE-2190-2-trunk.patch, LUCENE-2190-2-trunk.patch, LUCENE-2190-2.patch, LUCENE-2190-2.patch, LUCENE-2190.patch Spinoff from here: http://lucene.markmail.org/message/psw2m3adzibaixbq With the cutover to per-segment searching, CustomScoreQuery is not really usable anymore, because the per-doc custom scoring method (customScore) receives a per-segment docID, yet there is no way to figure out which segment you are currently searching. I think to fix this we must also notify the subclass whenever a new segment is switched to. I think if we copy Collector.setNextReader, that would be sufficient. It would by default do nothing in CustomScoreQuery, but a subclass could override. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2190) CustomScoreQuery (function query) is broken (due to per-segment searching)
[ https://issues.apache.org/jira/browse/LUCENE-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2190: -- Attachment: LUCENE-2190-2-branch29.patch Here the patch for 2.9 CustomScoreQuery (function query) is broken (due to per-segment searching) -- Key: LUCENE-2190 URL: https://issues.apache.org/jira/browse/LUCENE-2190 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9, 2.9.1, 3.0, 3.0.1, 3.1 Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2190-2-branch29.patch, LUCENE-2190-2-branch30.patch, LUCENE-2190-2-branch30.patch, LUCENE-2190-2-trunk.patch, LUCENE-2190-2-trunk.patch, LUCENE-2190-2.patch, LUCENE-2190-2.patch, LUCENE-2190.patch Spinoff from here: http://lucene.markmail.org/message/psw2m3adzibaixbq With the cutover to per-segment searching, CustomScoreQuery is not really usable anymore, because the per-doc custom scoring method (customScore) receives a per-segment docID, yet there is no way to figure out which segment you are currently searching. I think to fix this we must also notify the subclass whenever a new segment is switched to. I think if we copy Collector.setNextReader, that would be sufficient. It would by default do nothing in CustomScoreQuery, but a subclass could override. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-2190) CustomScoreQuery (function query) is broken (due to per-segment searching)
[ https://issues.apache.org/jira/browse/LUCENE-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler resolved LUCENE-2190. --- Resolution: Fixed Assignee: Uwe Schindler (was: Michael McCandless) Lucene Fields: [New, Patch Available] (was: [New]) Committed 3.0 branch revision: 912383, 912389 Committed trunk revision: 912386 Committed 2.9 branch revision: 912390 Thanks Mike for the help! CustomScoreQuery (function query) is broken (due to per-segment searching) -- Key: LUCENE-2190 URL: https://issues.apache.org/jira/browse/LUCENE-2190 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9, 2.9.1, 3.0, 3.0.1, 3.1 Reporter: Michael McCandless Assignee: Uwe Schindler Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2190-2-branch29.patch, LUCENE-2190-2-branch30.patch, LUCENE-2190-2-branch30.patch, LUCENE-2190-2-trunk.patch, LUCENE-2190-2-trunk.patch, LUCENE-2190-2.patch, LUCENE-2190-2.patch, LUCENE-2190.patch Spinoff from here: http://lucene.markmail.org/message/psw2m3adzibaixbq With the cutover to per-segment searching, CustomScoreQuery is not really usable anymore, because the per-doc custom scoring method (customScore) receives a per-segment docID, yet there is no way to figure out which segment you are currently searching. I think to fix this we must also notify the subclass whenever a new segment is switched to. I think if we copy Collector.setNextReader, that would be sufficient. It would by default do nothing in CustomScoreQuery, but a subclass could override. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: (LUCENE-1844) Speed up junit tests
Another test-bug that now shows as a real test failure (and not only in stderr as before, thanks to LUCENE-2274). Happens quite often, will check logs on Hudson, how often this happens. The test failure on my solaris box occurred in backwards branch of trunk. [junit] Testsuite: org.apache.lucene.store.TestRAMDirectory [junit] Tests run: 4, Failures: 1, Errors: 0, Time elapsed: 0.259 sec [junit] [junit] - Standard Error - [junit] The following exceptions were thrown by threads: [junit] *** Thread: Thread-16978 *** [junit] junit.framework.AssertionFailedError: expected:84992 but was:86016 [junit] at junit.framework.Assert.fail(Assert.java:47) [junit] at junit.framework.Assert.failNotEquals(Assert.java:277) [junit] at junit.framework.Assert.assertEquals(Assert.java:64) [junit] at junit.framework.Assert.assertEquals(Assert.java:130) [junit] at junit.framework.Assert.assertEquals(Assert.java:136) [junit] at org.apache.lucene.store.TestRAMDirectory$1.run(TestRAMDirectory.java:129) [junit] - --- [junit] Testcase: testRAMDirectorySize(org.apache.lucene.store.TestRAMDirectory): FAILED [junit] Some threads throwed uncaught exceptions! [junit] junit.framework.AssertionFailedError: Some threads throwed uncaught exceptions! [junit] at org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:142) [junit] at org.apache.lucene.store.TestRAMDirectory.tearDown(TestRAMDirectory.java:160) [junit] at org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:250) [junit] [junit] [junit] TEST org.apache.lucene.store.TestRAMDirectory FAILED - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de/ http://www.thetaphi.de eMail: u...@thetaphi.de From: Robert Muir [mailto:rcm...@gmail.com] Sent: Sunday, February 21, 2010 10:53 AM To: java-dev@lucene.apache.org Subject: Re: (LUCENE-1844) Speed up junit tests here is what i was worried about, if we cannot fix, i can revert back to forking. This is not reproduceable all the time: [junit] Testcase: testParallelMultiSort(org.apache.lucene.search.TestSort): Caused an ERROR [junit] java.util.ConcurrentModificationException [junit] java.lang.RuntimeException: java.util.ConcurrentModificationException [junit] at org.apache.lucene.search.ParallelMultiSearcher.foreach(ParallelMultiSearcher.java:216) [junit] at org.apache.lucene.search.ParallelMultiSearcher.search(ParallelMultiSearcher.java:121) [junit] at org.apache.lucene.search.Searcher.search(Searcher.java:49) [junit] at org.apache.lucene.search.TestSort.assertMatches(TestSort.java:965) [junit] at org.apache.lucene.search.TestSort.runMultiSorts(TestSort.java:891) [junit] at org.apache.lucene.search.TestSort.testParallelMultiSort(TestSort.java:629) [junit] at org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:208) [junit] Caused by: java.util.ConcurrentModificationException [junit] at java.util.WeakHashMap$HashIterator.nextEntry(WeakHashMap.java:762) [junit] at java.util.WeakHashMap$KeyIterator.next(WeakHashMap.java:795) [junit] at org.apache.lucene.search.FieldCacheImpl.getCacheEntries(FieldCacheImpl.java:75) [junit] at org.apache.lucene.util.FieldCacheSanityChecker.checkSanity(FieldCacheSanityChecker.java:72) [junit] at org.apache.lucene.search.FieldCacheImpl$Cache.printNewInsanity(FieldCacheImpl.java:205) [junit] at org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:194) [junit] at org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:357) [junit] at org.apache.lucene.search.FieldCacheImpl$IntCache.createValue(FieldCacheImpl.java:373) [junit] at org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:183) [junit] at org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:357) [junit] at org.apache.lucene.search.FieldComparator$IntComparator.setNextReader(FieldComparator.java:438) [junit] at org.apache.lucene.search.TopFieldCollector$OneComparatorNonScoringCollector.setNextReader(TopFieldCollector.java:95) [junit] at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:207) [junit] at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:197) [junit] at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:175) [junit] at org.apache.lucene.search.MultiSearcher$MultiSearcherCallableWithSort.call(MultiSearcher.java:420) [junit] at org.apache.lucene.search.MultiSearcher$MultiSearcherCallableWithSort.call(MultiSearcher.java:394) [junit] at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303
RE: (LUCENE-1844) Speed up junit tests
I fixed the backwards test and removed the assertion there. This was forgotten to be merged back. The reason why this test fails: TestRAMDir creates 10 threads that start to add files to a RAMDir and adds content to these files. The RAMFile updates its own size and also updates the size of the enclosing RAMDir (using AtomicLong). The problem is, that all threads do this in parallel and maybe another thread added the contents to the file and have updated the parents AtomicLong but not yet its own size. As update local size and in RAMDir is no longer atomic for the another thread, the assertion fails. In previous versions of RamDir both updates were one atomic op, synced to the directory. For speed reasons this was removed, so writing to RAMFiles no longer locks the parent directory. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de/ http://www.thetaphi.de eMail: u...@thetaphi.de From: Robert Muir [mailto:rcm...@gmail.com] Sent: Sunday, February 21, 2010 8:44 PM To: java-dev@lucene.apache.org Subject: Re: (LUCENE-1844) Speed up junit tests Mike removed this assertion in LUCENE-2095, so this only happens in the backwards tests. On Sun, Feb 21, 2010 at 2:26 PM, Uwe Schindler u...@thetaphi.de wrote: Another test-bug that now shows as a real test failure (and not only in stderr as before, thanks to LUCENE-2274). Happens quite often, will check logs on Hudson, how often this happens. The test failure on my solaris box occurred in backwards branch of trunk. [junit] Testsuite: org.apache.lucene.store.TestRAMDirectory [junit] Tests run: 4, Failures: 1, Errors: 0, Time elapsed: 0.259 sec [junit] [junit] - Standard Error - [junit] The following exceptions were thrown by threads: [junit] *** Thread: Thread-16978 *** [junit] junit.framework.AssertionFailedError: expected:84992 but was:86016 [junit] at junit.framework.Assert.fail(Assert.java:47) [junit] at junit.framework.Assert.failNotEquals(Assert.java:277) [junit] at junit.framework.Assert.assertEquals(Assert.java:64) [junit] at junit.framework.Assert.assertEquals(Assert.java:130) [junit] at junit.framework.Assert.assertEquals(Assert.java:136) [junit] at org.apache.lucene.store.TestRAMDirectory$1.run(TestRAMDirectory.java:129) [junit] - --- [junit] Testcase: testRAMDirectorySize(org.apache.lucene.store.TestRAMDirectory): FAILED [junit] Some threads throwed uncaught exceptions! [junit] junit.framework.AssertionFailedError: Some threads throwed uncaught exceptions! [junit] at org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:142) [junit] at org.apache.lucene.store.TestRAMDirectory.tearDown(TestRAMDirectory.java:160) [junit] at org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:250) [junit] [junit] [junit] TEST org.apache.lucene.store.TestRAMDirectory FAILED - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de http://www.thetaphi.de/ eMail: u...@thetaphi.de From: Robert Muir [mailto:rcm...@gmail.com] Sent: Sunday, February 21, 2010 10:53 AM To: java-dev@lucene.apache.org Subject: Re: (LUCENE-1844) Speed up junit tests here is what i was worried about, if we cannot fix, i can revert back to forking. This is not reproduceable all the time: [junit] Testcase: testParallelMultiSort(org.apache.lucene.search.TestSort): Caused an ERROR [junit] java.util.ConcurrentModificationException [junit] java.lang.RuntimeException: java.util.ConcurrentModificationException [junit] at org.apache.lucene.search.ParallelMultiSearcher.foreach(ParallelMultiSearcher.java:216) [junit] at org.apache.lucene.search.ParallelMultiSearcher.search(ParallelMultiSearcher.java:121) [junit] at org.apache.lucene.search.Searcher.search(Searcher.java:49) [junit] at org.apache.lucene.search.TestSort.assertMatches(TestSort.java:965) [junit] at org.apache.lucene.search.TestSort.runMultiSorts(TestSort.java:891) [junit] at org.apache.lucene.search.TestSort.testParallelMultiSort(TestSort.java:629) [junit] at org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:208) [junit] Caused by: java.util.ConcurrentModificationException [junit] at java.util.WeakHashMap$HashIterator.nextEntry(WeakHashMap.java:762) [junit] at java.util.WeakHashMap$KeyIterator.next(WeakHashMap.java:795) [junit] at org.apache.lucene.search.FieldCacheImpl.getCacheEntries(FieldCacheImpl.java:75) [junit] at org.apache.lucene.util.FieldCacheSanityChecker.checkSanity(FieldCacheSanityChecker.java:72) [junit] at org.apache.lucene.search.FieldCacheImpl$Cache.printNewInsanity(FieldCacheImpl.java:205) [junit
[jira] Updated: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2271: -- Attachment: LUCENE-2271-maybe-as-separate-collector.patch After applying Mike's patch (with modified asserts to correctly detect NaN), updated my patch of the delegating and -inf/NaN aware TopScoreDocCollector. Maybe we should add it as a separate collector for function queries in 3.1. Maybe with correct NaN ordering? Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector -- Key: LUCENE-2271 URL: https://issues.apache.org/jira/browse/LUCENE-2271 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9 Reporter: Uwe Schindler Priority: Minor Fix For: 3.1 Attachments: LUCENE-2271-bench.patch, LUCENE-2271-maybe-as-separate-collector.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, TSDC.patch This is a foolowup to LUCENE-2270, where a part of this problem was fixed (boost = 0 leading to NaN scores, which is also un-intuitive), but in general, function queries in Solr can create these invalid scores easily. In previous version of Lucene these scores ordered correct (except NaN, which mixes up results), but never invalid document ids are returned (like Integer.MAX_VALUE). The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ to work, this sentinel must be smaller than all posible values, which is not the case: - -inf is equal and the document is not inserted into the HQ, as not competitive, but the HQ is not yet full, so the sentinel values keep in the HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and only affects the Ordered collector) by chaning the exit condition to: {code} if (score = pqTop.score pqTop.doc != Integer.MAX_VALUE) { // Since docs are returned in-order (i.e., increasing doc Id), a document // with equal score to pqTop.score cannot compete since HitQueue favors // documents with lower doc Ids. Therefore reject those docs too. return; } {code} - The NaN case can be fixed in the same way, but then has another problem: all comparisons with NaN result in false (none of these is true): x NaN, x NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns false, leading to unexspected ordering in the PQ and sometimes the sentinel values do not stay at the top of the queue. A later hit then overrides the top of the queue but leaves the incorrect sentinels unchanged - invalid results. This can be fixed in two ways in HQ: Force all sentinels to the top: {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.doc == Integer.MAX_VALUE) return true; if (hitB.doc == Integer.MAX_VALUE) return false; if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return hitA.score hitB.score; } {code} or alternatively have a defined order for NaN (Float.compare sorts them after +inf): {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return Float.compare(hitA.score, hitB.score) 0; } {code} The problem with both solutions is, that we have now more comparisons per hit and the use of sentinels is questionable. I would like to remove the sentinels and use the old pre 2.9 code for comparing and using PQ.add() when a competitive hit arrives. The order of NaN would be unspecified. To fix the order of NaN, it would be better to replace all score comparisons by Float.compare() [also in FieldComparator]. I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2271: -- Fix Version/s: (was: 3.0.1) (was: 2.9.2) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector -- Key: LUCENE-2271 URL: https://issues.apache.org/jira/browse/LUCENE-2271 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9 Reporter: Uwe Schindler Priority: Minor Fix For: 3.1 Attachments: LUCENE-2271-bench.patch, LUCENE-2271-maybe-as-separate-collector.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, TSDC.patch This is a foolowup to LUCENE-2270, where a part of this problem was fixed (boost = 0 leading to NaN scores, which is also un-intuitive), but in general, function queries in Solr can create these invalid scores easily. In previous version of Lucene these scores ordered correct (except NaN, which mixes up results), but never invalid document ids are returned (like Integer.MAX_VALUE). The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ to work, this sentinel must be smaller than all posible values, which is not the case: - -inf is equal and the document is not inserted into the HQ, as not competitive, but the HQ is not yet full, so the sentinel values keep in the HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and only affects the Ordered collector) by chaning the exit condition to: {code} if (score = pqTop.score pqTop.doc != Integer.MAX_VALUE) { // Since docs are returned in-order (i.e., increasing doc Id), a document // with equal score to pqTop.score cannot compete since HitQueue favors // documents with lower doc Ids. Therefore reject those docs too. return; } {code} - The NaN case can be fixed in the same way, but then has another problem: all comparisons with NaN result in false (none of these is true): x NaN, x NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns false, leading to unexspected ordering in the PQ and sometimes the sentinel values do not stay at the top of the queue. A later hit then overrides the top of the queue but leaves the incorrect sentinels unchanged - invalid results. This can be fixed in two ways in HQ: Force all sentinels to the top: {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.doc == Integer.MAX_VALUE) return true; if (hitB.doc == Integer.MAX_VALUE) return false; if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return hitA.score hitB.score; } {code} or alternatively have a defined order for NaN (Float.compare sorts them after +inf): {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return Float.compare(hitA.score, hitB.score) 0; } {code} The problem with both solutions is, that we have now more comparisons per hit and the use of sentinels is questionable. I would like to remove the sentinels and use the old pre 2.9 code for comparing and using PQ.add() when a competitive hit arrives. The order of NaN would be unspecified. To fix the order of NaN, it would be better to replace all score comparisons by Float.compare() [also in FieldComparator]. I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[VOTE] Lucene Java 2.9.2 and 3.0.1 release artifacts - Take #2
Hallo Folks, I have posted a new release candidate (take #2) for both Lucene Java 2.9.2 and 3.0.1 (which both have the same bug fix level, functionality and release announcement), build from revision 912433 of the corresponding branches. Thanks for all your help! Please test them and give your votes until *Thursday morning*, as the scheduled release date for both versions is Friday, Feb 26th, 2010. Only votes from Lucene PMC are binding, but everyone is welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. We planned the parallel release with one announcement because of their parallel development / bug fix level to emphasize that they are equal except deprecation removal and Java 5 since major version 3. Updates since take #1 can be followed in issues: https://issues.apache.org/jira/browse/LUCENE-2190 (reopened, fixed) https://issues.apache.org/jira/browse/LUCENE-2270 (fixed) https://issues.apache.org/jira/browse/LUCENE-2271 (won't fix for 2.9.2/3.0.1) You can find the artifacts here: http://people.apache.org/~uschindler/staging-area/lucene-292-301-take2-rev912433/ Maven repo: http://people.apache.org/~uschindler/staging-area/lucene-292-301-take2-rev912433/maven/ The changes are here: http://people.apache.org/~uschindler/staging-area/lucene-292-301-take2-rev912433/changes-2.9.2/Changes.html http://people.apache.org/~uschindler/staging-area/lucene-292-301-take2-rev912433/changes-2.9.2/Contrib-Changes.html http://people.apache.org/~uschindler/staging-area/lucene-292-301-take2-rev912433/changes-3.0.1/Changes.html http://people.apache.org/~uschindler/staging-area/lucene-292-301-take2-rev912433/changes-3.0.1/Contrib-Changes.html Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2190) CustomScoreQuery (function query) is broken (due to per-segment searching)
[ https://issues.apache.org/jira/browse/LUCENE-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2190: -- Attachment: LUCENE-2190-2.patch Here a better solution. It now works like Filter's getDocIdSet: For customizing scores, you have to override the similar protected method getCustomScoreProvider(IndexReader) and return a subclass of CustomScoreProvider. The default delegates to the backwards layer. The semantics is now identical to filters: Each IndexReader of a segment gets its own calculator like its own DocIdSet in filters. Also fixes the following problems: - rewrite() was incorrectly implemented (now works like BooleanQuery.rewrite()) - equals/hashCode ignored strict CustomScoreQuery (function query) is broken (due to per-segment searching) -- Key: LUCENE-2190 URL: https://issues.apache.org/jira/browse/LUCENE-2190 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9, 2.9.1, 3.0, 3.0.1, 3.1 Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2190-2.patch, LUCENE-2190.patch Spinoff from here: http://lucene.markmail.org/message/psw2m3adzibaixbq With the cutover to per-segment searching, CustomScoreQuery is not really usable anymore, because the per-doc custom scoring method (customScore) receives a per-segment docID, yet there is no way to figure out which segment you are currently searching. I think to fix this we must also notify the subclass whenever a new segment is switched to. I think if we copy Collector.setNextReader, that would be sufficient. It would by default do nothing in CustomScoreQuery, but a subclass could override. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2190) CustomScoreQuery (function query) is broken (due to per-segment searching)
[ https://issues.apache.org/jira/browse/LUCENE-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2190: -- Attachment: LUCENE-2190-2.patch Updated patch (forgot to add an IOException to getCustomScoreProvider and fixed test). Will backport after committing to 3.0 and 2.9 (argh). CustomScoreQuery (function query) is broken (due to per-segment searching) -- Key: LUCENE-2190 URL: https://issues.apache.org/jira/browse/LUCENE-2190 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9, 2.9.1, 3.0, 3.0.1, 3.1 Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2190-2.patch, LUCENE-2190-2.patch, LUCENE-2190.patch Spinoff from here: http://lucene.markmail.org/message/psw2m3adzibaixbq With the cutover to per-segment searching, CustomScoreQuery is not really usable anymore, because the per-doc custom scoring method (customScore) receives a per-segment docID, yet there is no way to figure out which segment you are currently searching. I think to fix this we must also notify the subclass whenever a new segment is switched to. I think if we copy Collector.setNextReader, that would be sufficient. It would by default do nothing in CustomScoreQuery, but a subclass could override. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-2267) Add solr's artifact signing scripts into lucene's build.xml/common-build.xml
[ https://issues.apache.org/jira/browse/LUCENE-2267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler resolved LUCENE-2267. --- Resolution: Fixed Lucene Fields: [New, Patch Available] (was: [New]) Committed revision: 912115 Add solr's artifact signing scripts into lucene's build.xml/common-build.xml Key: LUCENE-2267 URL: https://issues.apache.org/jira/browse/LUCENE-2267 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9.2, 3.0.1 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 3.1 Attachments: LUCENE-2267.patch, LUCENE-2267.patch Solr has nice artifact signing scripts in its common-build.xml and build.xml. For me as release manager of 3.0 it would have be good to have them also when building lucene artifacts. I will investigate how to add them to src artifacts and maven artifacts -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836182#action_12836182 ] Uwe Schindler commented on LUCENE-2271: --- In my opinion we should fix it using the attached patch and in the future 3.1 do some refactoring: - no sentinels - define a order for NaN, as NaN breaks the complete order of results (because PQ cannot handle the case that lessThan(a,b) returns false and also lessThan(b,a) when NaN is involved) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector -- Key: LUCENE-2271 URL: https://issues.apache.org/jira/browse/LUCENE-2271 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9 Reporter: Uwe Schindler Priority: Minor Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, TSDC.patch This is a foolowup to LUCENE-2270, where a part of this problem was fixed (boost = 0 leading to NaN scores, which is also un-intuitive), but in general, function queries in Solr can create these invalid scores easily. In previous version of Lucene these scores ordered correct (except NaN, which mixes up results), but never invalid document ids are returned (like Integer.MAX_VALUE). The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ to work, this sentinel must be smaller than all posible values, which is not the case: - -inf is equal and the document is not inserted into the HQ, as not competitive, but the HQ is not yet full, so the sentinel values keep in the HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and only affects the Ordered collector) by chaning the exit condition to: {code} if (score = pqTop.score pqTop.doc != Integer.MAX_VALUE) { // Since docs are returned in-order (i.e., increasing doc Id), a document // with equal score to pqTop.score cannot compete since HitQueue favors // documents with lower doc Ids. Therefore reject those docs too. return; } {code} - The NaN case can be fixed in the same way, but then has another problem: all comparisons with NaN result in false (none of these is true): x NaN, x NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns false, leading to unexspected ordering in the PQ and sometimes the sentinel values do not stay at the top of the queue. A later hit then overrides the top of the queue but leaves the incorrect sentinels unchanged - invalid results. This can be fixed in two ways in HQ: Force all sentinels to the top: {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.doc == Integer.MAX_VALUE) return true; if (hitB.doc == Integer.MAX_VALUE) return false; if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return hitA.score hitB.score; } {code} or alternatively have a defined order for NaN (Float.compare sorts them after +inf): {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return Float.compare(hitA.score, hitB.score) 0; } {code} The problem with both solutions is, that we have now more comparisons per hit and the use of sentinels is questionable. I would like to remove the sentinels and use the old pre 2.9 code for comparing and using PQ.add() when a competitive hit arrives. The order of NaN would be unspecified. To fix the order of NaN, it would be better to replace all score comparisons by Float.compare() [also in FieldComparator]. I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2271: -- Attachment: LUCENE-2271.patch Here a simplier patch with sentinels removed. You can maybe think about a better if-check in the out of order collector Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector -- Key: LUCENE-2271 URL: https://issues.apache.org/jira/browse/LUCENE-2271 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9 Reporter: Uwe Schindler Priority: Minor Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, TSDC.patch This is a foolowup to LUCENE-2270, where a part of this problem was fixed (boost = 0 leading to NaN scores, which is also un-intuitive), but in general, function queries in Solr can create these invalid scores easily. In previous version of Lucene these scores ordered correct (except NaN, which mixes up results), but never invalid document ids are returned (like Integer.MAX_VALUE). The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ to work, this sentinel must be smaller than all posible values, which is not the case: - -inf is equal and the document is not inserted into the HQ, as not competitive, but the HQ is not yet full, so the sentinel values keep in the HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and only affects the Ordered collector) by chaning the exit condition to: {code} if (score = pqTop.score pqTop.doc != Integer.MAX_VALUE) { // Since docs are returned in-order (i.e., increasing doc Id), a document // with equal score to pqTop.score cannot compete since HitQueue favors // documents with lower doc Ids. Therefore reject those docs too. return; } {code} - The NaN case can be fixed in the same way, but then has another problem: all comparisons with NaN result in false (none of these is true): x NaN, x NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns false, leading to unexspected ordering in the PQ and sometimes the sentinel values do not stay at the top of the queue. A later hit then overrides the top of the queue but leaves the incorrect sentinels unchanged - invalid results. This can be fixed in two ways in HQ: Force all sentinels to the top: {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.doc == Integer.MAX_VALUE) return true; if (hitB.doc == Integer.MAX_VALUE) return false; if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return hitA.score hitB.score; } {code} or alternatively have a defined order for NaN (Float.compare sorts them after +inf): {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return Float.compare(hitA.score, hitB.score) 0; } {code} The problem with both solutions is, that we have now more comparisons per hit and the use of sentinels is questionable. I would like to remove the sentinels and use the old pre 2.9 code for comparing and using PQ.add() when a competitive hit arrives. The order of NaN would be unspecified. To fix the order of NaN, it would be better to replace all score comparisons by Float.compare() [also in FieldComparator]. I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2271: -- Attachment: LUCENE-2271.patch Sorry, insertWithOverflow is correct! Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector -- Key: LUCENE-2271 URL: https://issues.apache.org/jira/browse/LUCENE-2271 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9 Reporter: Uwe Schindler Priority: Minor Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, TSDC.patch This is a foolowup to LUCENE-2270, where a part of this problem was fixed (boost = 0 leading to NaN scores, which is also un-intuitive), but in general, function queries in Solr can create these invalid scores easily. In previous version of Lucene these scores ordered correct (except NaN, which mixes up results), but never invalid document ids are returned (like Integer.MAX_VALUE). The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ to work, this sentinel must be smaller than all posible values, which is not the case: - -inf is equal and the document is not inserted into the HQ, as not competitive, but the HQ is not yet full, so the sentinel values keep in the HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and only affects the Ordered collector) by chaning the exit condition to: {code} if (score = pqTop.score pqTop.doc != Integer.MAX_VALUE) { // Since docs are returned in-order (i.e., increasing doc Id), a document // with equal score to pqTop.score cannot compete since HitQueue favors // documents with lower doc Ids. Therefore reject those docs too. return; } {code} - The NaN case can be fixed in the same way, but then has another problem: all comparisons with NaN result in false (none of these is true): x NaN, x NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns false, leading to unexspected ordering in the PQ and sometimes the sentinel values do not stay at the top of the queue. A later hit then overrides the top of the queue but leaves the incorrect sentinels unchanged - invalid results. This can be fixed in two ways in HQ: Force all sentinels to the top: {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.doc == Integer.MAX_VALUE) return true; if (hitB.doc == Integer.MAX_VALUE) return false; if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return hitA.score hitB.score; } {code} or alternatively have a defined order for NaN (Float.compare sorts them after +inf): {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return Float.compare(hitA.score, hitB.score) 0; } {code} The problem with both solutions is, that we have now more comparisons per hit and the use of sentinels is questionable. I would like to remove the sentinels and use the old pre 2.9 code for comparing and using PQ.add() when a competitive hit arrives. The order of NaN would be unspecified. To fix the order of NaN, it would be better to replace all score comparisons by Float.compare() [also in FieldComparator]. I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1935) Generify PriorityQueue
[ https://issues.apache.org/jira/browse/LUCENE-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1935: -- Attachment: HitQueue.jad Just for reference: Here is the generated class (by javac) when overriding lessThan (as example HitQueue), decompiled from the resulting class file by JAD. Generify PriorityQueue -- Key: LUCENE-1935 URL: https://issues.apache.org/jira/browse/LUCENE-1935 Project: Lucene - Java Issue Type: Task Components: Other Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 3.0 Attachments: HitQueue.jad, LUCENE-1935.patch Priority Queue should use generics like all other Java 5 Collection API classes. This very simple, but makes code more readable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2271: -- Attachment: LUCENE-2271.patch Here is a new impl that only has exactly one additional check in the initial collection (when th pq is not yet full). After the PQ is full, the collector is replaced by the short-cutting one. This impl should even be faster than before, if the additional method call does not count and is removed by the JVM (which it should, because its clearly predictable) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector -- Key: LUCENE-2271 URL: https://issues.apache.org/jira/browse/LUCENE-2271 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9 Reporter: Uwe Schindler Priority: Minor Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, TSDC.patch This is a foolowup to LUCENE-2270, where a part of this problem was fixed (boost = 0 leading to NaN scores, which is also un-intuitive), but in general, function queries in Solr can create these invalid scores easily. In previous version of Lucene these scores ordered correct (except NaN, which mixes up results), but never invalid document ids are returned (like Integer.MAX_VALUE). The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ to work, this sentinel must be smaller than all posible values, which is not the case: - -inf is equal and the document is not inserted into the HQ, as not competitive, but the HQ is not yet full, so the sentinel values keep in the HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and only affects the Ordered collector) by chaning the exit condition to: {code} if (score = pqTop.score pqTop.doc != Integer.MAX_VALUE) { // Since docs are returned in-order (i.e., increasing doc Id), a document // with equal score to pqTop.score cannot compete since HitQueue favors // documents with lower doc Ids. Therefore reject those docs too. return; } {code} - The NaN case can be fixed in the same way, but then has another problem: all comparisons with NaN result in false (none of these is true): x NaN, x NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns false, leading to unexspected ordering in the PQ and sometimes the sentinel values do not stay at the top of the queue. A later hit then overrides the top of the queue but leaves the incorrect sentinels unchanged - invalid results. This can be fixed in two ways in HQ: Force all sentinels to the top: {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.doc == Integer.MAX_VALUE) return true; if (hitB.doc == Integer.MAX_VALUE) return false; if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return hitA.score hitB.score; } {code} or alternatively have a defined order for NaN (Float.compare sorts them after +inf): {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return Float.compare(hitA.score, hitB.score) 0; } {code} The problem with both solutions is, that we have now more comparisons per hit and the use of sentinels is questionable. I would like to remove the sentinels and use the old pre 2.9 code for comparing and using PQ.add() when a competitive hit arrives. The order of NaN would be unspecified. To fix the order of NaN, it would be better to replace all score comparisons by Float.compare() [also in FieldComparator]. I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2271: -- Attachment: (was: LUCENE-2271.patch) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector -- Key: LUCENE-2271 URL: https://issues.apache.org/jira/browse/LUCENE-2271 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9 Reporter: Uwe Schindler Priority: Minor Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, TSDC.patch This is a foolowup to LUCENE-2270, where a part of this problem was fixed (boost = 0 leading to NaN scores, which is also un-intuitive), but in general, function queries in Solr can create these invalid scores easily. In previous version of Lucene these scores ordered correct (except NaN, which mixes up results), but never invalid document ids are returned (like Integer.MAX_VALUE). The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ to work, this sentinel must be smaller than all posible values, which is not the case: - -inf is equal and the document is not inserted into the HQ, as not competitive, but the HQ is not yet full, so the sentinel values keep in the HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and only affects the Ordered collector) by chaning the exit condition to: {code} if (score = pqTop.score pqTop.doc != Integer.MAX_VALUE) { // Since docs are returned in-order (i.e., increasing doc Id), a document // with equal score to pqTop.score cannot compete since HitQueue favors // documents with lower doc Ids. Therefore reject those docs too. return; } {code} - The NaN case can be fixed in the same way, but then has another problem: all comparisons with NaN result in false (none of these is true): x NaN, x NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns false, leading to unexspected ordering in the PQ and sometimes the sentinel values do not stay at the top of the queue. A later hit then overrides the top of the queue but leaves the incorrect sentinels unchanged - invalid results. This can be fixed in two ways in HQ: Force all sentinels to the top: {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.doc == Integer.MAX_VALUE) return true; if (hitB.doc == Integer.MAX_VALUE) return false; if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return hitA.score hitB.score; } {code} or alternatively have a defined order for NaN (Float.compare sorts them after +inf): {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return Float.compare(hitA.score, hitB.score) 0; } {code} The problem with both solutions is, that we have now more comparisons per hit and the use of sentinels is questionable. I would like to remove the sentinels and use the old pre 2.9 code for comparing and using PQ.add() when a competitive hit arrives. The order of NaN would be unspecified. To fix the order of NaN, it would be better to replace all score comparisons by Float.compare() [also in FieldComparator]. I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2271: -- Attachment: LUCENE-2271.patch more improved Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector -- Key: LUCENE-2271 URL: https://issues.apache.org/jira/browse/LUCENE-2271 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9 Reporter: Uwe Schindler Priority: Minor Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, TSDC.patch This is a foolowup to LUCENE-2270, where a part of this problem was fixed (boost = 0 leading to NaN scores, which is also un-intuitive), but in general, function queries in Solr can create these invalid scores easily. In previous version of Lucene these scores ordered correct (except NaN, which mixes up results), but never invalid document ids are returned (like Integer.MAX_VALUE). The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ to work, this sentinel must be smaller than all posible values, which is not the case: - -inf is equal and the document is not inserted into the HQ, as not competitive, but the HQ is not yet full, so the sentinel values keep in the HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and only affects the Ordered collector) by chaning the exit condition to: {code} if (score = pqTop.score pqTop.doc != Integer.MAX_VALUE) { // Since docs are returned in-order (i.e., increasing doc Id), a document // with equal score to pqTop.score cannot compete since HitQueue favors // documents with lower doc Ids. Therefore reject those docs too. return; } {code} - The NaN case can be fixed in the same way, but then has another problem: all comparisons with NaN result in false (none of these is true): x NaN, x NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns false, leading to unexspected ordering in the PQ and sometimes the sentinel values do not stay at the top of the queue. A later hit then overrides the top of the queue but leaves the incorrect sentinels unchanged - invalid results. This can be fixed in two ways in HQ: Force all sentinels to the top: {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.doc == Integer.MAX_VALUE) return true; if (hitB.doc == Integer.MAX_VALUE) return false; if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return hitA.score hitB.score; } {code} or alternatively have a defined order for NaN (Float.compare sorts them after +inf): {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return Float.compare(hitA.score, hitB.score) 0; } {code} The problem with both solutions is, that we have now more comparisons per hit and the use of sentinels is questionable. I would like to remove the sentinels and use the old pre 2.9 code for comparing and using PQ.add() when a competitive hit arrives. The order of NaN would be unspecified. To fix the order of NaN, it would be better to replace all score comparisons by Float.compare() [also in FieldComparator]. I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2271: -- Attachment: LUCENE-2271.patch More optimized version with more local variables. This is the version for the benchmark-try. Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector -- Key: LUCENE-2271 URL: https://issues.apache.org/jira/browse/LUCENE-2271 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9 Reporter: Uwe Schindler Priority: Minor Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, TSDC.patch This is a foolowup to LUCENE-2270, where a part of this problem was fixed (boost = 0 leading to NaN scores, which is also un-intuitive), but in general, function queries in Solr can create these invalid scores easily. In previous version of Lucene these scores ordered correct (except NaN, which mixes up results), but never invalid document ids are returned (like Integer.MAX_VALUE). The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ to work, this sentinel must be smaller than all posible values, which is not the case: - -inf is equal and the document is not inserted into the HQ, as not competitive, but the HQ is not yet full, so the sentinel values keep in the HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and only affects the Ordered collector) by chaning the exit condition to: {code} if (score = pqTop.score pqTop.doc != Integer.MAX_VALUE) { // Since docs are returned in-order (i.e., increasing doc Id), a document // with equal score to pqTop.score cannot compete since HitQueue favors // documents with lower doc Ids. Therefore reject those docs too. return; } {code} - The NaN case can be fixed in the same way, but then has another problem: all comparisons with NaN result in false (none of these is true): x NaN, x NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns false, leading to unexspected ordering in the PQ and sometimes the sentinel values do not stay at the top of the queue. A later hit then overrides the top of the queue but leaves the incorrect sentinels unchanged - invalid results. This can be fixed in two ways in HQ: Force all sentinels to the top: {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.doc == Integer.MAX_VALUE) return true; if (hitB.doc == Integer.MAX_VALUE) return false; if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return hitA.score hitB.score; } {code} or alternatively have a defined order for NaN (Float.compare sorts them after +inf): {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return Float.compare(hitA.score, hitB.score) 0; } {code} The problem with both solutions is, that we have now more comparisons per hit and the use of sentinels is questionable. I would like to remove the sentinels and use the old pre 2.9 code for comparing and using PQ.add() when a competitive hit arrives. The order of NaN would be unspecified. To fix the order of NaN, it would be better to replace all score comparisons by Float.compare() [also in FieldComparator]. I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2271: -- Attachment: LUCENE-2271-bench.patch Here a benchmark task made by grant. Run collector.alg and wait long enough. Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector -- Key: LUCENE-2271 URL: https://issues.apache.org/jira/browse/LUCENE-2271 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9 Reporter: Uwe Schindler Priority: Minor Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2271-bench.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, TSDC.patch This is a foolowup to LUCENE-2270, where a part of this problem was fixed (boost = 0 leading to NaN scores, which is also un-intuitive), but in general, function queries in Solr can create these invalid scores easily. In previous version of Lucene these scores ordered correct (except NaN, which mixes up results), but never invalid document ids are returned (like Integer.MAX_VALUE). The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ to work, this sentinel must be smaller than all posible values, which is not the case: - -inf is equal and the document is not inserted into the HQ, as not competitive, but the HQ is not yet full, so the sentinel values keep in the HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and only affects the Ordered collector) by chaning the exit condition to: {code} if (score = pqTop.score pqTop.doc != Integer.MAX_VALUE) { // Since docs are returned in-order (i.e., increasing doc Id), a document // with equal score to pqTop.score cannot compete since HitQueue favors // documents with lower doc Ids. Therefore reject those docs too. return; } {code} - The NaN case can be fixed in the same way, but then has another problem: all comparisons with NaN result in false (none of these is true): x NaN, x NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns false, leading to unexspected ordering in the PQ and sometimes the sentinel values do not stay at the top of the queue. A later hit then overrides the top of the queue but leaves the incorrect sentinels unchanged - invalid results. This can be fixed in two ways in HQ: Force all sentinels to the top: {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.doc == Integer.MAX_VALUE) return true; if (hitB.doc == Integer.MAX_VALUE) return false; if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return hitA.score hitB.score; } {code} or alternatively have a defined order for NaN (Float.compare sorts them after +inf): {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return Float.compare(hitA.score, hitB.score) 0; } {code} The problem with both solutions is, that we have now more comparisons per hit and the use of sentinels is questionable. I would like to remove the sentinels and use the old pre 2.9 code for comparing and using PQ.add() when a competitive hit arrives. The order of NaN would be unspecified. To fix the order of NaN, it would be better to replace all score comparisons by Float.compare() [also in FieldComparator]. I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2271: -- Attachment: LUCENE-2271.patch More improved version, now equal to prefilled queue case, as the collector reuses overflowed ScoreDoc instances. Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector -- Key: LUCENE-2271 URL: https://issues.apache.org/jira/browse/LUCENE-2271 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9 Reporter: Uwe Schindler Priority: Minor Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2271-bench.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, TSDC.patch This is a foolowup to LUCENE-2270, where a part of this problem was fixed (boost = 0 leading to NaN scores, which is also un-intuitive), but in general, function queries in Solr can create these invalid scores easily. In previous version of Lucene these scores ordered correct (except NaN, which mixes up results), but never invalid document ids are returned (like Integer.MAX_VALUE). The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ to work, this sentinel must be smaller than all posible values, which is not the case: - -inf is equal and the document is not inserted into the HQ, as not competitive, but the HQ is not yet full, so the sentinel values keep in the HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and only affects the Ordered collector) by chaning the exit condition to: {code} if (score = pqTop.score pqTop.doc != Integer.MAX_VALUE) { // Since docs are returned in-order (i.e., increasing doc Id), a document // with equal score to pqTop.score cannot compete since HitQueue favors // documents with lower doc Ids. Therefore reject those docs too. return; } {code} - The NaN case can be fixed in the same way, but then has another problem: all comparisons with NaN result in false (none of these is true): x NaN, x NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns false, leading to unexspected ordering in the PQ and sometimes the sentinel values do not stay at the top of the queue. A later hit then overrides the top of the queue but leaves the incorrect sentinels unchanged - invalid results. This can be fixed in two ways in HQ: Force all sentinels to the top: {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.doc == Integer.MAX_VALUE) return true; if (hitB.doc == Integer.MAX_VALUE) return false; if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return hitA.score hitB.score; } {code} or alternatively have a defined order for NaN (Float.compare sorts them after +inf): {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return Float.compare(hitA.score, hitB.score) 0; } {code} The problem with both solutions is, that we have now more comparisons per hit and the use of sentinels is questionable. I would like to remove the sentinels and use the old pre 2.9 code for comparing and using PQ.add() when a competitive hit arrives. The order of NaN would be unspecified. To fix the order of NaN, it would be better to replace all score comparisons by Float.compare() [also in FieldComparator]. I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2271: -- Attachment: (was: LUCENE-2271.patch) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector -- Key: LUCENE-2271 URL: https://issues.apache.org/jira/browse/LUCENE-2271 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9 Reporter: Uwe Schindler Priority: Minor Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2271-bench.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, TSDC.patch This is a foolowup to LUCENE-2270, where a part of this problem was fixed (boost = 0 leading to NaN scores, which is also un-intuitive), but in general, function queries in Solr can create these invalid scores easily. In previous version of Lucene these scores ordered correct (except NaN, which mixes up results), but never invalid document ids are returned (like Integer.MAX_VALUE). The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ to work, this sentinel must be smaller than all posible values, which is not the case: - -inf is equal and the document is not inserted into the HQ, as not competitive, but the HQ is not yet full, so the sentinel values keep in the HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and only affects the Ordered collector) by chaning the exit condition to: {code} if (score = pqTop.score pqTop.doc != Integer.MAX_VALUE) { // Since docs are returned in-order (i.e., increasing doc Id), a document // with equal score to pqTop.score cannot compete since HitQueue favors // documents with lower doc Ids. Therefore reject those docs too. return; } {code} - The NaN case can be fixed in the same way, but then has another problem: all comparisons with NaN result in false (none of these is true): x NaN, x NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns false, leading to unexspected ordering in the PQ and sometimes the sentinel values do not stay at the top of the queue. A later hit then overrides the top of the queue but leaves the incorrect sentinels unchanged - invalid results. This can be fixed in two ways in HQ: Force all sentinels to the top: {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.doc == Integer.MAX_VALUE) return true; if (hitB.doc == Integer.MAX_VALUE) return false; if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return hitA.score hitB.score; } {code} or alternatively have a defined order for NaN (Float.compare sorts them after +inf): {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return Float.compare(hitA.score, hitB.score) 0; } {code} The problem with both solutions is, that we have now more comparisons per hit and the use of sentinels is questionable. I would like to remove the sentinels and use the old pre 2.9 code for comparing and using PQ.add() when a competitive hit arrives. The order of NaN would be unspecified. To fix the order of NaN, it would be better to replace all score comparisons by Float.compare() [also in FieldComparator]. I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2271: -- Attachment: (was: LUCENE-2271.patch) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector -- Key: LUCENE-2271 URL: https://issues.apache.org/jira/browse/LUCENE-2271 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9 Reporter: Uwe Schindler Priority: Minor Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2271-bench.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, TSDC.patch This is a foolowup to LUCENE-2270, where a part of this problem was fixed (boost = 0 leading to NaN scores, which is also un-intuitive), but in general, function queries in Solr can create these invalid scores easily. In previous version of Lucene these scores ordered correct (except NaN, which mixes up results), but never invalid document ids are returned (like Integer.MAX_VALUE). The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ to work, this sentinel must be smaller than all posible values, which is not the case: - -inf is equal and the document is not inserted into the HQ, as not competitive, but the HQ is not yet full, so the sentinel values keep in the HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and only affects the Ordered collector) by chaning the exit condition to: {code} if (score = pqTop.score pqTop.doc != Integer.MAX_VALUE) { // Since docs are returned in-order (i.e., increasing doc Id), a document // with equal score to pqTop.score cannot compete since HitQueue favors // documents with lower doc Ids. Therefore reject those docs too. return; } {code} - The NaN case can be fixed in the same way, but then has another problem: all comparisons with NaN result in false (none of these is true): x NaN, x NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns false, leading to unexspected ordering in the PQ and sometimes the sentinel values do not stay at the top of the queue. A later hit then overrides the top of the queue but leaves the incorrect sentinels unchanged - invalid results. This can be fixed in two ways in HQ: Force all sentinels to the top: {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.doc == Integer.MAX_VALUE) return true; if (hitB.doc == Integer.MAX_VALUE) return false; if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return hitA.score hitB.score; } {code} or alternatively have a defined order for NaN (Float.compare sorts them after +inf): {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return Float.compare(hitA.score, hitB.score) 0; } {code} The problem with both solutions is, that we have now more comparisons per hit and the use of sentinels is questionable. I would like to remove the sentinels and use the old pre 2.9 code for comparing and using PQ.add() when a competitive hit arrives. The order of NaN would be unspecified. To fix the order of NaN, it would be better to replace all score comparisons by Float.compare() [also in FieldComparator]. I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2271: -- Attachment: (was: LUCENE-2271.patch) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector -- Key: LUCENE-2271 URL: https://issues.apache.org/jira/browse/LUCENE-2271 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9 Reporter: Uwe Schindler Priority: Minor Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2271-bench.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, TSDC.patch This is a foolowup to LUCENE-2270, where a part of this problem was fixed (boost = 0 leading to NaN scores, which is also un-intuitive), but in general, function queries in Solr can create these invalid scores easily. In previous version of Lucene these scores ordered correct (except NaN, which mixes up results), but never invalid document ids are returned (like Integer.MAX_VALUE). The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ to work, this sentinel must be smaller than all posible values, which is not the case: - -inf is equal and the document is not inserted into the HQ, as not competitive, but the HQ is not yet full, so the sentinel values keep in the HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and only affects the Ordered collector) by chaning the exit condition to: {code} if (score = pqTop.score pqTop.doc != Integer.MAX_VALUE) { // Since docs are returned in-order (i.e., increasing doc Id), a document // with equal score to pqTop.score cannot compete since HitQueue favors // documents with lower doc Ids. Therefore reject those docs too. return; } {code} - The NaN case can be fixed in the same way, but then has another problem: all comparisons with NaN result in false (none of these is true): x NaN, x NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns false, leading to unexspected ordering in the PQ and sometimes the sentinel values do not stay at the top of the queue. A later hit then overrides the top of the queue but leaves the incorrect sentinels unchanged - invalid results. This can be fixed in two ways in HQ: Force all sentinels to the top: {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.doc == Integer.MAX_VALUE) return true; if (hitB.doc == Integer.MAX_VALUE) return false; if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return hitA.score hitB.score; } {code} or alternatively have a defined order for NaN (Float.compare sorts them after +inf): {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return Float.compare(hitA.score, hitB.score) 0; } {code} The problem with both solutions is, that we have now more comparisons per hit and the use of sentinels is questionable. I would like to remove the sentinels and use the old pre 2.9 code for comparing and using PQ.add() when a competitive hit arrives. The order of NaN would be unspecified. To fix the order of NaN, it would be better to replace all score comparisons by Float.compare() [also in FieldComparator]. I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2271: -- Attachment: LUCENE-2271.patch Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector -- Key: LUCENE-2271 URL: https://issues.apache.org/jira/browse/LUCENE-2271 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9 Reporter: Uwe Schindler Priority: Minor Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2271-bench.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, TSDC.patch This is a foolowup to LUCENE-2270, where a part of this problem was fixed (boost = 0 leading to NaN scores, which is also un-intuitive), but in general, function queries in Solr can create these invalid scores easily. In previous version of Lucene these scores ordered correct (except NaN, which mixes up results), but never invalid document ids are returned (like Integer.MAX_VALUE). The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ to work, this sentinel must be smaller than all posible values, which is not the case: - -inf is equal and the document is not inserted into the HQ, as not competitive, but the HQ is not yet full, so the sentinel values keep in the HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and only affects the Ordered collector) by chaning the exit condition to: {code} if (score = pqTop.score pqTop.doc != Integer.MAX_VALUE) { // Since docs are returned in-order (i.e., increasing doc Id), a document // with equal score to pqTop.score cannot compete since HitQueue favors // documents with lower doc Ids. Therefore reject those docs too. return; } {code} - The NaN case can be fixed in the same way, but then has another problem: all comparisons with NaN result in false (none of these is true): x NaN, x NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns false, leading to unexspected ordering in the PQ and sometimes the sentinel values do not stay at the top of the queue. A later hit then overrides the top of the queue but leaves the incorrect sentinels unchanged - invalid results. This can be fixed in two ways in HQ: Force all sentinels to the top: {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.doc == Integer.MAX_VALUE) return true; if (hitB.doc == Integer.MAX_VALUE) return false; if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return hitA.score hitB.score; } {code} or alternatively have a defined order for NaN (Float.compare sorts them after +inf): {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return Float.compare(hitA.score, hitB.score) 0; } {code} The problem with both solutions is, that we have now more comparisons per hit and the use of sentinels is questionable. I would like to remove the sentinels and use the old pre 2.9 code for comparing and using PQ.add() when a competitive hit arrives. The order of NaN would be unspecified. To fix the order of NaN, it would be better to replace all score comparisons by Float.compare() [also in FieldComparator]. I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2271: -- Attachment: (was: LUCENE-2271.patch) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector -- Key: LUCENE-2271 URL: https://issues.apache.org/jira/browse/LUCENE-2271 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9 Reporter: Uwe Schindler Priority: Minor Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2271-bench.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, TSDC.patch This is a foolowup to LUCENE-2270, where a part of this problem was fixed (boost = 0 leading to NaN scores, which is also un-intuitive), but in general, function queries in Solr can create these invalid scores easily. In previous version of Lucene these scores ordered correct (except NaN, which mixes up results), but never invalid document ids are returned (like Integer.MAX_VALUE). The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ to work, this sentinel must be smaller than all posible values, which is not the case: - -inf is equal and the document is not inserted into the HQ, as not competitive, but the HQ is not yet full, so the sentinel values keep in the HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and only affects the Ordered collector) by chaning the exit condition to: {code} if (score = pqTop.score pqTop.doc != Integer.MAX_VALUE) { // Since docs are returned in-order (i.e., increasing doc Id), a document // with equal score to pqTop.score cannot compete since HitQueue favors // documents with lower doc Ids. Therefore reject those docs too. return; } {code} - The NaN case can be fixed in the same way, but then has another problem: all comparisons with NaN result in false (none of these is true): x NaN, x NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns false, leading to unexspected ordering in the PQ and sometimes the sentinel values do not stay at the top of the queue. A later hit then overrides the top of the queue but leaves the incorrect sentinels unchanged - invalid results. This can be fixed in two ways in HQ: Force all sentinels to the top: {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.doc == Integer.MAX_VALUE) return true; if (hitB.doc == Integer.MAX_VALUE) return false; if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return hitA.score hitB.score; } {code} or alternatively have a defined order for NaN (Float.compare sorts them after +inf): {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return Float.compare(hitA.score, hitB.score) 0; } {code} The problem with both solutions is, that we have now more comparisons per hit and the use of sentinels is questionable. I would like to remove the sentinels and use the old pre 2.9 code for comparing and using PQ.add() when a competitive hit arrives. The order of NaN would be unspecified. To fix the order of NaN, it would be better to replace all score comparisons by Float.compare() [also in FieldComparator]. I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2271: -- Attachment: (was: LUCENE-2271.patch) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector -- Key: LUCENE-2271 URL: https://issues.apache.org/jira/browse/LUCENE-2271 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9 Reporter: Uwe Schindler Priority: Minor Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2271-bench.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, TSDC.patch This is a foolowup to LUCENE-2270, where a part of this problem was fixed (boost = 0 leading to NaN scores, which is also un-intuitive), but in general, function queries in Solr can create these invalid scores easily. In previous version of Lucene these scores ordered correct (except NaN, which mixes up results), but never invalid document ids are returned (like Integer.MAX_VALUE). The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ to work, this sentinel must be smaller than all posible values, which is not the case: - -inf is equal and the document is not inserted into the HQ, as not competitive, but the HQ is not yet full, so the sentinel values keep in the HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and only affects the Ordered collector) by chaning the exit condition to: {code} if (score = pqTop.score pqTop.doc != Integer.MAX_VALUE) { // Since docs are returned in-order (i.e., increasing doc Id), a document // with equal score to pqTop.score cannot compete since HitQueue favors // documents with lower doc Ids. Therefore reject those docs too. return; } {code} - The NaN case can be fixed in the same way, but then has another problem: all comparisons with NaN result in false (none of these is true): x NaN, x NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns false, leading to unexspected ordering in the PQ and sometimes the sentinel values do not stay at the top of the queue. A later hit then overrides the top of the queue but leaves the incorrect sentinels unchanged - invalid results. This can be fixed in two ways in HQ: Force all sentinels to the top: {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.doc == Integer.MAX_VALUE) return true; if (hitB.doc == Integer.MAX_VALUE) return false; if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return hitA.score hitB.score; } {code} or alternatively have a defined order for NaN (Float.compare sorts them after +inf): {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return Float.compare(hitA.score, hitB.score) 0; } {code} The problem with both solutions is, that we have now more comparisons per hit and the use of sentinels is questionable. I would like to remove the sentinels and use the old pre 2.9 code for comparing and using PQ.add() when a competitive hit arrives. The order of NaN would be unspecified. To fix the order of NaN, it would be better to replace all score comparisons by Float.compare() [also in FieldComparator]. I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2271: -- Attachment: (was: LUCENE-2271.patch) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector -- Key: LUCENE-2271 URL: https://issues.apache.org/jira/browse/LUCENE-2271 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9 Reporter: Uwe Schindler Priority: Minor Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2271-bench.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, TSDC.patch This is a foolowup to LUCENE-2270, where a part of this problem was fixed (boost = 0 leading to NaN scores, which is also un-intuitive), but in general, function queries in Solr can create these invalid scores easily. In previous version of Lucene these scores ordered correct (except NaN, which mixes up results), but never invalid document ids are returned (like Integer.MAX_VALUE). The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ to work, this sentinel must be smaller than all posible values, which is not the case: - -inf is equal and the document is not inserted into the HQ, as not competitive, but the HQ is not yet full, so the sentinel values keep in the HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and only affects the Ordered collector) by chaning the exit condition to: {code} if (score = pqTop.score pqTop.doc != Integer.MAX_VALUE) { // Since docs are returned in-order (i.e., increasing doc Id), a document // with equal score to pqTop.score cannot compete since HitQueue favors // documents with lower doc Ids. Therefore reject those docs too. return; } {code} - The NaN case can be fixed in the same way, but then has another problem: all comparisons with NaN result in false (none of these is true): x NaN, x NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns false, leading to unexspected ordering in the PQ and sometimes the sentinel values do not stay at the top of the queue. A later hit then overrides the top of the queue but leaves the incorrect sentinels unchanged - invalid results. This can be fixed in two ways in HQ: Force all sentinels to the top: {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.doc == Integer.MAX_VALUE) return true; if (hitB.doc == Integer.MAX_VALUE) return false; if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return hitA.score hitB.score; } {code} or alternatively have a defined order for NaN (Float.compare sorts them after +inf): {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return Float.compare(hitA.score, hitB.score) 0; } {code} The problem with both solutions is, that we have now more comparisons per hit and the use of sentinels is questionable. I would like to remove the sentinels and use the old pre 2.9 code for comparing and using PQ.add() when a competitive hit arrives. The order of NaN would be unspecified. To fix the order of NaN, it would be better to replace all score comparisons by Float.compare() [also in FieldComparator]. I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2271: -- Attachment: (was: LUCENE-2271.patch) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector -- Key: LUCENE-2271 URL: https://issues.apache.org/jira/browse/LUCENE-2271 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9 Reporter: Uwe Schindler Priority: Minor Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2271-bench.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, TSDC.patch This is a foolowup to LUCENE-2270, where a part of this problem was fixed (boost = 0 leading to NaN scores, which is also un-intuitive), but in general, function queries in Solr can create these invalid scores easily. In previous version of Lucene these scores ordered correct (except NaN, which mixes up results), but never invalid document ids are returned (like Integer.MAX_VALUE). The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ to work, this sentinel must be smaller than all posible values, which is not the case: - -inf is equal and the document is not inserted into the HQ, as not competitive, but the HQ is not yet full, so the sentinel values keep in the HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and only affects the Ordered collector) by chaning the exit condition to: {code} if (score = pqTop.score pqTop.doc != Integer.MAX_VALUE) { // Since docs are returned in-order (i.e., increasing doc Id), a document // with equal score to pqTop.score cannot compete since HitQueue favors // documents with lower doc Ids. Therefore reject those docs too. return; } {code} - The NaN case can be fixed in the same way, but then has another problem: all comparisons with NaN result in false (none of these is true): x NaN, x NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns false, leading to unexspected ordering in the PQ and sometimes the sentinel values do not stay at the top of the queue. A later hit then overrides the top of the queue but leaves the incorrect sentinels unchanged - invalid results. This can be fixed in two ways in HQ: Force all sentinels to the top: {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.doc == Integer.MAX_VALUE) return true; if (hitB.doc == Integer.MAX_VALUE) return false; if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return hitA.score hitB.score; } {code} or alternatively have a defined order for NaN (Float.compare sorts them after +inf): {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return Float.compare(hitA.score, hitB.score) 0; } {code} The problem with both solutions is, that we have now more comparisons per hit and the use of sentinels is questionable. I would like to remove the sentinels and use the old pre 2.9 code for comparing and using PQ.add() when a competitive hit arrives. The order of NaN would be unspecified. To fix the order of NaN, it would be better to replace all score comparisons by Float.compare() [also in FieldComparator]. I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836265#action_12836265 ] Uwe Schindler edited comment on LUCENE-2271 at 2/20/10 9:40 PM: I did some benchmarks (Java 1.5.0_22, 64bit, Win7, Core2Duo P8700) will do more tomorrow when i set up a large testing environment with 3 separate checkouts containing the three collector versions): - The latest approach (https://issues.apache.org/jira/secure/attachment/12436458/LUCENE-2271.patch) with no sentinels using the delegation and exchanging the inner collector was as fast as the original unpatched version - The approach with sentinels but fixed HitQueue ordering and extra checks (https://issues.apache.org/jira/secure/attachment/12436329/LUCENE-2271.patch), showed (as exspected) a little overhead: The ordered collector was as fast as the unpatched unordered collector (because one check more) - so i would not use this patch was (Author: thetaphi): I did some benchmarks (will do more tomorrow when i set up a large testing environment with 3 separate checkouts containing the three collector versions): - The latest approach (https://issues.apache.org/jira/secure/attachment/12436458/LUCENE-2271.patch) with no sentinels using the delegation and exchanging the inner collector was as fast as the original unpatched version - The approach with sentinels but fixed HitQueue ordering and extra checks (https://issues.apache.org/jira/secure/attachment/12436329/LUCENE-2271.patch), showed (as exspected) a little overhead: The ordered collector was as fast as the unpatched unordered collector (because one check more) - so i would not use this patch Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector -- Key: LUCENE-2271 URL: https://issues.apache.org/jira/browse/LUCENE-2271 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9 Reporter: Uwe Schindler Priority: Minor Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2271-bench.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, TSDC.patch This is a foolowup to LUCENE-2270, where a part of this problem was fixed (boost = 0 leading to NaN scores, which is also un-intuitive), but in general, function queries in Solr can create these invalid scores easily. In previous version of Lucene these scores ordered correct (except NaN, which mixes up results), but never invalid document ids are returned (like Integer.MAX_VALUE). The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ to work, this sentinel must be smaller than all posible values, which is not the case: - -inf is equal and the document is not inserted into the HQ, as not competitive, but the HQ is not yet full, so the sentinel values keep in the HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and only affects the Ordered collector) by chaning the exit condition to: {code} if (score = pqTop.score pqTop.doc != Integer.MAX_VALUE) { // Since docs are returned in-order (i.e., increasing doc Id), a document // with equal score to pqTop.score cannot compete since HitQueue favors // documents with lower doc Ids. Therefore reject those docs too. return; } {code} - The NaN case can be fixed in the same way, but then has another problem: all comparisons with NaN result in false (none of these is true): x NaN, x NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns false, leading to unexspected ordering in the PQ and sometimes the sentinel values do not stay at the top of the queue. A later hit then overrides the top of the queue but leaves the incorrect sentinels unchanged - invalid results. This can be fixed in two ways in HQ: Force all sentinels to the top: {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.doc == Integer.MAX_VALUE) return true; if (hitB.doc == Integer.MAX_VALUE) return false; if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return hitA.score hitB.score; } {code} or alternatively have a defined order for NaN (Float.compare sorts them after +inf): {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return Float.compare(hitA.score, hitB.score) 0; } {code} The problem with both solutions is, that we have now more comparisons per hit
[jira] Commented: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836265#action_12836265 ] Uwe Schindler commented on LUCENE-2271: --- I did some benchmarks (will do more tomorrow when i set up a large testing environment with 3 separate checkouts containing the three collector versions): - The latest approach (https://issues.apache.org/jira/secure/attachment/12436458/LUCENE-2271.patch) with no sentinels using the delegation and exchanging the inner collector was as fast as the original unpatched version - The approach with sentinels but fixed HitQueue ordering and extra checks (https://issues.apache.org/jira/secure/attachment/12436329/LUCENE-2271.patch), showed (as exspected) a little overhead: The ordered collector was as fast as the unpatched unordered collector (because one check more) - so i would not use this patch Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector -- Key: LUCENE-2271 URL: https://issues.apache.org/jira/browse/LUCENE-2271 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9 Reporter: Uwe Schindler Priority: Minor Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2271-bench.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, TSDC.patch This is a foolowup to LUCENE-2270, where a part of this problem was fixed (boost = 0 leading to NaN scores, which is also un-intuitive), but in general, function queries in Solr can create these invalid scores easily. In previous version of Lucene these scores ordered correct (except NaN, which mixes up results), but never invalid document ids are returned (like Integer.MAX_VALUE). The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ to work, this sentinel must be smaller than all posible values, which is not the case: - -inf is equal and the document is not inserted into the HQ, as not competitive, but the HQ is not yet full, so the sentinel values keep in the HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and only affects the Ordered collector) by chaning the exit condition to: {code} if (score = pqTop.score pqTop.doc != Integer.MAX_VALUE) { // Since docs are returned in-order (i.e., increasing doc Id), a document // with equal score to pqTop.score cannot compete since HitQueue favors // documents with lower doc Ids. Therefore reject those docs too. return; } {code} - The NaN case can be fixed in the same way, but then has another problem: all comparisons with NaN result in false (none of these is true): x NaN, x NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns false, leading to unexspected ordering in the PQ and sometimes the sentinel values do not stay at the top of the queue. A later hit then overrides the top of the queue but leaves the incorrect sentinels unchanged - invalid results. This can be fixed in two ways in HQ: Force all sentinels to the top: {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.doc == Integer.MAX_VALUE) return true; if (hitB.doc == Integer.MAX_VALUE) return false; if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return hitA.score hitB.score; } {code} or alternatively have a defined order for NaN (Float.compare sorts them after +inf): {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return Float.compare(hitA.score, hitB.score) 0; } {code} The problem with both solutions is, that we have now more comparisons per hit and the use of sentinels is questionable. I would like to remove the sentinels and use the old pre 2.9 code for comparing and using PQ.add() when a competitive hit arrives. The order of NaN would be unspecified. To fix the order of NaN, it would be better to replace all score comparisons by Float.compare() [also in FieldComparator]. I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2271: -- Attachment: LUCENE-2271.patch Fix an issue when numDocs==0. Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector -- Key: LUCENE-2271 URL: https://issues.apache.org/jira/browse/LUCENE-2271 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9 Reporter: Uwe Schindler Priority: Minor Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2271-bench.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, TSDC.patch This is a foolowup to LUCENE-2270, where a part of this problem was fixed (boost = 0 leading to NaN scores, which is also un-intuitive), but in general, function queries in Solr can create these invalid scores easily. In previous version of Lucene these scores ordered correct (except NaN, which mixes up results), but never invalid document ids are returned (like Integer.MAX_VALUE). The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ to work, this sentinel must be smaller than all posible values, which is not the case: - -inf is equal and the document is not inserted into the HQ, as not competitive, but the HQ is not yet full, so the sentinel values keep in the HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and only affects the Ordered collector) by chaning the exit condition to: {code} if (score = pqTop.score pqTop.doc != Integer.MAX_VALUE) { // Since docs are returned in-order (i.e., increasing doc Id), a document // with equal score to pqTop.score cannot compete since HitQueue favors // documents with lower doc Ids. Therefore reject those docs too. return; } {code} - The NaN case can be fixed in the same way, but then has another problem: all comparisons with NaN result in false (none of these is true): x NaN, x NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns false, leading to unexspected ordering in the PQ and sometimes the sentinel values do not stay at the top of the queue. A later hit then overrides the top of the queue but leaves the incorrect sentinels unchanged - invalid results. This can be fixed in two ways in HQ: Force all sentinels to the top: {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.doc == Integer.MAX_VALUE) return true; if (hitB.doc == Integer.MAX_VALUE) return false; if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return hitA.score hitB.score; } {code} or alternatively have a defined order for NaN (Float.compare sorts them after +inf): {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return Float.compare(hitA.score, hitB.score) 0; } {code} The problem with both solutions is, that we have now more comparisons per hit and the use of sentinels is questionable. I would like to remove the sentinels and use the old pre 2.9 code for comparing and using PQ.add() when a competitive hit arrives. The order of NaN would be unspecified. To fix the order of NaN, it would be better to replace all score comparisons by Float.compare() [also in FieldComparator]. I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector -- Key: LUCENE-2271 URL: https://issues.apache.org/jira/browse/LUCENE-2271 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9 Reporter: Uwe Schindler Fix For: 2.9.2, 3.0.1, 3.1 This is a foolowup to LUCENE-2270, where a part of this problem was fixed (boost = 0 leading to NaN scores, which is also un-intuitive), but in general, function queries in Solr can create these invalid scores easily. In previous version of Lucene these scores ordered correct (except NaN, which mixes up results), but never invalid document ids are returned (like Integer.MAX_VALUE). The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ to work, this sentinel must be smaller than all posible values, which is not the case: - -inf is equal and the document is not inserted into the HQ, as not competitive, but the HQ is not yet full, so the sentinel values keep in the HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and only affects the Ordered collector) by chaning the exit condition to: {code} if (score = pqTop.score pqTop.doc != Integer.MAX_VALUE) { // Since docs are returned in-order (i.e., increasing doc Id), a document // with equal score to pqTop.score cannot compete since HitQueue favors // documents with lower doc Ids. Therefore reject those docs too. return; } {code} - The NaN case can be fixed in the same way, but then has another problem: all comparisons with NaN result in false (none of these is true): x NaN, x NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns false, leading to unexspected ordering in the PQ and sometimes the sentinel values do not stay at the top of the queue. A later hit then overrides the top of the queue but leaves the incorrect sentinels unchanged - invalid results. This can be fixed in two ways in HQ: Force all sentinels to the top: {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.doc == Integer.MAX_VALUE) return true; if (hitB.doc == Integer.MAX_VALUE) return false; if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return hitA.score hitB.score; } {code} or alternatively have a defined order for NaN (Float.compare sorts them after +inf): {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return Float.compare(hitA.score, hitB.score) 0; } {code} The problem with both solutions is, that we have now more comparisons per hit and the use of sentinels is questionable. I would like to remove the sentinels and use the old pre 2.9 code for comparing and using PQ.add() when a competitive hit arrives. The order of NaN would be unspecified. To fix the order of NaN, it would be better to replace all score comparisons by Float.compare() [also in FieldComparator]. I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2271: -- Attachment: LUCENE-2271.patch This is patch that supports NaN and -inf. The cost of the additional checks in HitQueue.lessThan are neglectible, as they only occur when a competitive hit is really inserted into the queue. The check enforces all sentinels to the top of the queue, regardless what their score is (because always NaN != NaN). Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector -- Key: LUCENE-2271 URL: https://issues.apache.org/jira/browse/LUCENE-2271 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9 Reporter: Uwe Schindler Priority: Minor Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2271.patch This is a foolowup to LUCENE-2270, where a part of this problem was fixed (boost = 0 leading to NaN scores, which is also un-intuitive), but in general, function queries in Solr can create these invalid scores easily. In previous version of Lucene these scores ordered correct (except NaN, which mixes up results), but never invalid document ids are returned (like Integer.MAX_VALUE). The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ to work, this sentinel must be smaller than all posible values, which is not the case: - -inf is equal and the document is not inserted into the HQ, as not competitive, but the HQ is not yet full, so the sentinel values keep in the HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and only affects the Ordered collector) by chaning the exit condition to: {code} if (score = pqTop.score pqTop.doc != Integer.MAX_VALUE) { // Since docs are returned in-order (i.e., increasing doc Id), a document // with equal score to pqTop.score cannot compete since HitQueue favors // documents with lower doc Ids. Therefore reject those docs too. return; } {code} - The NaN case can be fixed in the same way, but then has another problem: all comparisons with NaN result in false (none of these is true): x NaN, x NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns false, leading to unexspected ordering in the PQ and sometimes the sentinel values do not stay at the top of the queue. A later hit then overrides the top of the queue but leaves the incorrect sentinels unchanged - invalid results. This can be fixed in two ways in HQ: Force all sentinels to the top: {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.doc == Integer.MAX_VALUE) return true; if (hitB.doc == Integer.MAX_VALUE) return false; if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return hitA.score hitB.score; } {code} or alternatively have a defined order for NaN (Float.compare sorts them after +inf): {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return Float.compare(hitA.score, hitB.score) 0; } {code} The problem with both solutions is, that we have now more comparisons per hit and the use of sentinels is questionable. I would like to remove the sentinels and use the old pre 2.9 code for comparing and using PQ.add() when a competitive hit arrives. The order of NaN would be unspecified. To fix the order of NaN, it would be better to replace all score comparisons by Float.compare() [also in FieldComparator]. I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2271: -- Attachment: LUCENE-2271.patch Sorry reverted a comment remove. Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector -- Key: LUCENE-2271 URL: https://issues.apache.org/jira/browse/LUCENE-2271 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9 Reporter: Uwe Schindler Priority: Minor Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2271.patch, TSDC.patch This is a foolowup to LUCENE-2270, where a part of this problem was fixed (boost = 0 leading to NaN scores, which is also un-intuitive), but in general, function queries in Solr can create these invalid scores easily. In previous version of Lucene these scores ordered correct (except NaN, which mixes up results), but never invalid document ids are returned (like Integer.MAX_VALUE). The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ to work, this sentinel must be smaller than all posible values, which is not the case: - -inf is equal and the document is not inserted into the HQ, as not competitive, but the HQ is not yet full, so the sentinel values keep in the HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and only affects the Ordered collector) by chaning the exit condition to: {code} if (score = pqTop.score pqTop.doc != Integer.MAX_VALUE) { // Since docs are returned in-order (i.e., increasing doc Id), a document // with equal score to pqTop.score cannot compete since HitQueue favors // documents with lower doc Ids. Therefore reject those docs too. return; } {code} - The NaN case can be fixed in the same way, but then has another problem: all comparisons with NaN result in false (none of these is true): x NaN, x NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns false, leading to unexspected ordering in the PQ and sometimes the sentinel values do not stay at the top of the queue. A later hit then overrides the top of the queue but leaves the incorrect sentinels unchanged - invalid results. This can be fixed in two ways in HQ: Force all sentinels to the top: {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.doc == Integer.MAX_VALUE) return true; if (hitB.doc == Integer.MAX_VALUE) return false; if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return hitA.score hitB.score; } {code} or alternatively have a defined order for NaN (Float.compare sorts them after +inf): {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return Float.compare(hitA.score, hitB.score) 0; } {code} The problem with both solutions is, that we have now more comparisons per hit and the use of sentinels is questionable. I would like to remove the sentinels and use the old pre 2.9 code for comparing and using PQ.add() when a competitive hit arrives. The order of NaN would be unspecified. To fix the order of NaN, it would be better to replace all score comparisons by Float.compare() [also in FieldComparator]. I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2271: -- Attachment: (was: LUCENE-2271.patch) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector -- Key: LUCENE-2271 URL: https://issues.apache.org/jira/browse/LUCENE-2271 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9 Reporter: Uwe Schindler Priority: Minor Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2271.patch, TSDC.patch This is a foolowup to LUCENE-2270, where a part of this problem was fixed (boost = 0 leading to NaN scores, which is also un-intuitive), but in general, function queries in Solr can create these invalid scores easily. In previous version of Lucene these scores ordered correct (except NaN, which mixes up results), but never invalid document ids are returned (like Integer.MAX_VALUE). The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ to work, this sentinel must be smaller than all posible values, which is not the case: - -inf is equal and the document is not inserted into the HQ, as not competitive, but the HQ is not yet full, so the sentinel values keep in the HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and only affects the Ordered collector) by chaning the exit condition to: {code} if (score = pqTop.score pqTop.doc != Integer.MAX_VALUE) { // Since docs are returned in-order (i.e., increasing doc Id), a document // with equal score to pqTop.score cannot compete since HitQueue favors // documents with lower doc Ids. Therefore reject those docs too. return; } {code} - The NaN case can be fixed in the same way, but then has another problem: all comparisons with NaN result in false (none of these is true): x NaN, x NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns false, leading to unexspected ordering in the PQ and sometimes the sentinel values do not stay at the top of the queue. A later hit then overrides the top of the queue but leaves the incorrect sentinels unchanged - invalid results. This can be fixed in two ways in HQ: Force all sentinels to the top: {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.doc == Integer.MAX_VALUE) return true; if (hitB.doc == Integer.MAX_VALUE) return false; if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return hitA.score hitB.score; } {code} or alternatively have a defined order for NaN (Float.compare sorts them after +inf): {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return Float.compare(hitA.score, hitB.score) 0; } {code} The problem with both solutions is, that we have now more comparisons per hit and the use of sentinels is questionable. I would like to remove the sentinels and use the old pre 2.9 code for comparing and using PQ.add() when a competitive hit arrives. The order of NaN would be unspecified. To fix the order of NaN, it would be better to replace all score comparisons by Float.compare() [also in FieldComparator]. I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2271: -- Attachment: LUCENE-2271.patch Patch with testcases for trunk, but should work on branches, too (after removing @Override). Without the fixes in HitQueue or TSDC the tests fail. Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector -- Key: LUCENE-2271 URL: https://issues.apache.org/jira/browse/LUCENE-2271 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9 Reporter: Uwe Schindler Priority: Minor Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2271.patch, LUCENE-2271.patch, TSDC.patch This is a foolowup to LUCENE-2270, where a part of this problem was fixed (boost = 0 leading to NaN scores, which is also un-intuitive), but in general, function queries in Solr can create these invalid scores easily. In previous version of Lucene these scores ordered correct (except NaN, which mixes up results), but never invalid document ids are returned (like Integer.MAX_VALUE). The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ to work, this sentinel must be smaller than all posible values, which is not the case: - -inf is equal and the document is not inserted into the HQ, as not competitive, but the HQ is not yet full, so the sentinel values keep in the HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and only affects the Ordered collector) by chaning the exit condition to: {code} if (score = pqTop.score pqTop.doc != Integer.MAX_VALUE) { // Since docs are returned in-order (i.e., increasing doc Id), a document // with equal score to pqTop.score cannot compete since HitQueue favors // documents with lower doc Ids. Therefore reject those docs too. return; } {code} - The NaN case can be fixed in the same way, but then has another problem: all comparisons with NaN result in false (none of these is true): x NaN, x NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns false, leading to unexspected ordering in the PQ and sometimes the sentinel values do not stay at the top of the queue. A later hit then overrides the top of the queue but leaves the incorrect sentinels unchanged - invalid results. This can be fixed in two ways in HQ: Force all sentinels to the top: {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.doc == Integer.MAX_VALUE) return true; if (hitB.doc == Integer.MAX_VALUE) return false; if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return hitA.score hitB.score; } {code} or alternatively have a defined order for NaN (Float.compare sorts them after +inf): {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return Float.compare(hitA.score, hitB.score) 0; } {code} The problem with both solutions is, that we have now more comparisons per hit and the use of sentinels is questionable. I would like to remove the sentinels and use the old pre 2.9 code for comparing and using PQ.add() when a competitive hit arrives. The order of NaN would be unspecified. To fix the order of NaN, it would be better to replace all score comparisons by Float.compare() [also in FieldComparator]. I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12835763#action_12835763 ] Uwe Schindler commented on LUCENE-2271: --- The cost to handle NaN is the modified lessThan() in HitQueue. Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector -- Key: LUCENE-2271 URL: https://issues.apache.org/jira/browse/LUCENE-2271 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9 Reporter: Uwe Schindler Priority: Minor Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2271.patch, LUCENE-2271.patch, TSDC.patch This is a foolowup to LUCENE-2270, where a part of this problem was fixed (boost = 0 leading to NaN scores, which is also un-intuitive), but in general, function queries in Solr can create these invalid scores easily. In previous version of Lucene these scores ordered correct (except NaN, which mixes up results), but never invalid document ids are returned (like Integer.MAX_VALUE). The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ to work, this sentinel must be smaller than all posible values, which is not the case: - -inf is equal and the document is not inserted into the HQ, as not competitive, but the HQ is not yet full, so the sentinel values keep in the HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and only affects the Ordered collector) by chaning the exit condition to: {code} if (score = pqTop.score pqTop.doc != Integer.MAX_VALUE) { // Since docs are returned in-order (i.e., increasing doc Id), a document // with equal score to pqTop.score cannot compete since HitQueue favors // documents with lower doc Ids. Therefore reject those docs too. return; } {code} - The NaN case can be fixed in the same way, but then has another problem: all comparisons with NaN result in false (none of these is true): x NaN, x NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns false, leading to unexspected ordering in the PQ and sometimes the sentinel values do not stay at the top of the queue. A later hit then overrides the top of the queue but leaves the incorrect sentinels unchanged - invalid results. This can be fixed in two ways in HQ: Force all sentinels to the top: {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.doc == Integer.MAX_VALUE) return true; if (hitB.doc == Integer.MAX_VALUE) return false; if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return hitA.score hitB.score; } {code} or alternatively have a defined order for NaN (Float.compare sorts them after +inf): {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return Float.compare(hitA.score, hitB.score) 0; } {code} The problem with both solutions is, that we have now more comparisons per hit and the use of sentinels is questionable. I would like to remove the sentinels and use the old pre 2.9 code for comparing and using PQ.add() when a competitive hit arrives. The order of NaN would be unspecified. To fix the order of NaN, it would be better to replace all score comparisons by Float.compare() [also in FieldComparator]. I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2271: -- Attachment: LUCENE-2271.patch Improved test, that also checks for increasing doc ids when score identical Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector -- Key: LUCENE-2271 URL: https://issues.apache.org/jira/browse/LUCENE-2271 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9 Reporter: Uwe Schindler Priority: Minor Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, TSDC.patch This is a foolowup to LUCENE-2270, where a part of this problem was fixed (boost = 0 leading to NaN scores, which is also un-intuitive), but in general, function queries in Solr can create these invalid scores easily. In previous version of Lucene these scores ordered correct (except NaN, which mixes up results), but never invalid document ids are returned (like Integer.MAX_VALUE). The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ to work, this sentinel must be smaller than all posible values, which is not the case: - -inf is equal and the document is not inserted into the HQ, as not competitive, but the HQ is not yet full, so the sentinel values keep in the HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and only affects the Ordered collector) by chaning the exit condition to: {code} if (score = pqTop.score pqTop.doc != Integer.MAX_VALUE) { // Since docs are returned in-order (i.e., increasing doc Id), a document // with equal score to pqTop.score cannot compete since HitQueue favors // documents with lower doc Ids. Therefore reject those docs too. return; } {code} - The NaN case can be fixed in the same way, but then has another problem: all comparisons with NaN result in false (none of these is true): x NaN, x NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns false, leading to unexspected ordering in the PQ and sometimes the sentinel values do not stay at the top of the queue. A later hit then overrides the top of the queue but leaves the incorrect sentinels unchanged - invalid results. This can be fixed in two ways in HQ: Force all sentinels to the top: {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.doc == Integer.MAX_VALUE) return true; if (hitB.doc == Integer.MAX_VALUE) return false; if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return hitA.score hitB.score; } {code} or alternatively have a defined order for NaN (Float.compare sorts them after +inf): {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return Float.compare(hitA.score, hitB.score) 0; } {code} The problem with both solutions is, that we have now more comparisons per hit and the use of sentinels is questionable. I would like to remove the sentinels and use the old pre 2.9 code for comparing and using PQ.add() when a competitive hit arrives. The order of NaN would be unspecified. To fix the order of NaN, it would be better to replace all score comparisons by Float.compare() [also in FieldComparator]. I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Reopened: (LUCENE-2190) CustomScoreQuery (function query) is broken (due to per-segment searching)
[ https://issues.apache.org/jira/browse/LUCENE-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler reopened LUCENE-2190: --- The fix is invalid: Adding setNextReader to CustomScoreQuery makes the Query itsself stateful. This breaks when using together with e.g. ParallelMultiSearcher. As the package is experimental, I see no problem in changing the method signature of customScore to pass in the affected IndexReader (CustomScorer can do this) CustomScoreQuery (function query) is broken (due to per-segment searching) -- Key: LUCENE-2190 URL: https://issues.apache.org/jira/browse/LUCENE-2190 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9, 2.9.1, 3.0, 3.0.1, 3.1 Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2190.patch Spinoff from here: http://lucene.markmail.org/message/psw2m3adzibaixbq With the cutover to per-segment searching, CustomScoreQuery is not really usable anymore, because the per-doc custom scoring method (customScore) receives a per-segment docID, yet there is no way to figure out which segment you are currently searching. I think to fix this we must also notify the subclass whenever a new segment is switched to. I think if we copy Collector.setNextReader, that would be sufficient. It would by default do nothing in CustomScoreQuery, but a subclass could override. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2190) CustomScoreQuery (function query) is broken (due to per-segment searching)
[ https://issues.apache.org/jira/browse/LUCENE-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12835837#action_12835837 ] Uwe Schindler commented on LUCENE-2190: --- We can preserve backwards compatibility is the default impl with the new reader only passes to the deprecated old customScore function. I will provide a patch tomorrow. CustomScoreQuery (function query) is broken (due to per-segment searching) -- Key: LUCENE-2190 URL: https://issues.apache.org/jira/browse/LUCENE-2190 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9, 2.9.1, 3.0, 3.0.1, 3.1 Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2190.patch Spinoff from here: http://lucene.markmail.org/message/psw2m3adzibaixbq With the cutover to per-segment searching, CustomScoreQuery is not really usable anymore, because the per-doc custom scoring method (customScore) receives a per-segment docID, yet there is no way to figure out which segment you are currently searching. I think to fix this we must also notify the subclass whenever a new segment is switched to. I think if we copy Collector.setNextReader, that would be sufficient. It would by default do nothing in CustomScoreQuery, but a subclass could override. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2190) CustomScoreQuery (function query) is broken (due to per-segment searching)
[ https://issues.apache.org/jira/browse/LUCENE-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12835903#action_12835903 ] Uwe Schindler commented on LUCENE-2190: --- During refactoring I found out: CustomScoreQuery is more broken: the rewrite() method is wrong, for now its not really a problem but when we change to per-segment rewrite (as Mike plans) its broken. Its even broken if you rewrite against one IndexReader and want to reuse the query later on another IndexReader. It rewrites all its subqueries and returns itsself, which is wrong: if one of the subqueries was rewritten it must return a new clone instance (like BooleanQuery). Also hashCode and equals ignore strict. Will provide patch later. Now everything seems to work correct. CustomScoreQuery (function query) is broken (due to per-segment searching) -- Key: LUCENE-2190 URL: https://issues.apache.org/jira/browse/LUCENE-2190 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9, 2.9.1, 3.0, 3.0.1, 3.1 Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2190.patch Spinoff from here: http://lucene.markmail.org/message/psw2m3adzibaixbq With the cutover to per-segment searching, CustomScoreQuery is not really usable anymore, because the per-doc custom scoring method (customScore) receives a per-segment docID, yet there is no way to figure out which segment you are currently searching. I think to fix this we must also notify the subclass whenever a new segment is switched to. I think if we copy Collector.setNextReader, that would be sufficient. It would by default do nothing in CustomScoreQuery, but a subclass could override. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2267) Add solr's artifact signing scripts into lucene's build.xml/common-build.xml
[ https://issues.apache.org/jira/browse/LUCENE-2267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2267: -- Attachment: LUCENE-2267.patch Patch with a heavy improved version of solrs macros, I changed: - For security reasons, the password is not passed through command line (you can see it with ps \-ef !!!). Also \-\-passphrase does not work with newer 2.x versions of gpg. The correct way is the same as in Mike McCandless Python script to pass \-\-passphrase-fd 0 (then it read the passphrase from stdin), piping in the password using the inputstring task attribute of ant. - added \-\-batch parameter to gpg. Without, in GUI environments it ignores the passed-in password and uses gpg-agent - no manual signing of every file, it uses the apply ant task that starts a process for every file in a fileset and also supplies a source - desfilename mapping (which appends .asc) - add \-\-default-key with a default value of CODE SIGNING KEY, you can override with \-Dgpg.key=YourHexKeyOrEmail The only problem is that appy does not print the command lines or some filelist. You only get a message at the end that applied 'gpg' to x files, which is fine. Usage: {code} ant sign-artifacts -Dgpg.exe=/path/to/gpg -Dgpg.key=YourHexKeyOrEmail -Dgpg.passphrase=12345 {code} All parameters are optional, defaults are: {code} gpg.exe = gpg gpg.key = CODE SIGNING KEY gpg.passphrase = none, if not given, you are asked to input {code} Add solr's artifact signing scripts into lucene's build.xml/common-build.xml Key: LUCENE-2267 URL: https://issues.apache.org/jira/browse/LUCENE-2267 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9.2, 3.0.1 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 3.1 Attachments: LUCENE-2267.patch Solr has nice artifact signing scripts in its common-build.xml and build.xml. For me as release manager of 3.0 it would have be good to have them also when building lucene artifacts. I will investigate how to add them to src artifacts and maven artifacts -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2267) Add solr's artifact signing scripts into lucene's build.xml/common-build.xml
[ https://issues.apache.org/jira/browse/LUCENE-2267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12835217#action_12835217 ] Uwe Schindler commented on LUCENE-2267: --- I forgot, the target has no dependencies on maven run before or dist-src/bin. You have to run dist-src, dist-bin and generate-maven-artifacts before, esle it would simply sign no files, or more it would break, because dist-folder does not exist Add solr's artifact signing scripts into lucene's build.xml/common-build.xml Key: LUCENE-2267 URL: https://issues.apache.org/jira/browse/LUCENE-2267 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9.2, 3.0.1 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 3.1 Attachments: LUCENE-2267.patch Solr has nice artifact signing scripts in its common-build.xml and build.xml. For me as release manager of 3.0 it would have be good to have them also when building lucene artifacts. I will investigate how to add them to src artifacts and maven artifacts -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2267) Add solr's artifact signing scripts into lucene's build.xml/common-build.xml
[ https://issues.apache.org/jira/browse/LUCENE-2267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2267: -- Attachment: LUCENE-2267.patch Updated patch that needs only requires trunk's minimum ANT version 1.7.0. Secure password input is only available, if ant = 1.7.1 and java 6 is used. Add solr's artifact signing scripts into lucene's build.xml/common-build.xml Key: LUCENE-2267 URL: https://issues.apache.org/jira/browse/LUCENE-2267 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9.2, 3.0.1 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 3.1 Attachments: LUCENE-2267.patch, LUCENE-2267.patch Solr has nice artifact signing scripts in its common-build.xml and build.xml. For me as release manager of 3.0 it would have be good to have them also when building lucene artifacts. I will investigate how to add them to src artifacts and maven artifacts -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2270) queries with zero boosts don't work
[ https://issues.apache.org/jira/browse/LUCENE-2270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2270: -- Fix Version/s: 3.1 3.0.1 2.9.2 Assignee: Yonik Seeley queries with zero boosts don't work --- Key: LUCENE-2270 URL: https://issues.apache.org/jira/browse/LUCENE-2270 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9 Reporter: Yonik Seeley Assignee: Yonik Seeley Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2270.patch Queries consisting of only zero boosts result in incorrect results. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12834695#action_12834695 ] Uwe Schindler commented on LUCENE-2089: --- Hi Robert, I reviewed you latest patch and was a little bit irritated, but then everything explained when also looking into AutomatonTermsEnum and understanding what happes. But there is still some code duplication (not really duplication, but functionality duplication): - If a constant prefix is used, the generated Automatons are using a constant prefix + a Levenshtein Automaton (using concat) - If you run such an automaton agains the TermIndex using the superclass, it will seek first to the prefix term (or some term starting with the prefix), thats ok. As soon as the prefix is no longer valid, the AutomatonTermsEnum stops processing (if running such an automaton using the standard AutomatonTermsEnum). - The AutomatonFuzzyTermsEnum checks if the term starts with prefix and if not it exists ENDs (!) the automaton. The reason why this works is because nextString() in superclass returns automatically a string starting with the prefix, but this was the main fact that irritated me. - The question is now, is this extra prefix check really needed? Running the automaton against the current term in accept would simply return no match and nextString() would stop further processing? Or is this because the accept method should not iterate through all distances once the prefix is not matched? Maybe you should add some comments to the AutomatonFuzzyTermsEnum or some asserts to show whats happening. explore using automaton for fuzzyquery -- Key: LUCENE-2089 URL: https://issues.apache.org/jira/browse/LUCENE-2089 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: Flex Branch Reporter: Robert Muir Assignee: Mark Miller Priority: Minor Fix For: Flex Branch Attachments: LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089_concat.patch, Moman-0.2.1.tar.gz, TestFuzzy.java Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm) we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea * up front, calculate the maximum required K edits needed to match the users supplied float threshold. * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E. if the required E is above our supported max, we use dumb mode at first (no seeking, no DFA, just brute force like now). As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq. This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms. This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from dumb mode to smart mode. i modified my wildcard benchmark to generate random fuzzy queries. * Pattern: 7N stands for NNN, etc. * AvgMS_DFA: this is the time spent creating the automaton (constructor) ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| |7N|10|64.0|4155.9|38.6|20.3| |14N|10|0.0|2511.6|46.0|37.9| |28N|10|0.0|2506.3|93.0|86.6| |56N|10|0.0|2524.5|304.4|298.5| as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA-DFA conversion. So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there. instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok. the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization. in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online
[jira] Created: (LUCENE-2267) Add solr's artifact signing scripts into lucene's build.xml/common-build.xml
Add solr's artifact signing scripts into lucene's build.xml/common-build.xml Key: LUCENE-2267 URL: https://issues.apache.org/jira/browse/LUCENE-2267 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9.2, 3.0.1 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 3.1 Solr has nice artifact signing scripts in its common-build.xml and build.xml. For me as release manager of 3.0 it would have be good to have them also when building lucene artifacts. I will investigate how to add them to src artifacts and maven artifacts -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2268) Add test to check maven artifacts and their poms
Add test to check maven artifacts and their poms Key: LUCENE-2268 URL: https://issues.apache.org/jira/browse/LUCENE-2268 Project: Lucene - Java Issue Type: Test Reporter: Uwe Schindler As release manager it is hard to find out if the maven artifacts work correct. It would be good to have an ant task that executes maven with a .pom file that requires all contrib/core artifacts (or one for each contrib) that downloads the artifacts from the local dist/maven folder and builds that test project. This would require maven to execute the build script. Also it should pass the ${version} ANT property to this pom.xml -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: [VOTE] Lucene Java 2.9.2 and 3.0.1 release artifacts
Hi Grant, inline: Inline On Feb 14, 2010, at 6:45 PM, Uwe Schindler wrote: Hallo Folks, I have posted a release candidate for both Lucene Java 2.9.2 and 3.0.1 (which both have the same bug fix level, functionality and release announcement), build from revision 910082 of the corresponding branches. Thanks for all your help! Please test them and give your votes until Thursday morning, as the scheduled release date for both versions is Friday, Feb 19th, 2010. Only votes from Lucene PMC are binding, but everyone is welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. We planned the parallel release with one announcement because of their parallel development / bug fix level to emphasize that they are equal except deprecation removal and Java 5 since major version 3. Please also read the attached release announcement (Open Document) and send it corrected back if you miss anything or want to improve my bad English :-) You find the artifacts here: http://people.apache.org/~uschindler/staging-area/lucene-292-301- take1-rev910082/ Still working through this, but: Why are there SHA1 signatures for the 3.0.1 releases but not 2.9.2. I don't think SHA1 is required (in fact, isn't it cracked?) so it may be fine to just remove it. Md5 is cracked, sha1 not (yet). We have the sha1 since 3.0 (its generated by 3.0's build.xml since upgrade to ANT 1.7 because of fixed ant task definitions). And also all maven artifacts require sha1, too, so its only 2.9's zip/tgz missing them. So I could simply generate them manually for 2.9.2. The current 3.0.0 release on apache.org already have sha1, so why remove them now? === Proposed Release Announcement === Hello Lucene users, On behalf of the Lucene development community I would like to announce the release of Lucene Java versions 3.0.1 and 2.9.2: Both releases fix bugs in the previous versions, where 2.9.2 is the last release working with Java 1.4, still providing all deprecated APIs of the Lucene Java 2.x series. 3.0.1 has the same bug fix level, but requires Java 5 and is no longer compatible with code using deprecated APIs. The API was cleaned up to make use of Java 5's generics, varargs, enums, and autoboxing. New users of Lucene are advised to use version 3.0.1 for new developments, because it has a clean, type safe new API. Users upgrading from 2.9.x can now remove unnecessary casts and add generics to their code, too. Important improvements in these releases are a increased maximum number of unique terms in each index segment. They also add fixes in IndexWriter’s commit and lost document deletes in near real-time indexing. Also lots of bugs in Contrib’s Analyzers package were fixed. How about: Several bugs in Contrib's Analyzers package were fixed Also, do these changes imply reindexing is needed? If so, we should say so. I have to go through this, but reindexing is not required, because the bugs were mostly missing clearAttributes() calls leading to StopFilter integer overflows (with Version.LUCENE_30) - and so crashes during indexing. Robert? As always we preserve index compatibility, so we would not change behavior without adding a new Version enum constant. I will change the wording, Robert already sent me some grammar changes and a better overview using bullted lists. Thanks for reviewing, Uwe - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-124) Fuzzy Searches do not get a boost of 0.2 as stated in Query Syntax doc
[ https://issues.apache.org/jira/browse/LUCENE-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12834163#action_12834163 ] Uwe Schindler commented on LUCENE-124: -- bq. I will wait till after the code freeze and commit this in a few days if no one objects. The code freeze only affects branches. Trunk is only frozen for fixes that should also go into branches. Fuzzy Searches do not get a boost of 0.2 as stated in Query Syntax doc Key: LUCENE-124 URL: https://issues.apache.org/jira/browse/LUCENE-124 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.2 Environment: Operating System: All Platform: All Reporter: Cormac Twomey Assignee: Robert Muir Priority: Minor Attachments: LUCENE-124.patch According to the website's Query Syntax page, fuzzy searches are given a boost of 0.2. I've found this not to be the case, and have seen situations where exact matches have lower relevance scores than fuzzy matches. Rather than getting a boost of 0.2, it appears that all variations on the term are first found in the model, where dist* 0.5. * dist = levenshteinDistance / length of min(termlength, variantlength) This then leads to a boolean OR search of all the variant terms, each of whose boost is set to (dist - 0.5)*2 for that variant. The upshot of all of this is that there are many cases where a fuzzy match will get a higher relevance score than an exact match. See this email for a test case to reproduce this anomalous behaviour. http://www.mail-archive.com/lucene-...@jakarta.apache.org/msg02819.html Here is a candidate patch to address the issue - *** lucene-1.2\src\java\org\apache\lucene\search\FuzzyTermEnum.java Sun Jun 09 13:47:54 2002 --- lucene-1.2-modified\src\java\org\apache\lucene\search\FuzzyTermEnum.java Fri Mar 14 11:37:20 2003 *** *** 99,105 } final protected float difference() { ! return (float)((distance - FUZZY_THRESHOLD) * SCALE_FACTOR); } final public boolean endEnum() { --- 99,109 } final protected float difference() { ! if (distance == 1.0) { ! return 1.0f; ! } ! else ! return (float)((distance - FUZZY_THRESHOLD) * SCALE_FACTOR); } final public boolean endEnum() { *** *** 111,117 **/ public static final double FUZZY_THRESHOLD = 0.5; ! public static final double SCALE_FACTOR = 1.0f / (1.0f - FUZZY_THRESHOLD); /** Finds and returns the smallest of three integers --- 115,121 **/ public static final double FUZZY_THRESHOLD = 0.5; ! public static final double SCALE_FACTOR = 0.2f * (1.0f / (1.0f - FUZZY_THRESHOLD)); /** Finds and returns the smallest of three integers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12834676#action_12834676 ] Uwe Schindler commented on LUCENE-2089: --- bq. this is the patch to improve BasicOperations.concatenate. If the left side is a singleton automaton, then it has only one final state with no outgoing transitions. applying epsilon transitions with the NFA concatenation algorithm when the right side is a DFA always produces a resulting DFA, so mark it as such. Strange that the automaton author did not add this? Have you reported upstream? explore using automaton for fuzzyquery -- Key: LUCENE-2089 URL: https://issues.apache.org/jira/browse/LUCENE-2089 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: Flex Branch Reporter: Robert Muir Assignee: Mark Miller Priority: Minor Fix For: Flex Branch Attachments: LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089_concat.patch, Moman-0.2.1.tar.gz, TestFuzzy.java Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm) we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea * up front, calculate the maximum required K edits needed to match the users supplied float threshold. * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E. if the required E is above our supported max, we use dumb mode at first (no seeking, no DFA, just brute force like now). As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq. This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms. This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from dumb mode to smart mode. i modified my wildcard benchmark to generate random fuzzy queries. * Pattern: 7N stands for NNN, etc. * AvgMS_DFA: this is the time spent creating the automaton (constructor) ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| |7N|10|64.0|4155.9|38.6|20.3| |14N|10|0.0|2511.6|46.0|37.9| |28N|10|0.0|2506.3|93.0|86.6| |56N|10|0.0|2524.5|304.4|298.5| as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA-DFA conversion. So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there. instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok. the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization. in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2089: -- Affects Version/s: Flex Branch Fix Version/s: Flex Branch explore using automaton for fuzzyquery -- Key: LUCENE-2089 URL: https://issues.apache.org/jira/browse/LUCENE-2089 Project: Lucene - Java Issue Type: Wish Components: Search Affects Versions: Flex Branch Reporter: Robert Muir Assignee: Mark Miller Priority: Minor Fix For: Flex Branch Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm) we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea * up front, calculate the maximum required K edits needed to match the users supplied float threshold. * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E. if the required E is above our supported max, we use dumb mode at first (no seeking, no DFA, just brute force like now). As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq. This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms. This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from dumb mode to smart mode. i modified my wildcard benchmark to generate random fuzzy queries. * Pattern: 7N stands for NNN, etc. * AvgMS_DFA: this is the time spent creating the automaton (constructor) ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| |7N|10|64.0|4155.9|38.6|20.3| |14N|10|0.0|2511.6|46.0|37.9| |28N|10|0.0|2506.3|93.0|86.6| |56N|10|0.0|2524.5|304.4|298.5| as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA-DFA conversion. So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there. instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok. the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization. in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler reassigned LUCENE-329: Assignee: (was: Lucene Developers) Fuzzy query scoring issues -- Key: LUCENE-329 URL: https://issues.apache.org/jira/browse/LUCENE-329 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.2rc5 Environment: Operating System: All Platform: All Reporter: Mark Harwood Priority: Minor Attachments: LUCENE-329.patch, patch.txt Queries which automatically produce multiple terms (wildcard, range, prefix, fuzzy etc)currently suffer from two problems: 1) Scores for matching documents are significantly smaller than term queries because of the volume of terms introduced (A match on query Foo~ is 0.1 whereas a match on query Foo is 1). 2) The rarer forms of expanded terms are favoured over those of more common forms because of the IDF. When using Fuzzy queries for example, rare mis- spellings typically appear in results before the more common correct spellings. I will attach a patch that corrects the issues identified above by 1) Overriding Similarity.coord to counteract the downplaying of scores introduced by expanding terms. 2) Taking the IDF factor of the most common form of expanded terms as the basis of scoring all other expanded terms. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler reassigned LUCENE-329: Assignee: (was: Lucene Developers) Fuzzy query scoring issues -- Key: LUCENE-329 URL: https://issues.apache.org/jira/browse/LUCENE-329 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.2rc5 Environment: Operating System: All Platform: All Reporter: Mark Harwood Priority: Minor Attachments: LUCENE-329.patch, patch.txt Queries which automatically produce multiple terms (wildcard, range, prefix, fuzzy etc)currently suffer from two problems: 1) Scores for matching documents are significantly smaller than term queries because of the volume of terms introduced (A match on query Foo~ is 0.1 whereas a match on query Foo is 1). 2) The rarer forms of expanded terms are favoured over those of more common forms because of the IDF. When using Fuzzy queries for example, rare mis- spellings typically appear in results before the more common correct spellings. I will attach a patch that corrects the issues identified above by 1) Overriding Similarity.coord to counteract the downplaying of scores introduced by expanding terms. 2) Taking the IDF factor of the most common form of expanded terms as the basis of scoring all other expanded terms. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-124) Fuzzy Searches do not get a boost of 0.2 as stated in Query Syntax doc
[ https://issues.apache.org/jira/browse/LUCENE-124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler reassigned LUCENE-124: Assignee: (was: Lucene Developers) Fuzzy Searches do not get a boost of 0.2 as stated in Query Syntax doc Key: LUCENE-124 URL: https://issues.apache.org/jira/browse/LUCENE-124 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.2 Environment: Operating System: All Platform: All Reporter: Cormac Twomey Priority: Minor According to the website's Query Syntax page, fuzzy searches are given a boost of 0.2. I've found this not to be the case, and have seen situations where exact matches have lower relevance scores than fuzzy matches. Rather than getting a boost of 0.2, it appears that all variations on the term are first found in the model, where dist* 0.5. * dist = levenshteinDistance / length of min(termlength, variantlength) This then leads to a boolean OR search of all the variant terms, each of whose boost is set to (dist - 0.5)*2 for that variant. The upshot of all of this is that there are many cases where a fuzzy match will get a higher relevance score than an exact match. See this email for a test case to reproduce this anomalous behaviour. http://www.mail-archive.com/lucene-...@jakarta.apache.org/msg02819.html Here is a candidate patch to address the issue - *** lucene-1.2\src\java\org\apache\lucene\search\FuzzyTermEnum.java Sun Jun 09 13:47:54 2002 --- lucene-1.2-modified\src\java\org\apache\lucene\search\FuzzyTermEnum.java Fri Mar 14 11:37:20 2003 *** *** 99,105 } final protected float difference() { ! return (float)((distance - FUZZY_THRESHOLD) * SCALE_FACTOR); } final public boolean endEnum() { --- 99,109 } final protected float difference() { ! if (distance == 1.0) { ! return 1.0f; ! } ! else ! return (float)((distance - FUZZY_THRESHOLD) * SCALE_FACTOR); } final public boolean endEnum() { *** *** 111,117 **/ public static final double FUZZY_THRESHOLD = 0.5; ! public static final double SCALE_FACTOR = 1.0f / (1.0f - FUZZY_THRESHOLD); /** Finds and returns the smallest of three integers --- 115,121 **/ public static final double FUZZY_THRESHOLD = 0.5; ! public static final double SCALE_FACTOR = 0.2f * (1.0f / (1.0f - FUZZY_THRESHOLD)); /** Finds and returns the smallest of three integers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-124) Fuzzy Searches do not get a boost of 0.2 as stated in Query Syntax doc
[ https://issues.apache.org/jira/browse/LUCENE-124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler reassigned LUCENE-124: Assignee: (was: Lucene Developers) Fuzzy Searches do not get a boost of 0.2 as stated in Query Syntax doc Key: LUCENE-124 URL: https://issues.apache.org/jira/browse/LUCENE-124 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.2 Environment: Operating System: All Platform: All Reporter: Cormac Twomey Priority: Minor According to the website's Query Syntax page, fuzzy searches are given a boost of 0.2. I've found this not to be the case, and have seen situations where exact matches have lower relevance scores than fuzzy matches. Rather than getting a boost of 0.2, it appears that all variations on the term are first found in the model, where dist* 0.5. * dist = levenshteinDistance / length of min(termlength, variantlength) This then leads to a boolean OR search of all the variant terms, each of whose boost is set to (dist - 0.5)*2 for that variant. The upshot of all of this is that there are many cases where a fuzzy match will get a higher relevance score than an exact match. See this email for a test case to reproduce this anomalous behaviour. http://www.mail-archive.com/lucene-...@jakarta.apache.org/msg02819.html Here is a candidate patch to address the issue - *** lucene-1.2\src\java\org\apache\lucene\search\FuzzyTermEnum.java Sun Jun 09 13:47:54 2002 --- lucene-1.2-modified\src\java\org\apache\lucene\search\FuzzyTermEnum.java Fri Mar 14 11:37:20 2003 *** *** 99,105 } final protected float difference() { ! return (float)((distance - FUZZY_THRESHOLD) * SCALE_FACTOR); } final public boolean endEnum() { --- 99,109 } final protected float difference() { ! if (distance == 1.0) { ! return 1.0f; ! } ! else ! return (float)((distance - FUZZY_THRESHOLD) * SCALE_FACTOR); } final public boolean endEnum() { *** *** 111,117 **/ public static final double FUZZY_THRESHOLD = 0.5; ! public static final double SCALE_FACTOR = 1.0f / (1.0f - FUZZY_THRESHOLD); /** Finds and returns the smallest of three integers --- 115,121 **/ public static final double FUZZY_THRESHOLD = 0.5; ! public static final double SCALE_FACTOR = 0.2f * (1.0f / (1.0f - FUZZY_THRESHOLD)); /** Finds and returns the smallest of three integers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: [VOTE] Lucene Java 2.9.2 and 3.0.1 release artifacts
As people.apache.org is down, here is an alternate location with the same artifacts: http://alpha.thetaphi.de/lucene-292-301-take1-rev910082/ Happy testing! - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Uwe Schindler [mailto:u...@thetaphi.de] Sent: Monday, February 15, 2010 12:46 AM To: gene...@lucene.apache.org; java-dev@lucene.apache.org Subject: [VOTE] Lucene Java 2.9.2 and 3.0.1 release artifacts Hallo Folks, I have posted a release candidate for both Lucene Java 2.9.2 and 3.0.1 (which both have the same bug fix level, functionality and release announcement), build from revision 910082 of the corresponding branches. Thanks for all your help! Please test them and give your votes until Thursday morning, as the scheduled release date for both versions is Friday, Feb 19th, 2010. Only votes from Lucene PMC are binding, but everyone is welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. We planned the parallel release with one announcement because of their parallel development / bug fix level to emphasize that they are equal except deprecation removal and Java 5 since major version 3. Please also read the attached release announcement (Open Document) and send it corrected back if you miss anything or want to improve my bad English :-) You find the artifacts here: http://people.apache.org/~uschindler/staging-area/lucene-292-301-take1- rev910082/ Maven repo: http://people.apache.org/~uschindler/staging-area/lucene-292-301-take1- rev910082/maven/ The changes are here: http://people.apache.org/~uschindler/staging-area/lucene-292-301-take1- rev910082/changes-2.9.2/Changes.html http://people.apache.org/~uschindler/staging-area/lucene-292-301-take1- rev910082/changes-2.9.2/Contrib-Changes.html http://people.apache.org/~uschindler/staging-area/lucene-292-301-take1- rev910082/changes-3.0.1/Changes.html http://people.apache.org/~uschindler/staging-area/lucene-292-301-take1- rev910082/changes-3.0.1/Contrib-Changes.html Uwe === Proposed Release Announcement === Hello Lucene users, On behalf of the Lucene development community I would like to announce the release of Lucene Java versions 3.0.1 and 2.9.2: Both releases fix bugs in the previous versions, where 2.9.2 is the last release working with Java 1.4, still providing all deprecated APIs of the Lucene Java 2.x series. 3.0.1 has the same bug fix level, but requires Java 5 and is no longer compatible with code using deprecated APIs. The API was cleaned up to make use of Java 5's generics, varargs, enums, and autoboxing. New users of Lucene are advised to use version 3.0.1 for new developments, because it has a clean, type safe new API. Users upgrading from 2.9.x can now remove unnecessary casts and add generics to their code, too. Important improvements in these releases are a increased maximum number of unique terms in each index segment. They also add fixes in IndexWriter’s commit and lost document deletes in near real-time indexing. Also lots of bugs in Contrib’s Analyzers package were fixed. Additionally, the 3.0.1 release restored some public methods, that get lost during deprecation removal. If you are using Lucene in a web application environment, you will notice that the new Attribute-based TokenStream API now works correct with different class loaders. Both releases are fully compatible with the corresponding previous versions. We strongly recommend upgrading to 2.9.2 if you are using 2.9.1 or 2.9.0; and to 3.0.1 if you are using 3.0.0. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: (LUCENE-1844) Speed up junit tests
At least we should check all core tests to not set any static defaults without try...finally! Are there any possibilities inside Eclipse/other-IDEs to check this? Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Sunday, February 14, 2010 11:43 AM To: java-dev@lucene.apache.org Subject: Re: (LUCENE-1844) Speed up junit tests Wow -- this is MUCH faster! I think we should switch... It seems like we use a batchtest for all core tests, then for all back-compat tests, then once per contrib package? Ie, so ant test-core uses one jvm? I think we should simply fix any badly behaved tests (that don't restore statics). It's impressive we already have no test failures when we do this... I guess our tests are already cleaning things up (though also probably not often changing global state, or, changing it in a way that'd lead other tests to fail). Mike On Sat, Feb 13, 2010 at 5:23 PM, Robert Muir rcm...@gmail.com wrote: On Fri, Nov 27, 2009 at 1:27 PM, Michael McCandless luc...@mikemccandless.com wrote: Also one thing I'd love to try is NOT forking the JVM for each test (fork=no in the junit task). I wonder how much time that'd buy... it shaves off a good deal of time on my machine. 'ant test-core': 4 minutes, 39 seconds - 3 minutes, 3 seconds 'ant test': 11 minutes, 8 seconds - 7 minutes, 13 seconds however, it makes me a little nervous because i'm not sure all the tests cleanup nicely if they change statics and stuff. anyway, here's the trivial patch (you don't want fork=no, because it turns off assertions) Index: common-build.xml === --- common-build.xml(revision 909395) +++ common-build.xml(working copy) @@ -398,7 +398,7 @@ /condition mkdir dir=@{junit.output.dir}/ junit printsummary=off haltonfailure=no maxmemory=512M - errorProperty=tests.failed failureProperty=tests.failed + errorProperty=tests.failed failureProperty=tests.failed forkmode=perBatch classpath refid=@{junit.classpath}/ assertions enable package=org.apache.lucene/ -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1941) MinPayloadFunction returns 0 when only one payload is present
[ https://issues.apache.org/jira/browse/LUCENE-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833551#action_12833551 ] Uwe Schindler commented on LUCENE-1941: --- As there is no real test available (for the whole class exspect ctor, Mark Miller figured out yesterday) - I think the attached fix is ok at the moment and i would like to apply it to 2.9, 3.0 and trunk to release the pending 2.9.2 and 3.0.1. If nobody is against it (Erik?) i would like to apply this patch and release the artifacts for PMC vote today afternoon. Also I open a new issue requesting tests at all :-) MinPayloadFunction returns 0 when only one payload is present - Key: LUCENE-1941 URL: https://issues.apache.org/jira/browse/LUCENE-1941 Project: Lucene - Java Issue Type: Bug Components: Query/Scoring Affects Versions: 2.9, 3.0 Reporter: Erik Hatcher Assignee: Uwe Schindler Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-1941.patch, LUCENE-1941.patch In some experiments with payload scoring through PayloadTermQuery, I'm seeing 0 returned when using MinPayloadFunction. I believe there is a bug there. No time at the moment to flesh out a unit test, but wanted to report it for tracking. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2264) Add missing tests for PayloadXxxQuery
Add missing tests for PayloadXxxQuery - Key: LUCENE-2264 URL: https://issues.apache.org/jira/browse/LUCENE-2264 Project: Lucene - Java Issue Type: Test Components: Search Reporter: Uwe Schindler Priority: Minor Fix For: 3.1 This is a followup for LUCENE-1941 and the discussion in IRC. The Payload queries have no real working tests, esp they are missing for the Min/Max/Avg functions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1941) MinPayloadFunction returns 0 when only one payload is present
[ https://issues.apache.org/jira/browse/LUCENE-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833564#action_12833564 ] Uwe Schindler commented on LUCENE-1941: --- I created LUCENE-2264 for the tests. I will no proceed with applying the patches and merging to 2.9/3.0. MinPayloadFunction returns 0 when only one payload is present - Key: LUCENE-1941 URL: https://issues.apache.org/jira/browse/LUCENE-1941 Project: Lucene - Java Issue Type: Bug Components: Query/Scoring Affects Versions: 2.9, 3.0 Reporter: Erik Hatcher Assignee: Uwe Schindler Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-1941.patch, LUCENE-1941.patch In some experiments with payload scoring through PayloadTermQuery, I'm seeing 0 returned when using MinPayloadFunction. I believe there is a bug there. No time at the moment to flesh out a unit test, but wanted to report it for tracking. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1941) MinPayloadFunction returns 0 when only one payload is present
[ https://issues.apache.org/jira/browse/LUCENE-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833564#action_12833564 ] Uwe Schindler edited comment on LUCENE-1941 at 2/14/10 12:52 PM: - I created LUCENE-2264 for the tests. I will now proceed with applying the patches and merging to 2.9/3.0. was (Author: thetaphi): I created LUCENE-2264 for the tests. I will no proceed with applying the patches and merging to 2.9/3.0. MinPayloadFunction returns 0 when only one payload is present - Key: LUCENE-1941 URL: https://issues.apache.org/jira/browse/LUCENE-1941 Project: Lucene - Java Issue Type: Bug Components: Query/Scoring Affects Versions: 2.9, 3.0 Reporter: Erik Hatcher Assignee: Uwe Schindler Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-1941.patch, LUCENE-1941.patch In some experiments with payload scoring through PayloadTermQuery, I'm seeing 0 returned when using MinPayloadFunction. I believe there is a bug there. No time at the moment to flesh out a unit test, but wanted to report it for tracking. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1941) MinPayloadFunction returns 0 when only one payload is present
[ https://issues.apache.org/jira/browse/LUCENE-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1941: -- Attachment: LUCENE-1941.patch Patch with CHANGES.txt in the new 3.0.1/2.9.2 section of restructured trunk changes. MinPayloadFunction returns 0 when only one payload is present - Key: LUCENE-1941 URL: https://issues.apache.org/jira/browse/LUCENE-1941 Project: Lucene - Java Issue Type: Bug Components: Query/Scoring Affects Versions: 2.9, 3.0 Reporter: Erik Hatcher Assignee: Uwe Schindler Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-1941.patch, LUCENE-1941.patch, LUCENE-1941.patch In some experiments with payload scoring through PayloadTermQuery, I'm seeing 0 returned when using MinPayloadFunction. I believe there is a bug there. No time at the moment to flesh out a unit test, but wanted to report it for tracking. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: (LUCENE-1844) Speed up junit tests
That look exciting! Too bad that I have no IntelliJ, maybe we can use that somehow! - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Steven A Rowe [mailto:sar...@syr.edu] Sent: Sunday, February 14, 2010 4:52 PM To: java-dev@lucene.apache.org Subject: RE: (LUCENE-1844) Speed up junit tests Hi Uwe, On 02/14/2010 at 5:53 AM, Uwe Schindler wrote: At least we should check all core tests to not set any static defaults without try...finally! Are there any possibilities inside Eclipse/other-IDEs to check this? IntelliJ has something called structural search and replace (SSR) - it could probably do something like what you want (I've only used it once, so I'm afraid I can't be of much assistance figuring out an appropriate expression): http://www.jetbrains.com/idea/documentation/ssr.html Steve - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1941) MinPayloadFunction returns 0 when only one payload is present
[ https://issues.apache.org/jira/browse/LUCENE-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler resolved LUCENE-1941. --- Resolution: Fixed Committed trunk revision: 910034 Committed 3.0 branch revision: 910037 Committed 2.9 branch revision: 910038 MinPayloadFunction returns 0 when only one payload is present - Key: LUCENE-1941 URL: https://issues.apache.org/jira/browse/LUCENE-1941 Project: Lucene - Java Issue Type: Bug Components: Query/Scoring Affects Versions: 2.9, 3.0 Reporter: Erik Hatcher Assignee: Uwe Schindler Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-1941.patch, LUCENE-1941.patch, LUCENE-1941.patch In some experiments with payload scoring through PayloadTermQuery, I'm seeing 0 returned when using MinPayloadFunction. I believe there is a bug there. No time at the moment to flesh out a unit test, but wanted to report it for tracking. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2266) problem with edgengramtokenfilter and highlighter
[ https://issues.apache.org/jira/browse/LUCENE-2266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833626#action_12833626 ] Uwe Schindler commented on LUCENE-2266: --- As this patch is really simple, I have no problem with quickly putting into 2.9.2. Robert, as we are in code freeze, I would like to apply it. problem with edgengramtokenfilter and highlighter - Key: LUCENE-2266 URL: https://issues.apache.org/jira/browse/LUCENE-2266 Project: Lucene - Java Issue Type: Bug Components: contrib/analyzers Affects Versions: 2.9.1 Reporter: Joe Calderon Assignee: Robert Muir Priority: Minor Attachments: LUCENE-2266.patch, LUCENE-2266.patch i ran into a problem while using the edgengramtokenfilter, it seems to report incorrect offsets when generating tokens, more specifically all the tokens have offset 0 and term length as start and end, this leads to goofy highlighting behavior when creating edge grams for tokens beyond the first one, i created a small patch that takes into account the start of the original token and adds that to the reported start/end offsets. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2266) problem with edgengramtokenfilter and highlighter
[ https://issues.apache.org/jira/browse/LUCENE-2266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2266: -- Fix Version/s: 3.1 3.0.1 2.9.2 problem with edgengramtokenfilter and highlighter - Key: LUCENE-2266 URL: https://issues.apache.org/jira/browse/LUCENE-2266 Project: Lucene - Java Issue Type: Bug Components: contrib/analyzers Affects Versions: 2.9.1 Reporter: Joe Calderon Assignee: Robert Muir Priority: Minor Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2266.patch, LUCENE-2266.patch i ran into a problem while using the edgengramtokenfilter, it seems to report incorrect offsets when generating tokens, more specifically all the tokens have offset 0 and term length as start and end, this leads to goofy highlighting behavior when creating edge grams for tokens beyond the first one, i created a small patch that takes into account the start of the original token and adds that to the reported start/end offsets. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-2266) problem with edgengramtokenfilter and highlighter
[ https://issues.apache.org/jira/browse/LUCENE-2266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler resolved LUCENE-2266. --- Resolution: Fixed Committed trunk revision: 910078 Committed 3.0 revision: 910080 Committed 2.9 revision: 910082 Thanks Joe Robert. Now I can start the PMC votes of Lucene 2.9.2 and 3.0.1! problem with edgengramtokenfilter and highlighter - Key: LUCENE-2266 URL: https://issues.apache.org/jira/browse/LUCENE-2266 Project: Lucene - Java Issue Type: Bug Components: contrib/analyzers Affects Versions: 2.9.1 Reporter: Joe Calderon Assignee: Robert Muir Priority: Minor Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2266.patch, LUCENE-2266.patch i ran into a problem while using the edgengramtokenfilter, it seems to report incorrect offsets when generating tokens, more specifically all the tokens have offset 0 and term length as start and end, this leads to goofy highlighting behavior when creating edge grams for tokens beyond the first one, i created a small patch that takes into account the start of the original token and adds that to the reported start/end offsets. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[VOTE] Lucene Java 2.9.2 and 3.0.1 release artifacts
Hallo Folks, I have posted a release candidate for both Lucene Java 2.9.2 and 3.0.1 (which both have the same bug fix level, functionality and release announcement), build from revision 910082 of the corresponding branches. Thanks for all your help! Please test them and give your votes until Thursday morning, as the scheduled release date for both versions is Friday, Feb 19th, 2010. Only votes from Lucene PMC are binding, but everyone is welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. We planned the parallel release with one announcement because of their parallel development / bug fix level to emphasize that they are equal except deprecation removal and Java 5 since major version 3. Please also read the attached release announcement (Open Document) and send it corrected back if you miss anything or want to improve my bad English :-) You find the artifacts here: http://people.apache.org/~uschindler/staging-area/lucene-292-301-take1-rev910082/ Maven repo: http://people.apache.org/~uschindler/staging-area/lucene-292-301-take1-rev910082/maven/ The changes are here: http://people.apache.org/~uschindler/staging-area/lucene-292-301-take1-rev910082/changes-2.9.2/Changes.html http://people.apache.org/~uschindler/staging-area/lucene-292-301-take1-rev910082/changes-2.9.2/Contrib-Changes.html http://people.apache.org/~uschindler/staging-area/lucene-292-301-take1-rev910082/changes-3.0.1/Changes.html http://people.apache.org/~uschindler/staging-area/lucene-292-301-take1-rev910082/changes-3.0.1/Contrib-Changes.html Uwe === Proposed Release Announcement === Hello Lucene users, On behalf of the Lucene development community I would like to announce the release of Lucene Java versions 3.0.1 and 2.9.2: Both releases fix bugs in the previous versions, where 2.9.2 is the last release working with Java 1.4, still providing all deprecated APIs of the Lucene Java 2.x series. 3.0.1 has the same bug fix level, but requires Java 5 and is no longer compatible with code using deprecated APIs. The API was cleaned up to make use of Java 5's generics, varargs, enums, and autoboxing. New users of Lucene are advised to use version 3.0.1 for new developments, because it has a clean, type safe new API. Users upgrading from 2.9.x can now remove unnecessary casts and add generics to their code, too. Important improvements in these releases are a increased maximum number of unique terms in each index segment. They also add fixes in IndexWriter’s commit and lost document deletes in near real-time indexing. Also lots of bugs in Contrib’s Analyzers package were fixed. Additionally, the 3.0.1 release restored some public methods, that get lost during deprecation removal. If you are using Lucene in a web application environment, you will notice that the new Attribute-based TokenStream API now works correct with different class loaders. Both releases are fully compatible with the corresponding previous versions. We strongly recommend upgrading to 2.9.2 if you are using 2.9.1 or 2.9.0; and to 3.0.1 if you are using 3.0.0. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de release-note.odt Description: application/vnd.oasis.opendocument.text - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2144) InstantiatedIndexReader does not handle #termDocs(null) correct (AllTermDocs)
[ https://issues.apache.org/jira/browse/LUCENE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2144: -- Fix Version/s: 2.9.2 merge back also to 2.9.2 InstantiatedIndexReader does not handle #termDocs(null) correct (AllTermDocs) - Key: LUCENE-2144 URL: https://issues.apache.org/jira/browse/LUCENE-2144 Project: Lucene - Java Issue Type: Bug Components: contrib/* Affects Versions: 2.9, 2.9.1, 3.0 Reporter: Karl Wettin Assignee: Michael McCandless Priority: Critical Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2144-30.patch, LUCENE-2144.txt This patch contains core changes so someone else needs to commit it. Due to the incompatible #termDocs(null) behaviour at least MatchAllDocsQuery, FieldCacheRangeFilter and ValueSourceQuery fails using II since 2.9. AllTermDocs now has a superclass, AbstractAllTermDocs that also InstantiatedAllTermDocs extend. Also: * II-tests made less plausable to pass on future incompatible changes to TermDocs and TermEnum * IITermDocs#skipTo and #next mimics the behaviour of document posisioning from SegmentTermDocs#dito when returning false * II now uses BitVector rather than sets for deleted documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2165) SnowballAnalyzer lacks a constructor that takes a Set of Stop Words
[ https://issues.apache.org/jira/browse/LUCENE-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2165: -- Fix Version/s: 2.9.2 backport SnowballAnalyzer lacks a constructor that takes a Set of Stop Words --- Key: LUCENE-2165 URL: https://issues.apache.org/jira/browse/LUCENE-2165 Project: Lucene - Java Issue Type: Bug Components: contrib/analyzers Affects Versions: 2.9.1, 3.0 Reporter: Nick Burch Assignee: Robert Muir Priority: Minor Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2165.patch As discussed on the java-user list, the SnowballAnalyzer has been updated to use a Set of stop words. However, there is no constructor which accepts a Set, there's only the original String[] one This is an issue, because most of the common sources of stop words (eg StopAnalyzer) have deprecated their String[] stop word lists, and moved over to Sets (eg StopAnalyzer.ENGLISH_STOP_WORDS_SET). So, for now, you either have to use a deprecated field on StopAnalyzer, or manually turn the Set into an array so you can pass it to the SnowballAnalyzer I would suggest that a constructor is added to SnowballAnalyzer which accepts a Set. Not sure if the old String[] one should be deprecated or not. A sample patch against 2.9.1 to add the constructor is: --- SnowballAnalyzer.java.orig 2009-12-15 11:14:08.0 + +++ SnowballAnalyzer.java 2009-12-14 12:58:37.0 + @@ -67,6 +67,12 @@ stopSet = StopFilter.makeStopSet(stopWords); } + /** Builds the named analyzer with the given stop words. */ + public SnowballAnalyzer(Version matchVersion, String name, Set stopWordsSet) { +this(matchVersion, name); +stopSet = stopWordsSet; + } + -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-1941) MinPayloadFunction returns 0 when only one payload is present
[ https://issues.apache.org/jira/browse/LUCENE-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler reassigned LUCENE-1941: - Assignee: Uwe Schindler MinPayloadFunction returns 0 when only one payload is present - Key: LUCENE-1941 URL: https://issues.apache.org/jira/browse/LUCENE-1941 Project: Lucene - Java Issue Type: Bug Components: Query/Scoring Affects Versions: 2.9, 3.0 Reporter: Erik Hatcher Assignee: Uwe Schindler Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-1941.patch, LUCENE-1941.patch In some experiments with payload scoring through PayloadTermQuery, I'm seeing 0 returned when using MinPayloadFunction. I believe there is a bug there. No time at the moment to flesh out a unit test, but wanted to report it for tracking. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Release of 2.9.2 and 3.0.1 in progress - commit freeze
Hi all, the release of 2.9.2 and 3.0.1 is in progress. I merged all CHANGES.txt entries, merged remaining bugfixes and prepared the version number in both branches. The only missing fix is https://issues.apache.org/jira/browse/LUCENE-1941, which is in progress, I will backport and commit when the tests are finished. Please do not commit anything to the branches and trunk, if it is fix-for 2.9.x or 3.0.x. All other changes can be committed. I will create the final artifacts for vote tomorrow and plan to release on Friday. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: Release of 2.9.2 and 3.0.1 in progress - commit freeze
Please do not commit anything to the branches and trunk, if it is fix- for 2.9.x or 3.0.x. All other changes can be committed. Of course only changes in *trunk* may be committed that are not also fix-for 2.9 and 3.0. :-) - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1941) MinPayloadFunction returns 0 when only one payload is present
[ https://issues.apache.org/jira/browse/LUCENE-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832928#action_12832928 ] Uwe Schindler commented on LUCENE-1941: --- Hi Erik, I want to release 2.9.2 and 3.0.1, is there any problem? I would change this to fix 3.1 only, else it should be fix for 3.0.1 and 2.9.2 both. MinPayloadFunction returns 0 when only one payload is present - Key: LUCENE-1941 URL: https://issues.apache.org/jira/browse/LUCENE-1941 Project: Lucene - Java Issue Type: Bug Components: Query/Scoring Affects Versions: 2.9 Reporter: Erik Hatcher Fix For: 3.0.1, 3.1 In some experiments with payload scoring through PayloadTermQuery, I'm seeing 0 returned when using MinPayloadFunction. I believe there is a bug there. No time at the moment to flesh out a unit test, but wanted to report it for tracking. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1941) MinPayloadFunction returns 0 when only one payload is present
[ https://issues.apache.org/jira/browse/LUCENE-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1941: -- Affects Version/s: 3.0 Fix Version/s: 2.9.2 MinPayloadFunction returns 0 when only one payload is present - Key: LUCENE-1941 URL: https://issues.apache.org/jira/browse/LUCENE-1941 Project: Lucene - Java Issue Type: Bug Components: Query/Scoring Affects Versions: 2.9, 3.0 Reporter: Erik Hatcher Fix For: 2.9.2, 3.0.1, 3.1 In some experiments with payload scoring through PayloadTermQuery, I'm seeing 0 returned when using MinPayloadFunction. I believe there is a bug there. No time at the moment to flesh out a unit test, but wanted to report it for tracking. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2080) Improve the documentation of Version
[ https://issues.apache.org/jira/browse/LUCENE-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832949#action_12832949 ] Uwe Schindler commented on LUCENE-2080: --- We should add a note in CHANGES.txt in 3.0 and 2.9 branch as this is an API change. Something like: Deprecated Version.LUCENE_CURRENT constant... with the reason phrases from above Improve the documentation of Version Key: LUCENE-2080 URL: https://issues.apache.org/jira/browse/LUCENE-2080 Project: Lucene - Java Issue Type: Task Components: Javadocs Reporter: Robert Muir Assignee: Robert Muir Priority: Trivial Fix For: 2.9.2, 3.0, 3.1 Attachments: LUCENE-2080.patch, LUCENE-2080.patch, LUCENE-2080.patch, LUCENE-2080.patch In my opinion, we should elaborate more on the effects of changing the Version parameter. Particularly, changing this value, even if you recompile your code, likely involves reindexing your data. I do not think this is adequately clear from the current javadocs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2260) AttributeSource holds strong reference to class instances and prevents unloading e.g. in Solr if webapplication reload and custom attributes in separate classloaders ar
[ https://issues.apache.org/jira/browse/LUCENE-2260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832951#action_12832951 ] Uwe Schindler commented on LUCENE-2260: --- I'll commit this soon! AttributeSource holds strong reference to class instances and prevents unloading e.g. in Solr if webapplication reload and custom attributes in separate classloaders are used (e.g. in the Solr plugins classloader) - Key: LUCENE-2260 URL: https://issues.apache.org/jira/browse/LUCENE-2260 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9.1, 3.0 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2260-lucene29.patch, LUCENE-2260.patch, LUCENE-2260.patch When working on the dynmaic proxy classes using cglib/javaassist i recognized a problem in the caching code inside AttributeSource: - AttributeSource has a static (!) cache map that holds implementation classes for attributes to be faster on creating new attributes (reflection cost) - AttributeSource has a static (!) cache map that holds a list of all interfaces implemented by a specific AttributeImpl Also: - VirtualMethod in 3.1 hold a map of implementation distances keyed by subclasses of the deprecated API Both have the problem that this strong reference is inside Lucene's classloader and so persists as long as lucene lives. The classes referenced can never be unloaded therefore, which would be fine if all live in the same classloader. As soon as the Attribute or implementation class or the subclass of the deprecated API are loaded by a different classloder (e.g. Lucene lives in bootclasspath of tomcat, but lucene-consumer with custom attributes lives in a webapp), they can never be unloaded, because a reference exists. Libs like CGLIB or JavaAssist or JDK's reflect.Proxy have a similar cache for generated class files. They also manage this by a WeakHashMap. The cache will always work perfect and no class will be evicted without reason, as classes are only unloaded when the classloader goes and this will only happen on request (e.g. by Tomcat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-2260) AttributeSource holds strong reference to class instances and prevents unloading e.g. in Solr if webapplication reload and custom attributes in separate classloaders are
[ https://issues.apache.org/jira/browse/LUCENE-2260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler resolved LUCENE-2260. --- Resolution: Fixed Committed trunk revision: 909360 Committed 2.9 revision: 909361 Committed 3.0 revision: 909365 AttributeSource holds strong reference to class instances and prevents unloading e.g. in Solr if webapplication reload and custom attributes in separate classloaders are used (e.g. in the Solr plugins classloader) - Key: LUCENE-2260 URL: https://issues.apache.org/jira/browse/LUCENE-2260 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9.1, 3.0 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2260-lucene29.patch, LUCENE-2260.patch, LUCENE-2260.patch When working on the dynmaic proxy classes using cglib/javaassist i recognized a problem in the caching code inside AttributeSource: - AttributeSource has a static (!) cache map that holds implementation classes for attributes to be faster on creating new attributes (reflection cost) - AttributeSource has a static (!) cache map that holds a list of all interfaces implemented by a specific AttributeImpl Also: - VirtualMethod in 3.1 hold a map of implementation distances keyed by subclasses of the deprecated API Both have the problem that this strong reference is inside Lucene's classloader and so persists as long as lucene lives. The classes referenced can never be unloaded therefore, which would be fine if all live in the same classloader. As soon as the Attribute or implementation class or the subclass of the deprecated API are loaded by a different classloder (e.g. Lucene lives in bootclasspath of tomcat, but lucene-consumer with custom attributes lives in a webapp), they can never be unloaded, because a reference exists. Libs like CGLIB or JavaAssist or JDK's reflect.Proxy have a similar cache for generated class files. They also manage this by a WeakHashMap. The cache will always work perfect and no class will be evicted without reason, as classes are only unloaded when the classloader goes and this will only happen on request (e.g. by Tomcat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2154) Need a clean way for Dir/MultiReader to merge the AttributeSources of the sub-readers
[ https://issues.apache.org/jira/browse/LUCENE-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2154: -- Attachment: LUCENE-2154-javassist.patch LUCENE-2154-cglib.patch Here the last CGLIB patch for reference. Now the real cool class created using JAVASSIST [http://www.javassist.org/]: You have to place the latest javassist.jar (Mozilla/LGPL licensed) in the lib/ folder and apply the patch. What it does is the fastest proxy we can think of: It creates a subclass of ProxyAttributeImpl that implements all methods of the interface natively in bytecode using JAVASSIST's bytecode generation tools (a subset of the Java language spec). The micro-benchmark shows, no difference between proxied and native method - as hotspot removes the extra method call. With Javassist it would even be possible to create classes that implement our interfaces around simple fields that are set by get/setters. Just like Eclipse's create get/set around a private field. That would be really cool. Or we could create combining attributes on the fly, Michael Busch would be excited. All *Impl classes we currently have would be almost obsolete (except TermAttributeImpl, which is rather complex). We could also create dynamic State classes for capturing state... Nice, but a little bit hackish. Maybe we put this first into contrib and supply a ConcenatingTokenStream as demo impl and also other Solr TokenStreams that are no longer easy with the Attributes without proxies (Robert listed some). Need a clean way for Dir/MultiReader to merge the AttributeSources of the sub-readers --- Key: LUCENE-2154 URL: https://issues.apache.org/jira/browse/LUCENE-2154 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: Flex Branch Reporter: Michael McCandless Fix For: Flex Branch Attachments: LUCENE-2154-cglib.patch, LUCENE-2154-javassist.patch, LUCENE-2154.patch, LUCENE-2154.patch The flex API allows extensibility at the Fields/Terms/Docs/PositionsEnum levels, for a codec to set custom attrs. But, it's currently broken for Dir/MultiReader, which must somehow share attrs across all the sub-readers. Somehow we must make a single attr source, and tell each sub-reader's enum to use that instead of creating its own. Hopefully Uwe can work some magic here :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2154) Need a clean way for Dir/MultiReader to merge the AttributeSources of the sub-readers
[ https://issues.apache.org/jira/browse/LUCENE-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2154: -- Attachment: (was: LUCENE-2154-javassist.patch) Need a clean way for Dir/MultiReader to merge the AttributeSources of the sub-readers --- Key: LUCENE-2154 URL: https://issues.apache.org/jira/browse/LUCENE-2154 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: Flex Branch Reporter: Michael McCandless Fix For: Flex Branch Attachments: LUCENE-2154-cglib.patch, LUCENE-2154-javassist.patch, LUCENE-2154.patch, LUCENE-2154.patch The flex API allows extensibility at the Fields/Terms/Docs/PositionsEnum levels, for a codec to set custom attrs. But, it's currently broken for Dir/MultiReader, which must somehow share attrs across all the sub-readers. Somehow we must make a single attr source, and tell each sub-reader's enum to use that instead of creating its own. Hopefully Uwe can work some magic here :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2154) Need a clean way for Dir/MultiReader to merge the AttributeSources of the sub-readers
[ https://issues.apache.org/jira/browse/LUCENE-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2154: -- Attachment: LUCENE-2154-javassist.patch Better patch without classloader problems. Need a clean way for Dir/MultiReader to merge the AttributeSources of the sub-readers --- Key: LUCENE-2154 URL: https://issues.apache.org/jira/browse/LUCENE-2154 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: Flex Branch Reporter: Michael McCandless Fix For: Flex Branch Attachments: LUCENE-2154-cglib.patch, LUCENE-2154-javassist.patch, LUCENE-2154.patch, LUCENE-2154.patch The flex API allows extensibility at the Fields/Terms/Docs/PositionsEnum levels, for a codec to set custom attrs. But, it's currently broken for Dir/MultiReader, which must somehow share attrs across all the sub-readers. Somehow we must make a single attr source, and tell each sub-reader's enum to use that instead of creating its own. Hopefully Uwe can work some magic here :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2261) configurable MultiTermQuery TopTermsScoringBooleanRewrite pq size
[ https://issues.apache.org/jira/browse/LUCENE-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833181#action_12833181 ] Uwe Schindler commented on LUCENE-2261: --- Hi Robert, patch looks good, all tests pass, nothing to complain from the MTQ police :-) There is only one thing unrelated to that issue: It makes no sense to declare IllegalArgExceptions as they are unchecked. I would remove them, else the compiler does. configurable MultiTermQuery TopTermsScoringBooleanRewrite pq size - Key: LUCENE-2261 URL: https://issues.apache.org/jira/browse/LUCENE-2261 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Robert Muir Assignee: Robert Muir Priority: Minor Fix For: Flex Branch, 3.1 Attachments: LUCENE-2261.patch, LUCENE-2261.patch, LUCENE-2261.patch, LUCENE-2261.patch MultiTermQuery has a TopTermsScoringBooleanRewrite, that uses a priority queue to expand the query to the top-N terms. currently N is hardcoded at BooleanQuery.getMaxClauseCount(), but it would be nice to be able to set this for top-N MultiTermQueries: e.g. expand a fuzzy query to at most only the 50 closest terms. at a glance it seems one way would be to expose TopTermsScoringBooleanRewrite (it is private right now) and add a ctor to it, so a MultiTermQuery can instantiate one with its own limit. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2154) Need a clean way for Dir/MultiReader to merge the AttributeSources of the sub-readers
[ https://issues.apache.org/jira/browse/LUCENE-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2154: -- Attachment: LUCENE-2154-javassist.patch More cool, less casts, more speed. Need a clean way for Dir/MultiReader to merge the AttributeSources of the sub-readers --- Key: LUCENE-2154 URL: https://issues.apache.org/jira/browse/LUCENE-2154 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: Flex Branch Reporter: Michael McCandless Fix For: Flex Branch Attachments: LUCENE-2154-cglib.patch, LUCENE-2154-javassist.patch, LUCENE-2154-javassist.patch, LUCENE-2154.patch, LUCENE-2154.patch The flex API allows extensibility at the Fields/Terms/Docs/PositionsEnum levels, for a codec to set custom attrs. But, it's currently broken for Dir/MultiReader, which must somehow share attrs across all the sub-readers. Somehow we must make a single attr source, and tell each sub-reader's enum to use that instead of creating its own. Hopefully Uwe can work some magic here :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: Build failed in Hudson: Lucene-trunk #1091
Really cool: This time we hit the failure during the clover run: [junit] Testsuite: org.apache.lucene.index.TestIndexWriterMergePolicy [junit] Tests run: 6, Failures: 1, Errors: 0, Time elapsed: 28.519 sec [junit] [junit] Testcase: testMaxBufferedDocsChange(org.apache.lucene.index.TestIndexWriterMergePolicy): FAILED [junit] maxMergeDocs=2147483647; numSegments=11; upperBound=10; mergeFactor=10; segs=_65:c5950 _5t:c10-_32 _5u:c10-_32 _5v:c10-_32 _5w:c10-_32 _5x:c10-_32 _5y:c10-_32 _5z:c10-_32 _60:c10-_32 _61:c10-_32 _62:c9-_32 _64:c1-_62 [junit] junit.framework.AssertionFailedError: maxMergeDocs=2147483647; numSegments=11; upperBound=10; mergeFactor=10; segs=_65:c5950 _5t:c10-_32 _5u:c10-_32 _5v:c10-_32 _5w:c10-_32 _5x:c10-_32 _5y:c10-_32 _5z:c10-_32 _60:c10-_32 _61:c10-_32 _62:c9-_32 _64:c1-_62 [junit] at org.apache.lucene.index.TestIndexWriterMergePolicy.checkInvariants(TestIndexWriterMergePolicy.java:234) [junit] at org.apache.lucene.index.TestIndexWriterMergePolicy.__CLR2_6_3zf7i0317qu(TestIndexWriterMergePolicy.java:164) [junit] at org.apache.lucene.index.TestIndexWriterMergePolicy.testMaxBufferedDocsChange(TestIndexWriterMergePolicy.java:125) [junit] at org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:214) [junit] [junit] [junit] Test org.apache.lucene.index.TestIndexWriterMergePolicy FAILED We also get the exact location of the failure: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/1091/clover-report/org/apache/lucene/index/TestIndexWriterMergePolicy.html?line=125#src-125 And you can see which lines are called how often until the failure occurred! Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Apache Hudson Server [mailto:hud...@hudson.zones.apache.org] Sent: Thursday, February 11, 2010 5:16 AM To: java-dev@lucene.apache.org Subject: Build failed in Hudson: Lucene-trunk #1091 See http://hudson.zones.apache.org/hudson/job/Lucene- trunk/1091/changes Changes: [uschindler] LUCENE-2248: Change core tests to use a global Version constant [uschindler] LUCENE-2258: Remove unneeded synchronization in FuzzyTermEnum -- [...truncated 23095 lines...] [javadoc] Loading source files for package org.apache.lucene.queryParser.standard... [javadoc] Loading source files for package org.apache.lucene.queryParser.standard.builders... [javadoc] Loading source files for package org.apache.lucene.queryParser.standard.config... [javadoc] Loading source files for package org.apache.lucene.queryParser.standard.nodes... [javadoc] Loading source files for package org.apache.lucene.queryParser.standard.parser... [javadoc] Loading source files for package org.apache.lucene.queryParser.standard.processors... [javadoc] Constructing Javadoc information... [javadoc] Standard Doclet version 1.5.0_22 [javadoc] Building tree for all the packages and classes... [javadoc] Building index for all the packages and classes... [javadoc] Building index for all classes... [javadoc] Generating http://hudson.zones.apache.org/hudson/job/Lucene- trunk/ws/trunk/build/docs/api/contrib-queryparser/stylesheet.css... [javadoc] Note: Custom tags that were not seen: @lucene.experimental, @lucene.internal [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Lucene- trunk/ws/trunk/build/contrib/queryparser/lucene-queryparser-2010-02- 11_02-03-57-javadoc.jar [echo] Building regex... javadocs: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Lucene- trunk/ws/trunk/build/docs/api/contrib-regex [javadoc] Generating Javadoc [javadoc] Javadoc execution [javadoc] Loading source files for package org.apache.lucene.search.regex... [javadoc] Loading source files for package org.apache.regexp... [javadoc] Constructing Javadoc information... [javadoc] Standard Doclet version 1.5.0_22 [javadoc] Building tree for all the packages and classes... [javadoc] Building index for all the packages and classes... [javadoc] Building index for all classes... [javadoc] Generating http://hudson.zones.apache.org/hudson/job/Lucene- trunk/ws/trunk/build/docs/api/contrib-regex/stylesheet.css... [javadoc] Note: Custom tags that were not seen: @lucene.experimental, @lucene.internal [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Lucene- trunk/ws/trunk/build/contrib/regex/lucene-regex-2010-02-11_02-03-57- javadoc.jar [echo] Building remote... javadocs: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Lucene- trunk/ws/trunk/build/docs/api/contrib-remote [javadoc] Generating Javadoc [javadoc] Javadoc execution [javadoc] Loading source files for package org.apache.lucene.search... [javadoc] Constructing Javadoc information
[jira] Updated: (LUCENE-2154) Need a clean way for Dir/MultiReader to merge the AttributeSources of the sub-readers
[ https://issues.apache.org/jira/browse/LUCENE-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2154: -- Attachment: LUCENE-2154.patch Here is a first patch about cglib-generated proxy attributes. In IRC we found out yesterday, that the proposed idea to share the attributes accross all Multi*Enums would result in problems as the call to next() on any sub-enum would overwrite the contents of the attributes of the previous sub-enum which would make TermsEnum not working (because e.g. TermsEnum looks forward by calling next() an all sub-enums and choosing the lowest term to return - after calling each enums next() the attributes of the first enums cannot be restored without captureState co, as overwritten by the next() call to the last enum). This patch needs cglib-nodep-2.2.jar put into the lib-folder of the checkout [http://sourceforge.net/projects/cglib/files/cglib2/2.2/cglib-nodep-2.2.jar/download]. It contains a test and that shows how the usage is. The central part is cglib's Enhancer that creates a dynamic class extending ProxyAttributeImpl (which defines the general AttributeImpl methods delegating to the delegate) and implementing the requested Attribute interface using a MethodInterceptor. Please note: This uses no reflection (only during in-memory class file creation, which is only run one time on loading the proxy class). The proxy implements MethodInterceptor and uses the fast MethodProxy class (which is also generated by cglib for each proxied method, too) and can invoke the delegated method directly (without reflection) on the delegate. The test verifies everything works and also compares speed by using a TermAttribute natively and proxied. The speed is lower (which is not caused by reflection, but by the MethodInterceptor creating an array of parameters and boxing/unboxing native parameters into the Object[]), but for the testcase I have seen about only 50% more time needed. The generated classes are cached and reused (like DEFAULT_ATTRIBUTE_FACTORY does). To get maximum speed and no external libraries, the code implemented by Enhancer can be rewritten natively using the Apache Harmony java.lang.reflect.Proxy implementation source code as basis. The hardest part in generating bytecode is the ConstantPool in class files. But as the proxy methods are simply delegating and no magic like boxing/unboxing is needed, the generated bytecode is rather simple. One other use-case for these proxies is AppendingTokenStream, which is not possible since 3.0 without captureState (in old TS API it was possible, because you could reuse the same TokenInstance even over the appended streams). In the new TS api, the appending stream must have a view on the attributes of the current consuming sub-stream. Need a clean way for Dir/MultiReader to merge the AttributeSources of the sub-readers --- Key: LUCENE-2154 URL: https://issues.apache.org/jira/browse/LUCENE-2154 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: Flex Branch Reporter: Michael McCandless Fix For: Flex Branch Attachments: LUCENE-2154.patch The flex API allows extensibility at the Fields/Terms/Docs/PositionsEnum levels, for a codec to set custom attrs. But, it's currently broken for Dir/MultiReader, which must somehow share attrs across all the sub-readers. Somehow we must make a single attr source, and tell each sub-reader's enum to use that instead of creating its own. Hopefully Uwe can work some magic here :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2154) Need a clean way for Dir/MultiReader to merge the AttributeSources of the sub-readers
[ https://issues.apache.org/jira/browse/LUCENE-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2154: -- Attachment: LUCENE-2154.patch I had some more fun. Made ProxyAttributeSource non-final and added class name policy to also contain the corresponding interface (to make stack traces on errors nicer). Here the example output: {noformat} [junit] DEBUG: Created class org.apache.lucene.util.ProxyAttributeSource$ProxyAttributeImpl$$TermAttribute$$EnhancerByCGLIB$$6100bdf9 for attribute org.apache.lucene.analysis.tokenattributes.TermAttribute [junit] DEBUG: Created class org.apache.lucene.util.ProxyAttributeSource$ProxyAttributeImpl$$TypeAttribute$$EnhancerByCGLIB$$6f89c3ff for attribute org.apache.lucene.analysis.tokenattributes.TypeAttribute [junit] DEBUG: Created class org.apache.lucene.util.ProxyAttributeSource$ProxyAttributeImpl$$FlagsAttribute$$EnhancerByCGLIB$$4668733c for attribute org.apache.lucene.analysis.tokenattributes.FlagsAttribute [junit] Time taken using org.apache.lucene.analysis.tokenattributes.TermAttributeImpl: [junit] 1476.090658 ms for 1000 iterations [junit] Time taken using org.apache.lucene.util.ProxyAttributeSource$ProxyAttributeImpl$$TermAttribute$$EnhancerByCGLIB$$6100bdf 9: [junit] 1881.295734 ms for 1000 iterations {noformat} Need a clean way for Dir/MultiReader to merge the AttributeSources of the sub-readers --- Key: LUCENE-2154 URL: https://issues.apache.org/jira/browse/LUCENE-2154 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: Flex Branch Reporter: Michael McCandless Fix For: Flex Branch Attachments: LUCENE-2154.patch, LUCENE-2154.patch The flex API allows extensibility at the Fields/Terms/Docs/PositionsEnum levels, for a codec to set custom attrs. But, it's currently broken for Dir/MultiReader, which must somehow share attrs across all the sub-readers. Somehow we must make a single attr source, and tell each sub-reader's enum to use that instead of creating its own. Hopefully Uwe can work some magic here :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2260) AttributeSource holds strong reference to class instances and prevents unloading e.g. in Solr if webapplication reload and custom attributes in separate classloaders are
AttributeSource holds strong reference to class instances and prevents unloading e.g. in Solr if webapplication reload and custom attributes in separate classloaders are used (e.g. in the Solr plugins classloader) - Key: LUCENE-2260 URL: https://issues.apache.org/jira/browse/LUCENE-2260 Project: Lucene - Java Issue Type: Bug Affects Versions: 3.0, 2.9.1 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9.2, 3.0.1, 3.1 When working on the dynmaic proxy classes using cglib/javaassist i recognized a problem in the caching code inside AttributeSource: - AttributeSource has a static (!) cache map that holds implementation classes for attributes to be faster on creating new attributes (reflection cost) - AttributeSource has a static (!) cache map that holds a list of all interfaces implemented by a specific AttributeImpl Also: - VirtualMethod in 3.1 hold a map of implementation distances keyed by subclasses of the deprecated API Both have the problem that this strong reference is inside Lucene's classloader and so persists as long as lucene lives. The classes referenced can never be unloaded therefore, which would be fine if all live in the same classloader. As soon as the Attribute or implementation class or the subclass of the deprecated API are loaded by a different classloder (e.g. Lucene lives in bootclasspath of tomcat, but lucene-consumer with custom attributes lives in a webapp), they can never be unloaded, because a reference exists. Libs like CGLIB or JavaAssist or JDK's reflect.Proxy have a similar cache for generated class files. They also manage this by a WeakHashMap. The cache will always work perfect and no class will be evicted without reason, as classes are only unloaded when the classloader goes and this will only happen on request (e.g. by Tomcat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2260) AttributeSource holds strong reference to class instances and prevents unloading e.g. in Solr if webapplication reload and custom attributes in separate classloaders are
[ https://issues.apache.org/jira/browse/LUCENE-2260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2260: -- Attachment: LUCENE-2260.patch Attached patch. I will commit this in a day and also merge to 2.9 and 3.0 (without VirtualMethod) as this is a resource leak. This problem is similar to LUCENE-2182. AttributeSource holds strong reference to class instances and prevents unloading e.g. in Solr if webapplication reload and custom attributes in separate classloaders are used (e.g. in the Solr plugins classloader) - Key: LUCENE-2260 URL: https://issues.apache.org/jira/browse/LUCENE-2260 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9.1, 3.0 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2260.patch When working on the dynmaic proxy classes using cglib/javaassist i recognized a problem in the caching code inside AttributeSource: - AttributeSource has a static (!) cache map that holds implementation classes for attributes to be faster on creating new attributes (reflection cost) - AttributeSource has a static (!) cache map that holds a list of all interfaces implemented by a specific AttributeImpl Also: - VirtualMethod in 3.1 hold a map of implementation distances keyed by subclasses of the deprecated API Both have the problem that this strong reference is inside Lucene's classloader and so persists as long as lucene lives. The classes referenced can never be unloaded therefore, which would be fine if all live in the same classloader. As soon as the Attribute or implementation class or the subclass of the deprecated API are loaded by a different classloder (e.g. Lucene lives in bootclasspath of tomcat, but lucene-consumer with custom attributes lives in a webapp), they can never be unloaded, because a reference exists. Libs like CGLIB or JavaAssist or JDK's reflect.Proxy have a similar cache for generated class files. They also manage this by a WeakHashMap. The cache will always work perfect and no class will be evicted without reason, as classes are only unloaded when the classloader goes and this will only happen on request (e.g. by Tomcat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2260) AttributeSource holds strong reference to class instances and prevents unloading e.g. in Solr if webapplication reload and custom attributes in separate classloaders are
[ https://issues.apache.org/jira/browse/LUCENE-2260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2260: -- Attachment: LUCENE-2260.patch Improved patch, now all class references are weak. The assumption on the WeakReference inside addAttributeImpl is always != null is true because the code has a strong reference on the implementing class. AttributeSource holds strong reference to class instances and prevents unloading e.g. in Solr if webapplication reload and custom attributes in separate classloaders are used (e.g. in the Solr plugins classloader) - Key: LUCENE-2260 URL: https://issues.apache.org/jira/browse/LUCENE-2260 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9.1, 3.0 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2260.patch, LUCENE-2260.patch When working on the dynmaic proxy classes using cglib/javaassist i recognized a problem in the caching code inside AttributeSource: - AttributeSource has a static (!) cache map that holds implementation classes for attributes to be faster on creating new attributes (reflection cost) - AttributeSource has a static (!) cache map that holds a list of all interfaces implemented by a specific AttributeImpl Also: - VirtualMethod in 3.1 hold a map of implementation distances keyed by subclasses of the deprecated API Both have the problem that this strong reference is inside Lucene's classloader and so persists as long as lucene lives. The classes referenced can never be unloaded therefore, which would be fine if all live in the same classloader. As soon as the Attribute or implementation class or the subclass of the deprecated API are loaded by a different classloder (e.g. Lucene lives in bootclasspath of tomcat, but lucene-consumer with custom attributes lives in a webapp), they can never be unloaded, because a reference exists. Libs like CGLIB or JavaAssist or JDK's reflect.Proxy have a similar cache for generated class files. They also manage this by a WeakHashMap. The cache will always work perfect and no class will be evicted without reason, as classes are only unloaded when the classloader goes and this will only happen on request (e.g. by Tomcat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2260) AttributeSource holds strong reference to class instances and prevents unloading e.g. in Solr if webapplication reload and custom attributes in separate classloaders are
[ https://issues.apache.org/jira/browse/LUCENE-2260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2260: -- Attachment: LUCENE-2260-lucene29.patch Patch for 2.9 branch (without Java 5 generics) AttributeSource holds strong reference to class instances and prevents unloading e.g. in Solr if webapplication reload and custom attributes in separate classloaders are used (e.g. in the Solr plugins classloader) - Key: LUCENE-2260 URL: https://issues.apache.org/jira/browse/LUCENE-2260 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9.1, 3.0 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2260-lucene29.patch, LUCENE-2260.patch, LUCENE-2260.patch When working on the dynmaic proxy classes using cglib/javaassist i recognized a problem in the caching code inside AttributeSource: - AttributeSource has a static (!) cache map that holds implementation classes for attributes to be faster on creating new attributes (reflection cost) - AttributeSource has a static (!) cache map that holds a list of all interfaces implemented by a specific AttributeImpl Also: - VirtualMethod in 3.1 hold a map of implementation distances keyed by subclasses of the deprecated API Both have the problem that this strong reference is inside Lucene's classloader and so persists as long as lucene lives. The classes referenced can never be unloaded therefore, which would be fine if all live in the same classloader. As soon as the Attribute or implementation class or the subclass of the deprecated API are loaded by a different classloder (e.g. Lucene lives in bootclasspath of tomcat, but lucene-consumer with custom attributes lives in a webapp), they can never be unloaded, because a reference exists. Libs like CGLIB or JavaAssist or JDK's reflect.Proxy have a similar cache for generated class files. They also manage this by a WeakHashMap. The cache will always work perfect and no class will be evicted without reason, as classes are only unloaded when the classloader goes and this will only happen on request (e.g. by Tomcat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2261) configurable MultiTermQuery TopTermsScoringBooleanRewrite pq size
[ https://issues.apache.org/jira/browse/LUCENE-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832607#action_12832607 ] Uwe Schindler commented on LUCENE-2261: --- Patch looks good, some things because of serializable: - The readResove method must go to the singleton constant, which should also throw UOE when modified - euquals / hashcode is neaded for the rewritemode, else FuzzyQuery Co would no longer compare It could be solved by doing like for AutoRewrite and its unmodifiable constant. I know: Queries are a pain because of Serializable. +1 on adding a param to FuzzyQuery ctor configurable MultiTermQuery TopTermsScoringBooleanRewrite pq size - Key: LUCENE-2261 URL: https://issues.apache.org/jira/browse/LUCENE-2261 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Robert Muir Priority: Minor Fix For: Flex Branch, 3.1 Attachments: LUCENE-2261.patch MultiTermQuery has a TopTermsScoringBooleanRewrite, that uses a priority queue to expand the query to the top-N terms. currently N is hardcoded at BooleanQuery.getMaxClauseCount(), but it would be nice to be able to set this for top-N MultiTermQueries: e.g. expand a fuzzy query to at most only the 50 closest terms. at a glance it seems one way would be to expose TopTermsScoringBooleanRewrite (it is private right now) and add a ctor to it, so a MultiTermQuery can instantiate one with its own limit. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2261) configurable MultiTermQuery TopTermsScoringBooleanRewrite pq size
[ https://issues.apache.org/jira/browse/LUCENE-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832683#action_12832683 ] Uwe Schindler commented on LUCENE-2261: --- Looks good when gaining a first insight. I have not tried the patch, will do soon. configurable MultiTermQuery TopTermsScoringBooleanRewrite pq size - Key: LUCENE-2261 URL: https://issues.apache.org/jira/browse/LUCENE-2261 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Robert Muir Priority: Minor Fix For: Flex Branch, 3.1 Attachments: LUCENE-2261.patch, LUCENE-2261.patch, LUCENE-2261.patch, LUCENE-2261.patch MultiTermQuery has a TopTermsScoringBooleanRewrite, that uses a priority queue to expand the query to the top-N terms. currently N is hardcoded at BooleanQuery.getMaxClauseCount(), but it would be nice to be able to set this for top-N MultiTermQueries: e.g. expand a fuzzy query to at most only the 50 closest terms. at a glance it seems one way would be to expose TopTermsScoringBooleanRewrite (it is private right now) and add a ctor to it, so a MultiTermQuery can instantiate one with its own limit. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2111) Wrapup flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831901#action_12831901 ] Uwe Schindler commented on LUCENE-2111: --- Mike: I reviewed this EmptyTermsEnum in MTQ. I would leave it in, but simply make EmptyTermsEnum a singleton (which is perfectly fine, because its stateless). Returning null here makes no performance in MTQs, it only makes the code in MTQ#rewrite and MTQWF#getDocIdSet ugly. The biggest problem with returning null here is the backwards layer that must be fixed then (because it checks if getTermsEnum return null and falls back to FilteredTermEnum from trunk). If you really want null, getTermsEnum should per default (if not overriddden) throw UOE and the rewrite code should catch this UOE and only then delegate to backwards layer. Wrapup flexible indexing Key: LUCENE-2111 URL: https://issues.apache.org/jira/browse/LUCENE-2111 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: Flex Branch Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111_fuzzy.patch Spinoff from LUCENE-1458. The flex branch is in fairly good shape -- all tests pass, initial search performance testing looks good, it survived several visits from the Unicode policeman ;) But it still has a number of nocommits, could use some more scrutiny especially on the emulate old API on flex index and vice/versa code paths, and still needs some more performance testing. I'll do these under this issue, and we should open separate issues for other self contained fixes. The end is in sight! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2258) Remove synchonized from FuzzyTermEnum#similarity(final String target)
Remove synchonized from FuzzyTermEnum#similarity(final String target) --- Key: LUCENE-2258 URL: https://issues.apache.org/jira/browse/LUCENE-2258 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Uwe Schindler Assignee: Uwe Schindler Priority: Trivial Fix For: 2.9.2, Flex Branch, 3.0.1, 3.1 The similarity method in FuzzyTermEnum is synchronized which is stupid because of: - TermEnums are the iterator pattern and so are single-thread per definition - The method is private, so nobody could ever create a fake FuzzyTermEnum just to have this method and use it multithreaded. - The method is not static and has no static fields - so instances do not affect each other The root of this comes from LUCENE-296, but was never reviewd and simply committed. The argument for making it synchronized is wrong. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2258) Remove synchonized from FuzzyTermEnum#similarity(final String target)
[ https://issues.apache.org/jira/browse/LUCENE-2258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2258: -- Attachment: LUCENE-2258.patch Patch. Remove synchonized from FuzzyTermEnum#similarity(final String target) --- Key: LUCENE-2258 URL: https://issues.apache.org/jira/browse/LUCENE-2258 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Uwe Schindler Assignee: Uwe Schindler Priority: Trivial Fix For: 2.9.2, Flex Branch, 3.0.1, 3.1 Attachments: LUCENE-2258.patch The similarity method in FuzzyTermEnum is synchronized which is stupid because of: - TermEnums are the iterator pattern and so are single-thread per definition - The method is private, so nobody could ever create a fake FuzzyTermEnum just to have this method and use it multithreaded. - The method is not static and has no static fields - so instances do not affect each other The root of this comes from LUCENE-296, but was never reviewd and simply committed. The argument for making it synchronized is wrong. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2111) Wrapup flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2111: -- Attachment: LUCENE-2111-EmptyTermsEnum.patch Here the EmptyTermsEnum singleton patch (against flex trunk). Wrapup flexible indexing Key: LUCENE-2111 URL: https://issues.apache.org/jira/browse/LUCENE-2111 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: Flex Branch Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2111-EmptyTermsEnum.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111.patch, LUCENE-2111_fuzzy.patch Spinoff from LUCENE-1458. The flex branch is in fairly good shape -- all tests pass, initial search performance testing looks good, it survived several visits from the Unicode policeman ;) But it still has a number of nocommits, could use some more scrutiny especially on the emulate old API on flex index and vice/versa code paths, and still needs some more performance testing. I'll do these under this issue, and we should open separate issues for other self contained fixes. The end is in sight! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org