Re: lucene 2.9 sorting algorithm
Hi Mike: I have been playing with the patch, and I think I have some information that you might like. Let me spend sometime and gather some more numbers and update in jira. Thanks btw: About the conversion on multi values fields, I am not sure I get it (sorry for being ignorant): say bottom has ords 23, 45, 76, each corresponding to a string. When moving to the next segment, you need to make bottom to have ords that can be comparable to other docs in this new segment, so you would need to find the new ords for the values in 23,45 and 76, don't you? To find it, assuming the values are s1,s2,s3, you would do a bin. search on the new val array, and find index for s1,s2,s3. Which is 3 bin searches per convert, I am not sure how you can short circuit it. Are you suggesting we call Comparable on compareBottom until some doc beats it? That would hurt performance I lot though, no? -John On Wed, Oct 21, 2009 at 3:11 AM, Michael McCandless luc...@mikemccandless.com wrote: On Tue, Oct 20, 2009 at 11:55 AM, John Wang john.w...@gmail.com wrote: the simpler api places less restriction on the type of custom sorting that can be done. Just to verify: this is not a back-compat break, right? Because, in 2.4, such an interesting custom sort must've been operating at the top-level index reader level, which is easy to carry over to 2.9 (you just rebase the docIDs). But, of course in moving to 2.9, you would like to also switch your custom sort to be per-segment (for faster reopen/near real-time perf), but the new sort API makes this more difficult because it requires that you are able to compare hits across different segments during the search, not just at the end. But then I don't understand the difficulty of doing that: if we had a Collector with the MultiPQ approach, at the end during merge, you'd also have to compare results across segments, ie, upgrade your ords to their real values. The MultiPQ approach does this by calling sortValue (returns Comparable) in the end. Putting performance aside for now... when comparing bottom, you don't actually have to truly invert Comparable - ord on segment transition. You could, instead, get the Comparable for each and compare, but then note the smallest ord for the current segment that has failed to compete, and short-ciruit the compareBottom test by checking against that ord. That should enable carrying over the custom sort to the single PQ API without needing invert ord-value. We'd obviously have to test performance... Or, we could commit the MultiPQ approach as another sorting collector? I know it's not great having two wildly differenet sort APIs, but both APIs seem to have their strengths in different cases. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2004) Constants.LUCENE_MAIN_VERSION is inlined in code compiled against Lucene JAR, so version detection is incorrect
Constants.LUCENE_MAIN_VERSION is inlined in code compiled against Lucene JAR, so version detection is incorrect --- Key: LUCENE-2004 URL: https://issues.apache.org/jira/browse/LUCENE-2004 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9.1, 3.0 When you compile your own code against the Lucene 2.9 version of the JARs and use the LUCENE_MAIN_VERSION constant and then run the code against the 3.0 JAR, the constant still contains 2.9, because javac inlines primitives and Strings into the class files if they are public static final and are generated by a constant (not method). The attached fix will fix this by using a ident(String) functions that return the String itsself to prevent this inlining. Will apply to 2.9, trunk and 2.9 BW branch. No I can also reenable one test I removed because of this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2004) Constants.LUCENE_MAIN_VERSION is inlined in code compiled against Lucene JAR, so version detection is incorrect
[ https://issues.apache.org/jira/browse/LUCENE-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2004: -- Attachment: LUCENE-2004.patch See also: http://www.javaworld.com/community/node/3400 Constants.LUCENE_MAIN_VERSION is inlined in code compiled against Lucene JAR, so version detection is incorrect --- Key: LUCENE-2004 URL: https://issues.apache.org/jira/browse/LUCENE-2004 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9.1, 3.0 Attachments: LUCENE-2004.patch When you compile your own code against the Lucene 2.9 version of the JARs and use the LUCENE_MAIN_VERSION constant and then run the code against the 3.0 JAR, the constant still contains 2.9, because javac inlines primitives and Strings into the class files if they are public static final and are generated by a constant (not method). The attached fix will fix this by using a ident(String) functions that return the String itsself to prevent this inlining. Will apply to 2.9, trunk and 2.9 BW branch. No I can also reenable one test I removed because of this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-2004) Constants.LUCENE_MAIN_VERSION is inlined in code compiled against Lucene JAR, so version detection is incorrect
[ https://issues.apache.org/jira/browse/LUCENE-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler resolved LUCENE-2004. --- Resolution: Fixed Fixed. Constants.LUCENE_MAIN_VERSION is inlined in code compiled against Lucene JAR, so version detection is incorrect --- Key: LUCENE-2004 URL: https://issues.apache.org/jira/browse/LUCENE-2004 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9.1, 3.0 Attachments: LUCENE-2004.patch When you compile your own code against the Lucene 2.9 version of the JARs and use the LUCENE_MAIN_VERSION constant and then run the code against the 3.0 JAR, the constant still contains 2.9, because javac inlines primitives and Strings into the class files if they are public static final and are generated by a constant (not method). The attached fix will fix this by using a ident(String) functions that return the String itsself to prevent this inlining. Will apply to 2.9, trunk and 2.9 BW branch. No I can also reenable one test I removed because of this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: svn commit: r828334 - /lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/lucene/index/TestCheckIndex.java
I found a solution for this problem! First the explaination: The test CheckIndexTest compares the version numbers from Constants with the current compilation (ant settings). There are two constants Constants.LUCENE_MAIN_VERSION which is hard coded into Constants.java. This version had a problem, because it was a static final String constant, which is inlined by javac, so that code compiled against that version of the class file will always see the static string even when you replace the JAR. The second constant LUCENE_VERSION contains the same like in the manifest, and if no manifest is available (no JAR file at all), it contains the LUCENE_MAIN_VERSION constant. The code has some intelligence to add LUCENE_MAIN_VERSION also to this constant (but at the end and in [] brackets), if the string from the manifest contains no version. E.g. Hudson compiles Lucene and puts just a date code into the manifest (-Dversion=200910 ANT parameter). LUCENE_MAIN_VERSION will contains this string, bud as 3.0-dev does not appear in this string, it is appended as [3.0-dev]. The test CheckIndex checks these version and tests if LUCENE_VERSION starts with LUCENE_MAIN_VERSION, which is not correct in this case. The test works for trunk, because the tests are run without JAR file (against the class files direct), but not for backwards (as the test is run against the lucene-core.jar, which contains the manifest). The easy fix would be to change Constants.LUCENE_VERSION to not append the string, but places it in front of the manifest string, if the manifest string does not start with LUCENE_MAIN_VERSION. We could also fix Hudson, but then test will fail if somebody uses a strange version string when calling ANT. The first solution is 100% secure. Opinions? - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: uschind...@apache.org [mailto:uschind...@apache.org] Sent: Thursday, October 22, 2009 9:22 AM To: java-comm...@lucene.apache.org Subject: svn commit: r828334 - /lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luc ene/index/TestCheckIndex.java Author: uschindler Date: Thu Oct 22 07:22:28 2009 New Revision: 828334 URL: http://svn.apache.org/viewvc?rev=828334view=rev Log: this test fails on hudson because of the strange version ant parameter with only a date code. test-tag is run against the JAR version, test-core against the class files. The JAR version contains the strange version number in manifest :( Should be somehow fixed. For now, I disable the test. Modified: lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luce ne/index/TestCheckIndex.java Modified: lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luce ne/index/TestCheckIndex.java URL: http://svn.apache.org/viewvc/lucene/java/branches/lucene_2_9_back_compat_t ests/src/test/org/apache/lucene/index/TestCheckIndex.java?rev=828334r1=82 8333r2=828334view=diff == --- lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luce ne/index/TestCheckIndex.java (original) +++ lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luce ne/index/TestCheckIndex.java Thu Oct 22 07:22:28 2009 @@ -96,6 +96,8 @@ assertNotNull(version); assertTrue(version.equals(Constants.LUCENE_MAIN_VERSION+-dev) || version.equals(Constants.LUCENE_MAIN_VERSION)); -assertTrue(Constants.LUCENE_VERSION.startsWith(version)); +// TODO: does not work on hudson, because tests are run against a JAR version, +// which has a package version like 20091013* not 3.0*: +//assertTrue(Constants.LUCENE_VERSION.startsWith(version)); } } - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: svn commit: r828334 - /lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/lucene/index/TestCheckIndex.java
Putting the LUCENE_VERSION in front of the string instead of in back seems fine? Or we could relax the test to simply assert that the expected version appears anywhere as a substring? (ie, .contains instead of .startsWith) Mike On Thu, Oct 22, 2009 at 4:13 AM, Uwe Schindler u...@thetaphi.de wrote: I found a solution for this problem! First the explaination: The test CheckIndexTest compares the version numbers from Constants with the current compilation (ant settings). There are two constants Constants.LUCENE_MAIN_VERSION which is hard coded into Constants.java. This version had a problem, because it was a static final String constant, which is inlined by javac, so that code compiled against that version of the class file will always see the static string even when you replace the JAR. The second constant LUCENE_VERSION contains the same like in the manifest, and if no manifest is available (no JAR file at all), it contains the LUCENE_MAIN_VERSION constant. The code has some intelligence to add LUCENE_MAIN_VERSION also to this constant (but at the end and in [] brackets), if the string from the manifest contains no version. E.g. Hudson compiles Lucene and puts just a date code into the manifest (-Dversion=200910 ANT parameter). LUCENE_MAIN_VERSION will contains this string, bud as 3.0-dev does not appear in this string, it is appended as [3.0-dev]. The test CheckIndex checks these version and tests if LUCENE_VERSION starts with LUCENE_MAIN_VERSION, which is not correct in this case. The test works for trunk, because the tests are run without JAR file (against the class files direct), but not for backwards (as the test is run against the lucene-core.jar, which contains the manifest). The easy fix would be to change Constants.LUCENE_VERSION to not append the string, but places it in front of the manifest string, if the manifest string does not start with LUCENE_MAIN_VERSION. We could also fix Hudson, but then test will fail if somebody uses a strange version string when calling ANT. The first solution is 100% secure. Opinions? - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: uschind...@apache.org [mailto:uschind...@apache.org] Sent: Thursday, October 22, 2009 9:22 AM To: java-comm...@lucene.apache.org Subject: svn commit: r828334 - /lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luc ene/index/TestCheckIndex.java Author: uschindler Date: Thu Oct 22 07:22:28 2009 New Revision: 828334 URL: http://svn.apache.org/viewvc?rev=828334view=rev Log: this test fails on hudson because of the strange version ant parameter with only a date code. test-tag is run against the JAR version, test-core against the class files. The JAR version contains the strange version number in manifest :( Should be somehow fixed. For now, I disable the test. Modified: lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luce ne/index/TestCheckIndex.java Modified: lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luce ne/index/TestCheckIndex.java URL: http://svn.apache.org/viewvc/lucene/java/branches/lucene_2_9_back_compat_t ests/src/test/org/apache/lucene/index/TestCheckIndex.java?rev=828334r1=82 8333r2=828334view=diff == --- lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luce ne/index/TestCheckIndex.java (original) +++ lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luce ne/index/TestCheckIndex.java Thu Oct 22 07:22:28 2009 @@ -96,6 +96,8 @@ assertNotNull(version); assertTrue(version.equals(Constants.LUCENE_MAIN_VERSION+-dev) || version.equals(Constants.LUCENE_MAIN_VERSION)); - assertTrue(Constants.LUCENE_VERSION.startsWith(version)); + // TODO: does not work on hudson, because tests are run against a JAR version, + // which has a package version like 20091013* not 3.0*: + //assertTrue(Constants.LUCENE_VERSION.startsWith(version)); } } - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: svn commit: r828334 - /lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/lucene/index/TestCheckIndex.java
Putting the LUCENE_VERSION in front of the string instead of in back seems fine? I would prefer this, as it makes it possible to do compareTo() comparisons and so on, which may be used in client code, too (not only test). OK, client code should not use trunk versions from Hudson, but it would be better. Or we could relax the test to simply assert that the expected version appears anywhere as a substring? (ie, .contains instead of .startsWith) This would only fix this test. I prefer the first. Uwe Mike On Thu, Oct 22, 2009 at 4:13 AM, Uwe Schindler u...@thetaphi.de wrote: I found a solution for this problem! First the explaination: The test CheckIndexTest compares the version numbers from Constants with the current compilation (ant settings). There are two constants Constants.LUCENE_MAIN_VERSION which is hard coded into Constants.java. This version had a problem, because it was a static final String constant, which is inlined by javac, so that code compiled against that version of the class file will always see the static string even when you replace the JAR. The second constant LUCENE_VERSION contains the same like in the manifest, and if no manifest is available (no JAR file at all), it contains the LUCENE_MAIN_VERSION constant. The code has some intelligence to add LUCENE_MAIN_VERSION also to this constant (but at the end and in [] brackets), if the string from the manifest contains no version. E.g. Hudson compiles Lucene and puts just a date code into the manifest (-Dversion=200910 ANT parameter). LUCENE_MAIN_VERSION will contains this string, bud as 3.0-dev does not appear in this string, it is appended as [3.0-dev]. The test CheckIndex checks these version and tests if LUCENE_VERSION starts with LUCENE_MAIN_VERSION, which is not correct in this case. The test works for trunk, because the tests are run without JAR file (against the class files direct), but not for backwards (as the test is run against the lucene-core.jar, which contains the manifest). The easy fix would be to change Constants.LUCENE_VERSION to not append the string, but places it in front of the manifest string, if the manifest string does not start with LUCENE_MAIN_VERSION. We could also fix Hudson, but then test will fail if somebody uses a strange version string when calling ANT. The first solution is 100% secure. Opinions? - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: uschind...@apache.org [mailto:uschind...@apache.org] Sent: Thursday, October 22, 2009 9:22 AM To: java-comm...@lucene.apache.org Subject: svn commit: r828334 - /lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luc ene/index/TestCheckIndex.java Author: uschindler Date: Thu Oct 22 07:22:28 2009 New Revision: 828334 URL: http://svn.apache.org/viewvc?rev=828334view=rev Log: this test fails on hudson because of the strange version ant parameter with only a date code. test-tag is run against the JAR version, test- core against the class files. The JAR version contains the strange version number in manifest :( Should be somehow fixed. For now, I disable the test. Modified: lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luce ne/index/TestCheckIndex.java Modified: lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luce ne/index/TestCheckIndex.java URL: http://svn.apache.org/viewvc/lucene/java/branches/lucene_2_9_back_compat_t ests/src/test/org/apache/lucene/index/TestCheckIndex.java?rev=828334r1=82 8333r2=828334view=diff == --- lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luce ne/index/TestCheckIndex.java (original) +++ lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luce ne/index/TestCheckIndex.java Thu Oct 22 07:22:28 2009 @@ -96,6 +96,8 @@ assertNotNull(version); assertTrue(version.equals(Constants.LUCENE_MAIN_VERSION+-dev) || version.equals(Constants.LUCENE_MAIN_VERSION)); - assertTrue(Constants.LUCENE_VERSION.startsWith(version)); + // TODO: does not work on hudson, because tests are run against a JAR version, + // which has a package version like 20091013* not 3.0*: + //assertTrue(Constants.LUCENE_VERSION.startsWith(version)); } } - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail:
Re: lucene 2.9 sorting algorithm
On Thu, Oct 22, 2009 at 2:17 AM, John Wang john.w...@gmail.com wrote: I have been playing with the patch, and I think I have some information that you might like. Let me spend sometime and gather some more numbers and update in jira. Excellent! say bottom has ords 23, 45, 76, each corresponding to a string. When moving to the next segment, you need to make bottom to have ords that can be comparable to other docs in this new segment, so you would need to find the new ords for the values in 23,45 and 76, don't you? To find it, assuming the values are s1,s2,s3, you would do a bin. search on the new val array, and find index for s1,s2,s3. It's that inversion (from ord-Comparable in first seg, and Comparable-ord in second seg) that I'm trying to avoid (w/ this new proposal). Which is 3 bin searches per convert, I am not sure how you can short circuit it. Are you suggesting we call Comparable on compareBottom until some doc beats it? I'm saying on seg transition you indeed get the Comparable for current bottom, but, don't attempt to invert it. Instead, as seg 2 finds a hit, you get that hit's Comparables and compare to bottom. If it beats bottom, it goes into the queue. If it does not, you use the ord (in seg 2's ord space) to learn a bottom in the ord space of seg 2. That would hurt performance I lot though, no? Yeah I think likely it would, since we're talking about a binary search on transition VS having to do possibly many upgrade-to-Comparable and compare-Comparabls to slowly learn the equivalent ord in the new segment. I was proposing it for cases where inversion is very difficult. But realistically, since you must keep around the ful ord - Comparable for every segment anyway (in order to merge in the end), inversion shouldn't ever actually be difficult -- it'd just be a binary search on presumably in-RAM storage. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: svn commit: r828334 - /lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/lucene/index/TestCheckIndex.java
OK let's do first! Mike On Thu, Oct 22, 2009 at 5:31 AM, Uwe Schindler u...@thetaphi.de wrote: Putting the LUCENE_VERSION in front of the string instead of in back seems fine? I would prefer this, as it makes it possible to do compareTo() comparisons and so on, which may be used in client code, too (not only test). OK, client code should not use trunk versions from Hudson, but it would be better. Or we could relax the test to simply assert that the expected version appears anywhere as a substring? (ie, .contains instead of .startsWith) This would only fix this test. I prefer the first. Uwe Mike On Thu, Oct 22, 2009 at 4:13 AM, Uwe Schindler u...@thetaphi.de wrote: I found a solution for this problem! First the explaination: The test CheckIndexTest compares the version numbers from Constants with the current compilation (ant settings). There are two constants Constants.LUCENE_MAIN_VERSION which is hard coded into Constants.java. This version had a problem, because it was a static final String constant, which is inlined by javac, so that code compiled against that version of the class file will always see the static string even when you replace the JAR. The second constant LUCENE_VERSION contains the same like in the manifest, and if no manifest is available (no JAR file at all), it contains the LUCENE_MAIN_VERSION constant. The code has some intelligence to add LUCENE_MAIN_VERSION also to this constant (but at the end and in [] brackets), if the string from the manifest contains no version. E.g. Hudson compiles Lucene and puts just a date code into the manifest (-Dversion=200910 ANT parameter). LUCENE_MAIN_VERSION will contains this string, bud as 3.0-dev does not appear in this string, it is appended as [3.0-dev]. The test CheckIndex checks these version and tests if LUCENE_VERSION starts with LUCENE_MAIN_VERSION, which is not correct in this case. The test works for trunk, because the tests are run without JAR file (against the class files direct), but not for backwards (as the test is run against the lucene-core.jar, which contains the manifest). The easy fix would be to change Constants.LUCENE_VERSION to not append the string, but places it in front of the manifest string, if the manifest string does not start with LUCENE_MAIN_VERSION. We could also fix Hudson, but then test will fail if somebody uses a strange version string when calling ANT. The first solution is 100% secure. Opinions? - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: uschind...@apache.org [mailto:uschind...@apache.org] Sent: Thursday, October 22, 2009 9:22 AM To: java-comm...@lucene.apache.org Subject: svn commit: r828334 - /lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luc ene/index/TestCheckIndex.java Author: uschindler Date: Thu Oct 22 07:22:28 2009 New Revision: 828334 URL: http://svn.apache.org/viewvc?rev=828334view=rev Log: this test fails on hudson because of the strange version ant parameter with only a date code. test-tag is run against the JAR version, test- core against the class files. The JAR version contains the strange version number in manifest :( Should be somehow fixed. For now, I disable the test. Modified: lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luce ne/index/TestCheckIndex.java Modified: lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luce ne/index/TestCheckIndex.java URL: http://svn.apache.org/viewvc/lucene/java/branches/lucene_2_9_back_compat_t ests/src/test/org/apache/lucene/index/TestCheckIndex.java?rev=828334r1=82 8333r2=828334view=diff == --- lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luce ne/index/TestCheckIndex.java (original) +++ lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luce ne/index/TestCheckIndex.java Thu Oct 22 07:22:28 2009 @@ -96,6 +96,8 @@ assertNotNull(version); assertTrue(version.equals(Constants.LUCENE_MAIN_VERSION+-dev) || version.equals(Constants.LUCENE_MAIN_VERSION)); - assertTrue(Constants.LUCENE_VERSION.startsWith(version)); + // TODO: does not work on hudson, because tests are run against a JAR version, + // which has a package version like 20091013* not 3.0*: + //assertTrue(Constants.LUCENE_VERSION.startsWith(version)); } } - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe,
RE: svn commit: r828334 - /lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/lucene/index/TestCheckIndex.java
Done! - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Thursday, October 22, 2009 11:39 AM To: java-dev@lucene.apache.org Subject: Re: svn commit: r828334 - /lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luc ene/index/TestCheckIndex.java OK let's do first! Mike On Thu, Oct 22, 2009 at 5:31 AM, Uwe Schindler u...@thetaphi.de wrote: Putting the LUCENE_VERSION in front of the string instead of in back seems fine? I would prefer this, as it makes it possible to do compareTo() comparisons and so on, which may be used in client code, too (not only test). OK, client code should not use trunk versions from Hudson, but it would be better. Or we could relax the test to simply assert that the expected version appears anywhere as a substring? (ie, .contains instead of .startsWith) This would only fix this test. I prefer the first. Uwe Mike On Thu, Oct 22, 2009 at 4:13 AM, Uwe Schindler u...@thetaphi.de wrote: I found a solution for this problem! First the explaination: The test CheckIndexTest compares the version numbers from Constants with the current compilation (ant settings). There are two constants Constants.LUCENE_MAIN_VERSION which is hard coded into Constants.java. This version had a problem, because it was a static final String constant, which is inlined by javac, so that code compiled against that version of the class file will always see the static string even when you replace the JAR. The second constant LUCENE_VERSION contains the same like in the manifest, and if no manifest is available (no JAR file at all), it contains the LUCENE_MAIN_VERSION constant. The code has some intelligence to add LUCENE_MAIN_VERSION also to this constant (but at the end and in [] brackets), if the string from the manifest contains no version. E.g. Hudson compiles Lucene and puts just a date code into the manifest (-Dversion=200910 ANT parameter). LUCENE_MAIN_VERSION will contains this string, bud as 3.0-dev does not appear in this string, it is appended as [3.0-dev]. The test CheckIndex checks these version and tests if LUCENE_VERSION starts with LUCENE_MAIN_VERSION, which is not correct in this case. The test works for trunk, because the tests are run without JAR file (against the class files direct), but not for backwards (as the test is run against the lucene-core.jar, which contains the manifest). The easy fix would be to change Constants.LUCENE_VERSION to not append the string, but places it in front of the manifest string, if the manifest string does not start with LUCENE_MAIN_VERSION. We could also fix Hudson, but then test will fail if somebody uses a strange version string when calling ANT. The first solution is 100% secure. Opinions? - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: uschind...@apache.org [mailto:uschind...@apache.org] Sent: Thursday, October 22, 2009 9:22 AM To: java-comm...@lucene.apache.org Subject: svn commit: r828334 - /lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luc ene/index/TestCheckIndex.java Author: uschindler Date: Thu Oct 22 07:22:28 2009 New Revision: 828334 URL: http://svn.apache.org/viewvc?rev=828334view=rev Log: this test fails on hudson because of the strange version ant parameter with only a date code. test-tag is run against the JAR version, test- core against the class files. The JAR version contains the strange version number in manifest :( Should be somehow fixed. For now, I disable the test. Modified: lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luce ne/index/TestCheckIndex.java Modified: lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luce ne/index/TestCheckIndex.java URL: http://svn.apache.org/viewvc/lucene/java/branches/lucene_2_9_back_compat_t ests/src/test/org/apache/lucene/index/TestCheckIndex.java?rev=828334r1=82 8333r2=828334view=diff == --- lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luce ne/index/TestCheckIndex.java (original) +++ lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luce ne/index/TestCheckIndex.java Thu Oct 22 07:22:28 2009 @@ -96,6 +96,8 @@ assertNotNull(version); assertTrue(version.equals(Constants.LUCENE_MAIN_VERSION+-dev) || version.equals(Constants.LUCENE_MAIN_VERSION)); -
[jira] Commented: (LUCENE-1973) Remove deprecated query components
[ https://issues.apache.org/jira/browse/LUCENE-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768661#action_12768661 ] Uwe Schindler commented on LUCENE-1973: --- Anybody wants to help? Remove deprecated query components -- Key: LUCENE-1973 URL: https://issues.apache.org/jira/browse/LUCENE-1973 Project: Lucene - Java Issue Type: Task Components: Search Reporter: Uwe Schindler Fix For: 3.0 Remove the rest of the deprecated query components. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-2001) wordnet parsing bug
[ https://issues.apache.org/jira/browse/LUCENE-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll reassigned LUCENE-2001: --- Assignee: Grant Ingersoll wordnet parsing bug --- Key: LUCENE-2001 URL: https://issues.apache.org/jira/browse/LUCENE-2001 Project: Lucene - Java Issue Type: Bug Components: contrib/* Affects Versions: 2.9 Reporter: Robert Muir Assignee: Grant Ingersoll Priority: Minor Fix For: 2.9.1, 3.0 Attachments: LUCENE-2001.patch, LUCENE-2001_branch.patch, LUCENE-2001_branch.patch A user reported that wordnet parses the prolog file incorrectly. Also need to check the wordnet parser in the memory contrib for this problem. If this is a false alarm, i'm not worried, because the test will be the first unit test wordnet package ever had. {noformat} For example, looking up the synsets for the word king, we get: java SynLookup wnindex king baron magnate mogul power queen rex scrofula struma tycoon Here, scrofula and struma are extraneous. This happens because, the line parser code in Syns2Index.java interpretes the two consecutive single quotes in entry s(114144247,3,'king''s evil',n,1,1) in wn_s.pl file, as termination of the string and separates into king. This entry concerns synset of words scrofula and struma, and thus they get inserted in the synset of king. *There 1382 such entries, in wn_s.pl* and more in other WordNet Prolog data-base files, where such use of two consecutive single quotes appears. We have resolved this by adding a statement in the line parsing portion of Syns2Index.java, as follows: // parse line line = line.substring(2); * line = line.replaceAll(\'\', `); // added statement* int comma = line.indexOf(','); String num = line.substring(0, comma); ... ... etc. In short we replace '' by ` (a back-quote). Then on recreating the index, we get: java SynLookup zwnindex king baron magnate mogul power queen rex tycoon {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2001) wordnet parsing bug
[ https://issues.apache.org/jira/browse/LUCENE-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768695#action_12768695 ] Grant Ingersoll commented on LUCENE-2001: - I'll take care of the branch. wordnet parsing bug --- Key: LUCENE-2001 URL: https://issues.apache.org/jira/browse/LUCENE-2001 Project: Lucene - Java Issue Type: Bug Components: contrib/* Affects Versions: 2.9 Reporter: Robert Muir Assignee: Grant Ingersoll Priority: Minor Fix For: 2.9.1, 3.0 Attachments: LUCENE-2001.patch, LUCENE-2001_branch.patch, LUCENE-2001_branch.patch A user reported that wordnet parses the prolog file incorrectly. Also need to check the wordnet parser in the memory contrib for this problem. If this is a false alarm, i'm not worried, because the test will be the first unit test wordnet package ever had. {noformat} For example, looking up the synsets for the word king, we get: java SynLookup wnindex king baron magnate mogul power queen rex scrofula struma tycoon Here, scrofula and struma are extraneous. This happens because, the line parser code in Syns2Index.java interpretes the two consecutive single quotes in entry s(114144247,3,'king''s evil',n,1,1) in wn_s.pl file, as termination of the string and separates into king. This entry concerns synset of words scrofula and struma, and thus they get inserted in the synset of king. *There 1382 such entries, in wn_s.pl* and more in other WordNet Prolog data-base files, where such use of two consecutive single quotes appears. We have resolved this by adding a statement in the line parsing portion of Syns2Index.java, as follows: // parse line line = line.substring(2); * line = line.replaceAll(\'\', `); // added statement* int comma = line.indexOf(','); String num = line.substring(0, comma); ... ... etc. In short we replace '' by ` (a back-quote). Then on recreating the index, we get: java SynLookup zwnindex king baron magnate mogul power queen rex tycoon {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-2001) wordnet parsing bug
[ https://issues.apache.org/jira/browse/LUCENE-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll resolved LUCENE-2001. - Resolution: Fixed Committed revision 828728. wordnet parsing bug --- Key: LUCENE-2001 URL: https://issues.apache.org/jira/browse/LUCENE-2001 Project: Lucene - Java Issue Type: Bug Components: contrib/* Affects Versions: 2.9 Reporter: Robert Muir Assignee: Grant Ingersoll Priority: Minor Fix For: 2.9.1, 3.0 Attachments: LUCENE-2001.patch, LUCENE-2001_branch.patch, LUCENE-2001_branch.patch A user reported that wordnet parses the prolog file incorrectly. Also need to check the wordnet parser in the memory contrib for this problem. If this is a false alarm, i'm not worried, because the test will be the first unit test wordnet package ever had. {noformat} For example, looking up the synsets for the word king, we get: java SynLookup wnindex king baron magnate mogul power queen rex scrofula struma tycoon Here, scrofula and struma are extraneous. This happens because, the line parser code in Syns2Index.java interpretes the two consecutive single quotes in entry s(114144247,3,'king''s evil',n,1,1) in wn_s.pl file, as termination of the string and separates into king. This entry concerns synset of words scrofula and struma, and thus they get inserted in the synset of king. *There 1382 such entries, in wn_s.pl* and more in other WordNet Prolog data-base files, where such use of two consecutive single quotes appears. We have resolved this by adding a statement in the line parsing portion of Syns2Index.java, as follows: // parse line line = line.substring(2); * line = line.replaceAll(\'\', `); // added statement* int comma = line.indexOf(','); String num = line.substring(0, comma); ... ... etc. In short we replace '' by ` (a back-quote). Then on recreating the index, we get: java SynLookup zwnindex king baron magnate mogul power queen rex tycoon {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768700#action_12768700 ] Grant Ingersoll commented on LUCENE-1606: - Why are new features going into 3.0? I was under the impression that 3.0 was just supposed to be cleanup plus Java 1.5 Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Assignee: Robert Muir Priority: Minor Fix For: 3.0 Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768705#action_12768705 ] Robert Muir commented on LUCENE-1606: - Grant, I thought it was ok from Uwe's comment: bq. I move this to 3.0 (and not 3.1), because it can be released together with 3.0 (contrib modules do not need to wait until 3.1). I guess now I am a little confused about what should happen for 3.0 with contrib in general? No problem moving this to 3.1, let me know! Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Assignee: Robert Muir Priority: Minor Fix For: 3.0 Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2002) Add oal.util.Version ctor to QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768711#action_12768711 ] Mark Miller commented on LUCENE-2002: - I think we need more doc as well - stopfilter is not just tied to standardanalyzer - standardanalyzer just happens to use it. Many analyzers can use a stopfilter and one of the stopfilters params is to enable or disable this setting. Add oal.util.Version ctor to QueryParser Key: LUCENE-2002 URL: https://issues.apache.org/jira/browse/LUCENE-2002 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9, 3.0 Reporter: Uwe Schindler Assignee: Michael McCandless Fix For: 2.9.1 Attachments: LUCENE-2002-29.patch This is a followup of LUCENE-1987: If somebody uses StandardAnalyzer with Version.LUCENE_CURRENT and then uses QueryParser, phrase queries will not work, because the StopFilter enables position Increments for stop words, but QueryParser ignores them per default. The user has to explicitely enable them. This issue would add a ctor taking the Version constant and automatically enable this setting. The same applies to the contrib queryparser. Eventually also StopAnalyzer should add this version ctor. To be able to remove the default ctor for 3.0 (to remove a possible trap for users of QueryParser), it must be deprecated and the new one also added to 2.9.1. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2002) Add oal.util.Version ctor to QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768717#action_12768717 ] Grant Ingersoll commented on LUCENE-2002: - {quote}Unfortunately, JavaCC generates two public ctors for QueryParser (one taking CharStream, another taking QueryParserTokenManager) that I don't know how to override to take a Version param. {quote} Those two constructors are bad anyway b/c if anyone calls them, it won't set the Analyzer, etc. Thus, I think, unfortunately, the answer just might be to edit the generated Java file by hand and make them be protected. I've looked through the JavaCC docs and I don't see any other way. Of course, the big down side to this is we now need to do this going forward. Add oal.util.Version ctor to QueryParser Key: LUCENE-2002 URL: https://issues.apache.org/jira/browse/LUCENE-2002 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9, 3.0 Reporter: Uwe Schindler Assignee: Michael McCandless Fix For: 2.9.1 Attachments: LUCENE-2002-29.patch This is a followup of LUCENE-1987: If somebody uses StandardAnalyzer with Version.LUCENE_CURRENT and then uses QueryParser, phrase queries will not work, because the StopFilter enables position Increments for stop words, but QueryParser ignores them per default. The user has to explicitely enable them. This issue would add a ctor taking the Version constant and automatically enable this setting. The same applies to the contrib queryparser. Eventually also StopAnalyzer should add this version ctor. To be able to remove the default ctor for 3.0 (to remove a possible trap for users of QueryParser), it must be deprecated and the new one also added to 2.9.1. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2005) Add LuSql project to Apache Lucene - Contributions wiki page
Add LuSql project to Apache Lucene - Contributions wiki page -- Key: LUCENE-2005 URL: https://issues.apache.org/jira/browse/LUCENE-2005 Project: Lucene - Java Issue Type: Task Components: Website Affects Versions: 2.9 Reporter: Glen Newton Add [LuSql|http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql] to the Apache Lucene - Contributions page [http://lucene.apache.org/java/2_9_0/contributions.html] I am the author of LuSql. I can supply any text needed. Perhaps a new heading is needed to capture Database/JDBC oriented Lucene tools (there are other out there)? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2005) Add LuSql project to Apache Lucene - Contributions wiki page
[ https://issues.apache.org/jira/browse/LUCENE-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Glen Newton updated LUCENE-2005: Description: Add [LuSql|http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql] to the Apache Lucene - Contributions page [http://lucene.apache.org/java/2_9_0/contributions.html] I am the author of LuSql. I can supply any text needed. Perhaps a new heading is needed to capture Database/JDBC oriented Lucene tools (there are others out there)? was: Add [LuSql|http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql] to the Apache Lucene - Contributions page [http://lucene.apache.org/java/2_9_0/contributions.html] I am the author of LuSql. I can supply any text needed. Perhaps a new heading is needed to capture Database/JDBC oriented Lucene tools (there are other out there)? Add LuSql project to Apache Lucene - Contributions wiki page -- Key: LUCENE-2005 URL: https://issues.apache.org/jira/browse/LUCENE-2005 Project: Lucene - Java Issue Type: Task Components: Website Affects Versions: 2.9 Reporter: Glen Newton Original Estimate: 2h Remaining Estimate: 2h Add [LuSql|http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql] to the Apache Lucene - Contributions page [http://lucene.apache.org/java/2_9_0/contributions.html] I am the author of LuSql. I can supply any text needed. Perhaps a new heading is needed to capture Database/JDBC oriented Lucene tools (there are others out there)? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
How to loop through all the entries for a field
I have a field in called BookTitle. I want to loop through all the entries without doing a search. I just want to get the list of BookTitle's that is in this field: I tried IndexReader but MaxDocs() doesnt work because it returns everything and I have other fields in their which is allot bigger. -- View this message in context: http://www.nabble.com/How-to-loop-through-all-the-entries-for-a-field-tp26012309p26012309.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2005) Add LuSql project to Apache Lucene - Contributions wiki page
[ https://issues.apache.org/jira/browse/LUCENE-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768727#action_12768727 ] Robert Muir commented on LUCENE-2005: - glen, I know there is an oracle domain index implementation too, so maybe a database category isn't a bad idea. do you know of any others? Add LuSql project to Apache Lucene - Contributions wiki page -- Key: LUCENE-2005 URL: https://issues.apache.org/jira/browse/LUCENE-2005 Project: Lucene - Java Issue Type: Task Components: Website Affects Versions: 2.9 Reporter: Glen Newton Original Estimate: 2h Remaining Estimate: 2h Add [LuSql|http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql] to the Apache Lucene - Contributions page [http://lucene.apache.org/java/2_9_0/contributions.html] I am the author of LuSql. I can supply any text needed. Perhaps a new heading is needed to capture Database/JDBC oriented Lucene tools (there are others out there)? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2002) Add oal.util.Version ctor to QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768733#action_12768733 ] Michael McCandless commented on LUCENE-2002: bq. Thus, I think, unfortunately, the answer just might be to edit the generated Java file by hand and make them be protected. OK I'll take that approach, and I guess make a unit test that peeks confirms these methods are still protected (to catch us in the future). Add oal.util.Version ctor to QueryParser Key: LUCENE-2002 URL: https://issues.apache.org/jira/browse/LUCENE-2002 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9, 3.0 Reporter: Uwe Schindler Assignee: Michael McCandless Fix For: 2.9.1 Attachments: LUCENE-2002-29.patch This is a followup of LUCENE-1987: If somebody uses StandardAnalyzer with Version.LUCENE_CURRENT and then uses QueryParser, phrase queries will not work, because the StopFilter enables position Increments for stop words, but QueryParser ignores them per default. The user has to explicitely enable them. This issue would add a ctor taking the Version constant and automatically enable this setting. The same applies to the contrib queryparser. Eventually also StopAnalyzer should add this version ctor. To be able to remove the default ctor for 3.0 (to remove a possible trap for users of QueryParser), it must be deprecated and the new one also added to 2.9.1. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2002) Add oal.util.Version ctor to QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768738#action_12768738 ] Michael McCandless commented on LUCENE-2002: bq. Many analyzers can use a stopfilter and one of the stopfilters params is to enable or disable this setting. In fact, I think we may have to un-deprecate StopFilter.get/setEnablePositionIncrementsDefault for this reason? Many analyzers do embed StopFilter without exposing control over this setting, and so the only way (up to including 2.9) to change the setting is to set the static default with StopFilter. If we remove that then we've taken that control away. Or, with this issue I could add Version to all contrib analyzers that embed StopFilter. I think I like that solution better (we shouldn't be using static defaults). I'll go forward w/ that shortly unless any objections come up... this'd also take care of analyzers that use StandardTokenizer (ie, we'll control fixing the acronym bug with Version as well). Add oal.util.Version ctor to QueryParser Key: LUCENE-2002 URL: https://issues.apache.org/jira/browse/LUCENE-2002 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9, 3.0 Reporter: Uwe Schindler Assignee: Michael McCandless Fix For: 2.9.1 Attachments: LUCENE-2002-29.patch This is a followup of LUCENE-1987: If somebody uses StandardAnalyzer with Version.LUCENE_CURRENT and then uses QueryParser, phrase queries will not work, because the StopFilter enables position Increments for stop words, but QueryParser ignores them per default. The user has to explicitely enable them. This issue would add a ctor taking the Version constant and automatically enable this setting. The same applies to the contrib queryparser. Eventually also StopAnalyzer should add this version ctor. To be able to remove the default ctor for 3.0 (to remove a possible trap for users of QueryParser), it must be deprecated and the new one also added to 2.9.1. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768739#action_12768739 ] Uwe Schindler commented on LUCENE-1606: --- 3.0 is just the switch to 1.5 and generics. So this is a typical java 1.5 issue and can go into 3.0 even if it is a new feature. Contrib is not core and may have own rules. In my opinion, this would be a nice addition to the regex contrib and should also have been in 2.9, but the underlying library is Java 5 only, so we had to wait until 3.0. Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Assignee: Robert Muir Priority: Minor Fix For: 3.0 Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2002) Add oal.util.Version ctor to QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768740#action_12768740 ] Robert Muir commented on LUCENE-2002: - {quote} Or, with this issue I could add Version to all contrib analyzers that embed StopFilter. I think I like that solution better (we shouldn't be using static defaults). I'll go forward w/ that shortly unless any objections come up... this'd also take care of analyzers that use StandardTokenizer (ie, we'll control fixing the acronym bug with Version as well). {quote} Michael, if you do this, can you mark LUCENE-1373 as resolved? :) Add oal.util.Version ctor to QueryParser Key: LUCENE-2002 URL: https://issues.apache.org/jira/browse/LUCENE-2002 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9, 3.0 Reporter: Uwe Schindler Assignee: Michael McCandless Fix For: 2.9.1 Attachments: LUCENE-2002-29.patch This is a followup of LUCENE-1987: If somebody uses StandardAnalyzer with Version.LUCENE_CURRENT and then uses QueryParser, phrase queries will not work, because the StopFilter enables position Increments for stop words, but QueryParser ignores them per default. The user has to explicitely enable them. This issue would add a ctor taking the Version constant and automatically enable this setting. The same applies to the contrib queryparser. Eventually also StopAnalyzer should add this version ctor. To be able to remove the default ctor for 3.0 (to remove a possible trap for users of QueryParser), it must be deprecated and the new one also added to 2.9.1. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2005) Add LuSql project to Apache Lucene - Contributions wiki page
[ https://issues.apache.org/jira/browse/LUCENE-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768743#action_12768743 ] Glen Newton commented on LUCENE-2005: - [DBSight|http://www.dbsight.net/] is a commercial product that does indexing and a lot more. I was wondering if there is the need for another category: Other projects/frameworks that have support for or are built on, Lucene internally (as opposed to Lucene Tools. Examples: * [Compass|http://www.compass-project.org/] * [Hibernate|https://www.hibernate.org/410.html] * [SOLR|http://lucene.apache.org/solr/] * others...??? In the FAQ, for [http://wiki.apache.org/lucene-java/LuceneFAQ#How_can_I_use_Lucene_to_index_a_database.3F|indexing databases with Lucene], LuSql should also be added (separate JIRA issue?) Add LuSql project to Apache Lucene - Contributions wiki page -- Key: LUCENE-2005 URL: https://issues.apache.org/jira/browse/LUCENE-2005 Project: Lucene - Java Issue Type: Task Components: Website Affects Versions: 2.9 Reporter: Glen Newton Original Estimate: 2h Remaining Estimate: 2h Add [LuSql|http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql] to the Apache Lucene - Contributions page [http://lucene.apache.org/java/2_9_0/contributions.html] I am the author of LuSql. I can supply any text needed. Perhaps a new heading is needed to capture Database/JDBC oriented Lucene tools (there are others out there)? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2002) Add oal.util.Version ctor to QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768744#action_12768744 ] Michael McCandless commented on LUCENE-2002: bq. Michael, if you do this, can you mark LUCENE-1373 as resolved? Ahh yes indeed. Is there a corresponding issue about not being able to control stop filter pos incr? Can't keep track of all these issues anymore! Add oal.util.Version ctor to QueryParser Key: LUCENE-2002 URL: https://issues.apache.org/jira/browse/LUCENE-2002 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9, 3.0 Reporter: Uwe Schindler Assignee: Michael McCandless Fix For: 2.9.1 Attachments: LUCENE-2002-29.patch This is a followup of LUCENE-1987: If somebody uses StandardAnalyzer with Version.LUCENE_CURRENT and then uses QueryParser, phrase queries will not work, because the StopFilter enables position Increments for stop words, but QueryParser ignores them per default. The user has to explicitely enable them. This issue would add a ctor taking the Version constant and automatically enable this setting. The same applies to the contrib queryparser. Eventually also StopAnalyzer should add this version ctor. To be able to remove the default ctor for 3.0 (to remove a possible trap for users of QueryParser), it must be deprecated and the new one also added to 2.9.1. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2005) Add LuSql project to Apache Lucene - Contributions wiki page
[ https://issues.apache.org/jira/browse/LUCENE-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768745#action_12768745 ] Robert Muir commented on LUCENE-2005: - Glen, I think it would be good to bring the contributions page completely up to speed. maybe for this issue, we stick with database integration though for simplicity? :) bq. In the FAQ, for [http://wiki.apache.org/lucene-java/LuceneFAQ#How_can_I_use_Lucene_to_index_a_database.3F|indexing databases with Lucene], LuSql should also be added (separate JIRA issue?) I think you can just register to the wiki and edit this yourself? Add LuSql project to Apache Lucene - Contributions wiki page -- Key: LUCENE-2005 URL: https://issues.apache.org/jira/browse/LUCENE-2005 Project: Lucene - Java Issue Type: Task Components: Website Affects Versions: 2.9 Reporter: Glen Newton Original Estimate: 2h Remaining Estimate: 2h Add [LuSql|http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql] to the Apache Lucene - Contributions page [http://lucene.apache.org/java/2_9_0/contributions.html] I am the author of LuSql. I can supply any text needed. Perhaps a new heading is needed to capture Database/JDBC oriented Lucene tools (there are others out there)? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2002) Add oal.util.Version ctor to QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768747#action_12768747 ] Robert Muir commented on LUCENE-2002: - bq. Ahh yes indeed. Is there a corresponding issue about not being able to control stop filter pos incr? Can't keep track of all these issues anymore! Michael, what about LUCENE-1258? Add oal.util.Version ctor to QueryParser Key: LUCENE-2002 URL: https://issues.apache.org/jira/browse/LUCENE-2002 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9, 3.0 Reporter: Uwe Schindler Assignee: Michael McCandless Fix For: 2.9.1 Attachments: LUCENE-2002-29.patch This is a followup of LUCENE-1987: If somebody uses StandardAnalyzer with Version.LUCENE_CURRENT and then uses QueryParser, phrase queries will not work, because the StopFilter enables position Increments for stop words, but QueryParser ignores them per default. The user has to explicitely enable them. This issue would add a ctor taking the Version constant and automatically enable this setting. The same applies to the contrib queryparser. Eventually also StopAnalyzer should add this version ctor. To be able to remove the default ctor for 3.0 (to remove a possible trap for users of QueryParser), it must be deprecated and the new one also added to 2.9.1. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2005) Add LuSql project to Apache Lucene - Contributions wiki page
[ https://issues.apache.org/jira/browse/LUCENE-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768746#action_12768746 ] Michael McCandless commented on LUCENE-2005: LuSql looks great! It'd be wonderful to have it available under contrib. I think a contrib/database would make sense? bq. In the FAQ, for [http://wiki.apache.org/lucene-java/LuceneFAQ#How_can_I_use_Lucene_to_index_a_database.3F|indexing databases with Lucene], LuSql should also be added (separate JIRA issue?) +1 But that's the wiki -- you can just go edit it (create an account if you don't already have one) and add it in (no need for a JIRA issue). Add LuSql project to Apache Lucene - Contributions wiki page -- Key: LUCENE-2005 URL: https://issues.apache.org/jira/browse/LUCENE-2005 Project: Lucene - Java Issue Type: Task Components: Website Affects Versions: 2.9 Reporter: Glen Newton Original Estimate: 2h Remaining Estimate: 2h Add [LuSql|http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql] to the Apache Lucene - Contributions page [http://lucene.apache.org/java/2_9_0/contributions.html] I am the author of LuSql. I can supply any text needed. Perhaps a new heading is needed to capture Database/JDBC oriented Lucene tools (there are others out there)? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2005) Add LuSql project to Apache Lucene - Contributions wiki page
[ https://issues.apache.org/jira/browse/LUCENE-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768749#action_12768749 ] Robert Muir commented on LUCENE-2005: - bq. LuSql looks great! It'd be wonderful to have it available under contrib. I think a contrib/database would make sense? Michael, actually this issue was just to add it to the contributions links on the website. But if Glen wants to incorporate it into contrib, I think that would be even better... it has really nice documentation, etc. Add LuSql project to Apache Lucene - Contributions wiki page -- Key: LUCENE-2005 URL: https://issues.apache.org/jira/browse/LUCENE-2005 Project: Lucene - Java Issue Type: Task Components: Website Affects Versions: 2.9 Reporter: Glen Newton Original Estimate: 2h Remaining Estimate: 2h Add [LuSql|http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql] to the Apache Lucene - Contributions page [http://lucene.apache.org/java/2_9_0/contributions.html] I am the author of LuSql. I can supply any text needed. Perhaps a new heading is needed to capture Database/JDBC oriented Lucene tools (there are others out there)? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2002) Add oal.util.Version ctor to QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768752#action_12768752 ] Michael McCandless commented on LUCENE-2002: bq. Michael, what about LUCENE-1258? Oh yeah, and look who opened that one :) I'll go resolve as a dup of this one. Add oal.util.Version ctor to QueryParser Key: LUCENE-2002 URL: https://issues.apache.org/jira/browse/LUCENE-2002 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9, 3.0 Reporter: Uwe Schindler Assignee: Michael McCandless Fix For: 2.9.1 Attachments: LUCENE-2002-29.patch This is a followup of LUCENE-1987: If somebody uses StandardAnalyzer with Version.LUCENE_CURRENT and then uses QueryParser, phrase queries will not work, because the StopFilter enables position Increments for stop words, but QueryParser ignores them per default. The user has to explicitely enable them. This issue would add a ctor taking the Version constant and automatically enable this setting. The same applies to the contrib queryparser. Eventually also StopAnalyzer should add this version ctor. To be able to remove the default ctor for 3.0 (to remove a possible trap for users of QueryParser), it must be deprecated and the new one also added to 2.9.1. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1258) Increment position by default in StopFilter QueryParser - PhraseQuery
[ https://issues.apache.org/jira/browse/LUCENE-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-1258. Resolution: Duplicate Dup of LUCENE-2002. Increment position by default in StopFilter QueryParser - PhraseQuery Key: LUCENE-1258 URL: https://issues.apache.org/jira/browse/LUCENE-1258 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, 2.9 Reporter: Michael McCandless Priority: Minor Fix For: 3.0 Spinoff from here: https://issues.apache.org/jira/browse/LUCENE-1095 I think for 3.0 we should change the default so that: * By default, StopFilter increments the positionIncrement whenever it skips stop words. Add option to revert back to old way. This is just toggling the boolean default. * By default, when QueryParser adds terms to a PhraseQuery it should include the position reported by the analyzer. Add option to revert back to old way. I'm just opening this now, marking as 3.0 fix, to remind us all to actually fix it for 3.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
contrib and lucene 3.0
Hi, What is the consensus on new features for contrib for Lucene 3.0? I know that for core, its mostly a java 5 upgrade and deprecation removal. I want to make sure LUCENE-1606 is set to the right version, but I figured its really not just about that specific issue, I would like to know the plans in general. Thanks, Robert -- Robert Muir rcm...@gmail.com
[jira] Updated: (LUCENE-1257) Port to Java5
[ https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Kay updated LUCENE-1257: Attachment: LUCENE-1257_contrib_benchmark.patch Port to Java5 - Key: LUCENE-1257 URL: https://issues.apache.org/jira/browse/LUCENE-1257 Project: Lucene - Java Issue Type: Improvement Components: Analysis, Examples, Index, Other, Query/Scoring, QueryParser, Search, Store, Term Vectors Affects Versions: 3.0 Reporter: Cédric Champeau Assignee: Uwe Schindler Priority: Minor Fix For: 3.0 Attachments: instantiated_fieldable.patch, LUCENE-1257-BooleanQuery.patch, LUCENE-1257-BooleanScorer_2.patch, LUCENE-1257-BufferedDeletes_DocumentsWriter.patch, LUCENE-1257-CheckIndex.patch, LUCENE-1257-CloseableThreadLocal.patch, LUCENE-1257-CompoundFileReaderWriter.patch, LUCENE-1257-ConcurrentMergeScheduler.patch, LUCENE-1257-DirectoryReader.patch, LUCENE-1257-DisjunctionMaxQuery-more_type_safety.patch, LUCENE-1257-DocFieldProcessorPerThread.patch, LUCENE-1257-Document.patch, LUCENE-1257-FieldCacheImpl.patch, LUCENE-1257-FieldCacheRangeFilter.patch, LUCENE-1257-IndexDeleter.patch, LUCENE-1257-IndexDeletionPolicy_IndexFileDeleter.patch, LUCENE-1257-iw.patch, LUCENE-1257-MTQWF.patch, LUCENE-1257-NormalizeCharMap.patch, LUCENE-1257-o.a.l.util.patch, LUCENE-1257-org_apache_lucene_document.patch, LUCENE-1257-org_apache_lucene_document.patch, LUCENE-1257-org_apache_lucene_document.patch, LUCENE-1257-SegmentInfos.patch, LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, LUCENE-1257-TopDocsCollector.patch, LUCENE-1257-WordListLoader.patch, LUCENE-1257_analysis.patch, LUCENE-1257_BooleanFilter_Generics.patch, LUCENE-1257_contrib_benchmark.patch, LUCENE-1257_contrib_highlighting.patch, LUCENE-1257_javacc_upgrade.patch, LUCENE-1257_messages.patch, LUCENE-1257_more_unnecessary_casts.patch, LUCENE-1257_MultiFieldQueryParser.patch, LUCENE-1257_o.a.l.queryParser.patch, LUCENE-1257_o.a.l.store.patch, LUCENE-1257_o_a_l_index_test.patch, LUCENE-1257_o_a_l_index_test.patch, LUCENE-1257_o_a_l_search.patch, LUCENE-1257_o_a_l_search_spans.patch, LUCENE-1257_org_apache_lucene_index.patch, LUCENE-1257_org_apache_lucene_index.patch, LUCENE-1257_queryParser_jj.patch, LUCENE-1257_unnecessary_casts.patch, lucene1257surround1.patch, lucene1257surround1.patch, shinglematrixfilter_generified.patch For my needs I've updated Lucene so that it uses Java 5 constructs. I know Java 5 migration had been planned for 2.1 someday in the past, but don't know when it is planned now. This patch against the trunk includes : - most obvious generics usage (there are tons of usages of sets, ... Those which are commonly used have been generified) - PriorityQueue generification - replacement of indexed for loops with for each constructs - removal of unnececessary unboxing The code is to my opinion much more readable with those features (you actually *know* what is stored in collections reading the code, without the need to lookup for field definitions everytime) and it simplifies many algorithms. Note that this patch also includes an interface for the Query class. This has been done for my company's needs for building custom Query classes which add some behaviour to the base Lucene queries. It prevents multiple unnnecessary casts. I know this introduction is not wanted by the team, but it really makes our developments easier to maintain. If you don't want to use this, replace all /Queriable/ calls with standard /Query/. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2005) Add LuSql project to Apache Lucene - Contributions wiki page
[ https://issues.apache.org/jira/browse/LUCENE-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768795#action_12768795 ] Yonik Seeley commented on LUCENE-2005: -- bq. Michael, actually this issue was just to add it to the contributions links on the website. Right... and I think we really shouldn't try to pull more and more projects into lucene contrib, making it this huge uber project - that just makes it harder and harder to change core. Add LuSql project to Apache Lucene - Contributions wiki page -- Key: LUCENE-2005 URL: https://issues.apache.org/jira/browse/LUCENE-2005 Project: Lucene - Java Issue Type: Task Components: Website Affects Versions: 2.9 Reporter: Glen Newton Original Estimate: 2h Remaining Estimate: 2h Add [LuSql|http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql] to the Apache Lucene - Contributions page [http://lucene.apache.org/java/2_9_0/contributions.html] I am the author of LuSql. I can supply any text needed. Perhaps a new heading is needed to capture Database/JDBC oriented Lucene tools (there are others out there)? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2005) Add LuSql project to Apache Lucene - Contributions wiki page
[ https://issues.apache.org/jira/browse/LUCENE-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768810#action_12768810 ] Robert Muir commented on LUCENE-2005: - bq. Right... and I think we really shouldn't try to pull more and more projects into lucene contrib, making it this huge uber project - that just makes it harder and harder to change core. I guess we can agree to disagree on this one... But I do think that this issue is just about adding hyperlinks to the website. Add LuSql project to Apache Lucene - Contributions wiki page -- Key: LUCENE-2005 URL: https://issues.apache.org/jira/browse/LUCENE-2005 Project: Lucene - Java Issue Type: Task Components: Website Affects Versions: 2.9 Reporter: Glen Newton Original Estimate: 2h Remaining Estimate: 2h Add [LuSql|http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql] to the Apache Lucene - Contributions page [http://lucene.apache.org/java/2_9_0/contributions.html] I am the author of LuSql. I can supply any text needed. Perhaps a new heading is needed to capture Database/JDBC oriented Lucene tools (there are others out there)? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2005) Add LuSql project to Apache Lucene - Contributions wiki page
[ https://issues.apache.org/jira/browse/LUCENE-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768815#action_12768815 ] Glen Newton commented on LUCENE-2005: - Yes, it was just concerned with adding hyperlinks to this page. I have just added LuSql to the [FAQ|http://wiki.apache.org/lucene-java/LuceneFAQ#How_can_I_use_Lucene_to_index_a_database.3F] I would prefer keeping LuSql out of contrib, as it makes my life easier (I think??), and allows me to release independent of the Lucene release schedule. :-) Add LuSql project to Apache Lucene - Contributions wiki page -- Key: LUCENE-2005 URL: https://issues.apache.org/jira/browse/LUCENE-2005 Project: Lucene - Java Issue Type: Task Components: Website Affects Versions: 2.9 Reporter: Glen Newton Original Estimate: 2h Remaining Estimate: 2h Add [LuSql|http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql] to the Apache Lucene - Contributions page [http://lucene.apache.org/java/2_9_0/contributions.html] I am the author of LuSql. I can supply any text needed. Perhaps a new heading is needed to capture Database/JDBC oriented Lucene tools (there are others out there)? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2003) Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or simplier StopFilter with stopWordsPosIncr mode switched on
[ https://issues.apache.org/jira/browse/LUCENE-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768827#action_12768827 ] Mark Miller commented on LUCENE-2003: - Umm - its hard to emulate the positions stuff from phrasequery with a SpanQuery. A limitation I hadn't really though much of. Should be doc'd. One - uh - sloppy fix - is to count up all of the extra positions and add that to the slop. ie if the positions for a phrasequery are 0, 1, 3 (stop word removed at 2), you would add 1 to the slop. 0,1,3,5 - add 2 to the slop. I think that keeps a fairly good approximation. Havn't thought about how that would work with MultiPhraseQuery yet. Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or simplier StopFilter with stopWordsPosIncr mode switched on --- Key: LUCENE-2003 URL: https://issues.apache.org/jira/browse/LUCENE-2003 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9, 3.0 Reporter: Uwe Schindler Assignee: Michael McCandless Fix For: 2.9.1, 3.0 This is a followup on LUCENE-1987: If you set in HighligterTest the constant static final Version TEST_VERSION = Version.LUCENE_24 to LUCENE_29 or LUCENE_CURRENT, the test testSimpleQueryScorerPhraseHighlighting fails. Please note, that currently (before LUCENE-2002 is fixed), you must also set the QueryParser to respect posIncr. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2003) Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or simplier StopFilter with stopWordsPosIncr mode switched on
[ https://issues.apache.org/jira/browse/LUCENE-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768829#action_12768829 ] Mark Miller commented on LUCENE-2003: - Well no crap - MultiPhraseQuery already does that. Someone else contrib'd that. Guess they are ahead of me - would have saved some though to look at it :) Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or simplier StopFilter with stopWordsPosIncr mode switched on --- Key: LUCENE-2003 URL: https://issues.apache.org/jira/browse/LUCENE-2003 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9, 3.0 Reporter: Uwe Schindler Assignee: Michael McCandless Fix For: 2.9.1, 3.0 This is a followup on LUCENE-1987: If you set in HighligterTest the constant static final Version TEST_VERSION = Version.LUCENE_24 to LUCENE_29 or LUCENE_CURRENT, the test testSimpleQueryScorerPhraseHighlighting fails. Please note, that currently (before LUCENE-2002 is fixed), you must also set the QueryParser to respect posIncr. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2003) Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or simplier StopFilter with stopWordsPosIncr mode switched on
[ https://issues.apache.org/jira/browse/LUCENE-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768829#action_12768829 ] Mark Miller edited comment on LUCENE-2003 at 10/22/09 7:40 PM: --- Well no crap - MultiPhraseQuery already does that. Someone else contrib'd that. Guess they are ahead of me - would have saved some thought to look at it :) was (Author: markrmil...@gmail.com): Well no crap - MultiPhraseQuery already does that. Someone else contrib'd that. Guess they are ahead of me - would have saved some though to look at it :) Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or simplier StopFilter with stopWordsPosIncr mode switched on --- Key: LUCENE-2003 URL: https://issues.apache.org/jira/browse/LUCENE-2003 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9, 3.0 Reporter: Uwe Schindler Assignee: Michael McCandless Fix For: 2.9.1, 3.0 Attachments: LUCENE-2003.patch This is a followup on LUCENE-1987: If you set in HighligterTest the constant static final Version TEST_VERSION = Version.LUCENE_24 to LUCENE_29 or LUCENE_CURRENT, the test testSimpleQueryScorerPhraseHighlighting fails. Please note, that currently (before LUCENE-2002 is fixed), you must also set the QueryParser to respect posIncr. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2003) Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or simplier StopFilter with stopWordsPosIncr mode switched on
[ https://issues.apache.org/jira/browse/LUCENE-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-2003: Attachment: LUCENE-2003.patch Here is a patch showing essentially what I mean Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or simplier StopFilter with stopWordsPosIncr mode switched on --- Key: LUCENE-2003 URL: https://issues.apache.org/jira/browse/LUCENE-2003 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9, 3.0 Reporter: Uwe Schindler Assignee: Michael McCandless Fix For: 2.9.1, 3.0 Attachments: LUCENE-2003.patch This is a followup on LUCENE-1987: If you set in HighligterTest the constant static final Version TEST_VERSION = Version.LUCENE_24 to LUCENE_29 or LUCENE_CURRENT, the test testSimpleQueryScorerPhraseHighlighting fails. Please note, that currently (before LUCENE-2002 is fixed), you must also set the QueryParser to respect posIncr. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2002) Add oal.util.Version ctor to QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768838#action_12768838 ] Grant Ingersoll commented on LUCENE-2002: - bq. OK I'll take that approach, and I guess make a unit test that peeks confirms these methods are still protected (to catch us in the future). We may want to see if it can be automated in the ANT task so that we don't have to remember to do it by hand each time. Add oal.util.Version ctor to QueryParser Key: LUCENE-2002 URL: https://issues.apache.org/jira/browse/LUCENE-2002 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9, 3.0 Reporter: Uwe Schindler Assignee: Michael McCandless Fix For: 2.9.1 Attachments: LUCENE-2002-29.patch This is a followup of LUCENE-1987: If somebody uses StandardAnalyzer with Version.LUCENE_CURRENT and then uses QueryParser, phrase queries will not work, because the StopFilter enables position Increments for stop words, but QueryParser ignores them per default. The user has to explicitely enable them. This issue would add a ctor taking the Version constant and automatically enable this setting. The same applies to the contrib queryparser. Eventually also StopAnalyzer should add this version ctor. To be able to remove the default ctor for 3.0 (to remove a possible trap for users of QueryParser), it must be deprecated and the new one also added to 2.9.1. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2002) Add oal.util.Version ctor to QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768840#action_12768840 ] Michael McCandless commented on LUCENE-2002: bq. We may want to see if it can be automated in the ANT task so that we don't have to remember to do it by hand each time. That would be fabulous but is way beyond my ant skills :) Any ant pros out there want to try? Add oal.util.Version ctor to QueryParser Key: LUCENE-2002 URL: https://issues.apache.org/jira/browse/LUCENE-2002 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9, 3.0 Reporter: Uwe Schindler Assignee: Michael McCandless Fix For: 2.9.1 Attachments: LUCENE-2002-29.patch This is a followup of LUCENE-1987: If somebody uses StandardAnalyzer with Version.LUCENE_CURRENT and then uses QueryParser, phrase queries will not work, because the StopFilter enables position Increments for stop words, but QueryParser ignores them per default. The user has to explicitely enable them. This issue would add a ctor taking the Version constant and automatically enable this setting. The same applies to the contrib queryparser. Eventually also StopAnalyzer should add this version ctor. To be able to remove the default ctor for 3.0 (to remove a possible trap for users of QueryParser), it must be deprecated and the new one also added to 2.9.1. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2002) Add oal.util.Version ctor to QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768845#action_12768845 ] Uwe Schindler commented on LUCENE-2002: --- Eric Hatcher :-) Maybe the search-replace with regex functionality can do it. Add oal.util.Version ctor to QueryParser Key: LUCENE-2002 URL: https://issues.apache.org/jira/browse/LUCENE-2002 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9, 3.0 Reporter: Uwe Schindler Assignee: Michael McCandless Fix For: 2.9.1 Attachments: LUCENE-2002-29.patch This is a followup of LUCENE-1987: If somebody uses StandardAnalyzer with Version.LUCENE_CURRENT and then uses QueryParser, phrase queries will not work, because the StopFilter enables position Increments for stop words, but QueryParser ignores them per default. The user has to explicitely enable them. This issue would add a ctor taking the Version constant and automatically enable this setting. The same applies to the contrib queryparser. Eventually also StopAnalyzer should add this version ctor. To be able to remove the default ctor for 3.0 (to remove a possible trap for users of QueryParser), it must be deprecated and the new one also added to 2.9.1. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1373) Most of the contributed Analyzers suffer from invalid recognition of acronyms.
[ https://issues.apache.org/jira/browse/LUCENE-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-1373. Resolution: Duplicate Dup of LUCENE-2002. Most of the contributed Analyzers suffer from invalid recognition of acronyms. -- Key: LUCENE-1373 URL: https://issues.apache.org/jira/browse/LUCENE-1373 Project: Lucene - Java Issue Type: Bug Components: Analysis, contrib/analyzers Affects Versions: 2.3.2 Reporter: Mark Lassau Priority: Minor Attachments: LUCENE-1373.patch LUCENE-1068 describes a bug in StandardTokenizer whereby a string like www.apache.org. would be incorrectly tokenized as an acronym (note the dot at the end). Unfortunately, keeping the backward compatibility of a bug turns out to harm us. StandardTokenizer has a couple of ways to indicate fix this bug, but unfortunately the default behaviour is still to be buggy. Most of the non-English analyzers provided in lucene-analyzers utilize the StandardTokenizer, and in v2.3.2 not one of these provides a way to get the non-buggy behaviour :( I refer to: * BrazilianAnalyzer * CzechAnalyzer * DutchAnalyzer * FrenchAnalyzer * GermanAnalyzer * GreekAnalyzer * ThaiAnalyzer -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2003) Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or simplier StopFilter with stopWordsPosIncr mode switched on
[ https://issues.apache.org/jira/browse/LUCENE-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768843#action_12768843 ] Yonik Seeley commented on LUCENE-2003: -- Could you explain this part? {code} + if (inc lastInc) { +slop += inc; + } {code} Seems like that would cause A ??? B ??? C ??? D to only have a slop of 3 (? represents a gap of 1). Couldn't slop just be maxPos-minPos+1-numTokens? Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or simplier StopFilter with stopWordsPosIncr mode switched on --- Key: LUCENE-2003 URL: https://issues.apache.org/jira/browse/LUCENE-2003 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9, 3.0 Reporter: Uwe Schindler Assignee: Michael McCandless Fix For: 2.9.1, 3.0 Attachments: LUCENE-2003.patch This is a followup on LUCENE-1987: If you set in HighligterTest the constant static final Version TEST_VERSION = Version.LUCENE_24 to LUCENE_29 or LUCENE_CURRENT, the test testSimpleQueryScorerPhraseHighlighting fails. Please note, that currently (before LUCENE-2002 is fixed), you must also set the QueryParser to respect posIncr. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2002) Add oal.util.Version ctor to QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768845#action_12768845 ] Uwe Schindler edited comment on LUCENE-2002 at 10/22/09 7:59 PM: - Eric Hatcher :-) Maybe the search-replace with regex functionality can do it. see: [http://ant.apache.org/manual/OptionalTasks/replaceregexp.html] was (Author: thetaphi): Eric Hatcher :-) Maybe the search-replace with regex functionality can do it. Add oal.util.Version ctor to QueryParser Key: LUCENE-2002 URL: https://issues.apache.org/jira/browse/LUCENE-2002 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9, 3.0 Reporter: Uwe Schindler Assignee: Michael McCandless Fix For: 2.9.1 Attachments: LUCENE-2002-29.patch This is a followup of LUCENE-1987: If somebody uses StandardAnalyzer with Version.LUCENE_CURRENT and then uses QueryParser, phrase queries will not work, because the StopFilter enables position Increments for stop words, but QueryParser ignores them per default. The user has to explicitely enable them. This issue would add a ctor taking the Version constant and automatically enable this setting. The same applies to the contrib queryparser. Eventually also StopAnalyzer should add this version ctor. To be able to remove the default ctor for 3.0 (to remove a possible trap for users of QueryParser), it must be deprecated and the new one also added to 2.9.1. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2003) Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or simplier StopFilter with stopWordsPosIncr mode switched on
[ https://issues.apache.org/jira/browse/LUCENE-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768853#action_12768853 ] Mark Miller commented on LUCENE-2003: - Hmm - well now you have me worried - never seen you be wrong. I just tried a test like that and it appeared to work though. Ah - I should have looked closer at the MultiPhraseQuery code - it is wrong - just happens to work. You only need to add to the slop the largest inc, because the SpanQuery slop is the dist allowed between *each* span. So thats why it works - it finds 3 the first time, doesn't add any more for the rest, but 3 is enough. I'll fix. Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or simplier StopFilter with stopWordsPosIncr mode switched on --- Key: LUCENE-2003 URL: https://issues.apache.org/jira/browse/LUCENE-2003 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9, 3.0 Reporter: Uwe Schindler Assignee: Michael McCandless Fix For: 2.9.1, 3.0 Attachments: LUCENE-2003.patch This is a followup on LUCENE-1987: If you set in HighligterTest the constant static final Version TEST_VERSION = Version.LUCENE_24 to LUCENE_29 or LUCENE_CURRENT, the test testSimpleQueryScorerPhraseHighlighting fails. Please note, that currently (before LUCENE-2002 is fixed), you must also set the QueryParser to respect posIncr. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2003) Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or simplier StopFilter with stopWordsPosIncr mode switched on
[ https://issues.apache.org/jira/browse/LUCENE-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-2003: Attachment: LUCENE-2003.patch This should be more correct - add the largest inc to the slop if its great than 1. Gotto consider this against your suggestion. Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or simplier StopFilter with stopWordsPosIncr mode switched on --- Key: LUCENE-2003 URL: https://issues.apache.org/jira/browse/LUCENE-2003 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9, 3.0 Reporter: Uwe Schindler Assignee: Michael McCandless Fix For: 2.9.1, 3.0 Attachments: LUCENE-2003.patch, LUCENE-2003.patch This is a followup on LUCENE-1987: If you set in HighligterTest the constant static final Version TEST_VERSION = Version.LUCENE_24 to LUCENE_29 or LUCENE_CURRENT, the test testSimpleQueryScorerPhraseHighlighting fails. Please note, that currently (before LUCENE-2002 is fixed), you must also set the QueryParser to respect posIncr. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1257) Port to Java5
[ https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Kay updated LUCENE-1257: Attachment: LUCENE-1257_unnnecessary_casts_2.patch Port to Java5 - Key: LUCENE-1257 URL: https://issues.apache.org/jira/browse/LUCENE-1257 Project: Lucene - Java Issue Type: Improvement Components: Analysis, Examples, Index, Other, Query/Scoring, QueryParser, Search, Store, Term Vectors Affects Versions: 3.0 Reporter: Cédric Champeau Assignee: Uwe Schindler Priority: Minor Fix For: 3.0 Attachments: instantiated_fieldable.patch, LUCENE-1257-BooleanQuery.patch, LUCENE-1257-BooleanScorer_2.patch, LUCENE-1257-BufferedDeletes_DocumentsWriter.patch, LUCENE-1257-CheckIndex.patch, LUCENE-1257-CloseableThreadLocal.patch, LUCENE-1257-CompoundFileReaderWriter.patch, LUCENE-1257-ConcurrentMergeScheduler.patch, LUCENE-1257-DirectoryReader.patch, LUCENE-1257-DisjunctionMaxQuery-more_type_safety.patch, LUCENE-1257-DocFieldProcessorPerThread.patch, LUCENE-1257-Document.patch, LUCENE-1257-FieldCacheImpl.patch, LUCENE-1257-FieldCacheRangeFilter.patch, LUCENE-1257-IndexDeleter.patch, LUCENE-1257-IndexDeletionPolicy_IndexFileDeleter.patch, LUCENE-1257-iw.patch, LUCENE-1257-MTQWF.patch, LUCENE-1257-NormalizeCharMap.patch, LUCENE-1257-o.a.l.util.patch, LUCENE-1257-org_apache_lucene_document.patch, LUCENE-1257-org_apache_lucene_document.patch, LUCENE-1257-org_apache_lucene_document.patch, LUCENE-1257-SegmentInfos.patch, LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, LUCENE-1257-TopDocsCollector.patch, LUCENE-1257-WordListLoader.patch, LUCENE-1257_analysis.patch, LUCENE-1257_BooleanFilter_Generics.patch, LUCENE-1257_contrib_benchmark.patch, LUCENE-1257_contrib_highlighting.patch, LUCENE-1257_javacc_upgrade.patch, LUCENE-1257_messages.patch, LUCENE-1257_more_unnecessary_casts.patch, LUCENE-1257_MultiFieldQueryParser.patch, LUCENE-1257_o.a.l.queryParser.patch, LUCENE-1257_o.a.l.store.patch, LUCENE-1257_o_a_l_index_test.patch, LUCENE-1257_o_a_l_index_test.patch, LUCENE-1257_o_a_l_search.patch, LUCENE-1257_o_a_l_search_spans.patch, LUCENE-1257_org_apache_lucene_index.patch, LUCENE-1257_org_apache_lucene_index.patch, LUCENE-1257_queryParser_jj.patch, LUCENE-1257_unnecessary_casts.patch, LUCENE-1257_unnnecessary_casts_2.patch, lucene1257surround1.patch, lucene1257surround1.patch, shinglematrixfilter_generified.patch For my needs I've updated Lucene so that it uses Java 5 constructs. I know Java 5 migration had been planned for 2.1 someday in the past, but don't know when it is planned now. This patch against the trunk includes : - most obvious generics usage (there are tons of usages of sets, ... Those which are commonly used have been generified) - PriorityQueue generification - replacement of indexed for loops with for each constructs - removal of unnececessary unboxing The code is to my opinion much more readable with those features (you actually *know* what is stored in collections reading the code, without the need to lookup for field definitions everytime) and it simplifies many algorithms. Note that this patch also includes an interface for the Query class. This has been done for my company's needs for building custom Query classes which add some behaviour to the base Lucene queries. It prevents multiple unnnecessary casts. I know this introduction is not wanted by the team, but it really makes our developments easier to maintain. If you don't want to use this, replace all /Queriable/ calls with standard /Query/. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2003) Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or simplier StopFilter with stopWordsPosIncr mode switched on
[ https://issues.apache.org/jira/browse/LUCENE-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768862#action_12768862 ] Mark Miller commented on LUCENE-2003: - Okay - I think this is the way to go - maxPos-minPos+1-numTokens is too much slop because it just has to be the largest posInc - forgot thats how SpanQueries work when I did the orig patch. Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or simplier StopFilter with stopWordsPosIncr mode switched on --- Key: LUCENE-2003 URL: https://issues.apache.org/jira/browse/LUCENE-2003 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9, 3.0 Reporter: Uwe Schindler Assignee: Michael McCandless Fix For: 2.9.1, 3.0 Attachments: LUCENE-2003.patch, LUCENE-2003.patch This is a followup on LUCENE-1987: If you set in HighligterTest the constant static final Version TEST_VERSION = Version.LUCENE_24 to LUCENE_29 or LUCENE_CURRENT, the test testSimpleQueryScorerPhraseHighlighting fails. Please note, that currently (before LUCENE-2002 is fixed), you must also set the QueryParser to respect posIncr. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2003) Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or simplier StopFilter with stopWordsPosIncr mode switched on
[ https://issues.apache.org/jira/browse/LUCENE-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768863#action_12768863 ] Yonik Seeley commented on LUCENE-2003: -- bq. You only need to add to the slop the largest inc, because the SpanQuery slop is the dist allowed between each span. Learn something new every day :-) Is this javadoc incorrect, or simply ambiguous, or am I reading it wrong: {code} /** Construct a SpanNearQuery. Matches spans matching a span from each * clause, with up to codeslop/code total unmatched positions between * them. * When codeinOrder/code is true, the spans from each clause * must be * ordered as in codeclauses/code. */ public SpanNearQuery(SpanQuery[] clauses, int slop, boolean inOrder) { this(clauses, slop, inOrder, true); } {code} The total would almost seem to tip the ambiguity toward meaning that it's the total slop between all clauses. Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or simplier StopFilter with stopWordsPosIncr mode switched on --- Key: LUCENE-2003 URL: https://issues.apache.org/jira/browse/LUCENE-2003 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9, 3.0 Reporter: Uwe Schindler Assignee: Michael McCandless Fix For: 2.9.1, 3.0 Attachments: LUCENE-2003.patch, LUCENE-2003.patch This is a followup on LUCENE-1987: If you set in HighligterTest the constant static final Version TEST_VERSION = Version.LUCENE_24 to LUCENE_29 or LUCENE_CURRENT, the test testSimpleQueryScorerPhraseHighlighting fails. Please note, that currently (before LUCENE-2002 is fixed), you must also set the QueryParser to respect posIncr. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-2003) Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or simplier StopFilter with stopWordsPosIncr mode switched on
[ https://issues.apache.org/jira/browse/LUCENE-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned LUCENE-2003: -- Assignee: Mark Miller (was: Michael McCandless) OK Mark you get this one :) Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or simplier StopFilter with stopWordsPosIncr mode switched on --- Key: LUCENE-2003 URL: https://issues.apache.org/jira/browse/LUCENE-2003 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9, 3.0 Reporter: Uwe Schindler Assignee: Mark Miller Fix For: 2.9.1, 3.0 Attachments: LUCENE-2003.patch, LUCENE-2003.patch This is a followup on LUCENE-1987: If you set in HighligterTest the constant static final Version TEST_VERSION = Version.LUCENE_24 to LUCENE_29 or LUCENE_CURRENT, the test testSimpleQueryScorerPhraseHighlighting fails. Please note, that currently (before LUCENE-2002 is fixed), you must also set the QueryParser to respect posIncr. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2003) Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or simplier StopFilter with stopWordsPosIncr mode switched on
[ https://issues.apache.org/jira/browse/LUCENE-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768866#action_12768866 ] Mark Miller commented on LUCENE-2003: - bq. The total would almost seem to tip the ambiguity toward meaning that it's the total slop between all clauses. Yeah, I think it needs to be changed. Total appears just wrong. Perhaps something more along the lines of: Matches spans matching a span from each clause, with up to codeslop/code unmatched positions between each of them Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or simplier StopFilter with stopWordsPosIncr mode switched on --- Key: LUCENE-2003 URL: https://issues.apache.org/jira/browse/LUCENE-2003 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9, 3.0 Reporter: Uwe Schindler Assignee: Mark Miller Fix For: 2.9.1, 3.0 Attachments: LUCENE-2003.patch, LUCENE-2003.patch This is a followup on LUCENE-1987: If you set in HighligterTest the constant static final Version TEST_VERSION = Version.LUCENE_24 to LUCENE_29 or LUCENE_CURRENT, the test testSimpleQueryScorerPhraseHighlighting fails. Please note, that currently (before LUCENE-2002 is fixed), you must also set the QueryParser to respect posIncr. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2002) Add oal.util.Version ctor to QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768874#action_12768874 ] Michael McCandless commented on LUCENE-2002: bq. Maybe the search-replace with regex functionality can do it. Excellent! That worked like a charm. I'll still leave the unit test in place to catch us if this fails... Add oal.util.Version ctor to QueryParser Key: LUCENE-2002 URL: https://issues.apache.org/jira/browse/LUCENE-2002 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9, 3.0 Reporter: Uwe Schindler Assignee: Michael McCandless Fix For: 2.9.1 Attachments: LUCENE-2002-29.patch This is a followup of LUCENE-1987: If somebody uses StandardAnalyzer with Version.LUCENE_CURRENT and then uses QueryParser, phrase queries will not work, because the StopFilter enables position Increments for stop words, but QueryParser ignores them per default. The user has to explicitely enable them. This issue would add a ctor taking the Version constant and automatically enable this setting. The same applies to the contrib queryparser. Eventually also StopAnalyzer should add this version ctor. To be able to remove the default ctor for 3.0 (to remove a possible trap for users of QueryParser), it must be deprecated and the new one also added to 2.9.1. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2002) Add oal.util.Version ctor to QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768878#action_12768878 ] Uwe Schindler commented on LUCENE-2002: --- Cool. Did you check the minimum ANT version needed for this? If the current BUILD.txt minimum does not fit, we shoudl update the build, docs. My problem: I didn't found the minimum version for replaceregexp in the docs. Add oal.util.Version ctor to QueryParser Key: LUCENE-2002 URL: https://issues.apache.org/jira/browse/LUCENE-2002 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9, 3.0 Reporter: Uwe Schindler Assignee: Michael McCandless Fix For: 2.9.1 Attachments: LUCENE-2002-29.patch This is a followup of LUCENE-1987: If somebody uses StandardAnalyzer with Version.LUCENE_CURRENT and then uses QueryParser, phrase queries will not work, because the StopFilter enables position Increments for stop words, but QueryParser ignores them per default. The user has to explicitely enable them. This issue would add a ctor taking the Version constant and automatically enable this setting. The same applies to the contrib queryparser. Eventually also StopAnalyzer should add this version ctor. To be able to remove the default ctor for 3.0 (to remove a possible trap for users of QueryParser), it must be deprecated and the new one also added to 2.9.1. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2002) Add oal.util.Version ctor to QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768889#action_12768889 ] Michael McCandless commented on LUCENE-2002: I think we are good: I just looked @ 1.6.3's javadocs (we specify ant 1.6.3 in BUILD.txt) and it's got the replaceregexp task. Add oal.util.Version ctor to QueryParser Key: LUCENE-2002 URL: https://issues.apache.org/jira/browse/LUCENE-2002 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9, 3.0 Reporter: Uwe Schindler Assignee: Michael McCandless Fix For: 2.9.1 Attachments: LUCENE-2002-29.patch This is a followup of LUCENE-1987: If somebody uses StandardAnalyzer with Version.LUCENE_CURRENT and then uses QueryParser, phrase queries will not work, because the StopFilter enables position Increments for stop words, but QueryParser ignores them per default. The user has to explicitely enable them. This issue would add a ctor taking the Version constant and automatically enable this setting. The same applies to the contrib queryparser. Eventually also StopAnalyzer should add this version ctor. To be able to remove the default ctor for 3.0 (to remove a possible trap for users of QueryParser), it must be deprecated and the new one also added to 2.9.1. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2002) Add oal.util.Version ctor to QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-2002: --- Attachment: LUCENE-2002-29.patch New patch attached. All tests pass. Changes: * Fixed the patch - match typo * Fixed build.xml to make 2 autogen'd (by JavaCC) public QueryParser ctors protected, and added unit test to assert this * Added Version matchVersion param to all (I think!) contrib analyzers that instantiate either StandardTokenizer (to manage changing the fix invalid acronym setting across versions), or StopFilter (to manage enable pos incr setting across versions), or, both, and threaded it down to StandardTokenizer StopFilter I didn't add Version to StopFilter nor StopAnalyzer; I think it's better to up-front require the enablePositionIncrements to their ctors. Add oal.util.Version ctor to QueryParser Key: LUCENE-2002 URL: https://issues.apache.org/jira/browse/LUCENE-2002 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9, 3.0 Reporter: Uwe Schindler Assignee: Michael McCandless Fix For: 2.9.1 Attachments: LUCENE-2002-29.patch, LUCENE-2002-29.patch This is a followup of LUCENE-1987: If somebody uses StandardAnalyzer with Version.LUCENE_CURRENT and then uses QueryParser, phrase queries will not work, because the StopFilter enables position Increments for stop words, but QueryParser ignores them per default. The user has to explicitely enable them. This issue would add a ctor taking the Version constant and automatically enable this setting. The same applies to the contrib queryparser. Eventually also StopAnalyzer should add this version ctor. To be able to remove the default ctor for 3.0 (to remove a possible trap for users of QueryParser), it must be deprecated and the new one also added to 2.9.1. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2002) Add oal.util.Version ctor to QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768907#action_12768907 ] Uwe Schindler commented on LUCENE-2002: --- Looks good. bq. I didn't add Version to StopFilter nor StopAnalyzer; I think it's better to up-front require the enablePositionIncrements to their ctors. I would add it to StopAnalyzer, StopFilter is not so important (because low-level). But that's my opinion. Add oal.util.Version ctor to QueryParser Key: LUCENE-2002 URL: https://issues.apache.org/jira/browse/LUCENE-2002 Project: Lucene - Java Issue Type: Bug Affects Versions: 2.9, 3.0 Reporter: Uwe Schindler Assignee: Michael McCandless Fix For: 2.9.1 Attachments: LUCENE-2002-29.patch, LUCENE-2002-29.patch This is a followup of LUCENE-1987: If somebody uses StandardAnalyzer with Version.LUCENE_CURRENT and then uses QueryParser, phrase queries will not work, because the StopFilter enables position Increments for stop words, but QueryParser ignores them per default. The user has to explicitely enable them. This issue would add a ctor taking the Version constant and automatically enable this setting. The same applies to the contrib queryparser. Eventually also StopAnalyzer should add this version ctor. To be able to remove the default ctor for 3.0 (to remove a possible trap for users of QueryParser), it must be deprecated and the new one also added to 2.9.1. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1960) Remove deprecated Field.Store.COMPRESS
[ https://issues.apache.org/jira/browse/LUCENE-1960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768916#action_12768916 ] Uwe Schindler commented on LUCENE-1960: --- I still prefer 1, but maybe it's not so good. Else I would implement 2 (even if we need FieldForMerge). Just remove the COMPRES flag that nobody can add any compressed fields anymore. 3 is bad, because it needs you to change your code on the change between 2.9 and 3.0 if you had compressed fields. In 2.9 they were automatically uncompressed, in 3.0 not. This would make it impossible to replace the lucene jar (which is currently possible if you remove all deprecated calls in 2.9). Remove deprecated Field.Store.COMPRESS -- Key: LUCENE-1960 URL: https://issues.apache.org/jira/browse/LUCENE-1960 Project: Lucene - Java Issue Type: Task Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 3.0 Attachments: lucene-1960-1.patch, lucene-1960.patch Also remove FieldForMerge and related code. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: lucene 2.9 sorting algorithm
Hey Michael: Would you mind rerunning the test you have with jdk1.5? Also, if you would, change the comparator method to avoid brachning for int and string comparators, e.g. return index.order[i.doc] - index.order[j.doc]; Thanks -John On Thu, Oct 22, 2009 at 2:38 AM, Michael McCandless luc...@mikemccandless.com wrote: On Thu, Oct 22, 2009 at 2:17 AM, John Wang john.w...@gmail.com wrote: I have been playing with the patch, and I think I have some information that you might like. Let me spend sometime and gather some more numbers and update in jira. Excellent! say bottom has ords 23, 45, 76, each corresponding to a string. When moving to the next segment, you need to make bottom to have ords that can be comparable to other docs in this new segment, so you would need to find the new ords for the values in 23,45 and 76, don't you? To find it, assuming the values are s1,s2,s3, you would do a bin. search on the new val array, and find index for s1,s2,s3. It's that inversion (from ord-Comparable in first seg, and Comparable-ord in second seg) that I'm trying to avoid (w/ this new proposal). Which is 3 bin searches per convert, I am not sure how you can short circuit it. Are you suggesting we call Comparable on compareBottom until some doc beats it? I'm saying on seg transition you indeed get the Comparable for current bottom, but, don't attempt to invert it. Instead, as seg 2 finds a hit, you get that hit's Comparables and compare to bottom. If it beats bottom, it goes into the queue. If it does not, you use the ord (in seg 2's ord space) to learn a bottom in the ord space of seg 2. That would hurt performance I lot though, no? Yeah I think likely it would, since we're talking about a binary search on transition VS having to do possibly many upgrade-to-Comparable and compare-Comparabls to slowly learn the equivalent ord in the new segment. I was proposing it for cases where inversion is very difficult. But realistically, since you must keep around the ful ord - Comparable for every segment anyway (in order to merge in the end), inversion shouldn't ever actually be difficult -- it'd just be a binary search on presumably in-RAM storage. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: lucene 2.9 sorting algorithm
Why? What might he find? Whats with the cryptic request? Why would Java 1.5 perform better than 1.6? It erases 20 and 40% gains? I know point 2 certainly doesn't. Cards on the table? John Wang wrote: Hey Michael: Would you mind rerunning the test you have with jdk1.5? Also, if you would, change the comparator method to avoid brachning for int and string comparators, e.g. return index.order[i.doc] - index.order[j.doc]; Thanks -John On Thu, Oct 22, 2009 at 2:38 AM, Michael McCandless luc...@mikemccandless.com mailto:luc...@mikemccandless.com wrote: On Thu, Oct 22, 2009 at 2:17 AM, John Wang john.w...@gmail.com mailto:john.w...@gmail.com wrote: I have been playing with the patch, and I think I have some information that you might like. Let me spend sometime and gather some more numbers and update in jira. Excellent! say bottom has ords 23, 45, 76, each corresponding to a string. When moving to the next segment, you need to make bottom to have ords that can be comparable to other docs in this new segment, so you would need to find the new ords for the values in 23,45 and 76, don't you? To find it, assuming the values are s1,s2,s3, you would do a bin. search on the new val array, and find index for s1,s2,s3. It's that inversion (from ord-Comparable in first seg, and Comparable-ord in second seg) that I'm trying to avoid (w/ this new proposal). Which is 3 bin searches per convert, I am not sure how you can short circuit it. Are you suggesting we call Comparable on compareBottom until some doc beats it? I'm saying on seg transition you indeed get the Comparable for current bottom, but, don't attempt to invert it. Instead, as seg 2 finds a hit, you get that hit's Comparables and compare to bottom. If it beats bottom, it goes into the queue. If it does not, you use the ord (in seg 2's ord space) to learn a bottom in the ord space of seg 2. That would hurt performance I lot though, no? Yeah I think likely it would, since we're talking about a binary search on transition VS having to do possibly many upgrade-to-Comparable and compare-Comparabls to slowly learn the equivalent ord in the new segment. I was proposing it for cases where inversion is very difficult. But realistically, since you must keep around the ful ord - Comparable for every segment anyway (in order to merge in the end), inversion shouldn't ever actually be difficult -- it'd just be a binary search on presumably in-RAM storage. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org mailto:java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org mailto:java-dev-h...@lucene.apache.org -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: lucene 2.9 sorting algorithm
Mark: Please be patient with me. I am seeing a difference and was wondering if Mike would see the same thing. I thought Michael would be willing to because he expressed interest in understanding what the performance discrepancies are. Again, it is only a request. It is perfectly fine if Michael refuses to. But it would be great if Michael speaks for himself. Thanks -John On Thu, Oct 22, 2009 at 7:29 PM, Mark Miller markrmil...@gmail.com wrote: Why? What might he find? Whats with the cryptic request? Why would Java 1.5 perform better than 1.6? It erases 20 and 40% gains? I know point 2 certainly doesn't. Cards on the table? John Wang wrote: Hey Michael: Would you mind rerunning the test you have with jdk1.5? Also, if you would, change the comparator method to avoid brachning for int and string comparators, e.g. return index.order[i.doc] - index.order[j.doc]; Thanks -John On Thu, Oct 22, 2009 at 2:38 AM, Michael McCandless luc...@mikemccandless.com mailto:luc...@mikemccandless.com wrote: On Thu, Oct 22, 2009 at 2:17 AM, John Wang john.w...@gmail.com mailto:john.w...@gmail.com wrote: I have been playing with the patch, and I think I have some information that you might like. Let me spend sometime and gather some more numbers and update in jira. Excellent! say bottom has ords 23, 45, 76, each corresponding to a string. When moving to the next segment, you need to make bottom to have ords that can be comparable to other docs in this new segment, so you would need to find the new ords for the values in 23,45 and 76, don't you? To find it, assuming the values are s1,s2,s3, you would do a bin. search on the new val array, and find index for s1,s2,s3. It's that inversion (from ord-Comparable in first seg, and Comparable-ord in second seg) that I'm trying to avoid (w/ this new proposal). Which is 3 bin searches per convert, I am not sure how you can short circuit it. Are you suggesting we call Comparable on compareBottom until some doc beats it? I'm saying on seg transition you indeed get the Comparable for current bottom, but, don't attempt to invert it. Instead, as seg 2 finds a hit, you get that hit's Comparables and compare to bottom. If it beats bottom, it goes into the queue. If it does not, you use the ord (in seg 2's ord space) to learn a bottom in the ord space of seg 2. That would hurt performance I lot though, no? Yeah I think likely it would, since we're talking about a binary search on transition VS having to do possibly many upgrade-to-Comparable and compare-Comparabls to slowly learn the equivalent ord in the new segment. I was proposing it for cases where inversion is very difficult. But realistically, since you must keep around the ful ord - Comparable for every segment anyway (in order to merge in the end), inversion shouldn't ever actually be difficult -- it'd just be a binary search on presumably in-RAM storage. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org mailto:java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org mailto:java-dev-h...@lucene.apache.org -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: lucene 2.9 sorting algorithm
Mark, We're not seeing exactly the numbers that Mike is seeing in his tests, running with jdk 1.5 on intel macs, so we're trying to eliminate factors of difference. Point 2 does indeed make a difference, we've seen it, and it's only fair: the single pq comparator does this branch optimization but the current patch multi-pq does not, so let's level the playing field. John's on the road with limited net connectivity, but we'll have some numbers to compare more over the weekend for sure. -jake On Thu, Oct 22, 2009 at 7:29 PM, Mark Miller markrmil...@gmail.com wrote: Why? What might he find? Whats with the cryptic request? Why would Java 1.5 perform better than 1.6? It erases 20 and 40% gains? I know point 2 certainly doesn't. Cards on the table? John Wang wrote: Hey Michael: Would you mind rerunning the test you have with jdk1.5? Also, if you would, change the comparator method to avoid brachning for int and string comparators, e.g. return index.order[i.doc] - index.order[j.doc]; Thanks -John On Thu, Oct 22, 2009 at 2:38 AM, Michael McCandless luc...@mikemccandless.com mailto:luc...@mikemccandless.com wrote: On Thu, Oct 22, 2009 at 2:17 AM, John Wang john.w...@gmail.com mailto:john.w...@gmail.com wrote: I have been playing with the patch, and I think I have some information that you might like. Let me spend sometime and gather some more numbers and update in jira. Excellent! say bottom has ords 23, 45, 76, each corresponding to a string. When moving to the next segment, you need to make bottom to have ords that can be comparable to other docs in this new segment, so you would need to find the new ords for the values in 23,45 and 76, don't you? To find it, assuming the values are s1,s2,s3, you would do a bin. search on the new val array, and find index for s1,s2,s3. It's that inversion (from ord-Comparable in first seg, and Comparable-ord in second seg) that I'm trying to avoid (w/ this new proposal). Which is 3 bin searches per convert, I am not sure how you can short circuit it. Are you suggesting we call Comparable on compareBottom until some doc beats it? I'm saying on seg transition you indeed get the Comparable for current bottom, but, don't attempt to invert it. Instead, as seg 2 finds a hit, you get that hit's Comparables and compare to bottom. If it beats bottom, it goes into the queue. If it does not, you use the ord (in seg 2's ord space) to learn a bottom in the ord space of seg 2. That would hurt performance I lot though, no? Yeah I think likely it would, since we're talking about a binary search on transition VS having to do possibly many upgrade-to-Comparable and compare-Comparabls to slowly learn the equivalent ord in the new segment. I was proposing it for cases where inversion is very difficult. But realistically, since you must keep around the ful ord - Comparable for every segment anyway (in order to merge in the end), inversion shouldn't ever actually be difficult -- it'd just be a binary search on presumably in-RAM storage. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org mailto:java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org mailto:java-dev-h...@lucene.apache.org -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: lucene 2.9 sorting algorithm
I am patient :) And I'm not speaking for Mike, I'm speaking for me. I'm wondering what your seeing. Asking Mike to rerun the tests without giving any further info (you didn't even say that your seeing something different) is unfair to the rest of us ;) Giving 0 info along with your request just makes 0 sense to me and I said as much. John Wang wrote: Mark: Please be patient with me. I am seeing a difference and was wondering if Mike would see the same thing. I thought Michael would be willing to because he expressed interest in understanding what the performance discrepancies are. Again, it is only a request. It is perfectly fine if Michael refuses to. But it would be great if Michael speaks for himself. Thanks -John On Thu, Oct 22, 2009 at 7:29 PM, Mark Miller markrmil...@gmail.com mailto:markrmil...@gmail.com wrote: Why? What might he find? Whats with the cryptic request? Why would Java 1.5 perform better than 1.6? It erases 20 and 40% gains? I know point 2 certainly doesn't. Cards on the table? John Wang wrote: Hey Michael: Would you mind rerunning the test you have with jdk1.5? Also, if you would, change the comparator method to avoid brachning for int and string comparators, e.g. return index.order[i.doc] - index.order[j.doc]; Thanks -John On Thu, Oct 22, 2009 at 2:38 AM, Michael McCandless luc...@mikemccandless.com mailto:luc...@mikemccandless.com mailto:luc...@mikemccandless.com mailto:luc...@mikemccandless.com wrote: On Thu, Oct 22, 2009 at 2:17 AM, John Wang john.w...@gmail.com mailto:john.w...@gmail.com mailto:john.w...@gmail.com mailto:john.w...@gmail.com wrote: I have been playing with the patch, and I think I have some information that you might like. Let me spend sometime and gather some more numbers and update in jira. Excellent! say bottom has ords 23, 45, 76, each corresponding to a string. When moving to the next segment, you need to make bottom to have ords that can be comparable to other docs in this new segment, so you would need to find the new ords for the values in 23,45 and 76, don't you? To find it, assuming the values are s1,s2,s3, you would do a bin. search on the new val array, and find index for s1,s2,s3. It's that inversion (from ord-Comparable in first seg, and Comparable-ord in second seg) that I'm trying to avoid (w/ this new proposal). Which is 3 bin searches per convert, I am not sure how you can short circuit it. Are you suggesting we call Comparable on compareBottom until some doc beats it? I'm saying on seg transition you indeed get the Comparable for current bottom, but, don't attempt to invert it. Instead, as seg 2 finds a hit, you get that hit's Comparables and compare to bottom. If it beats bottom, it goes into the queue. If it does not, you use the ord (in seg 2's ord space) to learn a bottom in the ord space of seg 2. That would hurt performance I lot though, no? Yeah I think likely it would, since we're talking about a binary search on transition VS having to do possibly many upgrade-to-Comparable and compare-Comparabls to slowly learn the equivalent ord in the new segment. I was proposing it for cases where inversion is very difficult. But realistically, since you must keep around the ful ord - Comparable for every segment anyway (in order to merge in the end), inversion shouldn't ever actually be difficult -- it'd just be a binary search on presumably in-RAM storage. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org mailto:java-dev-unsubscr...@lucene.apache.org mailto:java-dev-unsubscr...@lucene.apache.org mailto:java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org mailto:java-dev-h...@lucene.apache.org mailto:java-dev-h...@lucene.apache.org mailto:java-dev-h...@lucene.apache.org -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org mailto:java-dev-unsubscr...@lucene.apache.org For
Re: lucene 2.9 sorting algorithm
Thanks - thats all I'm asking for. A simple explanation of why you'd ask for a retest with those two things changed. Just seems its hold your cards a little to close to say - please do this with 0 explanation. As to point 2, thats fine - I'm sure it helps - I was just saying I didn't buy it helps by 20-40%. Not arguing against doing it, but since the request had no info, the only thing I could assume was that that was supposed to change things. I was about to run some of these tests myself (if i can find what darn revision to patch), and its a bit frustrating to see you guys knew something but were not telling ... Jake Mannix wrote: Mark, We're not seeing exactly the numbers that Mike is seeing in his tests, running with jdk 1.5 on intel macs, so we're trying to eliminate factors of difference. Point 2 does indeed make a difference, we've seen it, and it's only fair: the single pq comparator does this branch optimization but the current patch multi-pq does not, so let's level the playing field. John's on the road with limited net connectivity, but we'll have some numbers to compare more over the weekend for sure. -jake On Thu, Oct 22, 2009 at 7:29 PM, Mark Miller markrmil...@gmail.com mailto:markrmil...@gmail.com wrote: Why? What might he find? Whats with the cryptic request? Why would Java 1.5 perform better than 1.6? It erases 20 and 40% gains? I know point 2 certainly doesn't. Cards on the table? John Wang wrote: Hey Michael: Would you mind rerunning the test you have with jdk1.5? Also, if you would, change the comparator method to avoid brachning for int and string comparators, e.g. return index.order[i.doc] - index.order[j.doc]; Thanks -John On Thu, Oct 22, 2009 at 2:38 AM, Michael McCandless luc...@mikemccandless.com mailto:luc...@mikemccandless.com mailto:luc...@mikemccandless.com mailto:luc...@mikemccandless.com wrote: On Thu, Oct 22, 2009 at 2:17 AM, John Wang john.w...@gmail.com mailto:john.w...@gmail.com mailto:john.w...@gmail.com mailto:john.w...@gmail.com wrote: I have been playing with the patch, and I think I have some information that you might like. Let me spend sometime and gather some more numbers and update in jira. Excellent! say bottom has ords 23, 45, 76, each corresponding to a string. When moving to the next segment, you need to make bottom to have ords that can be comparable to other docs in this new segment, so you would need to find the new ords for the values in 23,45 and 76, don't you? To find it, assuming the values are s1,s2,s3, you would do a bin. search on the new val array, and find index for s1,s2,s3. It's that inversion (from ord-Comparable in first seg, and Comparable-ord in second seg) that I'm trying to avoid (w/ this new proposal). Which is 3 bin searches per convert, I am not sure how you can short circuit it. Are you suggesting we call Comparable on compareBottom until some doc beats it? I'm saying on seg transition you indeed get the Comparable for current bottom, but, don't attempt to invert it. Instead, as seg 2 finds a hit, you get that hit's Comparables and compare to bottom. If it beats bottom, it goes into the queue. If it does not, you use the ord (in seg 2's ord space) to learn a bottom in the ord space of seg 2. That would hurt performance I lot though, no? Yeah I think likely it would, since we're talking about a binary search on transition VS having to do possibly many upgrade-to-Comparable and compare-Comparabls to slowly learn the equivalent ord in the new segment. I was proposing it for cases where inversion is very difficult. But realistically, since you must keep around the ful ord - Comparable for every segment anyway (in order to merge in the end), inversion shouldn't ever actually be difficult -- it'd just be a binary search on presumably in-RAM storage. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org mailto:java-dev-unsubscr...@lucene.apache.org mailto:java-dev-unsubscr...@lucene.apache.org mailto:java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: lucene 2.9 sorting algorithm
Mike: I did just post with what I saw, feel free to read and comment on it. I am simply trying to work with Michael on this and trying to understand the code. As I have expressed previously, I have seen a difference between 1.5 and 1.6 that is significant. Since Mike has posted some numbers on jdk 1.6, I was hoping to eliminate all variables relating to the index and environment and see if he sees the same thing. I guess I should be more clear in the email. -John On Thu, Oct 22, 2009 at 7:39 PM, Mark Miller markrmil...@gmail.com wrote: I am patient :) And I'm not speaking for Mike, I'm speaking for me. I'm wondering what your seeing. Asking Mike to rerun the tests without giving any further info (you didn't even say that your seeing something different) is unfair to the rest of us ;) Giving 0 info along with your request just makes 0 sense to me and I said as much. John Wang wrote: Mark: Please be patient with me. I am seeing a difference and was wondering if Mike would see the same thing. I thought Michael would be willing to because he expressed interest in understanding what the performance discrepancies are. Again, it is only a request. It is perfectly fine if Michael refuses to. But it would be great if Michael speaks for himself. Thanks -John On Thu, Oct 22, 2009 at 7:29 PM, Mark Miller markrmil...@gmail.com mailto:markrmil...@gmail.com wrote: Why? What might he find? Whats with the cryptic request? Why would Java 1.5 perform better than 1.6? It erases 20 and 40% gains? I know point 2 certainly doesn't. Cards on the table? John Wang wrote: Hey Michael: Would you mind rerunning the test you have with jdk1.5? Also, if you would, change the comparator method to avoid brachning for int and string comparators, e.g. return index.order[i.doc] - index.order[j.doc]; Thanks -John On Thu, Oct 22, 2009 at 2:38 AM, Michael McCandless luc...@mikemccandless.com mailto:luc...@mikemccandless.com mailto:luc...@mikemccandless.com mailto:luc...@mikemccandless.com wrote: On Thu, Oct 22, 2009 at 2:17 AM, John Wang john.w...@gmail.com mailto:john.w...@gmail.com mailto:john.w...@gmail.com mailto:john.w...@gmail.com wrote: I have been playing with the patch, and I think I have some information that you might like. Let me spend sometime and gather some more numbers and update in jira. Excellent! say bottom has ords 23, 45, 76, each corresponding to a string. When moving to the next segment, you need to make bottom to have ords that can be comparable to other docs in this new segment, so you would need to find the new ords for the values in 23,45 and 76, don't you? To find it, assuming the values are s1,s2,s3, you would do a bin. search on the new val array, and find index for s1,s2,s3. It's that inversion (from ord-Comparable in first seg, and Comparable-ord in second seg) that I'm trying to avoid (w/ this new proposal). Which is 3 bin searches per convert, I am not sure how you can short circuit it. Are you suggesting we call Comparable on compareBottom until some doc beats it? I'm saying on seg transition you indeed get the Comparable for current bottom, but, don't attempt to invert it. Instead, as seg 2 finds a hit, you get that hit's Comparables and compare to bottom. If it beats bottom, it goes into the queue. If it does not, you use the ord (in seg 2's ord space) to learn a bottom in the ord space of seg 2. That would hurt performance I lot though, no? Yeah I think likely it would, since we're talking about a binary search on transition VS having to do possibly many upgrade-to-Comparable and compare-Comparabls to slowly learn the equivalent ord in the new segment. I was proposing it for cases where inversion is very difficult. But realistically, since you must keep around the ful ord - Comparable for every segment anyway (in order to merge in the end), inversion shouldn't ever actually be difficult -- it'd just be a binary search on presumably in-RAM storage. Mike - To unsubscribe, e-mail:
Re: lucene 2.9 sorting algorithm
For some reason I guess this didn't go thru and caused all the confusion. ||Seg size||Query||Tot hits||Sort||Top N||QPS old||QPS new||Pct change|| |log|all|100|rand string|10|91.76|108.63|{color:green}18.4%{color}| |log|all|100|rand string|25|92.39|106.79|{color:green}15.6%{color}| |log|all|100|rand string|50|91.30|104.02|{color:green}13.9%{color}| |log|all|100|rand string|500|86.16|63.27|{color:red}-26.6%{color}| |log|all|100|rand string|1000|76.92|64.85|{color:red}-15.7%{color}| |log|all|100|country|10|92.42|108.78|{color:green}17.7%{color}| |log|all|100|country|25|92.60|106.26|{color:green}14.8%{color}| |log|all|100|country|50|92.64|103.76|{color:green}12.0%{color}| |log|all|100|country|500|83.92|50.30|{color:red}-40.1%{color}| |log|all|100|country|1000|74.78|46.59|{color:red}-37.7%{color}| |log|all|100|rand int|10|114.03|114.85|{color:green}0.7%{color}| |log|all|100|rand int|25|113.77|112.92|{color:red}-0.7%{color}| |log|all|100|rand int|50|113.36|109.56|{color:red}-3.4%{color}| |log|all|100|rand int|500|103.90|66.29|{color:red}-36.2%{color}| |log|all|100|rand int|1000|89.52|70.67|{color:red}-21.1%{color}| On Thu, Oct 22, 2009 at 7:43 PM, John Wang john.w...@gmail.com wrote: Mike: I did just post with what I saw, feel free to read and comment on it. I am simply trying to work with Michael on this and trying to understand the code. As I have expressed previously, I have seen a difference between 1.5 and 1.6 that is significant. Since Mike has posted some numbers on jdk 1.6, I was hoping to eliminate all variables relating to the index and environment and see if he sees the same thing. I guess I should be more clear in the email. -John On Thu, Oct 22, 2009 at 7:39 PM, Mark Miller markrmil...@gmail.comwrote: I am patient :) And I'm not speaking for Mike, I'm speaking for me. I'm wondering what your seeing. Asking Mike to rerun the tests without giving any further info (you didn't even say that your seeing something different) is unfair to the rest of us ;) Giving 0 info along with your request just makes 0 sense to me and I said as much. John Wang wrote: Mark: Please be patient with me. I am seeing a difference and was wondering if Mike would see the same thing. I thought Michael would be willing to because he expressed interest in understanding what the performance discrepancies are. Again, it is only a request. It is perfectly fine if Michael refuses to. But it would be great if Michael speaks for himself. Thanks -John On Thu, Oct 22, 2009 at 7:29 PM, Mark Miller markrmil...@gmail.com mailto:markrmil...@gmail.com wrote: Why? What might he find? Whats with the cryptic request? Why would Java 1.5 perform better than 1.6? It erases 20 and 40% gains? I know point 2 certainly doesn't. Cards on the table? John Wang wrote: Hey Michael: Would you mind rerunning the test you have with jdk1.5? Also, if you would, change the comparator method to avoid brachning for int and string comparators, e.g. return index.order[i.doc] - index.order[j.doc]; Thanks -John On Thu, Oct 22, 2009 at 2:38 AM, Michael McCandless luc...@mikemccandless.com mailto:luc...@mikemccandless.com mailto:luc...@mikemccandless.com mailto:luc...@mikemccandless.com wrote: On Thu, Oct 22, 2009 at 2:17 AM, John Wang john.w...@gmail.com mailto:john.w...@gmail.com mailto:john.w...@gmail.com mailto:john.w...@gmail.com wrote: I have been playing with the patch, and I think I have some information that you might like. Let me spend sometime and gather some more numbers and update in jira. Excellent! say bottom has ords 23, 45, 76, each corresponding to a string. When moving to the next segment, you need to make bottom to have ords that can be comparable to other docs in this new segment, so you would need to find the new ords for the values in 23,45 and 76, don't you? To find it, assuming the values are s1,s2,s3, you would do a bin. search on the new val array, and find index for s1,s2,s3. It's that inversion (from ord-Comparable in first seg, and Comparable-ord in second seg) that I'm trying to avoid (w/ this new proposal). Which is 3 bin searches per convert, I am not sure how you can short circuit it. Are you suggesting we call Comparable on compareBottom until some doc beats it? I'm
Re: lucene 2.9 sorting algorithm
I guess I should be more clear in the email. No - If you mentioned before the other info and I missed it, just say: Mark you don't know what your talking about it and you missed the info. Thats what I'd do. You just caught me at a time when I'm trying to get these tests going myself, and a little frustrated at the lack of info. I'd consider trying Java 6 vs Java 1.5 or something on Linux, but with no reason why I should, its like .. come on - throw me a bone. John Wang wrote: Mike: I did just post with what I saw, feel free to read and comment on it. I am simply trying to work with Michael on this and trying to understand the code. As I have expressed previously, I have seen a difference between 1.5 and 1.6 that is significant. Since Mike has posted some numbers on jdk 1.6, I was hoping to eliminate all variables relating to the index and environment and see if he sees the same thing. I guess I should be more clear in the email. -John On Thu, Oct 22, 2009 at 7:39 PM, Mark Miller markrmil...@gmail.com mailto:markrmil...@gmail.com wrote: I am patient :) And I'm not speaking for Mike, I'm speaking for me. I'm wondering what your seeing. Asking Mike to rerun the tests without giving any further info (you didn't even say that your seeing something different) is unfair to the rest of us ;) Giving 0 info along with your request just makes 0 sense to me and I said as much. John Wang wrote: Mark: Please be patient with me. I am seeing a difference and was wondering if Mike would see the same thing. I thought Michael would be willing to because he expressed interest in understanding what the performance discrepancies are. Again, it is only a request. It is perfectly fine if Michael refuses to. But it would be great if Michael speaks for himself. Thanks -John On Thu, Oct 22, 2009 at 7:29 PM, Mark Miller markrmil...@gmail.com mailto:markrmil...@gmail.com mailto:markrmil...@gmail.com mailto:markrmil...@gmail.com wrote: Why? What might he find? Whats with the cryptic request? Why would Java 1.5 perform better than 1.6? It erases 20 and 40% gains? I know point 2 certainly doesn't. Cards on the table? John Wang wrote: Hey Michael: Would you mind rerunning the test you have with jdk1.5? Also, if you would, change the comparator method to avoid brachning for int and string comparators, e.g. return index.order[i.doc] - index.order[j.doc]; Thanks -John On Thu, Oct 22, 2009 at 2:38 AM, Michael McCandless luc...@mikemccandless.com mailto:luc...@mikemccandless.com mailto:luc...@mikemccandless.com mailto:luc...@mikemccandless.com mailto:luc...@mikemccandless.com mailto:luc...@mikemccandless.com mailto:luc...@mikemccandless.com mailto:luc...@mikemccandless.com wrote: On Thu, Oct 22, 2009 at 2:17 AM, John Wang john.w...@gmail.com mailto:john.w...@gmail.com mailto:john.w...@gmail.com mailto:john.w...@gmail.com mailto:john.w...@gmail.com mailto:john.w...@gmail.com mailto:john.w...@gmail.com mailto:john.w...@gmail.com wrote: I have been playing with the patch, and I think I have some information that you might like. Let me spend sometime and gather some more numbers and update in jira. Excellent! say bottom has ords 23, 45, 76, each corresponding to a string. When moving to the next segment, you need to make bottom to have ords that can be comparable to other docs in this new segment, so you would need to find the new ords for the values in 23,45 and 76, don't you? To find it, assuming the values are s1,s2,s3, you would do a bin. search on the new val array, and find index for s1,s2,s3. It's that inversion (from ord-Comparable in first seg, and Comparable-ord in second seg) that I'm trying to avoid (w/ this new proposal). Which is 3 bin searches per convert, I am not sure how you can short circuit it. Are you suggesting we call Comparable on compareBottom until some doc beats it?
Re: lucene 2.9 sorting algorithm
Mark: There is no reason for me to withhold information. I just want to understand and share my findings. My bad for not being clear. Mike's test is actually very well written, I just followed instructions in the jira and got it running. I think the tests has good coverage and shows the symptoms the algorithms would suggest. -John On Thu, Oct 22, 2009 at 7:42 PM, Mark Miller markrmil...@gmail.com wrote: Thanks - thats all I'm asking for. A simple explanation of why you'd ask for a retest with those two things changed. Just seems its hold your cards a little to close to say - please do this with 0 explanation. As to point 2, thats fine - I'm sure it helps - I was just saying I didn't buy it helps by 20-40%. Not arguing against doing it, but since the request had no info, the only thing I could assume was that that was supposed to change things. I was about to run some of these tests myself (if i can find what darn revision to patch), and its a bit frustrating to see you guys knew something but were not telling ... Jake Mannix wrote: Mark, We're not seeing exactly the numbers that Mike is seeing in his tests, running with jdk 1.5 on intel macs, so we're trying to eliminate factors of difference. Point 2 does indeed make a difference, we've seen it, and it's only fair: the single pq comparator does this branch optimization but the current patch multi-pq does not, so let's level the playing field. John's on the road with limited net connectivity, but we'll have some numbers to compare more over the weekend for sure. -jake On Thu, Oct 22, 2009 at 7:29 PM, Mark Miller markrmil...@gmail.com mailto:markrmil...@gmail.com wrote: Why? What might he find? Whats with the cryptic request? Why would Java 1.5 perform better than 1.6? It erases 20 and 40% gains? I know point 2 certainly doesn't. Cards on the table? John Wang wrote: Hey Michael: Would you mind rerunning the test you have with jdk1.5? Also, if you would, change the comparator method to avoid brachning for int and string comparators, e.g. return index.order[i.doc] - index.order[j.doc]; Thanks -John On Thu, Oct 22, 2009 at 2:38 AM, Michael McCandless luc...@mikemccandless.com mailto:luc...@mikemccandless.com mailto:luc...@mikemccandless.com mailto:luc...@mikemccandless.com wrote: On Thu, Oct 22, 2009 at 2:17 AM, John Wang john.w...@gmail.com mailto:john.w...@gmail.com mailto:john.w...@gmail.com mailto:john.w...@gmail.com wrote: I have been playing with the patch, and I think I have some information that you might like. Let me spend sometime and gather some more numbers and update in jira. Excellent! say bottom has ords 23, 45, 76, each corresponding to a string. When moving to the next segment, you need to make bottom to have ords that can be comparable to other docs in this new segment, so you would need to find the new ords for the values in 23,45 and 76, don't you? To find it, assuming the values are s1,s2,s3, you would do a bin. search on the new val array, and find index for s1,s2,s3. It's that inversion (from ord-Comparable in first seg, and Comparable-ord in second seg) that I'm trying to avoid (w/ this new proposal). Which is 3 bin searches per convert, I am not sure how you can short circuit it. Are you suggesting we call Comparable on compareBottom until some doc beats it? I'm saying on seg transition you indeed get the Comparable for current bottom, but, don't attempt to invert it. Instead, as seg 2 finds a hit, you get that hit's Comparables and compare to bottom. If it beats bottom, it goes into the queue. If it does not, you use the ord (in seg 2's ord space) to learn a bottom in the ord space of seg 2. That would hurt performance I lot though, no? Yeah I think likely it would, since we're talking about a binary search on transition VS having to do possibly many upgrade-to-Comparable and compare-Comparabls to slowly learn the equivalent ord in the new segment. I was proposing it for cases where inversion is very difficult. But realistically, since you must keep around the ful ord - Comparable for every segment anyway (in order to merge in the end),
Re: lucene 2.9 sorting algorithm
John Wang wrote: Mark: There is no reason for me to withhold information. I just want to understand and share my findings. Right, I didn't mean to accuse you of that ;) Not that you were doing it on purpose. I was just trying to string out more :) Which I've managed to do - in my usual awkward ending up email thread way. Success :) My bad for not being clear. Mike's test is actually very well written, I just followed instructions in the jira and got it running. I think the tests has good coverage and shows the symptoms the algorithms would suggest. Yeah, I'm not complaining about his tests - I'm just trying to find a version of Lucene that it will patch into cleanly. -John On Thu, Oct 22, 2009 at 7:42 PM, Mark Miller markrmil...@gmail.com mailto:markrmil...@gmail.com wrote: Thanks - thats all I'm asking for. A simple explanation of why you'd ask for a retest with those two things changed. Just seems its hold your cards a little to close to say - please do this with 0 explanation. As to point 2, thats fine - I'm sure it helps - I was just saying I didn't buy it helps by 20-40%. Not arguing against doing it, but since the request had no info, the only thing I could assume was that that was supposed to change things. I was about to run some of these tests myself (if i can find what darn revision to patch), and its a bit frustrating to see you guys knew something but were not telling ... Jake Mannix wrote: Mark, We're not seeing exactly the numbers that Mike is seeing in his tests, running with jdk 1.5 on intel macs, so we're trying to eliminate factors of difference. Point 2 does indeed make a difference, we've seen it, and it's only fair: the single pq comparator does this branch optimization but the current patch multi-pq does not, so let's level the playing field. John's on the road with limited net connectivity, but we'll have some numbers to compare more over the weekend for sure. -jake On Thu, Oct 22, 2009 at 7:29 PM, Mark Miller markrmil...@gmail.com mailto:markrmil...@gmail.com mailto:markrmil...@gmail.com mailto:markrmil...@gmail.com wrote: Why? What might he find? Whats with the cryptic request? Why would Java 1.5 perform better than 1.6? It erases 20 and 40% gains? I know point 2 certainly doesn't. Cards on the table? John Wang wrote: Hey Michael: Would you mind rerunning the test you have with jdk1.5? Also, if you would, change the comparator method to avoid brachning for int and string comparators, e.g. return index.order[i.doc] - index.order[j.doc]; Thanks -John On Thu, Oct 22, 2009 at 2:38 AM, Michael McCandless luc...@mikemccandless.com mailto:luc...@mikemccandless.com mailto:luc...@mikemccandless.com mailto:luc...@mikemccandless.com mailto:luc...@mikemccandless.com mailto:luc...@mikemccandless.com mailto:luc...@mikemccandless.com mailto:luc...@mikemccandless.com wrote: On Thu, Oct 22, 2009 at 2:17 AM, John Wang john.w...@gmail.com mailto:john.w...@gmail.com mailto:john.w...@gmail.com mailto:john.w...@gmail.com mailto:john.w...@gmail.com mailto:john.w...@gmail.com mailto:john.w...@gmail.com mailto:john.w...@gmail.com wrote: I have been playing with the patch, and I think I have some information that you might like. Let me spend sometime and gather some more numbers and update in jira. Excellent! say bottom has ords 23, 45, 76, each corresponding to a string. When moving to the next segment, you need to make bottom to have ords that can be comparable to other docs in this new segment, so you would need to find the new ords for the values in 23,45 and 76, don't you? To find it, assuming the values are s1,s2,s3, you would do a bin. search on the new val array, and find index for s1,s2,s3. It's that inversion (from ord-Comparable in first seg, and Comparable-ord in second seg) that I'm trying to avoid (w/ this new proposal). Which is 3 bin searches per convert, I
Re: lucene 2.9 sorting algorithm
bq. I just followed instructions in the jira and got it running. Heh - I didn't read down far enough - first comment says 2.9 branch. Thanks ; ) I've been flipping through revisions for a while now, wondering how the heck the revs in the patch match up with trunk. John Wang wrote: Mark: There is no reason for me to withhold information. I just want to understand and share my findings. My bad for not being clear. Mike's test is actually very well written, I just followed instructions in the jira and got it running. I think the tests has good coverage and shows the symptoms the algorithms would suggest. -John On Thu, Oct 22, 2009 at 7:42 PM, Mark Miller markrmil...@gmail.com mailto:markrmil...@gmail.com wrote: Thanks - thats all I'm asking for. A simple explanation of why you'd ask for a retest with those two things changed. Just seems its hold your cards a little to close to say - please do this with 0 explanation. As to point 2, thats fine - I'm sure it helps - I was just saying I didn't buy it helps by 20-40%. Not arguing against doing it, but since the request had no info, the only thing I could assume was that that was supposed to change things. I was about to run some of these tests myself (if i can find what darn revision to patch), and its a bit frustrating to see you guys knew something but were not telling ... Jake Mannix wrote: Mark, We're not seeing exactly the numbers that Mike is seeing in his tests, running with jdk 1.5 on intel macs, so we're trying to eliminate factors of difference. Point 2 does indeed make a difference, we've seen it, and it's only fair: the single pq comparator does this branch optimization but the current patch multi-pq does not, so let's level the playing field. John's on the road with limited net connectivity, but we'll have some numbers to compare more over the weekend for sure. -jake On Thu, Oct 22, 2009 at 7:29 PM, Mark Miller markrmil...@gmail.com mailto:markrmil...@gmail.com mailto:markrmil...@gmail.com mailto:markrmil...@gmail.com wrote: Why? What might he find? Whats with the cryptic request? Why would Java 1.5 perform better than 1.6? It erases 20 and 40% gains? I know point 2 certainly doesn't. Cards on the table? John Wang wrote: Hey Michael: Would you mind rerunning the test you have with jdk1.5? Also, if you would, change the comparator method to avoid brachning for int and string comparators, e.g. return index.order[i.doc] - index.order[j.doc]; Thanks -John On Thu, Oct 22, 2009 at 2:38 AM, Michael McCandless luc...@mikemccandless.com mailto:luc...@mikemccandless.com mailto:luc...@mikemccandless.com mailto:luc...@mikemccandless.com mailto:luc...@mikemccandless.com mailto:luc...@mikemccandless.com mailto:luc...@mikemccandless.com mailto:luc...@mikemccandless.com wrote: On Thu, Oct 22, 2009 at 2:17 AM, John Wang john.w...@gmail.com mailto:john.w...@gmail.com mailto:john.w...@gmail.com mailto:john.w...@gmail.com mailto:john.w...@gmail.com mailto:john.w...@gmail.com mailto:john.w...@gmail.com mailto:john.w...@gmail.com wrote: I have been playing with the patch, and I think I have some information that you might like. Let me spend sometime and gather some more numbers and update in jira. Excellent! say bottom has ords 23, 45, 76, each corresponding to a string. When moving to the next segment, you need to make bottom to have ords that can be comparable to other docs in this new segment, so you would need to find the new ords for the values in 23,45 and 76, don't you? To find it, assuming the values are s1,s2,s3, you would do a bin. search on the new val array, and find index for s1,s2,s3. It's that inversion (from ord-Comparable in first seg, and Comparable-ord in second seg) that I'm trying to avoid (w/ this new proposal). Which is 3 bin searches per convert, I am not sure how you can short circuit it. Are you
Re: lucene 2.9 sorting algorithm
On Thu, Oct 22, 2009 at 10:35 PM, John Wang john.w...@gmail.com wrote: Please be patient with me. I am seeing a difference and was wondering if Mike would see the same thing. Some differences are bound to be seen... with your changes (JVM changes, branch optimizations), are you seeing better average performance with multiPQ? -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: lucene 2.9 sorting algorithm
It's hard to read the column format, but if you look up above in the thread from tonight, you can see that yes, for PQ sizes less than 100 elements, multiPQ is better, and only starts to be worse at around 100 for strings, and 50 for ints. -jake On Thu, Oct 22, 2009 at 8:06 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Thu, Oct 22, 2009 at 10:35 PM, John Wang john.w...@gmail.com wrote: Please be patient with me. I am seeing a difference and was wondering if Mike would see the same thing. Some differences are bound to be seen... with your changes (JVM changes, branch optimizations), are you seeing better average performance with multiPQ? -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API
[ https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769039#action_12769039 ] Mark Miller commented on LUCENE-1997: - Results from John Wang: ||Seg size||Query||Tot hits||Sort||Top N||QPS old||QPS new||Pct change|| |log|all|100|rand string|10|91.76|108.63|{color:green}18.4%{color}| |log|all|100|rand string|25|92.39|106.79|{color:green}15.6%{color}| |log|all|100|rand string|50|91.30|104.02|{color:green}13.9%{color}| |log|all|100|rand string|500|86.16|63.27|{color:red}-26.6%{color}| |log|all|100|rand string|1000|76.92|64.85|{color:red}-15.7%{color}| |log|all|100|country|10|92.42|108.78|{color:green}17.7%{color}| |log|all|100|country|25|92.60|106.26|{color:green}14.8%{color}| |log|all|100|country|50|92.64|103.76|{color:green}12.0%{color}| |log|all|100|country|500|83.92|50.30|{color:red}-40.1%{color}| |log|all|100|country|1000|74.78|46.59|{color:red}-37.7%{color}| |log|all|100|rand int|10|114.03|114.85|{color:green}0.7%{color}| |log|all|100|rand int|25|113.77|112.92|{color:red}-0.7%{color}| |log|all|100|rand int|50|113.36|109.56|{color:red}-3.4%{color}| |log|all|100|rand int|500|103.90|66.29|{color:red}-36.2%{color}| |log|all|100|rand int|1000|89.52|70.67|{color:red}-21.1%{color}| Explore performance of multi-PQ vs single-PQ sorting API Key: LUCENE-1997 URL: https://issues.apache.org/jira/browse/LUCENE-1997 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-1997.patch, LUCENE-1997.patch Spinoff from recent lucene 2.9 sorting algorithm thread on java-dev, where a simpler (non-segment-based) comparator API is proposed that gathers results into multiple PQs (one per segment) and then merges them in the end. I started from John's multi-PQ code and worked it into contrib/benchmark so that we could run perf tests. Then I generified the Python script I use for running search benchmarks (in contrib/benchmark/sortBench.py). The script first creates indexes with 1M docs (based on SortableSingleDocSource, and based on wikipedia, if available). Then it runs various combinations: * Index with 20 balanced segments vs index with the normal log segment size * Queries with different numbers of hits (only for wikipedia index) * Different top N * Different sorts (by title, for wikipedia, and by random string, random int, and country for the random index) For each test, 7 search rounds are run and the best QPS is kept. The script runs singlePQ then multiPQ, and records the resulting best QPS for each and produces table (in Jira format) as output. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API
[ https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769042#action_12769042 ] Jake Mannix commented on LUCENE-1997: - Hah! Thanks for posting that, Mark! Much easier to read. :) Hey John, can you comment with your hardware specs on this, so it can be recorded for posterity? ;) Explore performance of multi-PQ vs single-PQ sorting API Key: LUCENE-1997 URL: https://issues.apache.org/jira/browse/LUCENE-1997 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-1997.patch, LUCENE-1997.patch Spinoff from recent lucene 2.9 sorting algorithm thread on java-dev, where a simpler (non-segment-based) comparator API is proposed that gathers results into multiple PQs (one per segment) and then merges them in the end. I started from John's multi-PQ code and worked it into contrib/benchmark so that we could run perf tests. Then I generified the Python script I use for running search benchmarks (in contrib/benchmark/sortBench.py). The script first creates indexes with 1M docs (based on SortableSingleDocSource, and based on wikipedia, if available). Then it runs various combinations: * Index with 20 balanced segments vs index with the normal log segment size * Queries with different numbers of hits (only for wikipedia index) * Different top N * Different sorts (by title, for wikipedia, and by random string, random int, and country for the random index) For each test, 7 search rounds are run and the best QPS is kept. The script runs singlePQ then multiPQ, and records the resulting best QPS for each and produces table (in Jira format) as output. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: lucene 2.9 sorting algorithm
Hi Yonik I am, but I don't think I should. Even with branching etc., I should see that much of a consistent difference. I am traveling with my macbook pro, I wanted to eliminate all variables. It really does not make sense to me... -John On Thu, Oct 22, 2009 at 8:06 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Thu, Oct 22, 2009 at 10:35 PM, John Wang john.w...@gmail.com wrote: Please be patient with me. I am seeing a difference and was wondering if Mike would see the same thing. Some differences are bound to be seen... with your changes (JVM changes, branch optimizations), are you seeing better average performance with multiPQ? -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API
[ https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769045#action_12769045 ] John Wang commented on LUCENE-1997: --- My machine HW spec: Model Name: MacBook Pro Model Identifier: MacBookPro3,1 Processor Name: Intel Core 2 Duo Processor Speed: 2.4 GHz Number Of Processors: 1 Total Number Of Cores:2 L2 Cache: 4 MB Memory: 4 GB Bus Speed:800 MHz Explore performance of multi-PQ vs single-PQ sorting API Key: LUCENE-1997 URL: https://issues.apache.org/jira/browse/LUCENE-1997 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-1997.patch, LUCENE-1997.patch Spinoff from recent lucene 2.9 sorting algorithm thread on java-dev, where a simpler (non-segment-based) comparator API is proposed that gathers results into multiple PQs (one per segment) and then merges them in the end. I started from John's multi-PQ code and worked it into contrib/benchmark so that we could run perf tests. Then I generified the Python script I use for running search benchmarks (in contrib/benchmark/sortBench.py). The script first creates indexes with 1M docs (based on SortableSingleDocSource, and based on wikipedia, if available). Then it runs various combinations: * Index with 20 balanced segments vs index with the normal log segment size * Queries with different numbers of hits (only for wikipedia index) * Different top N * Different sorts (by title, for wikipedia, and by random string, random int, and country for the random index) For each test, 7 search rounds are run and the best QPS is kept. The script runs singlePQ then multiPQ, and records the resulting best QPS for each and produces table (in Jira format) as output. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: lucene 2.9 sorting algorithm
On Thu, Oct 22, 2009 at 11:11 PM, Jake Mannix jake.man...@gmail.com wrote: It's hard to read the column format, but if you look up above in the thread from tonight, you can see that yes, for PQ sizes less than 100 elements, multiPQ is better, and only starts to be worse at around 100 for strings, and 50 for ints. Ah, OK, I had missed John's followup with the numbers. I assume this is for Java5 + optimizations? What does Java6 show? bq. Point 2 does indeed make a difference, we've seen it, and it's only fair: the single pq comparator does this branch optimization but the current patch multi-pq does not, so let's level the playing field. Of course - it's not about leveling the playing field, but finding the best solution for the average case - so everything should be optimized as much as possible. There are probably further optimizations possible in both the single and multi PQ cases. My biggest reservation is that we've gone down the road of telling people to implement a new style of comparators, and told them that the old style comparators would be deleted in the next release (which is where we are). Reversing that will be a bit of a headache/question... the new stuff isn't deprecated, and having *both* isn't desirable, but that's a separate decision to be made apart from performance testing. Is there also an option of using a multiPQ approach with the new style comparators? -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: lucene 2.9 sorting algorithm
On Thu, Oct 22, 2009 at 8:30 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Thu, Oct 22, 2009 at 11:11 PM, Jake Mannix jake.man...@gmail.com wrote: It's hard to read the column format, but if you look up above in the thread from tonight, you can see that yes, for PQ sizes less than 100 elements, multiPQ is better, and only starts to be worse at around 100 for strings, and 50 for ints. Ah, OK, I had missed John's followup with the numbers. I assume this is for Java5 + optimizations? Yeah, this was for Java5 + optimizations. What does Java6 show? Java6 on Mac showed close to what Mike posted in his report on the Jira ticket - that single-PQ performs a little better for small pq, and more like 30-40% better for large pq. My biggest reservation is that we've gone down the road of telling people to implement a new style of comparators, and told them that the old style comparators would be deleted in the next release (which is where we are). Reversing that will be a bit of a headache/question... the new stuff isn't deprecated, and having *both* isn't desirable, but that's a separate decision to be made apart from performance testing. Well the issue comes down to: if the performance is *basically comparable* between the two approaches, then the new API is much harder for the average user to use, and even for the experienced user, it's not terribly fun, and more importantly: for the user who has already implemented custom sorts on the old API, upgrading is enough trouble that people may decide it's not worth it. It probably *is* worth it, but if you're going to even put that kind of thinking in the user's head, you've got to ask yourself: what's the reasoning for going with a more complex API if you can get equal (slightly better in some cases, slightly worse in others) performance with a simpler API? Yes, as Mike says, the new API is *not* breaking back-compat in a functional sense, but how many users have converted to the new sorting api already? 2.9 has barely just come out, and while it's work for the community as a whole to reconsider the multi-segment sorting api, and work to implement a change at this level, if it's the right thing to do, we shouldn't let the question of which method is deprecated dictate which one *should* be deprecated. Is there also an option of using a multiPQ approach with the new style comparators? For the record: that would be the worst of all worlds, in my view: harder API with only better performance in some cases, and sometimes worse performance. -jake
[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API
[ https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769051#action_12769051 ] Mark Miller commented on LUCENE-1997: - Another run: I made the changes to int/string comparator to do the faster compare. Java 1.5.0_20 Laptop Quad Core - 2.0 Ghz Ubuntu 9.10 Kernel 2.6.31 4 GB RAM ||Seg size||Query||Tot hits||Sort||Top N||QPS old||QPS new||Pct change|| |log|1|317925|title|10|87.38|75.42|{color:red}-13.7%{color}| |log|1|317925|title|25|86.55|74.49|{color:red}-13.9%{color}| |log|1|317925|title|50|90.49|71.90|{color:red}-20.5%{color}| |log|1|317925|title|100|88.07|83.08|{color:red}-5.7%{color}| |log|1|317925|title|500|76.67|54.34|{color:red}-29.1%{color}| |log|1|317925|title|1000|69.29|38.54|{color:red}-44.4%{color}| |log|all|100|title|10|109.01|92.78|{color:red}-14.9%{color}| |log|all|100|title|25|108.30|89.43|{color:red}-17.4%{color}| |log|all|100|title|50|107.19|85.86|{color:red}-19.9%{color}| |log|all|100|title|100|94.84|80.25|{color:red}-15.4%{color}| |log|all|100|title|500|78.84|49.10|{color:red}-37.7%{color}| |log|all|100|title|1000|72.52|26.90|{color:red}-62.9%{color}| |log|all|100|rand string|10|115.32|101.53|{color:red}-12.0%{color}| |log|all|100|rand string|25|115.22|91.82|{color:red}-20.3%{color}| |log|all|100|rand string|50|114.40|89.70|{color:red}-21.6%{color}| |log|all|100|rand string|100|91.30|81.04|{color:red}-11.2%{color}| |log|all|100|rand string|500|76.31|43.94|{color:red}-42.4%{color}| |log|all|100|rand string|1000|67.33|28.29|{color:red}-58.0%{color}| |log|all|100|country|10|115.40|101.46|{color:red}-12.1%{color}| |log|all|100|country|25|115.06|92.15|{color:red}-19.9%{color}| |log|all|100|country|50|114.03|90.06|{color:red}-21.0%{color}| |log|all|100|country|100|99.30|80.07|{color:red}-19.4%{color}| |log|all|100|country|500|75.64|43.44|{color:red}-42.6%{color}| |log|all|100|country|1000|66.05|27.94|{color:red}-57.7%{color}| |log|all|100|rand int|10|118.47|109.30|{color:red}-7.7%{color}| |log|all|100|rand int|25|118.72|99.37|{color:red}-16.3%{color}| |log|all|100|rand int|50|118.25|95.14|{color:red}-19.5%{color}| |log|all|100|rand int|100|97.57|83.39|{color:red}-14.5%{color}| |log|all|100|rand int|500|86.55|46.21|{color:red}-46.6%{color}| |log|all|100|rand int|1000|78.23|28.94|{color:red}-63.0%{color}| Explore performance of multi-PQ vs single-PQ sorting API Key: LUCENE-1997 URL: https://issues.apache.org/jira/browse/LUCENE-1997 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-1997.patch, LUCENE-1997.patch Spinoff from recent lucene 2.9 sorting algorithm thread on java-dev, where a simpler (non-segment-based) comparator API is proposed that gathers results into multiple PQs (one per segment) and then merges them in the end. I started from John's multi-PQ code and worked it into contrib/benchmark so that we could run perf tests. Then I generified the Python script I use for running search benchmarks (in contrib/benchmark/sortBench.py). The script first creates indexes with 1M docs (based on SortableSingleDocSource, and based on wikipedia, if available). Then it runs various combinations: * Index with 20 balanced segments vs index with the normal log segment size * Queries with different numbers of hits (only for wikipedia index) * Different top N * Different sorts (by title, for wikipedia, and by random string, random int, and country for the random index) For each test, 7 search rounds are run and the best QPS is kept. The script runs singlePQ then multiPQ, and records the resulting best QPS for each and produces table (in Jira format) as output. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: lucene 2.9 sorting algorithm
he new API is much harder for the average user to use, and even for the experienced user, it's not terribly fun, and more importantly: Do we have enough info to support that though? All the cases I have seen on the list, people have figured it out pretty easily - havn't really seen any complaints in that regard (not counting you and John - that is two). The only other complaints I have noticed are those that happened to count on unsupported behavior (eg people counting on no MultiSearcher use) I think Uwe had some good ideas for exposing an easier API with the new one. Jake Mannix wrote: On Thu, Oct 22, 2009 at 8:30 PM, Yonik Seeley yo...@lucidimagination.com mailto:yo...@lucidimagination.com wrote: On Thu, Oct 22, 2009 at 11:11 PM, Jake Mannix jake.man...@gmail.com mailto:jake.man...@gmail.com wrote: It's hard to read the column format, but if you look up above in the thread from tonight, you can see that yes, for PQ sizes less than 100 elements, multiPQ is better, and only starts to be worse at around 100 for strings, and 50 for ints. Ah, OK, I had missed John's followup with the numbers. I assume this is for Java5 + optimizations? Yeah, this was for Java5 + optimizations. What does Java6 show? Java6 on Mac showed close to what Mike posted in his report on the Jira ticket - that single-PQ performs a little better for small pq, and more like 30-40% better for large pq. My biggest reservation is that we've gone down the road of telling people to implement a new style of comparators, and told them that the old style comparators would be deleted in the next release (which is where we are). Reversing that will be a bit of a headache/question... the new stuff isn't deprecated, and having *both* isn't desirable, but that's a separate decision to be made apart from performance testing. Well the issue comes down to: if the performance is *basically comparable* between the two approaches, then the new API is much harder for the average user to use, and even for the experienced user, it's not terribly fun, and more importantly: for the user who has already implemented custom sorts on the old API, upgrading is enough trouble that people may decide it's not worth it. It probably *is* worth it, but if you're going to even put that kind of thinking in the user's head, you've got to ask yourself: what's the reasoning for going with a more complex API if you can get equal (slightly better in some cases, slightly worse in others) performance with a simpler API? Yes, as Mike says, the new API is *not* breaking back-compat in a functional sense, but how many users have converted to the new sorting api already? 2.9 has barely just come out, and while it's work for the community as a whole to reconsider the multi-segment sorting api, and work to implement a change at this level, if it's the right thing to do, we shouldn't let the question of which method is deprecated dictate which one *should* be deprecated. Is there also an option of using a multiPQ approach with the new style comparators? For the record: that would be the worst of all worlds, in my view: harder API with only better performance in some cases, and sometimes worse performance. -jake - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API
[ https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769053#action_12769053 ] Yonik Seeley commented on LUCENE-1997: -- While Java5 numbers are still important, I'd say that Java6 (-server of course) should be weighted far heavier? That must be what a majority of people are running in production for new systems? Explore performance of multi-PQ vs single-PQ sorting API Key: LUCENE-1997 URL: https://issues.apache.org/jira/browse/LUCENE-1997 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-1997.patch, LUCENE-1997.patch Spinoff from recent lucene 2.9 sorting algorithm thread on java-dev, where a simpler (non-segment-based) comparator API is proposed that gathers results into multiple PQs (one per segment) and then merges them in the end. I started from John's multi-PQ code and worked it into contrib/benchmark so that we could run perf tests. Then I generified the Python script I use for running search benchmarks (in contrib/benchmark/sortBench.py). The script first creates indexes with 1M docs (based on SortableSingleDocSource, and based on wikipedia, if available). Then it runs various combinations: * Index with 20 balanced segments vs index with the normal log segment size * Queries with different numbers of hits (only for wikipedia index) * Different top N * Different sorts (by title, for wikipedia, and by random string, random int, and country for the random index) For each test, 7 search rounds are run and the best QPS is kept. The script runs singlePQ then multiPQ, and records the resulting best QPS for each and produces table (in Jira format) as output. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API
[ https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769051#action_12769051 ] Mark Miller edited comment on LUCENE-1997 at 10/23/09 4:29 AM: --- Another run: I made the changes to int/string comparator to do the faster compare. Java 1.5.0_20 Laptop - 64bit OS - 64bit JVM - 64bit Quad Core - 2.0 Ghz Ubuntu 9.10 Kernel 2.6.31 4 GB RAM ||Seg size||Query||Tot hits||Sort||Top N||QPS old||QPS new||Pct change|| |log|1|317925|title|10|87.38|75.42|{color:red}-13.7%{color}| |log|1|317925|title|25|86.55|74.49|{color:red}-13.9%{color}| |log|1|317925|title|50|90.49|71.90|{color:red}-20.5%{color}| |log|1|317925|title|100|88.07|83.08|{color:red}-5.7%{color}| |log|1|317925|title|500|76.67|54.34|{color:red}-29.1%{color}| |log|1|317925|title|1000|69.29|38.54|{color:red}-44.4%{color}| |log|all|100|title|10|109.01|92.78|{color:red}-14.9%{color}| |log|all|100|title|25|108.30|89.43|{color:red}-17.4%{color}| |log|all|100|title|50|107.19|85.86|{color:red}-19.9%{color}| |log|all|100|title|100|94.84|80.25|{color:red}-15.4%{color}| |log|all|100|title|500|78.84|49.10|{color:red}-37.7%{color}| |log|all|100|title|1000|72.52|26.90|{color:red}-62.9%{color}| |log|all|100|rand string|10|115.32|101.53|{color:red}-12.0%{color}| |log|all|100|rand string|25|115.22|91.82|{color:red}-20.3%{color}| |log|all|100|rand string|50|114.40|89.70|{color:red}-21.6%{color}| |log|all|100|rand string|100|91.30|81.04|{color:red}-11.2%{color}| |log|all|100|rand string|500|76.31|43.94|{color:red}-42.4%{color}| |log|all|100|rand string|1000|67.33|28.29|{color:red}-58.0%{color}| |log|all|100|country|10|115.40|101.46|{color:red}-12.1%{color}| |log|all|100|country|25|115.06|92.15|{color:red}-19.9%{color}| |log|all|100|country|50|114.03|90.06|{color:red}-21.0%{color}| |log|all|100|country|100|99.30|80.07|{color:red}-19.4%{color}| |log|all|100|country|500|75.64|43.44|{color:red}-42.6%{color}| |log|all|100|country|1000|66.05|27.94|{color:red}-57.7%{color}| |log|all|100|rand int|10|118.47|109.30|{color:red}-7.7%{color}| |log|all|100|rand int|25|118.72|99.37|{color:red}-16.3%{color}| |log|all|100|rand int|50|118.25|95.14|{color:red}-19.5%{color}| |log|all|100|rand int|100|97.57|83.39|{color:red}-14.5%{color}| |log|all|100|rand int|500|86.55|46.21|{color:red}-46.6%{color}| |log|all|100|rand int|1000|78.23|28.94|{color:red}-63.0%{color}| was (Author: markrmil...@gmail.com): Another run: I made the changes to int/string comparator to do the faster compare. Java 1.5.0_20 Laptop Quad Core - 2.0 Ghz Ubuntu 9.10 Kernel 2.6.31 4 GB RAM ||Seg size||Query||Tot hits||Sort||Top N||QPS old||QPS new||Pct change|| |log|1|317925|title|10|87.38|75.42|{color:red}-13.7%{color}| |log|1|317925|title|25|86.55|74.49|{color:red}-13.9%{color}| |log|1|317925|title|50|90.49|71.90|{color:red}-20.5%{color}| |log|1|317925|title|100|88.07|83.08|{color:red}-5.7%{color}| |log|1|317925|title|500|76.67|54.34|{color:red}-29.1%{color}| |log|1|317925|title|1000|69.29|38.54|{color:red}-44.4%{color}| |log|all|100|title|10|109.01|92.78|{color:red}-14.9%{color}| |log|all|100|title|25|108.30|89.43|{color:red}-17.4%{color}| |log|all|100|title|50|107.19|85.86|{color:red}-19.9%{color}| |log|all|100|title|100|94.84|80.25|{color:red}-15.4%{color}| |log|all|100|title|500|78.84|49.10|{color:red}-37.7%{color}| |log|all|100|title|1000|72.52|26.90|{color:red}-62.9%{color}| |log|all|100|rand string|10|115.32|101.53|{color:red}-12.0%{color}| |log|all|100|rand string|25|115.22|91.82|{color:red}-20.3%{color}| |log|all|100|rand string|50|114.40|89.70|{color:red}-21.6%{color}| |log|all|100|rand string|100|91.30|81.04|{color:red}-11.2%{color}| |log|all|100|rand string|500|76.31|43.94|{color:red}-42.4%{color}| |log|all|100|rand string|1000|67.33|28.29|{color:red}-58.0%{color}| |log|all|100|country|10|115.40|101.46|{color:red}-12.1%{color}| |log|all|100|country|25|115.06|92.15|{color:red}-19.9%{color}| |log|all|100|country|50|114.03|90.06|{color:red}-21.0%{color}| |log|all|100|country|100|99.30|80.07|{color:red}-19.4%{color}| |log|all|100|country|500|75.64|43.44|{color:red}-42.6%{color}| |log|all|100|country|1000|66.05|27.94|{color:red}-57.7%{color}| |log|all|100|rand int|10|118.47|109.30|{color:red}-7.7%{color}| |log|all|100|rand int|25|118.72|99.37|{color:red}-16.3%{color}| |log|all|100|rand int|50|118.25|95.14|{color:red}-19.5%{color}| |log|all|100|rand int|100|97.57|83.39|{color:red}-14.5%{color}| |log|all|100|rand int|500|86.55|46.21|{color:red}-46.6%{color}| |log|all|100|rand int|1000|78.23|28.94|{color:red}-63.0%{color}| Explore performance of multi-PQ vs single-PQ sorting API Key:
Re: lucene 2.9 sorting algorithm
On Thu, Oct 22, 2009 at 9:25 PM, Mark Miller markrmil...@gmail.com wrote: he new API is much harder for the average user to use, and even for the experienced user, it's not terribly fun, and more importantly: Do we have enough info to support that though? All the cases I have seen on the list, people have figured it out pretty easily - havn't really seen any complaints in that regard (not counting you and John - that is two). The only other complaints I have noticed are those that happened to count on unsupported behavior (eg people counting on no MultiSearcher use) John and I and TomS all found it both complex, and we're all pretty serious users of inner lucene apis. You see *core developers* saying the api seems fine. Have you seen *any users* of the new sorting api say anything positive about it? Do you know of *anyone* who has implemented the new comparator interface at all, let alone *likes* it? 3 negative votes by users, in comparison to *zero* positive votes by users together with a bunch of core developers saying, yeah it looks easy, what are you guys complaining about?. Internal apis take a while to percolate out to the user base - we're only the first few running into this, and while the sample size is small, it shouldn't be discounted. Yes, of course it is possible to migrate to the new APIs - which is what we, as well as many others, were in the process of doing. This is just an example of an API which got more complex in going to 2.9, and unlike the Collector API, it's possible that in this case it wasn't necessary for it to be as complex as it did. -jake
[jira] Issue Comment Edited: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API
[ https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769055#action_12769055 ] Mark Miller edited comment on LUCENE-1997 at 10/23/09 4:37 AM: --- Hey John, did you pull from a wiki dump or use the random index? *edit* NM - that explains your shortened table - no wiki results - I go it. was (Author: markrmil...@gmail.com): Hey John, did you pull from a wiki dump or use the random index? Explore performance of multi-PQ vs single-PQ sorting API Key: LUCENE-1997 URL: https://issues.apache.org/jira/browse/LUCENE-1997 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-1997.patch, LUCENE-1997.patch Spinoff from recent lucene 2.9 sorting algorithm thread on java-dev, where a simpler (non-segment-based) comparator API is proposed that gathers results into multiple PQs (one per segment) and then merges them in the end. I started from John's multi-PQ code and worked it into contrib/benchmark so that we could run perf tests. Then I generified the Python script I use for running search benchmarks (in contrib/benchmark/sortBench.py). The script first creates indexes with 1M docs (based on SortableSingleDocSource, and based on wikipedia, if available). Then it runs various combinations: * Index with 20 balanced segments vs index with the normal log segment size * Queries with different numbers of hits (only for wikipedia index) * Different top N * Different sorts (by title, for wikipedia, and by random string, random int, and country for the random index) For each test, 7 search rounds are run and the best QPS is kept. The script runs singlePQ then multiPQ, and records the resulting best QPS for each and produces table (in Jira format) as output. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API
[ https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769055#action_12769055 ] Mark Miller commented on LUCENE-1997: - Hey John, did you pull from a wiki dump or use the random index? Explore performance of multi-PQ vs single-PQ sorting API Key: LUCENE-1997 URL: https://issues.apache.org/jira/browse/LUCENE-1997 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-1997.patch, LUCENE-1997.patch Spinoff from recent lucene 2.9 sorting algorithm thread on java-dev, where a simpler (non-segment-based) comparator API is proposed that gathers results into multiple PQs (one per segment) and then merges them in the end. I started from John's multi-PQ code and worked it into contrib/benchmark so that we could run perf tests. Then I generified the Python script I use for running search benchmarks (in contrib/benchmark/sortBench.py). The script first creates indexes with 1M docs (based on SortableSingleDocSource, and based on wikipedia, if available). Then it runs various combinations: * Index with 20 balanced segments vs index with the normal log segment size * Queries with different numbers of hits (only for wikipedia index) * Different top N * Different sorts (by title, for wikipedia, and by random string, random int, and country for the random index) For each test, 7 search rounds are run and the best QPS is kept. The script runs singlePQ then multiPQ, and records the resulting best QPS for each and produces table (in Jira format) as output. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API
[ https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769056#action_12769056 ] Jake Mannix commented on LUCENE-1997: - Java6 is standard in production servers, since when? What justified lucene staying java1.4 for so long if this is the case? In my own experience, my last job only moved to java1.5 a year ago, and at my current company, we're still on 1.5, and I've seen that be pretty common, and I'm in the Valley, where things update pretty quickly. Explore performance of multi-PQ vs single-PQ sorting API Key: LUCENE-1997 URL: https://issues.apache.org/jira/browse/LUCENE-1997 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-1997.patch, LUCENE-1997.patch Spinoff from recent lucene 2.9 sorting algorithm thread on java-dev, where a simpler (non-segment-based) comparator API is proposed that gathers results into multiple PQs (one per segment) and then merges them in the end. I started from John's multi-PQ code and worked it into contrib/benchmark so that we could run perf tests. Then I generified the Python script I use for running search benchmarks (in contrib/benchmark/sortBench.py). The script first creates indexes with 1M docs (based on SortableSingleDocSource, and based on wikipedia, if available). Then it runs various combinations: * Index with 20 balanced segments vs index with the normal log segment size * Queries with different numbers of hits (only for wikipedia index) * Different top N * Different sorts (by title, for wikipedia, and by random string, random int, and country for the random index) For each test, 7 search rounds are run and the best QPS is kept. The script runs singlePQ then multiPQ, and records the resulting best QPS for each and produces table (in Jira format) as output. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API
[ https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769058#action_12769058 ] Jake Mannix commented on LUCENE-1997: - I would say that of course weighting more highly linux and solaris should be done over results on macs, because while I love my mac, I've yet to see a production cluster running on MacBook Pros... :) Explore performance of multi-PQ vs single-PQ sorting API Key: LUCENE-1997 URL: https://issues.apache.org/jira/browse/LUCENE-1997 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-1997.patch, LUCENE-1997.patch Spinoff from recent lucene 2.9 sorting algorithm thread on java-dev, where a simpler (non-segment-based) comparator API is proposed that gathers results into multiple PQs (one per segment) and then merges them in the end. I started from John's multi-PQ code and worked it into contrib/benchmark so that we could run perf tests. Then I generified the Python script I use for running search benchmarks (in contrib/benchmark/sortBench.py). The script first creates indexes with 1M docs (based on SortableSingleDocSource, and based on wikipedia, if available). Then it runs various combinations: * Index with 20 balanced segments vs index with the normal log segment size * Queries with different numbers of hits (only for wikipedia index) * Different top N * Different sorts (by title, for wikipedia, and by random string, random int, and country for the random index) For each test, 7 search rounds are run and the best QPS is kept. The script runs singlePQ then multiPQ, and records the resulting best QPS for each and produces table (in Jira format) as output. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API
[ https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769059#action_12769059 ] Yonik Seeley commented on LUCENE-1997: -- bq. Java6 is standard in production servers, since when? Maybe I'm wrong... it was just a guess. It's just what I've seen most customers deploying new projects on. bq. What justified lucene staying java1.4 for so long if this is the case? The decision of what JVM a business should use to deploy their new app is a very different one than what Lucene should require. A minority of users may be justification enough to avoid requring a new JVM... unless the benefits are really that huge. Lucene does not target the JVM that most people will be deploying on - if that were the case, I have a feeling we'd be switching to Java6 instead of Java5. Explore performance of multi-PQ vs single-PQ sorting API Key: LUCENE-1997 URL: https://issues.apache.org/jira/browse/LUCENE-1997 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-1997.patch, LUCENE-1997.patch Spinoff from recent lucene 2.9 sorting algorithm thread on java-dev, where a simpler (non-segment-based) comparator API is proposed that gathers results into multiple PQs (one per segment) and then merges them in the end. I started from John's multi-PQ code and worked it into contrib/benchmark so that we could run perf tests. Then I generified the Python script I use for running search benchmarks (in contrib/benchmark/sortBench.py). The script first creates indexes with 1M docs (based on SortableSingleDocSource, and based on wikipedia, if available). Then it runs various combinations: * Index with 20 balanced segments vs index with the normal log segment size * Queries with different numbers of hits (only for wikipedia index) * Different top N * Different sorts (by title, for wikipedia, and by random string, random int, and country for the random index) For each test, 7 search rounds are run and the best QPS is kept. The script runs singlePQ then multiPQ, and records the resulting best QPS for each and produces table (in Jira format) as output. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API
[ https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769060#action_12769060 ] Mark Miller commented on LUCENE-1997: - Same system, Java 1.6.0_15 ||Seg size||Query||Tot hits||Sort||Top N||QPS old||QPS new||Pct change|| |log|1|317925|title|10|105.46|97.11|{color:red}-7.9%{color}| |log|1|317925|title|25|109.08|98.34|{color:red}-9.8%{color}| |log|1|317925|title|50|108.01|93.99|{color:red}-13.0%{color}| |log|1|317925|title|100|105.79|84.08|{color:red}-20.5%{color}| |log|1|317925|title|500|91.12|50.28|{color:red}-44.8%{color}| |log|1|317925|title|1000|80.51|33.59|{color:red}-58.3%{color}| |log|all|100|title|10|113.89|105.39|{color:red}-7.5%{color}| |log|all|100|title|25|113.14|102.13|{color:red}-9.7%{color}| |log|all|100|title|50|111.30|96.51|{color:red}-13.3%{color}| |log|all|100|title|100|86.77|83.86|{color:red}-3.4%{color}| |log|all|100|title|500|78.00|42.15|{color:red}-46.0%{color}| |log|all|100|title|1000|70.50|27.02|{color:red}-61.7%{color}| |log|all|100|rand string|10|107.78|106.09|{color:red}-1.6%{color}| |log|all|100|rand string|25|103.09|102.53|{color:red}-0.5%{color}| |log|all|100|rand string|50|106.42|95.17|{color:red}-10.6%{color}| |log|all|100|rand string|100|86.28|85.41|{color:red}-1.0%{color}| |log|all|100|rand string|500|76.69|37.76|{color:red}-50.8%{color}| |log|all|100|rand string|1000|68.48|22.95|{color:red}-66.5%{color}| |log|all|100|country|10|103.36|106.79|{color:green}3.3%{color}| |log|all|100|country|25|103.43|102.69|{color:red}-0.7%{color}| |log|all|100|country|50|102.93|94.97|{color:red}-7.7%{color}| |log|all|100|country|100|108.49|85.71|{color:red}-21.0%{color}| |log|all|100|country|500|80.87|38.23|{color:red}-52.7%{color}| |log|all|100|country|1000|67.24|22.79|{color:red}-66.1%{color}| |log|all|100|rand int|10|120.59|112.03|{color:red}-7.1%{color}| |log|all|100|rand int|25|119.80|107.49|{color:red}-10.3%{color}| |log|all|100|rand int|50|119.96|98.84|{color:red}-17.6%{color}| |log|all|100|rand int|100|88.58|89.24|{color:green}0.7%{color}| |log|all|100|rand int|500|83.50|40.13|{color:red}-51.9%{color}| |log|all|100|rand int|1000|74.80|23.83|{color:red}-68.1%{color}| Explore performance of multi-PQ vs single-PQ sorting API Key: LUCENE-1997 URL: https://issues.apache.org/jira/browse/LUCENE-1997 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-1997.patch, LUCENE-1997.patch Spinoff from recent lucene 2.9 sorting algorithm thread on java-dev, where a simpler (non-segment-based) comparator API is proposed that gathers results into multiple PQs (one per segment) and then merges them in the end. I started from John's multi-PQ code and worked it into contrib/benchmark so that we could run perf tests. Then I generified the Python script I use for running search benchmarks (in contrib/benchmark/sortBench.py). The script first creates indexes with 1M docs (based on SortableSingleDocSource, and based on wikipedia, if available). Then it runs various combinations: * Index with 20 balanced segments vs index with the normal log segment size * Queries with different numbers of hits (only for wikipedia index) * Different top N * Different sorts (by title, for wikipedia, and by random string, random int, and country for the random index) For each test, 7 search rounds are run and the best QPS is kept. The script runs singlePQ then multiPQ, and records the resulting best QPS for each and produces table (in Jira format) as output. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: lucene 2.9 sorting algorithm
Jake Mannix wrote: On Thu, Oct 22, 2009 at 9:25 PM, Mark Miller markrmil...@gmail.com mailto:markrmil...@gmail.com wrote: he new API is much harder for the average user to use, and even for the experienced user, it's not terribly fun, and more importantly: Do we have enough info to support that though? All the cases I have seen on the list, people have figured it out pretty easily - havn't really seen any complaints in that regard (not counting you and John - that is two). The only other complaints I have noticed are those that happened to count on unsupported behavior (eg people counting on no MultiSearcher use) John and I and TomS all found it both complex, and we're all pretty serious users of inner lucene apis. You see *core developers* saying the api seems fine. Have you seen *any users* of the new sorting api say anything positive about it? Do you know of *anyone* who has implemented the new comparator interface at all, let alone *likes* it? 3 negative votes by users, in comparison to *zero* positive votes by users together with a bunch of core developers saying, yeah it looks easy, what are you guys complaining about?. Internal apis take a while to percolate out to the user base - we're only the first few running into this, and while the sample size is small, it shouldn't be discounted. Yes, of course it is possible to migrate to the new APIs - which is what we, as well as many others, were in the process of doing. This is just an example of an API which got more complex in going to 2.9, and unlike the Collector API, it's possible that in this case it wasn't necessary for it to be as complex as it did. -jake Yes - I've seen a handful of non core devs report back that they upgraded with no complaints on the difficulty. Its in the mailing list archives. The only core dev I've seen say its easy is Uwe. He's super sharp though, so I wasn't banking my comment on him ;) -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Hudson build is back to normal: Lucene-trunk #987
See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/987/changes - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: lucene 2.9 sorting algorithm
On Thu, Oct 22, 2009 at 9:58 PM, Mark Miller markrmil...@gmail.com wrote: Yes - I've seen a handful of non core devs report back that they upgraded with no complaints on the difficulty. Its in the mailing list archives. The only core dev I've seen say its easy is Uwe. He's super sharp though, so I wasn't banking my comment on him ;) Upgrade custom sorting? Where has anyone talked about this? 2.9 is great, I like the new apis, they're great in general. It's just this multi-segment sorting we're talking about here. -jake
[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API
[ https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769085#action_12769085 ] Mark Miller commented on LUCENE-1997: - bq. Java6 is standard in production servers, since when? bq. Maybe I'm wrong... it was just a guess. It's just what I've seen most customers deploying new projects on. Thats my impression too - Java 1.6 is mainly just a bug fix and performance release and has been out for a while, so its usually the choice I've seen. Sounds like Uwe thinks its more buggy though, so who knows if thats a good idea :) Explore performance of multi-PQ vs single-PQ sorting API Key: LUCENE-1997 URL: https://issues.apache.org/jira/browse/LUCENE-1997 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-1997.patch, LUCENE-1997.patch Spinoff from recent lucene 2.9 sorting algorithm thread on java-dev, where a simpler (non-segment-based) comparator API is proposed that gathers results into multiple PQs (one per segment) and then merges them in the end. I started from John's multi-PQ code and worked it into contrib/benchmark so that we could run perf tests. Then I generified the Python script I use for running search benchmarks (in contrib/benchmark/sortBench.py). The script first creates indexes with 1M docs (based on SortableSingleDocSource, and based on wikipedia, if available). Then it runs various combinations: * Index with 20 balanced segments vs index with the normal log segment size * Queries with different numbers of hits (only for wikipedia index) * Different top N * Different sorts (by title, for wikipedia, and by random string, random int, and country for the random index) For each test, 7 search rounds are run and the best QPS is kept. The script runs singlePQ then multiPQ, and records the resulting best QPS for each and produces table (in Jira format) as output. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: lucene 2.9 sorting algorithm
Hi Yonik: I have been head deep in this trying to find out a good solution for better part of the past two days, it's been hard because there are so many variables: 1) how optimized are the code from either of the implementations 2) VM difference 3) HW etc. Also, there are quite a few dimensions this issue is being discussed on: Algorithm: I think we should NOT jump to the conclusion that my number on the multiQ is valid until others reproduce it (which is one of the reason I asked mike to run his benchmark again with 1.5) I am gonna try to run it on server machines when I get back to my office next week. Overall, I think the single Q algorithm is better. (It however does pay a price for some string compares etc.), Its benefit becomes more and more significant when the product of PQ size and segment count increases, which makes complete sense from the algorithm. However, when PQ size is small (which is in most of the cases, the multiplier on the segment count is also small) the benefit is not as obvious. And sometimes the trade-off for the constant string compare cost may not be worth it. (this remains a hypothesis) With Java 1.6, maybe the singleQ approach is a winner in all cases. I will spend more time to find out a more definitive answer. API: The new FieldComparator API is not difficult to understand (especially for Lucene experts such as yourselves), but it is more involved in comparison to the ScoreDocComparator API. I think anyone would agree with that. Furthermore, when implementing some custom comparators, (examples I have given earlier in this thread), it can be difficult to implement while maintaining performance. I understand changing API is hard, that is why I am trying to raise this as soon as possible, and it could very well be that the current API is fine. Lucene's collector api allows anyone to plugin any sorting algorithm, kinda like what Mike has done with the tests. So it is ok if an API selected does not fit the needs for everyone. In conclusion, please understand I am not trying to be right on this, just trying to learn and to understand, which I did from reading and trying to understand the code, along with guidance from Mike and Yonik and I am more than impressed with the thoughts and code tuning that went into it. Thanks -John On Thu, Oct 22, 2009 at 8:30 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Thu, Oct 22, 2009 at 11:11 PM, Jake Mannix jake.man...@gmail.com wrote: It's hard to read the column format, but if you look up above in the thread from tonight, you can see that yes, for PQ sizes less than 100 elements, multiPQ is better, and only starts to be worse at around 100 for strings, and 50 for ints. Ah, OK, I had missed John's followup with the numbers. I assume this is for Java5 + optimizations? What does Java6 show? bq. Point 2 does indeed make a difference, we've seen it, and it's only fair: the single pq comparator does this branch optimization but the current patch multi-pq does not, so let's level the playing field. Of course - it's not about leveling the playing field, but finding the best solution for the average case - so everything should be optimized as much as possible. There are probably further optimizations possible in both the single and multi PQ cases. My biggest reservation is that we've gone down the road of telling people to implement a new style of comparators, and told them that the old style comparators would be deleted in the next release (which is where we are). Reversing that will be a bit of a headache/question... the new stuff isn't deprecated, and having *both* isn't desirable, but that's a separate decision to be made apart from performance testing. Is there also an option of using a multiPQ approach with the new style comparators? -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API
[ https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769088#action_12769088 ] Mark Miller commented on LUCENE-1997: - John, what happened to your topn:100 results? Explore performance of multi-PQ vs single-PQ sorting API Key: LUCENE-1997 URL: https://issues.apache.org/jira/browse/LUCENE-1997 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-1997.patch, LUCENE-1997.patch Spinoff from recent lucene 2.9 sorting algorithm thread on java-dev, where a simpler (non-segment-based) comparator API is proposed that gathers results into multiple PQs (one per segment) and then merges them in the end. I started from John's multi-PQ code and worked it into contrib/benchmark so that we could run perf tests. Then I generified the Python script I use for running search benchmarks (in contrib/benchmark/sortBench.py). The script first creates indexes with 1M docs (based on SortableSingleDocSource, and based on wikipedia, if available). Then it runs various combinations: * Index with 20 balanced segments vs index with the normal log segment size * Queries with different numbers of hits (only for wikipedia index) * Different top N * Different sorts (by title, for wikipedia, and by random string, random int, and country for the random index) For each test, 7 search rounds are run and the best QPS is kept. The script runs singlePQ then multiPQ, and records the resulting best QPS for each and produces table (in Jira format) as output. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API
[ https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769089#action_12769089 ] Yonik Seeley commented on LUCENE-1997: -- There was a bad stretch in Java6... they plopped in a major JVM upgrade (not just bug fixes) and there were bugs. I think that's been behind us for a little while now though. If someone were starting a project today, I'd recommend the latest Java6 JVM. Explore performance of multi-PQ vs single-PQ sorting API Key: LUCENE-1997 URL: https://issues.apache.org/jira/browse/LUCENE-1997 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-1997.patch, LUCENE-1997.patch Spinoff from recent lucene 2.9 sorting algorithm thread on java-dev, where a simpler (non-segment-based) comparator API is proposed that gathers results into multiple PQs (one per segment) and then merges them in the end. I started from John's multi-PQ code and worked it into contrib/benchmark so that we could run perf tests. Then I generified the Python script I use for running search benchmarks (in contrib/benchmark/sortBench.py). The script first creates indexes with 1M docs (based on SortableSingleDocSource, and based on wikipedia, if available). Then it runs various combinations: * Index with 20 balanced segments vs index with the normal log segment size * Queries with different numbers of hits (only for wikipedia index) * Different top N * Different sorts (by title, for wikipedia, and by random string, random int, and country for the random index) For each test, 7 search rounds are run and the best QPS is kept. The script runs singlePQ then multiPQ, and records the resulting best QPS for each and produces table (in Jira format) as output. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org