Re: lucene 2.9 sorting algorithm

2009-10-22 Thread John Wang
Hi Mike:
 I have been playing with the patch, and I think I have some information
that you might like.

 Let me spend sometime and gather some more numbers and update in jira.

Thanks

btw:

 About the conversion on multi values fields, I am not sure I get it
(sorry for being ignorant):

 say bottom has ords 23, 45, 76, each corresponding to a string. When
moving to the next segment, you need to make bottom to have ords that can be
comparable to other docs in this new segment, so you would need to find the
new ords for the values in 23,45 and 76, don't you? To find it, assuming the
values are s1,s2,s3, you would do a bin. search on the new val array, and
find index for s1,s2,s3. Which is 3 bin searches per convert, I am not sure
how you can short circuit it. Are you suggesting we call Comparable on
compareBottom until some doc beats it? That would hurt performance I lot
though, no?

-John

On Wed, Oct 21, 2009 at 3:11 AM, Michael McCandless 
luc...@mikemccandless.com wrote:

 On Tue, Oct 20, 2009 at 11:55 AM, John Wang john.w...@gmail.com wrote:

  the simpler api places less restriction on the type of custom
  sorting that can be done.

 Just to verify: this is not a back-compat break, right?

 Because, in 2.4, such an interesting custom sort must've been
 operating at the top-level index reader level, which is easy to carry
 over to 2.9 (you just rebase the docIDs).

 But, of course in moving to 2.9, you would like to also switch your
 custom sort to be per-segment (for faster reopen/near real-time perf),
 but the new sort API makes this more difficult because it requires
 that you are able to compare hits across different segments during the
 search, not just at the end.

 But then I don't understand the difficulty of doing that: if we had a
 Collector with the MultiPQ approach, at the end during merge, you'd
 also have to compare results across segments, ie, upgrade your ords to
 their real values.  The MultiPQ approach does this by calling
 sortValue (returns Comparable) in the end.

 Putting performance aside for now... when comparing bottom, you don't
 actually have to truly invert Comparable - ord on segment
 transition.  You could, instead, get the Comparable for each and
 compare, but then note the smallest ord for the current segment that
 has failed to compete, and short-ciruit the compareBottom test by
 checking against that ord. That should enable carrying over the custom
 sort to the single PQ API without needing invert ord-value.

 We'd obviously have to test performance...

 Or, we could commit the MultiPQ approach as another sorting collector?
 I know it's not great having two wildly differenet sort APIs, but both
 APIs seem to have their strengths in different cases.

 Mike

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




[jira] Created: (LUCENE-2004) Constants.LUCENE_MAIN_VERSION is inlined in code compiled against Lucene JAR, so version detection is incorrect

2009-10-22 Thread Uwe Schindler (JIRA)
Constants.LUCENE_MAIN_VERSION is inlined in code compiled against Lucene JAR, 
so version detection is incorrect
---

 Key: LUCENE-2004
 URL: https://issues.apache.org/jira/browse/LUCENE-2004
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9.1, 3.0


When you compile your own code against the Lucene 2.9 version of the JARs and 
use the LUCENE_MAIN_VERSION constant and then run the code against the 3.0 JAR, 
the constant still contains 2.9, because javac inlines primitives and Strings 
into the class files if they are public static final and are generated by a 
constant (not method).

The attached fix will fix this by using a ident(String) functions that return 
the String itsself to prevent this inlining.

Will apply to 2.9, trunk and 2.9 BW branch. No I can also reenable one test I 
removed because of this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2004) Constants.LUCENE_MAIN_VERSION is inlined in code compiled against Lucene JAR, so version detection is incorrect

2009-10-22 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2004:
--

Attachment: LUCENE-2004.patch

See also: http://www.javaworld.com/community/node/3400

 Constants.LUCENE_MAIN_VERSION is inlined in code compiled against Lucene JAR, 
 so version detection is incorrect
 ---

 Key: LUCENE-2004
 URL: https://issues.apache.org/jira/browse/LUCENE-2004
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9.1, 3.0

 Attachments: LUCENE-2004.patch


 When you compile your own code against the Lucene 2.9 version of the JARs and 
 use the LUCENE_MAIN_VERSION constant and then run the code against the 3.0 
 JAR, the constant still contains 2.9, because javac inlines primitives and 
 Strings into the class files if they are public static final and are 
 generated by a constant (not method).
 The attached fix will fix this by using a ident(String) functions that return 
 the String itsself to prevent this inlining.
 Will apply to 2.9, trunk and 2.9 BW branch. No I can also reenable one test I 
 removed because of this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-2004) Constants.LUCENE_MAIN_VERSION is inlined in code compiled against Lucene JAR, so version detection is incorrect

2009-10-22 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler resolved LUCENE-2004.
---

Resolution: Fixed

Fixed.

 Constants.LUCENE_MAIN_VERSION is inlined in code compiled against Lucene JAR, 
 so version detection is incorrect
 ---

 Key: LUCENE-2004
 URL: https://issues.apache.org/jira/browse/LUCENE-2004
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9.1, 3.0

 Attachments: LUCENE-2004.patch


 When you compile your own code against the Lucene 2.9 version of the JARs and 
 use the LUCENE_MAIN_VERSION constant and then run the code against the 3.0 
 JAR, the constant still contains 2.9, because javac inlines primitives and 
 Strings into the class files if they are public static final and are 
 generated by a constant (not method).
 The attached fix will fix this by using a ident(String) functions that return 
 the String itsself to prevent this inlining.
 Will apply to 2.9, trunk and 2.9 BW branch. No I can also reenable one test I 
 removed because of this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: svn commit: r828334 - /lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/lucene/index/TestCheckIndex.java

2009-10-22 Thread Uwe Schindler
I found a solution for this problem!

First the explaination:
The test CheckIndexTest compares the version numbers from Constants with the
current compilation (ant settings). There are two constants
Constants.LUCENE_MAIN_VERSION which is hard coded into Constants.java. This
version had  a problem, because it was a static final String constant, which
is inlined by javac, so that code compiled against that version of the class
file will always see the static string even when you replace the JAR.

The second constant LUCENE_VERSION contains the same like in the manifest,
and if no manifest is available (no JAR file at all), it contains the
LUCENE_MAIN_VERSION constant. The code has some intelligence to add
LUCENE_MAIN_VERSION also to this constant (but at the end and in []
brackets), if the string from the manifest contains no version.

E.g. Hudson compiles Lucene and puts just a date code into the manifest
(-Dversion=200910 ANT parameter). LUCENE_MAIN_VERSION will contains
this string, bud as 3.0-dev does not appear in this string, it is appended
as [3.0-dev].

The test CheckIndex checks these version and tests if LUCENE_VERSION starts
with LUCENE_MAIN_VERSION, which is not correct in this case. The test works
for trunk, because the tests are run without JAR file (against the class
files direct), but not for backwards (as the test is run against the
lucene-core.jar, which contains the manifest).

The easy fix would be to change Constants.LUCENE_VERSION to not append the
string, but places it in front of the manifest string, if the manifest
string does not start with LUCENE_MAIN_VERSION. We could also fix Hudson,
but then test will fail if somebody uses a strange version string when
calling ANT. The first solution is 100% secure.

Opinions?

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

 -Original Message-
 From: uschind...@apache.org [mailto:uschind...@apache.org]
 Sent: Thursday, October 22, 2009 9:22 AM
 To: java-comm...@lucene.apache.org
 Subject: svn commit: r828334 -
 /lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luc
 ene/index/TestCheckIndex.java
 
 Author: uschindler
 Date: Thu Oct 22 07:22:28 2009
 New Revision: 828334
 
 URL: http://svn.apache.org/viewvc?rev=828334view=rev
 Log:
 this test fails on hudson because of the strange version ant parameter
 with only a date code. test-tag is run against the JAR version, test-core
 against the class files. The JAR version contains the strange version
 number in manifest :(
 Should be somehow fixed. For now, I disable the test.
 
 Modified:
 
 lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luce
 ne/index/TestCheckIndex.java
 
 Modified:
 lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luce
 ne/index/TestCheckIndex.java
 URL:
 http://svn.apache.org/viewvc/lucene/java/branches/lucene_2_9_back_compat_t
 ests/src/test/org/apache/lucene/index/TestCheckIndex.java?rev=828334r1=82
 8333r2=828334view=diff
 ==
 
 ---
 lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luce
 ne/index/TestCheckIndex.java (original)
 +++
 lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luce
 ne/index/TestCheckIndex.java Thu Oct 22 07:22:28 2009
 @@ -96,6 +96,8 @@
  assertNotNull(version);
  assertTrue(version.equals(Constants.LUCENE_MAIN_VERSION+-dev) ||
 version.equals(Constants.LUCENE_MAIN_VERSION));
 -assertTrue(Constants.LUCENE_VERSION.startsWith(version));
 +// TODO: does not work on hudson, because tests are run against a JAR
 version,
 +// which has a package version like 20091013* not 3.0*:
 +//assertTrue(Constants.LUCENE_VERSION.startsWith(version));
}
  }
 



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: svn commit: r828334 - /lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/lucene/index/TestCheckIndex.java

2009-10-22 Thread Michael McCandless
Putting the LUCENE_VERSION in front of the string instead of in back seems fine?

Or we could relax the test to simply assert that the expected version
appears anywhere as a substring?  (ie, .contains instead of
.startsWith)

Mike

On Thu, Oct 22, 2009 at 4:13 AM, Uwe Schindler u...@thetaphi.de wrote:
 I found a solution for this problem!

 First the explaination:
 The test CheckIndexTest compares the version numbers from Constants with the
 current compilation (ant settings). There are two constants
 Constants.LUCENE_MAIN_VERSION which is hard coded into Constants.java. This
 version had  a problem, because it was a static final String constant, which
 is inlined by javac, so that code compiled against that version of the class
 file will always see the static string even when you replace the JAR.

 The second constant LUCENE_VERSION contains the same like in the manifest,
 and if no manifest is available (no JAR file at all), it contains the
 LUCENE_MAIN_VERSION constant. The code has some intelligence to add
 LUCENE_MAIN_VERSION also to this constant (but at the end and in []
 brackets), if the string from the manifest contains no version.

 E.g. Hudson compiles Lucene and puts just a date code into the manifest
 (-Dversion=200910 ANT parameter). LUCENE_MAIN_VERSION will contains
 this string, bud as 3.0-dev does not appear in this string, it is appended
 as [3.0-dev].

 The test CheckIndex checks these version and tests if LUCENE_VERSION starts
 with LUCENE_MAIN_VERSION, which is not correct in this case. The test works
 for trunk, because the tests are run without JAR file (against the class
 files direct), but not for backwards (as the test is run against the
 lucene-core.jar, which contains the manifest).

 The easy fix would be to change Constants.LUCENE_VERSION to not append the
 string, but places it in front of the manifest string, if the manifest
 string does not start with LUCENE_MAIN_VERSION. We could also fix Hudson,
 but then test will fail if somebody uses a strange version string when
 calling ANT. The first solution is 100% secure.

 Opinions?

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de

 -Original Message-
 From: uschind...@apache.org [mailto:uschind...@apache.org]
 Sent: Thursday, October 22, 2009 9:22 AM
 To: java-comm...@lucene.apache.org
 Subject: svn commit: r828334 -
 /lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luc
 ene/index/TestCheckIndex.java

 Author: uschindler
 Date: Thu Oct 22 07:22:28 2009
 New Revision: 828334

 URL: http://svn.apache.org/viewvc?rev=828334view=rev
 Log:
 this test fails on hudson because of the strange version ant parameter
 with only a date code. test-tag is run against the JAR version, test-core
 against the class files. The JAR version contains the strange version
 number in manifest :(
 Should be somehow fixed. For now, I disable the test.

 Modified:

 lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luce
 ne/index/TestCheckIndex.java

 Modified:
 lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luce
 ne/index/TestCheckIndex.java
 URL:
 http://svn.apache.org/viewvc/lucene/java/branches/lucene_2_9_back_compat_t
 ests/src/test/org/apache/lucene/index/TestCheckIndex.java?rev=828334r1=82
 8333r2=828334view=diff
 ==
 
 ---
 lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luce
 ne/index/TestCheckIndex.java (original)
 +++
 lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luce
 ne/index/TestCheckIndex.java Thu Oct 22 07:22:28 2009
 @@ -96,6 +96,8 @@
      assertNotNull(version);
      assertTrue(version.equals(Constants.LUCENE_MAIN_VERSION+-dev) ||
                 version.equals(Constants.LUCENE_MAIN_VERSION));
 -    assertTrue(Constants.LUCENE_VERSION.startsWith(version));
 +    // TODO: does not work on hudson, because tests are run against a JAR
 version,
 +    // which has a package version like 20091013* not 3.0*:
 +    //assertTrue(Constants.LUCENE_VERSION.startsWith(version));
    }
  }




 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: svn commit: r828334 - /lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/lucene/index/TestCheckIndex.java

2009-10-22 Thread Uwe Schindler
 Putting the LUCENE_VERSION in front of the string instead of in back seems
 fine?

I would prefer this, as it makes it possible to do compareTo() comparisons
and so on, which may be used in client code, too (not only test). OK, client
code should not use trunk versions from Hudson, but it would be better.

 Or we could relax the test to simply assert that the expected version
 appears anywhere as a substring?  (ie, .contains instead of
 .startsWith)

This would only fix this test. I prefer the first.

Uwe

 Mike
 
 On Thu, Oct 22, 2009 at 4:13 AM, Uwe Schindler u...@thetaphi.de wrote:
  I found a solution for this problem!
 
  First the explaination:
  The test CheckIndexTest compares the version numbers from Constants with
 the
  current compilation (ant settings). There are two constants
  Constants.LUCENE_MAIN_VERSION which is hard coded into Constants.java.
 This
  version had  a problem, because it was a static final String constant,
 which
  is inlined by javac, so that code compiled against that version of the
 class
  file will always see the static string even when you replace the JAR.
 
  The second constant LUCENE_VERSION contains the same like in the
 manifest,
  and if no manifest is available (no JAR file at all), it contains the
  LUCENE_MAIN_VERSION constant. The code has some intelligence to add
  LUCENE_MAIN_VERSION also to this constant (but at the end and in []
  brackets), if the string from the manifest contains no version.
 
  E.g. Hudson compiles Lucene and puts just a date code into the manifest
  (-Dversion=200910 ANT parameter). LUCENE_MAIN_VERSION will
 contains
  this string, bud as 3.0-dev does not appear in this string, it is
 appended
  as [3.0-dev].
 
  The test CheckIndex checks these version and tests if LUCENE_VERSION
 starts
  with LUCENE_MAIN_VERSION, which is not correct in this case. The test
 works
  for trunk, because the tests are run without JAR file (against the class
  files direct), but not for backwards (as the test is run against the
  lucene-core.jar, which contains the manifest).
 
  The easy fix would be to change Constants.LUCENE_VERSION to not append
 the
  string, but places it in front of the manifest string, if the manifest
  string does not start with LUCENE_MAIN_VERSION. We could also fix
 Hudson,
  but then test will fail if somebody uses a strange version string when
  calling ANT. The first solution is 100% secure.
 
  Opinions?
 
  -
  Uwe Schindler
  H.-H.-Meier-Allee 63, D-28213 Bremen
  http://www.thetaphi.de
  eMail: u...@thetaphi.de
 
  -Original Message-
  From: uschind...@apache.org [mailto:uschind...@apache.org]
  Sent: Thursday, October 22, 2009 9:22 AM
  To: java-comm...@lucene.apache.org
  Subject: svn commit: r828334 -
 
 /lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luc
  ene/index/TestCheckIndex.java
 
  Author: uschindler
  Date: Thu Oct 22 07:22:28 2009
  New Revision: 828334
 
  URL: http://svn.apache.org/viewvc?rev=828334view=rev
  Log:
  this test fails on hudson because of the strange version ant
 parameter
  with only a date code. test-tag is run against the JAR version, test-
 core
  against the class files. The JAR version contains the strange version
  number in manifest :(
  Should be somehow fixed. For now, I disable the test.
 
  Modified:
 
 
 lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luce
  ne/index/TestCheckIndex.java
 
  Modified:
 
 lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luce
  ne/index/TestCheckIndex.java
  URL:
 
 http://svn.apache.org/viewvc/lucene/java/branches/lucene_2_9_back_compat_t
 
 ests/src/test/org/apache/lucene/index/TestCheckIndex.java?rev=828334r1=82
  8333r2=828334view=diff
 
 ==
  
  ---
 
 lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luce
  ne/index/TestCheckIndex.java (original)
  +++
 
 lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luce
  ne/index/TestCheckIndex.java Thu Oct 22 07:22:28 2009
  @@ -96,6 +96,8 @@
       assertNotNull(version);
       assertTrue(version.equals(Constants.LUCENE_MAIN_VERSION+-dev) ||
                  version.equals(Constants.LUCENE_MAIN_VERSION));
  -    assertTrue(Constants.LUCENE_VERSION.startsWith(version));
  +    // TODO: does not work on hudson, because tests are run against a
 JAR
  version,
  +    // which has a package version like 20091013* not 3.0*:
  +    //assertTrue(Constants.LUCENE_VERSION.startsWith(version));
     }
   }
 
 
 
 
  -
  To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-dev-h...@lucene.apache.org
 
 
 
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: 

Re: lucene 2.9 sorting algorithm

2009-10-22 Thread Michael McCandless
On Thu, Oct 22, 2009 at 2:17 AM, John Wang john.w...@gmail.com wrote:

  I have been playing with the patch, and I think I have some information
 that you might like.
  Let me spend sometime and gather some more numbers and update in jira.

Excellent!

  say bottom has ords 23, 45, 76, each corresponding to a string. When
 moving to the next segment, you need to make bottom to have ords that can be
 comparable to other docs in this new segment, so you would need to find the
 new ords for the values in 23,45 and 76, don't you? To find it, assuming the
 values are s1,s2,s3, you would do a bin. search on the new val array, and
 find index for s1,s2,s3.

It's that inversion (from ord-Comparable in first seg, and
Comparable-ord in second seg) that I'm trying to avoid (w/ this new
proposal).

 Which is 3 bin searches per convert, I am not sure
 how you can short circuit it. Are you suggesting we call Comparable on
 compareBottom until some doc beats it?

I'm saying on seg transition you indeed get the Comparable for current
bottom, but, don't attempt to invert it.  Instead, as seg 2 finds a
hit, you get that hit's Comparables and compare to bottom.  If it
beats bottom, it goes into the queue.  If it does not, you use the ord
(in seg 2's ord space) to learn a bottom in the ord space of seg 2.

 That would hurt performance I lot though, no?

Yeah I think likely it would, since we're talking about a binary
search on transition VS having to do possibly many
upgrade-to-Comparable and compare-Comparabls to slowly learn the
equivalent ord in the new segment.  I was proposing it for cases where
inversion is very difficult.  But realistically, since you must keep
around the ful ord - Comparable for every segment anyway (in order to
merge in the end), inversion shouldn't ever actually be difficult --
it'd just be a binary search on presumably in-RAM storage.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: svn commit: r828334 - /lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/lucene/index/TestCheckIndex.java

2009-10-22 Thread Michael McCandless
OK let's do first!

Mike

On Thu, Oct 22, 2009 at 5:31 AM, Uwe Schindler u...@thetaphi.de wrote:
 Putting the LUCENE_VERSION in front of the string instead of in back seems
 fine?

 I would prefer this, as it makes it possible to do compareTo() comparisons
 and so on, which may be used in client code, too (not only test). OK, client
 code should not use trunk versions from Hudson, but it would be better.

 Or we could relax the test to simply assert that the expected version
 appears anywhere as a substring?  (ie, .contains instead of
 .startsWith)

 This would only fix this test. I prefer the first.

 Uwe

 Mike

 On Thu, Oct 22, 2009 at 4:13 AM, Uwe Schindler u...@thetaphi.de wrote:
  I found a solution for this problem!
 
  First the explaination:
  The test CheckIndexTest compares the version numbers from Constants with
 the
  current compilation (ant settings). There are two constants
  Constants.LUCENE_MAIN_VERSION which is hard coded into Constants.java.
 This
  version had  a problem, because it was a static final String constant,
 which
  is inlined by javac, so that code compiled against that version of the
 class
  file will always see the static string even when you replace the JAR.
 
  The second constant LUCENE_VERSION contains the same like in the
 manifest,
  and if no manifest is available (no JAR file at all), it contains the
  LUCENE_MAIN_VERSION constant. The code has some intelligence to add
  LUCENE_MAIN_VERSION also to this constant (but at the end and in []
  brackets), if the string from the manifest contains no version.
 
  E.g. Hudson compiles Lucene and puts just a date code into the manifest
  (-Dversion=200910 ANT parameter). LUCENE_MAIN_VERSION will
 contains
  this string, bud as 3.0-dev does not appear in this string, it is
 appended
  as [3.0-dev].
 
  The test CheckIndex checks these version and tests if LUCENE_VERSION
 starts
  with LUCENE_MAIN_VERSION, which is not correct in this case. The test
 works
  for trunk, because the tests are run without JAR file (against the class
  files direct), but not for backwards (as the test is run against the
  lucene-core.jar, which contains the manifest).
 
  The easy fix would be to change Constants.LUCENE_VERSION to not append
 the
  string, but places it in front of the manifest string, if the manifest
  string does not start with LUCENE_MAIN_VERSION. We could also fix
 Hudson,
  but then test will fail if somebody uses a strange version string when
  calling ANT. The first solution is 100% secure.
 
  Opinions?
 
  -
  Uwe Schindler
  H.-H.-Meier-Allee 63, D-28213 Bremen
  http://www.thetaphi.de
  eMail: u...@thetaphi.de
 
  -Original Message-
  From: uschind...@apache.org [mailto:uschind...@apache.org]
  Sent: Thursday, October 22, 2009 9:22 AM
  To: java-comm...@lucene.apache.org
  Subject: svn commit: r828334 -
 
 /lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luc
  ene/index/TestCheckIndex.java
 
  Author: uschindler
  Date: Thu Oct 22 07:22:28 2009
  New Revision: 828334
 
  URL: http://svn.apache.org/viewvc?rev=828334view=rev
  Log:
  this test fails on hudson because of the strange version ant
 parameter
  with only a date code. test-tag is run against the JAR version, test-
 core
  against the class files. The JAR version contains the strange version
  number in manifest :(
  Should be somehow fixed. For now, I disable the test.
 
  Modified:
 
 
 lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luce
  ne/index/TestCheckIndex.java
 
  Modified:
 
 lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luce
  ne/index/TestCheckIndex.java
  URL:
 
 http://svn.apache.org/viewvc/lucene/java/branches/lucene_2_9_back_compat_t
 
 ests/src/test/org/apache/lucene/index/TestCheckIndex.java?rev=828334r1=82
  8333r2=828334view=diff
 
 ==
  
  ---
 
 lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luce
  ne/index/TestCheckIndex.java (original)
  +++
 
 lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luce
  ne/index/TestCheckIndex.java Thu Oct 22 07:22:28 2009
  @@ -96,6 +96,8 @@
       assertNotNull(version);
       assertTrue(version.equals(Constants.LUCENE_MAIN_VERSION+-dev) ||
                  version.equals(Constants.LUCENE_MAIN_VERSION));
  -    assertTrue(Constants.LUCENE_VERSION.startsWith(version));
  +    // TODO: does not work on hudson, because tests are run against a
 JAR
  version,
  +    // which has a package version like 20091013* not 3.0*:
  +    //assertTrue(Constants.LUCENE_VERSION.startsWith(version));
     }
   }
 
 
 
 
  -
  To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-dev-h...@lucene.apache.org
 
 

 -
 To unsubscribe, 

RE: svn commit: r828334 - /lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/lucene/index/TestCheckIndex.java

2009-10-22 Thread Uwe Schindler
Done!

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Thursday, October 22, 2009 11:39 AM
 To: java-dev@lucene.apache.org
 Subject: Re: svn commit: r828334 -
 /lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luc
 ene/index/TestCheckIndex.java
 
 OK let's do first!
 
 Mike
 
 On Thu, Oct 22, 2009 at 5:31 AM, Uwe Schindler u...@thetaphi.de wrote:
  Putting the LUCENE_VERSION in front of the string instead of in back
 seems
  fine?
 
  I would prefer this, as it makes it possible to do compareTo()
 comparisons
  and so on, which may be used in client code, too (not only test). OK,
 client
  code should not use trunk versions from Hudson, but it would be better.
 
  Or we could relax the test to simply assert that the expected version
  appears anywhere as a substring?  (ie, .contains instead of
  .startsWith)
 
  This would only fix this test. I prefer the first.
 
  Uwe
 
  Mike
 
  On Thu, Oct 22, 2009 at 4:13 AM, Uwe Schindler u...@thetaphi.de wrote:
   I found a solution for this problem!
  
   First the explaination:
   The test CheckIndexTest compares the version numbers from Constants
 with
  the
   current compilation (ant settings). There are two constants
   Constants.LUCENE_MAIN_VERSION which is hard coded into
 Constants.java.
  This
   version had  a problem, because it was a static final String
 constant,
  which
   is inlined by javac, so that code compiled against that version of
 the
  class
   file will always see the static string even when you replace the JAR.
  
   The second constant LUCENE_VERSION contains the same like in the
  manifest,
   and if no manifest is available (no JAR file at all), it contains the
   LUCENE_MAIN_VERSION constant. The code has some intelligence to add
   LUCENE_MAIN_VERSION also to this constant (but at the end and in []
   brackets), if the string from the manifest contains no version.
  
   E.g. Hudson compiles Lucene and puts just a date code into the
 manifest
   (-Dversion=200910 ANT parameter). LUCENE_MAIN_VERSION will
  contains
   this string, bud as 3.0-dev does not appear in this string, it is
  appended
   as [3.0-dev].
  
   The test CheckIndex checks these version and tests if LUCENE_VERSION
  starts
   with LUCENE_MAIN_VERSION, which is not correct in this case. The test
  works
   for trunk, because the tests are run without JAR file (against the
 class
   files direct), but not for backwards (as the test is run against the
   lucene-core.jar, which contains the manifest).
  
   The easy fix would be to change Constants.LUCENE_VERSION to not
 append
  the
   string, but places it in front of the manifest string, if the
 manifest
   string does not start with LUCENE_MAIN_VERSION. We could also fix
  Hudson,
   but then test will fail if somebody uses a strange version string
 when
   calling ANT. The first solution is 100% secure.
  
   Opinions?
  
   -
   Uwe Schindler
   H.-H.-Meier-Allee 63, D-28213 Bremen
   http://www.thetaphi.de
   eMail: u...@thetaphi.de
  
   -Original Message-
   From: uschind...@apache.org [mailto:uschind...@apache.org]
   Sent: Thursday, October 22, 2009 9:22 AM
   To: java-comm...@lucene.apache.org
   Subject: svn commit: r828334 -
  
 
 /lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luc
   ene/index/TestCheckIndex.java
  
   Author: uschindler
   Date: Thu Oct 22 07:22:28 2009
   New Revision: 828334
  
   URL: http://svn.apache.org/viewvc?rev=828334view=rev
   Log:
   this test fails on hudson because of the strange version ant
  parameter
   with only a date code. test-tag is run against the JAR version,
 test-
  core
   against the class files. The JAR version contains the strange
 version
   number in manifest :(
   Should be somehow fixed. For now, I disable the test.
  
   Modified:
  
  
 
 lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luce
   ne/index/TestCheckIndex.java
  
   Modified:
  
 
 lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luce
   ne/index/TestCheckIndex.java
   URL:
  
 
 http://svn.apache.org/viewvc/lucene/java/branches/lucene_2_9_back_compat_t
  
 
 ests/src/test/org/apache/lucene/index/TestCheckIndex.java?rev=828334r1=82
   8333r2=828334view=diff
  
 
 ==
   
   ---
  
 
 lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luce
   ne/index/TestCheckIndex.java (original)
   +++
  
 
 lucene/java/branches/lucene_2_9_back_compat_tests/src/test/org/apache/luce
   ne/index/TestCheckIndex.java Thu Oct 22 07:22:28 2009
   @@ -96,6 +96,8 @@
        assertNotNull(version);
        assertTrue(version.equals(Constants.LUCENE_MAIN_VERSION+-dev)
 ||
                   version.equals(Constants.LUCENE_MAIN_VERSION));
   -    

[jira] Commented: (LUCENE-1973) Remove deprecated query components

2009-10-22 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768661#action_12768661
 ] 

Uwe Schindler commented on LUCENE-1973:
---

Anybody wants to help?

 Remove deprecated query components
 --

 Key: LUCENE-1973
 URL: https://issues.apache.org/jira/browse/LUCENE-1973
 Project: Lucene - Java
  Issue Type: Task
  Components: Search
Reporter: Uwe Schindler
 Fix For: 3.0


 Remove the rest of the deprecated query components.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-2001) wordnet parsing bug

2009-10-22 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll reassigned LUCENE-2001:
---

Assignee: Grant Ingersoll

 wordnet parsing bug
 ---

 Key: LUCENE-2001
 URL: https://issues.apache.org/jira/browse/LUCENE-2001
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/*
Affects Versions: 2.9
Reporter: Robert Muir
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 2.9.1, 3.0

 Attachments: LUCENE-2001.patch, LUCENE-2001_branch.patch, 
 LUCENE-2001_branch.patch


 A user reported that wordnet parses the prolog file incorrectly.
 Also need to check the wordnet parser in the memory contrib for this problem.
 If this is a false alarm, i'm not worried, because the test will be the first 
 unit test wordnet package ever had.
 {noformat}
 For example, looking up the synsets for the
 word king, we get:
 java SynLookup wnindex king
 baron
 magnate
 mogul
 power
 queen
 rex
 scrofula
 struma
 tycoon
 Here, scrofula and struma are extraneous. This happens because, the line
 parser code in Syns2Index.java interpretes the two consecutive single quotes
 in entry s(114144247,3,'king''s evil',n,1,1) in  wn_s.pl file, as
 termination
 of the string and separates into king. This entry concerns
 synset of words scrofula and struma, and thus they get inserted in the
 synset of king. *There 1382 such entries, in wn_s.pl* and more in other
 WordNet
 Prolog data-base files, where such use of two consecutive single quotes
 appears.
 We have resolved this by adding a statement in the line parsing portion of
 Syns2Index.java, as follows:
// parse line
line = line.substring(2);
   * line = line.replaceAll(\'\', `); // added statement*
int comma = line.indexOf(',');
String num = line.substring(0, comma);  ... ... etc.
 In short we replace '' by ` (a back-quote). Then on recreating the
 index, we get:
 java SynLookup zwnindex king
 baron
 magnate
 mogul
 power
 queen
 rex
 tycoon
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2001) wordnet parsing bug

2009-10-22 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768695#action_12768695
 ] 

Grant Ingersoll commented on LUCENE-2001:
-

I'll take care of the branch.

 wordnet parsing bug
 ---

 Key: LUCENE-2001
 URL: https://issues.apache.org/jira/browse/LUCENE-2001
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/*
Affects Versions: 2.9
Reporter: Robert Muir
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 2.9.1, 3.0

 Attachments: LUCENE-2001.patch, LUCENE-2001_branch.patch, 
 LUCENE-2001_branch.patch


 A user reported that wordnet parses the prolog file incorrectly.
 Also need to check the wordnet parser in the memory contrib for this problem.
 If this is a false alarm, i'm not worried, because the test will be the first 
 unit test wordnet package ever had.
 {noformat}
 For example, looking up the synsets for the
 word king, we get:
 java SynLookup wnindex king
 baron
 magnate
 mogul
 power
 queen
 rex
 scrofula
 struma
 tycoon
 Here, scrofula and struma are extraneous. This happens because, the line
 parser code in Syns2Index.java interpretes the two consecutive single quotes
 in entry s(114144247,3,'king''s evil',n,1,1) in  wn_s.pl file, as
 termination
 of the string and separates into king. This entry concerns
 synset of words scrofula and struma, and thus they get inserted in the
 synset of king. *There 1382 such entries, in wn_s.pl* and more in other
 WordNet
 Prolog data-base files, where such use of two consecutive single quotes
 appears.
 We have resolved this by adding a statement in the line parsing portion of
 Syns2Index.java, as follows:
// parse line
line = line.substring(2);
   * line = line.replaceAll(\'\', `); // added statement*
int comma = line.indexOf(',');
String num = line.substring(0, comma);  ... ... etc.
 In short we replace '' by ` (a back-quote). Then on recreating the
 index, we get:
 java SynLookup zwnindex king
 baron
 magnate
 mogul
 power
 queen
 rex
 tycoon
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-2001) wordnet parsing bug

2009-10-22 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved LUCENE-2001.
-

Resolution: Fixed

Committed revision 828728.

 wordnet parsing bug
 ---

 Key: LUCENE-2001
 URL: https://issues.apache.org/jira/browse/LUCENE-2001
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/*
Affects Versions: 2.9
Reporter: Robert Muir
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 2.9.1, 3.0

 Attachments: LUCENE-2001.patch, LUCENE-2001_branch.patch, 
 LUCENE-2001_branch.patch


 A user reported that wordnet parses the prolog file incorrectly.
 Also need to check the wordnet parser in the memory contrib for this problem.
 If this is a false alarm, i'm not worried, because the test will be the first 
 unit test wordnet package ever had.
 {noformat}
 For example, looking up the synsets for the
 word king, we get:
 java SynLookup wnindex king
 baron
 magnate
 mogul
 power
 queen
 rex
 scrofula
 struma
 tycoon
 Here, scrofula and struma are extraneous. This happens because, the line
 parser code in Syns2Index.java interpretes the two consecutive single quotes
 in entry s(114144247,3,'king''s evil',n,1,1) in  wn_s.pl file, as
 termination
 of the string and separates into king. This entry concerns
 synset of words scrofula and struma, and thus they get inserted in the
 synset of king. *There 1382 such entries, in wn_s.pl* and more in other
 WordNet
 Prolog data-base files, where such use of two consecutive single quotes
 appears.
 We have resolved this by adding a statement in the line parsing portion of
 Syns2Index.java, as follows:
// parse line
line = line.substring(2);
   * line = line.replaceAll(\'\', `); // added statement*
int comma = line.indexOf(',');
String num = line.substring(0, comma);  ... ... etc.
 In short we replace '' by ` (a back-quote). Then on recreating the
 index, we get:
 java SynLookup zwnindex king
 baron
 magnate
 mogul
 power
 queen
 rex
 tycoon
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-10-22 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768700#action_12768700
 ] 

Grant Ingersoll commented on LUCENE-1606:
-

Why are new features going into 3.0?  I was under the impression that 3.0 was 
just supposed to be cleanup plus Java 1.5

 Automaton Query/Filter (scalable regex)
 ---

 Key: LUCENE-1606
 URL: https://issues.apache.org/jira/browse/LUCENE-1606
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Minor
 Fix For: 3.0

 Attachments: automaton.patch, automatonMultiQuery.patch, 
 automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
 automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, 
 LUCENE-1606.patch


 Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
 suitable).
 Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
 indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
 Additionally all of the existing RegexQuery implementations in Lucene are 
 really slow if there is no constant prefix. This implementation does not 
 depend upon constant prefix, and runs the same query in 640ms.
 Some use cases I envision:
  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant (http:// 
 or ftp://)
 The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
 regular expressions into a DFA. Then, the filter enumerates terms in a 
 special way, by using the underlying state machine. Here is my short 
 description from the comments:
  The algorithm here is pretty basic. Enumerate terms but instead of a 
 binary accept/reject do:
   
  1. Look at the portion that is OK (did not enter a reject state in the 
 DFA)
  2. Generate the next possible String and seek to that.
 the Query simply wraps the filter with ConstantScoreQuery.
 I did not include the automaton.jar inside the patch but it can be downloaded 
 from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-10-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768705#action_12768705
 ] 

Robert Muir commented on LUCENE-1606:
-

Grant, I thought it was ok from Uwe's comment:

bq. I move this to 3.0 (and not 3.1), because it can be released together with 
3.0 (contrib modules do not need to wait until 3.1). 

I guess now I am a little confused about what should happen for 3.0 with 
contrib in general? 
No problem moving this to 3.1, let me know!


 Automaton Query/Filter (scalable regex)
 ---

 Key: LUCENE-1606
 URL: https://issues.apache.org/jira/browse/LUCENE-1606
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Minor
 Fix For: 3.0

 Attachments: automaton.patch, automatonMultiQuery.patch, 
 automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
 automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, 
 LUCENE-1606.patch


 Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
 suitable).
 Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
 indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
 Additionally all of the existing RegexQuery implementations in Lucene are 
 really slow if there is no constant prefix. This implementation does not 
 depend upon constant prefix, and runs the same query in 640ms.
 Some use cases I envision:
  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant (http:// 
 or ftp://)
 The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
 regular expressions into a DFA. Then, the filter enumerates terms in a 
 special way, by using the underlying state machine. Here is my short 
 description from the comments:
  The algorithm here is pretty basic. Enumerate terms but instead of a 
 binary accept/reject do:
   
  1. Look at the portion that is OK (did not enter a reject state in the 
 DFA)
  2. Generate the next possible String and seek to that.
 the Query simply wraps the filter with ConstantScoreQuery.
 I did not include the automaton.jar inside the patch but it can be downloaded 
 from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2002) Add oal.util.Version ctor to QueryParser

2009-10-22 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768711#action_12768711
 ] 

Mark Miller commented on LUCENE-2002:
-

I think we need more doc as well - stopfilter is not just tied to 
standardanalyzer - standardanalyzer just happens to use it. Many analyzers can 
use a stopfilter and one of the stopfilters params is to enable or disable this 
setting.

 Add oal.util.Version ctor to QueryParser
 

 Key: LUCENE-2002
 URL: https://issues.apache.org/jira/browse/LUCENE-2002
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.9, 3.0
Reporter: Uwe Schindler
Assignee: Michael McCandless
 Fix For: 2.9.1

 Attachments: LUCENE-2002-29.patch


 This is a followup of LUCENE-1987:
 If somebody uses StandardAnalyzer with Version.LUCENE_CURRENT and then uses 
 QueryParser, phrase queries will not work, because the StopFilter enables 
 position Increments for stop words, but QueryParser ignores them per default. 
 The user has to explicitely enable them.
 This issue would add a ctor taking the Version constant and automatically 
 enable this setting. The same applies to the contrib queryparser. Eventually 
 also StopAnalyzer should add this version ctor.
 To be able to remove the default ctor for 3.0 (to remove a possible trap for 
 users of QueryParser), it must be deprecated and the new one also added to 
 2.9.1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2002) Add oal.util.Version ctor to QueryParser

2009-10-22 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768717#action_12768717
 ] 

Grant Ingersoll commented on LUCENE-2002:
-

{quote}Unfortunately, JavaCC generates two public ctors for QueryParser (one 
taking
CharStream, another taking QueryParserTokenManager) that I don't know
how to override to take a Version param.
{quote}

Those two constructors are bad anyway b/c if anyone calls them, it won't set 
the Analyzer, etc.  Thus, I think, unfortunately, the answer just might be to 
edit the generated Java file by hand and make them be protected.  I've looked 
through the JavaCC docs and I don't see any other way.  Of course, the big down 
side to this is we now need to do this going forward. 


 Add oal.util.Version ctor to QueryParser
 

 Key: LUCENE-2002
 URL: https://issues.apache.org/jira/browse/LUCENE-2002
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.9, 3.0
Reporter: Uwe Schindler
Assignee: Michael McCandless
 Fix For: 2.9.1

 Attachments: LUCENE-2002-29.patch


 This is a followup of LUCENE-1987:
 If somebody uses StandardAnalyzer with Version.LUCENE_CURRENT and then uses 
 QueryParser, phrase queries will not work, because the StopFilter enables 
 position Increments for stop words, but QueryParser ignores them per default. 
 The user has to explicitely enable them.
 This issue would add a ctor taking the Version constant and automatically 
 enable this setting. The same applies to the contrib queryparser. Eventually 
 also StopAnalyzer should add this version ctor.
 To be able to remove the default ctor for 3.0 (to remove a possible trap for 
 users of QueryParser), it must be deprecated and the new one also added to 
 2.9.1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2005) Add LuSql project to Apache Lucene - Contributions wiki page

2009-10-22 Thread Glen Newton (JIRA)
Add LuSql project to Apache Lucene - Contributions wiki page
--

 Key: LUCENE-2005
 URL: https://issues.apache.org/jira/browse/LUCENE-2005
 Project: Lucene - Java
  Issue Type: Task
  Components: Website
Affects Versions: 2.9
Reporter: Glen Newton


Add [LuSql|http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql] 
to the Apache Lucene - Contributions page 
[http://lucene.apache.org/java/2_9_0/contributions.html]
I am the author of LuSql. I can supply any text needed. 

Perhaps a new heading is needed to capture Database/JDBC oriented Lucene tools 
(there are other out there)?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2005) Add LuSql project to Apache Lucene - Contributions wiki page

2009-10-22 Thread Glen Newton (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Newton updated LUCENE-2005:


Description: 
Add [LuSql|http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql] 
to the Apache Lucene - Contributions page 
[http://lucene.apache.org/java/2_9_0/contributions.html]
I am the author of LuSql. I can supply any text needed. 

Perhaps a new heading is needed to capture Database/JDBC oriented Lucene tools 
(there are others out there)?

  was:
Add [LuSql|http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql] 
to the Apache Lucene - Contributions page 
[http://lucene.apache.org/java/2_9_0/contributions.html]
I am the author of LuSql. I can supply any text needed. 

Perhaps a new heading is needed to capture Database/JDBC oriented Lucene tools 
(there are other out there)?


 Add LuSql project to Apache Lucene - Contributions wiki page
 --

 Key: LUCENE-2005
 URL: https://issues.apache.org/jira/browse/LUCENE-2005
 Project: Lucene - Java
  Issue Type: Task
  Components: Website
Affects Versions: 2.9
Reporter: Glen Newton
   Original Estimate: 2h
  Remaining Estimate: 2h

 Add 
 [LuSql|http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql] 
 to the Apache Lucene - Contributions page 
 [http://lucene.apache.org/java/2_9_0/contributions.html]
 I am the author of LuSql. I can supply any text needed. 
 Perhaps a new heading is needed to capture Database/JDBC oriented Lucene 
 tools (there are others out there)?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



How to loop through all the entries for a field

2009-10-22 Thread adviner

I have a field in called BookTitle.  I want to loop through all the entries
without doing a search.  I just want to get the list of BookTitle's that is
in this field:

I tried IndexReader but MaxDocs() doesnt work because it returns everything
and I have other fields in their which is allot bigger.  


-- 
View this message in context: 
http://www.nabble.com/How-to-loop-through-all-the-entries-for-a-field-tp26012309p26012309.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2005) Add LuSql project to Apache Lucene - Contributions wiki page

2009-10-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768727#action_12768727
 ] 

Robert Muir commented on LUCENE-2005:
-

glen, I know there is an oracle domain index implementation too, so maybe a 
database category isn't a bad idea.

do you know of any others?

 Add LuSql project to Apache Lucene - Contributions wiki page
 --

 Key: LUCENE-2005
 URL: https://issues.apache.org/jira/browse/LUCENE-2005
 Project: Lucene - Java
  Issue Type: Task
  Components: Website
Affects Versions: 2.9
Reporter: Glen Newton
   Original Estimate: 2h
  Remaining Estimate: 2h

 Add 
 [LuSql|http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql] 
 to the Apache Lucene - Contributions page 
 [http://lucene.apache.org/java/2_9_0/contributions.html]
 I am the author of LuSql. I can supply any text needed. 
 Perhaps a new heading is needed to capture Database/JDBC oriented Lucene 
 tools (there are others out there)?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2002) Add oal.util.Version ctor to QueryParser

2009-10-22 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768733#action_12768733
 ] 

Michael McCandless commented on LUCENE-2002:


bq. Thus, I think, unfortunately, the answer just might be to edit the 
generated Java file by hand and make them be protected.

OK I'll take that approach, and I guess make a unit test that peeks  confirms 
these methods are still protected (to catch us in the future).

 Add oal.util.Version ctor to QueryParser
 

 Key: LUCENE-2002
 URL: https://issues.apache.org/jira/browse/LUCENE-2002
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.9, 3.0
Reporter: Uwe Schindler
Assignee: Michael McCandless
 Fix For: 2.9.1

 Attachments: LUCENE-2002-29.patch


 This is a followup of LUCENE-1987:
 If somebody uses StandardAnalyzer with Version.LUCENE_CURRENT and then uses 
 QueryParser, phrase queries will not work, because the StopFilter enables 
 position Increments for stop words, but QueryParser ignores them per default. 
 The user has to explicitely enable them.
 This issue would add a ctor taking the Version constant and automatically 
 enable this setting. The same applies to the contrib queryparser. Eventually 
 also StopAnalyzer should add this version ctor.
 To be able to remove the default ctor for 3.0 (to remove a possible trap for 
 users of QueryParser), it must be deprecated and the new one also added to 
 2.9.1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2002) Add oal.util.Version ctor to QueryParser

2009-10-22 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768738#action_12768738
 ] 

Michael McCandless commented on LUCENE-2002:


bq. Many analyzers can use a stopfilter and one of the stopfilters params is to 
enable or disable this setting.

In fact, I think we may have to un-deprecate 
StopFilter.get/setEnablePositionIncrementsDefault for this reason?  Many 
analyzers do embed StopFilter without exposing control over this setting, and 
so the only way (up to  including 2.9) to change the setting is to set the 
static default with StopFilter.  If we remove that then we've taken that 
control away.

Or, with this issue I could add Version to all contrib analyzers that embed 
StopFilter.  I think I like that solution better (we shouldn't be using static 
defaults).  I'll go forward w/ that shortly unless any objections come up... 
this'd also take care of analyzers that use StandardTokenizer (ie, we'll 
control fixing the acronym bug with Version as well).

 Add oal.util.Version ctor to QueryParser
 

 Key: LUCENE-2002
 URL: https://issues.apache.org/jira/browse/LUCENE-2002
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.9, 3.0
Reporter: Uwe Schindler
Assignee: Michael McCandless
 Fix For: 2.9.1

 Attachments: LUCENE-2002-29.patch


 This is a followup of LUCENE-1987:
 If somebody uses StandardAnalyzer with Version.LUCENE_CURRENT and then uses 
 QueryParser, phrase queries will not work, because the StopFilter enables 
 position Increments for stop words, but QueryParser ignores them per default. 
 The user has to explicitely enable them.
 This issue would add a ctor taking the Version constant and automatically 
 enable this setting. The same applies to the contrib queryparser. Eventually 
 also StopAnalyzer should add this version ctor.
 To be able to remove the default ctor for 3.0 (to remove a possible trap for 
 users of QueryParser), it must be deprecated and the new one also added to 
 2.9.1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-10-22 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768739#action_12768739
 ] 

Uwe Schindler commented on LUCENE-1606:
---

3.0 is just the switch to 1.5 and generics. So this is a typical java 1.5 issue 
and can go into 3.0 even if it is a new feature. Contrib is not core and may 
have own rules.

In my opinion, this would be a nice addition to the regex contrib and should 
also have been in 2.9, but the underlying library is Java 5 only, so we had to 
wait until 3.0.

 Automaton Query/Filter (scalable regex)
 ---

 Key: LUCENE-1606
 URL: https://issues.apache.org/jira/browse/LUCENE-1606
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Minor
 Fix For: 3.0

 Attachments: automaton.patch, automatonMultiQuery.patch, 
 automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
 automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, 
 LUCENE-1606.patch


 Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
 suitable).
 Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
 indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
 Additionally all of the existing RegexQuery implementations in Lucene are 
 really slow if there is no constant prefix. This implementation does not 
 depend upon constant prefix, and runs the same query in 640ms.
 Some use cases I envision:
  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant (http:// 
 or ftp://)
 The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
 regular expressions into a DFA. Then, the filter enumerates terms in a 
 special way, by using the underlying state machine. Here is my short 
 description from the comments:
  The algorithm here is pretty basic. Enumerate terms but instead of a 
 binary accept/reject do:
   
  1. Look at the portion that is OK (did not enter a reject state in the 
 DFA)
  2. Generate the next possible String and seek to that.
 the Query simply wraps the filter with ConstantScoreQuery.
 I did not include the automaton.jar inside the patch but it can be downloaded 
 from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2002) Add oal.util.Version ctor to QueryParser

2009-10-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768740#action_12768740
 ] 

Robert Muir commented on LUCENE-2002:
-

{quote}
Or, with this issue I could add Version to all contrib analyzers that embed 
StopFilter. I think I like that solution better (we shouldn't be using static 
defaults). I'll go forward w/ that shortly unless any objections come up... 
this'd also take care of analyzers that use StandardTokenizer (ie, we'll 
control fixing the acronym bug with Version as well).
{quote}

Michael, if you do this, can you mark LUCENE-1373 as resolved? :)

 Add oal.util.Version ctor to QueryParser
 

 Key: LUCENE-2002
 URL: https://issues.apache.org/jira/browse/LUCENE-2002
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.9, 3.0
Reporter: Uwe Schindler
Assignee: Michael McCandless
 Fix For: 2.9.1

 Attachments: LUCENE-2002-29.patch


 This is a followup of LUCENE-1987:
 If somebody uses StandardAnalyzer with Version.LUCENE_CURRENT and then uses 
 QueryParser, phrase queries will not work, because the StopFilter enables 
 position Increments for stop words, but QueryParser ignores them per default. 
 The user has to explicitely enable them.
 This issue would add a ctor taking the Version constant and automatically 
 enable this setting. The same applies to the contrib queryparser. Eventually 
 also StopAnalyzer should add this version ctor.
 To be able to remove the default ctor for 3.0 (to remove a possible trap for 
 users of QueryParser), it must be deprecated and the new one also added to 
 2.9.1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2005) Add LuSql project to Apache Lucene - Contributions wiki page

2009-10-22 Thread Glen Newton (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768743#action_12768743
 ] 

Glen Newton commented on LUCENE-2005:
-

[DBSight|http://www.dbsight.net/] is a commercial product that does indexing 
and a lot more.

I was wondering if there is the need for another category: Other 
projects/frameworks that have support for or are built on, Lucene internally 
(as opposed to Lucene Tools. Examples:
* [Compass|http://www.compass-project.org/]
* [Hibernate|https://www.hibernate.org/410.html]
* [SOLR|http://lucene.apache.org/solr/]
* others...???

In the FAQ, for 
[http://wiki.apache.org/lucene-java/LuceneFAQ#How_can_I_use_Lucene_to_index_a_database.3F|indexing
 databases with Lucene], LuSql should also be added (separate JIRA issue?)

 Add LuSql project to Apache Lucene - Contributions wiki page
 --

 Key: LUCENE-2005
 URL: https://issues.apache.org/jira/browse/LUCENE-2005
 Project: Lucene - Java
  Issue Type: Task
  Components: Website
Affects Versions: 2.9
Reporter: Glen Newton
   Original Estimate: 2h
  Remaining Estimate: 2h

 Add 
 [LuSql|http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql] 
 to the Apache Lucene - Contributions page 
 [http://lucene.apache.org/java/2_9_0/contributions.html]
 I am the author of LuSql. I can supply any text needed. 
 Perhaps a new heading is needed to capture Database/JDBC oriented Lucene 
 tools (there are others out there)?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2002) Add oal.util.Version ctor to QueryParser

2009-10-22 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768744#action_12768744
 ] 

Michael McCandless commented on LUCENE-2002:


bq. Michael, if you do this, can you mark LUCENE-1373 as resolved?

Ahh yes indeed.  Is there a corresponding issue about not being able to control 
stop filter pos incr?  Can't keep track of all these issues anymore!

 Add oal.util.Version ctor to QueryParser
 

 Key: LUCENE-2002
 URL: https://issues.apache.org/jira/browse/LUCENE-2002
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.9, 3.0
Reporter: Uwe Schindler
Assignee: Michael McCandless
 Fix For: 2.9.1

 Attachments: LUCENE-2002-29.patch


 This is a followup of LUCENE-1987:
 If somebody uses StandardAnalyzer with Version.LUCENE_CURRENT and then uses 
 QueryParser, phrase queries will not work, because the StopFilter enables 
 position Increments for stop words, but QueryParser ignores them per default. 
 The user has to explicitely enable them.
 This issue would add a ctor taking the Version constant and automatically 
 enable this setting. The same applies to the contrib queryparser. Eventually 
 also StopAnalyzer should add this version ctor.
 To be able to remove the default ctor for 3.0 (to remove a possible trap for 
 users of QueryParser), it must be deprecated and the new one also added to 
 2.9.1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2005) Add LuSql project to Apache Lucene - Contributions wiki page

2009-10-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768745#action_12768745
 ] 

Robert Muir commented on LUCENE-2005:
-

Glen, I think it would  be good to bring the contributions page completely up 
to speed.

maybe for this issue, we stick with database integration though for simplicity? 
:)

bq. In the FAQ, for 
[http://wiki.apache.org/lucene-java/LuceneFAQ#How_can_I_use_Lucene_to_index_a_database.3F|indexing
 databases with Lucene], LuSql should also be added (separate JIRA issue?)

I think you can just register to the wiki and edit this yourself?

 Add LuSql project to Apache Lucene - Contributions wiki page
 --

 Key: LUCENE-2005
 URL: https://issues.apache.org/jira/browse/LUCENE-2005
 Project: Lucene - Java
  Issue Type: Task
  Components: Website
Affects Versions: 2.9
Reporter: Glen Newton
   Original Estimate: 2h
  Remaining Estimate: 2h

 Add 
 [LuSql|http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql] 
 to the Apache Lucene - Contributions page 
 [http://lucene.apache.org/java/2_9_0/contributions.html]
 I am the author of LuSql. I can supply any text needed. 
 Perhaps a new heading is needed to capture Database/JDBC oriented Lucene 
 tools (there are others out there)?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2002) Add oal.util.Version ctor to QueryParser

2009-10-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768747#action_12768747
 ] 

Robert Muir commented on LUCENE-2002:
-

bq. Ahh yes indeed. Is there a corresponding issue about not being able to 
control stop filter pos incr? Can't keep track of all these issues anymore!
Michael, what about LUCENE-1258?

 Add oal.util.Version ctor to QueryParser
 

 Key: LUCENE-2002
 URL: https://issues.apache.org/jira/browse/LUCENE-2002
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.9, 3.0
Reporter: Uwe Schindler
Assignee: Michael McCandless
 Fix For: 2.9.1

 Attachments: LUCENE-2002-29.patch


 This is a followup of LUCENE-1987:
 If somebody uses StandardAnalyzer with Version.LUCENE_CURRENT and then uses 
 QueryParser, phrase queries will not work, because the StopFilter enables 
 position Increments for stop words, but QueryParser ignores them per default. 
 The user has to explicitely enable them.
 This issue would add a ctor taking the Version constant and automatically 
 enable this setting. The same applies to the contrib queryparser. Eventually 
 also StopAnalyzer should add this version ctor.
 To be able to remove the default ctor for 3.0 (to remove a possible trap for 
 users of QueryParser), it must be deprecated and the new one also added to 
 2.9.1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2005) Add LuSql project to Apache Lucene - Contributions wiki page

2009-10-22 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768746#action_12768746
 ] 

Michael McCandless commented on LUCENE-2005:


LuSql looks great!  It'd be wonderful to have it available under contrib.  I 
think a contrib/database would make sense?

bq. In the FAQ, for 
[http://wiki.apache.org/lucene-java/LuceneFAQ#How_can_I_use_Lucene_to_index_a_database.3F|indexing
 databases with Lucene], LuSql should also be added (separate JIRA issue?)

+1

But that's the wiki -- you can just go edit it (create an account if you don't 
already have one) and add it in (no need for a JIRA issue).

 Add LuSql project to Apache Lucene - Contributions wiki page
 --

 Key: LUCENE-2005
 URL: https://issues.apache.org/jira/browse/LUCENE-2005
 Project: Lucene - Java
  Issue Type: Task
  Components: Website
Affects Versions: 2.9
Reporter: Glen Newton
   Original Estimate: 2h
  Remaining Estimate: 2h

 Add 
 [LuSql|http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql] 
 to the Apache Lucene - Contributions page 
 [http://lucene.apache.org/java/2_9_0/contributions.html]
 I am the author of LuSql. I can supply any text needed. 
 Perhaps a new heading is needed to capture Database/JDBC oriented Lucene 
 tools (there are others out there)?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2005) Add LuSql project to Apache Lucene - Contributions wiki page

2009-10-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768749#action_12768749
 ] 

Robert Muir commented on LUCENE-2005:
-

bq. LuSql looks great! It'd be wonderful to have it available under contrib. I 
think a contrib/database would make sense? 

Michael, actually this issue was just to add it to the contributions links on 
the website.

But if Glen wants to incorporate it into contrib, I think that would be even 
better... it has really nice documentation, etc.

 Add LuSql project to Apache Lucene - Contributions wiki page
 --

 Key: LUCENE-2005
 URL: https://issues.apache.org/jira/browse/LUCENE-2005
 Project: Lucene - Java
  Issue Type: Task
  Components: Website
Affects Versions: 2.9
Reporter: Glen Newton
   Original Estimate: 2h
  Remaining Estimate: 2h

 Add 
 [LuSql|http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql] 
 to the Apache Lucene - Contributions page 
 [http://lucene.apache.org/java/2_9_0/contributions.html]
 I am the author of LuSql. I can supply any text needed. 
 Perhaps a new heading is needed to capture Database/JDBC oriented Lucene 
 tools (there are others out there)?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2002) Add oal.util.Version ctor to QueryParser

2009-10-22 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768752#action_12768752
 ] 

Michael McCandless commented on LUCENE-2002:


bq.  Michael, what about LUCENE-1258?

Oh yeah, and look who opened that one :)   I'll go resolve as a dup of this one.

 Add oal.util.Version ctor to QueryParser
 

 Key: LUCENE-2002
 URL: https://issues.apache.org/jira/browse/LUCENE-2002
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.9, 3.0
Reporter: Uwe Schindler
Assignee: Michael McCandless
 Fix For: 2.9.1

 Attachments: LUCENE-2002-29.patch


 This is a followup of LUCENE-1987:
 If somebody uses StandardAnalyzer with Version.LUCENE_CURRENT and then uses 
 QueryParser, phrase queries will not work, because the StopFilter enables 
 position Increments for stop words, but QueryParser ignores them per default. 
 The user has to explicitely enable them.
 This issue would add a ctor taking the Version constant and automatically 
 enable this setting. The same applies to the contrib queryparser. Eventually 
 also StopAnalyzer should add this version ctor.
 To be able to remove the default ctor for 3.0 (to remove a possible trap for 
 users of QueryParser), it must be deprecated and the new one also added to 
 2.9.1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1258) Increment position by default in StopFilter QueryParser - PhraseQuery

2009-10-22 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1258.


Resolution: Duplicate

Dup of LUCENE-2002.

 Increment position by default in StopFilter  QueryParser - PhraseQuery
 

 Key: LUCENE-1258
 URL: https://issues.apache.org/jira/browse/LUCENE-1258
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, 2.9
Reporter: Michael McCandless
Priority: Minor
 Fix For: 3.0


 Spinoff from here:
   https://issues.apache.org/jira/browse/LUCENE-1095
 I think for 3.0 we should change the default so that:
   * By default, StopFilter increments the positionIncrement whenever
 it skips stop words.  Add option to revert back to old way.  This is
 just toggling the boolean default.
   * By default, when QueryParser adds terms to a PhraseQuery it should
 include the position reported by the analyzer.  Add option to
 revert back to old way.
 I'm just opening this now, marking as 3.0 fix, to remind us all to
 actually fix it for 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



contrib and lucene 3.0

2009-10-22 Thread Robert Muir
Hi,

What is the consensus on new features for contrib for Lucene 3.0? I know
that for core, its mostly a java 5 upgrade and deprecation removal.

I want to make sure LUCENE-1606 is set to the right version, but I figured
its really not just about that specific issue, I would like to know the
plans in general.

Thanks,
Robert

-- 
Robert Muir
rcm...@gmail.com


[jira] Updated: (LUCENE-1257) Port to Java5

2009-10-22 Thread Kay Kay (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Kay updated LUCENE-1257:


Attachment: LUCENE-1257_contrib_benchmark.patch

 Port to Java5
 -

 Key: LUCENE-1257
 URL: https://issues.apache.org/jira/browse/LUCENE-1257
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis, Examples, Index, Other, Query/Scoring, 
 QueryParser, Search, Store, Term Vectors
Affects Versions: 3.0
Reporter: Cédric Champeau
Assignee: Uwe Schindler
Priority: Minor
 Fix For: 3.0

 Attachments: instantiated_fieldable.patch, 
 LUCENE-1257-BooleanQuery.patch, LUCENE-1257-BooleanScorer_2.patch, 
 LUCENE-1257-BufferedDeletes_DocumentsWriter.patch, 
 LUCENE-1257-CheckIndex.patch, LUCENE-1257-CloseableThreadLocal.patch, 
 LUCENE-1257-CompoundFileReaderWriter.patch, 
 LUCENE-1257-ConcurrentMergeScheduler.patch, 
 LUCENE-1257-DirectoryReader.patch, 
 LUCENE-1257-DisjunctionMaxQuery-more_type_safety.patch, 
 LUCENE-1257-DocFieldProcessorPerThread.patch, LUCENE-1257-Document.patch, 
 LUCENE-1257-FieldCacheImpl.patch, LUCENE-1257-FieldCacheRangeFilter.patch, 
 LUCENE-1257-IndexDeleter.patch, 
 LUCENE-1257-IndexDeletionPolicy_IndexFileDeleter.patch, LUCENE-1257-iw.patch, 
 LUCENE-1257-MTQWF.patch, LUCENE-1257-NormalizeCharMap.patch, 
 LUCENE-1257-o.a.l.util.patch, LUCENE-1257-org_apache_lucene_document.patch, 
 LUCENE-1257-org_apache_lucene_document.patch, 
 LUCENE-1257-org_apache_lucene_document.patch, LUCENE-1257-SegmentInfos.patch, 
 LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, 
 LUCENE-1257-StringBuffer.patch, LUCENE-1257-TopDocsCollector.patch, 
 LUCENE-1257-WordListLoader.patch, LUCENE-1257_analysis.patch, 
 LUCENE-1257_BooleanFilter_Generics.patch, 
 LUCENE-1257_contrib_benchmark.patch, LUCENE-1257_contrib_highlighting.patch, 
 LUCENE-1257_javacc_upgrade.patch, LUCENE-1257_messages.patch, 
 LUCENE-1257_more_unnecessary_casts.patch, 
 LUCENE-1257_MultiFieldQueryParser.patch, LUCENE-1257_o.a.l.queryParser.patch, 
 LUCENE-1257_o.a.l.store.patch, LUCENE-1257_o_a_l_index_test.patch, 
 LUCENE-1257_o_a_l_index_test.patch, LUCENE-1257_o_a_l_search.patch, 
 LUCENE-1257_o_a_l_search_spans.patch, 
 LUCENE-1257_org_apache_lucene_index.patch, 
 LUCENE-1257_org_apache_lucene_index.patch, LUCENE-1257_queryParser_jj.patch, 
 LUCENE-1257_unnecessary_casts.patch, lucene1257surround1.patch, 
 lucene1257surround1.patch, shinglematrixfilter_generified.patch


 For my needs I've updated Lucene so that it uses Java 5 constructs. I know 
 Java 5 migration had been planned for 2.1 someday in the past, but don't know 
 when it is planned now. This patch against the trunk includes :
 - most obvious generics usage (there are tons of usages of sets, ... Those 
 which are commonly used have been generified)
 - PriorityQueue generification
 - replacement of indexed for loops with for each constructs
 - removal of unnececessary unboxing
 The code is to my opinion much more readable with those features (you 
 actually *know* what is stored in collections reading the code, without the 
 need to lookup for field definitions everytime) and it simplifies many 
 algorithms.
 Note that this patch also includes an interface for the Query class. This has 
 been done for my company's needs for building custom Query classes which add 
 some behaviour to the base Lucene queries. It prevents multiple unnnecessary 
 casts. I know this introduction is not wanted by the team, but it really 
 makes our developments easier to maintain. If you don't want to use this, 
 replace all /Queriable/ calls with standard /Query/.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2005) Add LuSql project to Apache Lucene - Contributions wiki page

2009-10-22 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768795#action_12768795
 ] 

Yonik Seeley commented on LUCENE-2005:
--

bq. Michael, actually this issue was just to add it to the contributions links 
on the website.

Right... and I think we really shouldn't try to pull more and more projects 
into lucene contrib, making it this huge uber project - that just makes it 
harder and harder to change core.


 Add LuSql project to Apache Lucene - Contributions wiki page
 --

 Key: LUCENE-2005
 URL: https://issues.apache.org/jira/browse/LUCENE-2005
 Project: Lucene - Java
  Issue Type: Task
  Components: Website
Affects Versions: 2.9
Reporter: Glen Newton
   Original Estimate: 2h
  Remaining Estimate: 2h

 Add 
 [LuSql|http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql] 
 to the Apache Lucene - Contributions page 
 [http://lucene.apache.org/java/2_9_0/contributions.html]
 I am the author of LuSql. I can supply any text needed. 
 Perhaps a new heading is needed to capture Database/JDBC oriented Lucene 
 tools (there are others out there)?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2005) Add LuSql project to Apache Lucene - Contributions wiki page

2009-10-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768810#action_12768810
 ] 

Robert Muir commented on LUCENE-2005:
-

bq. Right... and I think we really shouldn't try to pull more and more projects 
into lucene contrib, making it this huge uber project - that just makes it 
harder and harder to change core.

I guess we can agree to disagree on this one... 

But I do think that this issue is just about adding hyperlinks to the website.

 Add LuSql project to Apache Lucene - Contributions wiki page
 --

 Key: LUCENE-2005
 URL: https://issues.apache.org/jira/browse/LUCENE-2005
 Project: Lucene - Java
  Issue Type: Task
  Components: Website
Affects Versions: 2.9
Reporter: Glen Newton
   Original Estimate: 2h
  Remaining Estimate: 2h

 Add 
 [LuSql|http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql] 
 to the Apache Lucene - Contributions page 
 [http://lucene.apache.org/java/2_9_0/contributions.html]
 I am the author of LuSql. I can supply any text needed. 
 Perhaps a new heading is needed to capture Database/JDBC oriented Lucene 
 tools (there are others out there)?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2005) Add LuSql project to Apache Lucene - Contributions wiki page

2009-10-22 Thread Glen Newton (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768815#action_12768815
 ] 

Glen Newton commented on LUCENE-2005:
-

Yes, it was just concerned with adding hyperlinks to this page.

I have just added LuSql to the 
[FAQ|http://wiki.apache.org/lucene-java/LuceneFAQ#How_can_I_use_Lucene_to_index_a_database.3F]

I would prefer keeping LuSql out of contrib, as it makes my life easier (I 
think??), and allows me to release independent of the Lucene release schedule.

:-)



 Add LuSql project to Apache Lucene - Contributions wiki page
 --

 Key: LUCENE-2005
 URL: https://issues.apache.org/jira/browse/LUCENE-2005
 Project: Lucene - Java
  Issue Type: Task
  Components: Website
Affects Versions: 2.9
Reporter: Glen Newton
   Original Estimate: 2h
  Remaining Estimate: 2h

 Add 
 [LuSql|http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql] 
 to the Apache Lucene - Contributions page 
 [http://lucene.apache.org/java/2_9_0/contributions.html]
 I am the author of LuSql. I can supply any text needed. 
 Perhaps a new heading is needed to capture Database/JDBC oriented Lucene 
 tools (there are others out there)?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2003) Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or simplier StopFilter with stopWordsPosIncr mode switched on

2009-10-22 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768827#action_12768827
 ] 

Mark Miller commented on LUCENE-2003:
-

Umm - its hard to emulate the positions stuff from phrasequery with a 
SpanQuery. A limitation I hadn't really though much of. Should be doc'd.

One - uh - sloppy fix - is to count up all of the extra positions and add that 
to the slop.

ie if the positions for a phrasequery are 0, 1, 3 (stop word removed at 2), you 
would add 1 to the slop. 0,1,3,5 - add 2 to the slop.

I think that keeps a fairly good approximation.

Havn't thought about how that would work with MultiPhraseQuery yet.

 Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or 
 simplier StopFilter with stopWordsPosIncr mode switched on
 ---

 Key: LUCENE-2003
 URL: https://issues.apache.org/jira/browse/LUCENE-2003
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.9, 3.0
Reporter: Uwe Schindler
Assignee: Michael McCandless
 Fix For: 2.9.1, 3.0


 This is a followup on LUCENE-1987:
 If you set in HighligterTest the constant static final Version TEST_VERSION = 
 Version.LUCENE_24 to LUCENE_29 or LUCENE_CURRENT, the test 
 testSimpleQueryScorerPhraseHighlighting fails. Please note, that currently 
 (before LUCENE-2002 is fixed), you must also set the QueryParser to respect 
 posIncr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2003) Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or simplier StopFilter with stopWordsPosIncr mode switched on

2009-10-22 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768829#action_12768829
 ] 

Mark Miller commented on LUCENE-2003:
-

Well no crap - MultiPhraseQuery already does that. Someone else contrib'd that. 
Guess they are ahead of me - would have saved some though to look at it :)

 Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or 
 simplier StopFilter with stopWordsPosIncr mode switched on
 ---

 Key: LUCENE-2003
 URL: https://issues.apache.org/jira/browse/LUCENE-2003
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.9, 3.0
Reporter: Uwe Schindler
Assignee: Michael McCandless
 Fix For: 2.9.1, 3.0


 This is a followup on LUCENE-1987:
 If you set in HighligterTest the constant static final Version TEST_VERSION = 
 Version.LUCENE_24 to LUCENE_29 or LUCENE_CURRENT, the test 
 testSimpleQueryScorerPhraseHighlighting fails. Please note, that currently 
 (before LUCENE-2002 is fixed), you must also set the QueryParser to respect 
 posIncr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2003) Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or simplier StopFilter with stopWordsPosIncr mode switched on

2009-10-22 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768829#action_12768829
 ] 

Mark Miller edited comment on LUCENE-2003 at 10/22/09 7:40 PM:
---

Well no crap - MultiPhraseQuery already does that. Someone else contrib'd that. 
Guess they are ahead of me - would have saved some thought to look at it :)

  was (Author: markrmil...@gmail.com):
Well no crap - MultiPhraseQuery already does that. Someone else contrib'd 
that. Guess they are ahead of me - would have saved some though to look at it :)
  
 Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or 
 simplier StopFilter with stopWordsPosIncr mode switched on
 ---

 Key: LUCENE-2003
 URL: https://issues.apache.org/jira/browse/LUCENE-2003
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.9, 3.0
Reporter: Uwe Schindler
Assignee: Michael McCandless
 Fix For: 2.9.1, 3.0

 Attachments: LUCENE-2003.patch


 This is a followup on LUCENE-1987:
 If you set in HighligterTest the constant static final Version TEST_VERSION = 
 Version.LUCENE_24 to LUCENE_29 or LUCENE_CURRENT, the test 
 testSimpleQueryScorerPhraseHighlighting fails. Please note, that currently 
 (before LUCENE-2002 is fixed), you must also set the QueryParser to respect 
 posIncr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2003) Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or simplier StopFilter with stopWordsPosIncr mode switched on

2009-10-22 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-2003:


Attachment: LUCENE-2003.patch

Here is a patch showing essentially what I mean

 Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or 
 simplier StopFilter with stopWordsPosIncr mode switched on
 ---

 Key: LUCENE-2003
 URL: https://issues.apache.org/jira/browse/LUCENE-2003
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.9, 3.0
Reporter: Uwe Schindler
Assignee: Michael McCandless
 Fix For: 2.9.1, 3.0

 Attachments: LUCENE-2003.patch


 This is a followup on LUCENE-1987:
 If you set in HighligterTest the constant static final Version TEST_VERSION = 
 Version.LUCENE_24 to LUCENE_29 or LUCENE_CURRENT, the test 
 testSimpleQueryScorerPhraseHighlighting fails. Please note, that currently 
 (before LUCENE-2002 is fixed), you must also set the QueryParser to respect 
 posIncr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2002) Add oal.util.Version ctor to QueryParser

2009-10-22 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768838#action_12768838
 ] 

Grant Ingersoll commented on LUCENE-2002:
-

bq. OK I'll take that approach, and I guess make a unit test that peeks  
confirms these methods are still protected (to catch us in the future).

We may want to see if it can be automated in the ANT task so that we don't have 
to remember to do it by hand each time.  

 Add oal.util.Version ctor to QueryParser
 

 Key: LUCENE-2002
 URL: https://issues.apache.org/jira/browse/LUCENE-2002
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.9, 3.0
Reporter: Uwe Schindler
Assignee: Michael McCandless
 Fix For: 2.9.1

 Attachments: LUCENE-2002-29.patch


 This is a followup of LUCENE-1987:
 If somebody uses StandardAnalyzer with Version.LUCENE_CURRENT and then uses 
 QueryParser, phrase queries will not work, because the StopFilter enables 
 position Increments for stop words, but QueryParser ignores them per default. 
 The user has to explicitely enable them.
 This issue would add a ctor taking the Version constant and automatically 
 enable this setting. The same applies to the contrib queryparser. Eventually 
 also StopAnalyzer should add this version ctor.
 To be able to remove the default ctor for 3.0 (to remove a possible trap for 
 users of QueryParser), it must be deprecated and the new one also added to 
 2.9.1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2002) Add oal.util.Version ctor to QueryParser

2009-10-22 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768840#action_12768840
 ] 

Michael McCandless commented on LUCENE-2002:


bq. We may want to see if it can be automated in the ANT task so that we don't 
have to remember to do it by hand each time.

That would be fabulous but is way beyond my ant skills :)  Any ant pros out 
there want to try?

 Add oal.util.Version ctor to QueryParser
 

 Key: LUCENE-2002
 URL: https://issues.apache.org/jira/browse/LUCENE-2002
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.9, 3.0
Reporter: Uwe Schindler
Assignee: Michael McCandless
 Fix For: 2.9.1

 Attachments: LUCENE-2002-29.patch


 This is a followup of LUCENE-1987:
 If somebody uses StandardAnalyzer with Version.LUCENE_CURRENT and then uses 
 QueryParser, phrase queries will not work, because the StopFilter enables 
 position Increments for stop words, but QueryParser ignores them per default. 
 The user has to explicitely enable them.
 This issue would add a ctor taking the Version constant and automatically 
 enable this setting. The same applies to the contrib queryparser. Eventually 
 also StopAnalyzer should add this version ctor.
 To be able to remove the default ctor for 3.0 (to remove a possible trap for 
 users of QueryParser), it must be deprecated and the new one also added to 
 2.9.1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2002) Add oal.util.Version ctor to QueryParser

2009-10-22 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768845#action_12768845
 ] 

Uwe Schindler commented on LUCENE-2002:
---

Eric Hatcher :-)

Maybe the search-replace with regex functionality can do it.

 Add oal.util.Version ctor to QueryParser
 

 Key: LUCENE-2002
 URL: https://issues.apache.org/jira/browse/LUCENE-2002
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.9, 3.0
Reporter: Uwe Schindler
Assignee: Michael McCandless
 Fix For: 2.9.1

 Attachments: LUCENE-2002-29.patch


 This is a followup of LUCENE-1987:
 If somebody uses StandardAnalyzer with Version.LUCENE_CURRENT and then uses 
 QueryParser, phrase queries will not work, because the StopFilter enables 
 position Increments for stop words, but QueryParser ignores them per default. 
 The user has to explicitely enable them.
 This issue would add a ctor taking the Version constant and automatically 
 enable this setting. The same applies to the contrib queryparser. Eventually 
 also StopAnalyzer should add this version ctor.
 To be able to remove the default ctor for 3.0 (to remove a possible trap for 
 users of QueryParser), it must be deprecated and the new one also added to 
 2.9.1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1373) Most of the contributed Analyzers suffer from invalid recognition of acronyms.

2009-10-22 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1373.


Resolution: Duplicate

Dup of LUCENE-2002.

 Most of the contributed Analyzers suffer from invalid recognition of acronyms.
 --

 Key: LUCENE-1373
 URL: https://issues.apache.org/jira/browse/LUCENE-1373
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis, contrib/analyzers
Affects Versions: 2.3.2
Reporter: Mark Lassau
Priority: Minor
 Attachments: LUCENE-1373.patch


 LUCENE-1068 describes a bug in StandardTokenizer whereby a string like 
 www.apache.org. would be incorrectly tokenized as an acronym (note the dot 
 at the end).
 Unfortunately, keeping the backward compatibility of a bug turns out to 
 harm us.
 StandardTokenizer has a couple of ways to indicate fix this bug, but 
 unfortunately the default behaviour is still to be buggy.
 Most of the non-English analyzers provided in lucene-analyzers utilize the 
 StandardTokenizer, and in v2.3.2 not one of these provides a way to get the 
 non-buggy behaviour :(
 I refer to:
 * BrazilianAnalyzer
 * CzechAnalyzer
 * DutchAnalyzer
 * FrenchAnalyzer
 * GermanAnalyzer
 * GreekAnalyzer
 * ThaiAnalyzer

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2003) Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or simplier StopFilter with stopWordsPosIncr mode switched on

2009-10-22 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768843#action_12768843
 ] 

Yonik Seeley commented on LUCENE-2003:
--

Could you explain this part?
{code}
+  if (inc  lastInc) {
+slop += inc;
+  }
{code}

Seems like that would cause A ??? B ??? C ??? D to only have a slop of 3 (? 
represents a gap of 1).

Couldn't slop just be maxPos-minPos+1-numTokens?


 Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or 
 simplier StopFilter with stopWordsPosIncr mode switched on
 ---

 Key: LUCENE-2003
 URL: https://issues.apache.org/jira/browse/LUCENE-2003
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.9, 3.0
Reporter: Uwe Schindler
Assignee: Michael McCandless
 Fix For: 2.9.1, 3.0

 Attachments: LUCENE-2003.patch


 This is a followup on LUCENE-1987:
 If you set in HighligterTest the constant static final Version TEST_VERSION = 
 Version.LUCENE_24 to LUCENE_29 or LUCENE_CURRENT, the test 
 testSimpleQueryScorerPhraseHighlighting fails. Please note, that currently 
 (before LUCENE-2002 is fixed), you must also set the QueryParser to respect 
 posIncr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2002) Add oal.util.Version ctor to QueryParser

2009-10-22 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768845#action_12768845
 ] 

Uwe Schindler edited comment on LUCENE-2002 at 10/22/09 7:59 PM:
-

Eric Hatcher :-)

Maybe the search-replace with regex functionality can do it.
see: [http://ant.apache.org/manual/OptionalTasks/replaceregexp.html]

  was (Author: thetaphi):
Eric Hatcher :-)

Maybe the search-replace with regex functionality can do it.
  
 Add oal.util.Version ctor to QueryParser
 

 Key: LUCENE-2002
 URL: https://issues.apache.org/jira/browse/LUCENE-2002
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.9, 3.0
Reporter: Uwe Schindler
Assignee: Michael McCandless
 Fix For: 2.9.1

 Attachments: LUCENE-2002-29.patch


 This is a followup of LUCENE-1987:
 If somebody uses StandardAnalyzer with Version.LUCENE_CURRENT and then uses 
 QueryParser, phrase queries will not work, because the StopFilter enables 
 position Increments for stop words, but QueryParser ignores them per default. 
 The user has to explicitely enable them.
 This issue would add a ctor taking the Version constant and automatically 
 enable this setting. The same applies to the contrib queryparser. Eventually 
 also StopAnalyzer should add this version ctor.
 To be able to remove the default ctor for 3.0 (to remove a possible trap for 
 users of QueryParser), it must be deprecated and the new one also added to 
 2.9.1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2003) Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or simplier StopFilter with stopWordsPosIncr mode switched on

2009-10-22 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768853#action_12768853
 ] 

Mark Miller commented on LUCENE-2003:
-

Hmm - well now you have me worried - never seen you be wrong.

I just tried a test like that and it appeared to work though.

Ah - I should have looked closer at the MultiPhraseQuery code - it is wrong - 
just happens to work.

You only need to add to the slop the largest inc, because the SpanQuery slop is 
the dist allowed between *each* span.

So thats why it works - it finds 3 the first time, doesn't add any more for the 
rest, but 3 is enough. I'll fix.

 Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or 
 simplier StopFilter with stopWordsPosIncr mode switched on
 ---

 Key: LUCENE-2003
 URL: https://issues.apache.org/jira/browse/LUCENE-2003
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.9, 3.0
Reporter: Uwe Schindler
Assignee: Michael McCandless
 Fix For: 2.9.1, 3.0

 Attachments: LUCENE-2003.patch


 This is a followup on LUCENE-1987:
 If you set in HighligterTest the constant static final Version TEST_VERSION = 
 Version.LUCENE_24 to LUCENE_29 or LUCENE_CURRENT, the test 
 testSimpleQueryScorerPhraseHighlighting fails. Please note, that currently 
 (before LUCENE-2002 is fixed), you must also set the QueryParser to respect 
 posIncr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2003) Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or simplier StopFilter with stopWordsPosIncr mode switched on

2009-10-22 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-2003:


Attachment: LUCENE-2003.patch

This should be more correct - add the largest inc to the slop if its great than 
1.

Gotto consider this against your suggestion.

 Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or 
 simplier StopFilter with stopWordsPosIncr mode switched on
 ---

 Key: LUCENE-2003
 URL: https://issues.apache.org/jira/browse/LUCENE-2003
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.9, 3.0
Reporter: Uwe Schindler
Assignee: Michael McCandless
 Fix For: 2.9.1, 3.0

 Attachments: LUCENE-2003.patch, LUCENE-2003.patch


 This is a followup on LUCENE-1987:
 If you set in HighligterTest the constant static final Version TEST_VERSION = 
 Version.LUCENE_24 to LUCENE_29 or LUCENE_CURRENT, the test 
 testSimpleQueryScorerPhraseHighlighting fails. Please note, that currently 
 (before LUCENE-2002 is fixed), you must also set the QueryParser to respect 
 posIncr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1257) Port to Java5

2009-10-22 Thread Kay Kay (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Kay updated LUCENE-1257:


Attachment: LUCENE-1257_unnnecessary_casts_2.patch

 Port to Java5
 -

 Key: LUCENE-1257
 URL: https://issues.apache.org/jira/browse/LUCENE-1257
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis, Examples, Index, Other, Query/Scoring, 
 QueryParser, Search, Store, Term Vectors
Affects Versions: 3.0
Reporter: Cédric Champeau
Assignee: Uwe Schindler
Priority: Minor
 Fix For: 3.0

 Attachments: instantiated_fieldable.patch, 
 LUCENE-1257-BooleanQuery.patch, LUCENE-1257-BooleanScorer_2.patch, 
 LUCENE-1257-BufferedDeletes_DocumentsWriter.patch, 
 LUCENE-1257-CheckIndex.patch, LUCENE-1257-CloseableThreadLocal.patch, 
 LUCENE-1257-CompoundFileReaderWriter.patch, 
 LUCENE-1257-ConcurrentMergeScheduler.patch, 
 LUCENE-1257-DirectoryReader.patch, 
 LUCENE-1257-DisjunctionMaxQuery-more_type_safety.patch, 
 LUCENE-1257-DocFieldProcessorPerThread.patch, LUCENE-1257-Document.patch, 
 LUCENE-1257-FieldCacheImpl.patch, LUCENE-1257-FieldCacheRangeFilter.patch, 
 LUCENE-1257-IndexDeleter.patch, 
 LUCENE-1257-IndexDeletionPolicy_IndexFileDeleter.patch, LUCENE-1257-iw.patch, 
 LUCENE-1257-MTQWF.patch, LUCENE-1257-NormalizeCharMap.patch, 
 LUCENE-1257-o.a.l.util.patch, LUCENE-1257-org_apache_lucene_document.patch, 
 LUCENE-1257-org_apache_lucene_document.patch, 
 LUCENE-1257-org_apache_lucene_document.patch, LUCENE-1257-SegmentInfos.patch, 
 LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, 
 LUCENE-1257-StringBuffer.patch, LUCENE-1257-TopDocsCollector.patch, 
 LUCENE-1257-WordListLoader.patch, LUCENE-1257_analysis.patch, 
 LUCENE-1257_BooleanFilter_Generics.patch, 
 LUCENE-1257_contrib_benchmark.patch, LUCENE-1257_contrib_highlighting.patch, 
 LUCENE-1257_javacc_upgrade.patch, LUCENE-1257_messages.patch, 
 LUCENE-1257_more_unnecessary_casts.patch, 
 LUCENE-1257_MultiFieldQueryParser.patch, LUCENE-1257_o.a.l.queryParser.patch, 
 LUCENE-1257_o.a.l.store.patch, LUCENE-1257_o_a_l_index_test.patch, 
 LUCENE-1257_o_a_l_index_test.patch, LUCENE-1257_o_a_l_search.patch, 
 LUCENE-1257_o_a_l_search_spans.patch, 
 LUCENE-1257_org_apache_lucene_index.patch, 
 LUCENE-1257_org_apache_lucene_index.patch, LUCENE-1257_queryParser_jj.patch, 
 LUCENE-1257_unnecessary_casts.patch, LUCENE-1257_unnnecessary_casts_2.patch, 
 lucene1257surround1.patch, lucene1257surround1.patch, 
 shinglematrixfilter_generified.patch


 For my needs I've updated Lucene so that it uses Java 5 constructs. I know 
 Java 5 migration had been planned for 2.1 someday in the past, but don't know 
 when it is planned now. This patch against the trunk includes :
 - most obvious generics usage (there are tons of usages of sets, ... Those 
 which are commonly used have been generified)
 - PriorityQueue generification
 - replacement of indexed for loops with for each constructs
 - removal of unnececessary unboxing
 The code is to my opinion much more readable with those features (you 
 actually *know* what is stored in collections reading the code, without the 
 need to lookup for field definitions everytime) and it simplifies many 
 algorithms.
 Note that this patch also includes an interface for the Query class. This has 
 been done for my company's needs for building custom Query classes which add 
 some behaviour to the base Lucene queries. It prevents multiple unnnecessary 
 casts. I know this introduction is not wanted by the team, but it really 
 makes our developments easier to maintain. If you don't want to use this, 
 replace all /Queriable/ calls with standard /Query/.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2003) Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or simplier StopFilter with stopWordsPosIncr mode switched on

2009-10-22 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768862#action_12768862
 ] 

Mark Miller commented on LUCENE-2003:
-

Okay - I think this is the way to go -  maxPos-minPos+1-numTokens is too much 
slop because it just has to be the largest posInc - forgot thats how 
SpanQueries work when I did the orig patch.

 Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or 
 simplier StopFilter with stopWordsPosIncr mode switched on
 ---

 Key: LUCENE-2003
 URL: https://issues.apache.org/jira/browse/LUCENE-2003
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.9, 3.0
Reporter: Uwe Schindler
Assignee: Michael McCandless
 Fix For: 2.9.1, 3.0

 Attachments: LUCENE-2003.patch, LUCENE-2003.patch


 This is a followup on LUCENE-1987:
 If you set in HighligterTest the constant static final Version TEST_VERSION = 
 Version.LUCENE_24 to LUCENE_29 or LUCENE_CURRENT, the test 
 testSimpleQueryScorerPhraseHighlighting fails. Please note, that currently 
 (before LUCENE-2002 is fixed), you must also set the QueryParser to respect 
 posIncr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2003) Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or simplier StopFilter with stopWordsPosIncr mode switched on

2009-10-22 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768863#action_12768863
 ] 

Yonik Seeley commented on LUCENE-2003:
--

bq. You only need to add to the slop the largest inc, because the SpanQuery 
slop is the dist allowed between each span.

Learn something new every day :-)

Is this javadoc incorrect, or simply ambiguous, or am I reading it wrong:
{code}
  /** Construct a SpanNearQuery.  Matches spans matching a span from each
   * clause, with up to codeslop/code total unmatched positions between
   * them.  * When codeinOrder/code is true, the spans from each clause
   * must be * ordered as in codeclauses/code. */
  public SpanNearQuery(SpanQuery[] clauses, int slop, boolean inOrder) {
this(clauses, slop, inOrder, true); 
  }
{code}

The total would almost seem to tip the ambiguity toward meaning that it's the 
total slop between all clauses.

 Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or 
 simplier StopFilter with stopWordsPosIncr mode switched on
 ---

 Key: LUCENE-2003
 URL: https://issues.apache.org/jira/browse/LUCENE-2003
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.9, 3.0
Reporter: Uwe Schindler
Assignee: Michael McCandless
 Fix For: 2.9.1, 3.0

 Attachments: LUCENE-2003.patch, LUCENE-2003.patch


 This is a followup on LUCENE-1987:
 If you set in HighligterTest the constant static final Version TEST_VERSION = 
 Version.LUCENE_24 to LUCENE_29 or LUCENE_CURRENT, the test 
 testSimpleQueryScorerPhraseHighlighting fails. Please note, that currently 
 (before LUCENE-2002 is fixed), you must also set the QueryParser to respect 
 posIncr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-2003) Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or simplier StopFilter with stopWordsPosIncr mode switched on

2009-10-22 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-2003:
--

Assignee: Mark Miller  (was: Michael McCandless)

OK Mark you get this one :)

 Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or 
 simplier StopFilter with stopWordsPosIncr mode switched on
 ---

 Key: LUCENE-2003
 URL: https://issues.apache.org/jira/browse/LUCENE-2003
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.9, 3.0
Reporter: Uwe Schindler
Assignee: Mark Miller
 Fix For: 2.9.1, 3.0

 Attachments: LUCENE-2003.patch, LUCENE-2003.patch


 This is a followup on LUCENE-1987:
 If you set in HighligterTest the constant static final Version TEST_VERSION = 
 Version.LUCENE_24 to LUCENE_29 or LUCENE_CURRENT, the test 
 testSimpleQueryScorerPhraseHighlighting fails. Please note, that currently 
 (before LUCENE-2002 is fixed), you must also set the QueryParser to respect 
 posIncr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2003) Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or simplier StopFilter with stopWordsPosIncr mode switched on

2009-10-22 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768866#action_12768866
 ] 

Mark Miller commented on LUCENE-2003:
-

bq. The total would almost seem to tip the ambiguity toward meaning that it's 
the total slop between all clauses.

Yeah, I think it needs to be changed. Total appears just wrong. Perhaps 
something more along the lines of:

Matches spans matching a span from each clause, with up to codeslop/code 
unmatched positions between each of them

 Highlighter has problems when you use StandardAnalyzer with LUCENE_29 or 
 simplier StopFilter with stopWordsPosIncr mode switched on
 ---

 Key: LUCENE-2003
 URL: https://issues.apache.org/jira/browse/LUCENE-2003
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.9, 3.0
Reporter: Uwe Schindler
Assignee: Mark Miller
 Fix For: 2.9.1, 3.0

 Attachments: LUCENE-2003.patch, LUCENE-2003.patch


 This is a followup on LUCENE-1987:
 If you set in HighligterTest the constant static final Version TEST_VERSION = 
 Version.LUCENE_24 to LUCENE_29 or LUCENE_CURRENT, the test 
 testSimpleQueryScorerPhraseHighlighting fails. Please note, that currently 
 (before LUCENE-2002 is fixed), you must also set the QueryParser to respect 
 posIncr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2002) Add oal.util.Version ctor to QueryParser

2009-10-22 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768874#action_12768874
 ] 

Michael McCandless commented on LUCENE-2002:


bq. Maybe the search-replace with regex functionality can do it.

Excellent!  That worked like a charm.  I'll still leave the unit test in place 
to catch us if this fails...

 Add oal.util.Version ctor to QueryParser
 

 Key: LUCENE-2002
 URL: https://issues.apache.org/jira/browse/LUCENE-2002
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.9, 3.0
Reporter: Uwe Schindler
Assignee: Michael McCandless
 Fix For: 2.9.1

 Attachments: LUCENE-2002-29.patch


 This is a followup of LUCENE-1987:
 If somebody uses StandardAnalyzer with Version.LUCENE_CURRENT and then uses 
 QueryParser, phrase queries will not work, because the StopFilter enables 
 position Increments for stop words, but QueryParser ignores them per default. 
 The user has to explicitely enable them.
 This issue would add a ctor taking the Version constant and automatically 
 enable this setting. The same applies to the contrib queryparser. Eventually 
 also StopAnalyzer should add this version ctor.
 To be able to remove the default ctor for 3.0 (to remove a possible trap for 
 users of QueryParser), it must be deprecated and the new one also added to 
 2.9.1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2002) Add oal.util.Version ctor to QueryParser

2009-10-22 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768878#action_12768878
 ] 

Uwe Schindler commented on LUCENE-2002:
---

Cool. Did you check the minimum ANT version needed for this? If the current 
BUILD.txt minimum does not fit, we shoudl update the build, docs. My problem: I 
didn't found the minimum version for replaceregexp in the docs.

 Add oal.util.Version ctor to QueryParser
 

 Key: LUCENE-2002
 URL: https://issues.apache.org/jira/browse/LUCENE-2002
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.9, 3.0
Reporter: Uwe Schindler
Assignee: Michael McCandless
 Fix For: 2.9.1

 Attachments: LUCENE-2002-29.patch


 This is a followup of LUCENE-1987:
 If somebody uses StandardAnalyzer with Version.LUCENE_CURRENT and then uses 
 QueryParser, phrase queries will not work, because the StopFilter enables 
 position Increments for stop words, but QueryParser ignores them per default. 
 The user has to explicitely enable them.
 This issue would add a ctor taking the Version constant and automatically 
 enable this setting. The same applies to the contrib queryparser. Eventually 
 also StopAnalyzer should add this version ctor.
 To be able to remove the default ctor for 3.0 (to remove a possible trap for 
 users of QueryParser), it must be deprecated and the new one also added to 
 2.9.1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2002) Add oal.util.Version ctor to QueryParser

2009-10-22 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768889#action_12768889
 ] 

Michael McCandless commented on LUCENE-2002:


I think we are good: I just looked @ 1.6.3's javadocs (we specify ant 1.6.3 in 
BUILD.txt) and it's got the replaceregexp task.

 Add oal.util.Version ctor to QueryParser
 

 Key: LUCENE-2002
 URL: https://issues.apache.org/jira/browse/LUCENE-2002
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.9, 3.0
Reporter: Uwe Schindler
Assignee: Michael McCandless
 Fix For: 2.9.1

 Attachments: LUCENE-2002-29.patch


 This is a followup of LUCENE-1987:
 If somebody uses StandardAnalyzer with Version.LUCENE_CURRENT and then uses 
 QueryParser, phrase queries will not work, because the StopFilter enables 
 position Increments for stop words, but QueryParser ignores them per default. 
 The user has to explicitely enable them.
 This issue would add a ctor taking the Version constant and automatically 
 enable this setting. The same applies to the contrib queryparser. Eventually 
 also StopAnalyzer should add this version ctor.
 To be able to remove the default ctor for 3.0 (to remove a possible trap for 
 users of QueryParser), it must be deprecated and the new one also added to 
 2.9.1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2002) Add oal.util.Version ctor to QueryParser

2009-10-22 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2002:
---

Attachment: LUCENE-2002-29.patch

New patch attached.  All tests pass.  Changes:

  * Fixed the patch - match typo

  * Fixed build.xml to make 2 autogen'd (by JavaCC) public QueryParser
ctors protected, and added unit test to assert this

  * Added Version matchVersion param to all (I think!) contrib
analyzers that instantiate either StandardTokenizer (to manage
changing the fix invalid acronym setting across versions), or
StopFilter (to manage enable pos incr setting across versions),
or, both, and threaded it down to StandardTokenizer  StopFilter

I didn't add Version to StopFilter nor StopAnalyzer; I think it's
better to up-front require the enablePositionIncrements to their
ctors.


 Add oal.util.Version ctor to QueryParser
 

 Key: LUCENE-2002
 URL: https://issues.apache.org/jira/browse/LUCENE-2002
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.9, 3.0
Reporter: Uwe Schindler
Assignee: Michael McCandless
 Fix For: 2.9.1

 Attachments: LUCENE-2002-29.patch, LUCENE-2002-29.patch


 This is a followup of LUCENE-1987:
 If somebody uses StandardAnalyzer with Version.LUCENE_CURRENT and then uses 
 QueryParser, phrase queries will not work, because the StopFilter enables 
 position Increments for stop words, but QueryParser ignores them per default. 
 The user has to explicitely enable them.
 This issue would add a ctor taking the Version constant and automatically 
 enable this setting. The same applies to the contrib queryparser. Eventually 
 also StopAnalyzer should add this version ctor.
 To be able to remove the default ctor for 3.0 (to remove a possible trap for 
 users of QueryParser), it must be deprecated and the new one also added to 
 2.9.1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2002) Add oal.util.Version ctor to QueryParser

2009-10-22 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768907#action_12768907
 ] 

Uwe Schindler commented on LUCENE-2002:
---

Looks good.

bq. I didn't add Version to StopFilter nor StopAnalyzer; I think it's better to 
up-front require the enablePositionIncrements to their ctors.

I would add it to StopAnalyzer, StopFilter is not so important (because 
low-level). But that's my opinion.

 Add oal.util.Version ctor to QueryParser
 

 Key: LUCENE-2002
 URL: https://issues.apache.org/jira/browse/LUCENE-2002
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.9, 3.0
Reporter: Uwe Schindler
Assignee: Michael McCandless
 Fix For: 2.9.1

 Attachments: LUCENE-2002-29.patch, LUCENE-2002-29.patch


 This is a followup of LUCENE-1987:
 If somebody uses StandardAnalyzer with Version.LUCENE_CURRENT and then uses 
 QueryParser, phrase queries will not work, because the StopFilter enables 
 position Increments for stop words, but QueryParser ignores them per default. 
 The user has to explicitely enable them.
 This issue would add a ctor taking the Version constant and automatically 
 enable this setting. The same applies to the contrib queryparser. Eventually 
 also StopAnalyzer should add this version ctor.
 To be able to remove the default ctor for 3.0 (to remove a possible trap for 
 users of QueryParser), it must be deprecated and the new one also added to 
 2.9.1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1960) Remove deprecated Field.Store.COMPRESS

2009-10-22 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768916#action_12768916
 ] 

Uwe Schindler commented on LUCENE-1960:
---

I still prefer 1, but maybe it's not so good. Else I would implement 2 (even if 
we need FieldForMerge). Just remove the COMPRES flag that nobody can add any 
compressed fields anymore.

3 is bad, because it needs you to change your code on the change between 2.9 
and 3.0 if you had compressed fields. In 2.9 they were automatically 
uncompressed, in 3.0 not. This would make it impossible to replace the lucene 
jar (which is currently possible if you remove all deprecated calls in 2.9).

 Remove deprecated Field.Store.COMPRESS
 --

 Key: LUCENE-1960
 URL: https://issues.apache.org/jira/browse/LUCENE-1960
 Project: Lucene - Java
  Issue Type: Task
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 3.0

 Attachments: lucene-1960-1.patch, lucene-1960.patch


 Also remove FieldForMerge and related code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: lucene 2.9 sorting algorithm

2009-10-22 Thread John Wang
Hey Michael:
   Would you mind rerunning the test you have with jdk1.5?

   Also, if you would, change the comparator method to avoid brachning
for int and string comparators, e.g.


  return index.order[i.doc] - index.order[j.doc];


Thanks


-John

On Thu, Oct 22, 2009 at 2:38 AM, Michael McCandless 
luc...@mikemccandless.com wrote:

 On Thu, Oct 22, 2009 at 2:17 AM, John Wang john.w...@gmail.com wrote:

   I have been playing with the patch, and I think I have some
 information
  that you might like.
   Let me spend sometime and gather some more numbers and update in
 jira.

 Excellent!

   say bottom has ords 23, 45, 76, each corresponding to a string. When
  moving to the next segment, you need to make bottom to have ords that can
 be
  comparable to other docs in this new segment, so you would need to find
 the
  new ords for the values in 23,45 and 76, don't you? To find it, assuming
 the
  values are s1,s2,s3, you would do a bin. search on the new val array, and
  find index for s1,s2,s3.

 It's that inversion (from ord-Comparable in first seg, and
 Comparable-ord in second seg) that I'm trying to avoid (w/ this new
 proposal).

  Which is 3 bin searches per convert, I am not sure
  how you can short circuit it. Are you suggesting we call Comparable on
  compareBottom until some doc beats it?

 I'm saying on seg transition you indeed get the Comparable for current
 bottom, but, don't attempt to invert it.  Instead, as seg 2 finds a
 hit, you get that hit's Comparables and compare to bottom.  If it
 beats bottom, it goes into the queue.  If it does not, you use the ord
 (in seg 2's ord space) to learn a bottom in the ord space of seg 2.

  That would hurt performance I lot though, no?

 Yeah I think likely it would, since we're talking about a binary
 search on transition VS having to do possibly many
 upgrade-to-Comparable and compare-Comparabls to slowly learn the
 equivalent ord in the new segment.  I was proposing it for cases where
 inversion is very difficult.  But realistically, since you must keep
 around the ful ord - Comparable for every segment anyway (in order to
 merge in the end), inversion shouldn't ever actually be difficult --
 it'd just be a binary search on presumably in-RAM storage.

 Mike

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




Re: lucene 2.9 sorting algorithm

2009-10-22 Thread Mark Miller
Why? What might he find? Whats with the cryptic request?

Why would Java 1.5 perform better than 1.6? It erases 20 and 40% gains?

I know point 2 certainly doesn't. Cards on the table?

John Wang wrote:
 Hey Michael:

Would you mind rerunning the test you have with jdk1.5?

Also, if you would, change the comparator method to avoid
 brachning for int and string comparators, e.g. 


   return index.order[i.doc] - index.order[j.doc];


 Thanks


 -John


 On Thu, Oct 22, 2009 at 2:38 AM, Michael McCandless
 luc...@mikemccandless.com mailto:luc...@mikemccandless.com wrote:

 On Thu, Oct 22, 2009 at 2:17 AM, John Wang john.w...@gmail.com
 mailto:john.w...@gmail.com wrote:

   I have been playing with the patch, and I think I have some
 information
  that you might like.
   Let me spend sometime and gather some more numbers and
 update in jira.

 Excellent!

   say bottom has ords 23, 45, 76, each corresponding to a
 string. When
  moving to the next segment, you need to make bottom to have ords
 that can be
  comparable to other docs in this new segment, so you would need
 to find the
  new ords for the values in 23,45 and 76, don't you? To find it,
 assuming the
  values are s1,s2,s3, you would do a bin. search on the new val
 array, and
  find index for s1,s2,s3.

 It's that inversion (from ord-Comparable in first seg, and
 Comparable-ord in second seg) that I'm trying to avoid (w/ this new
 proposal).

  Which is 3 bin searches per convert, I am not sure
  how you can short circuit it. Are you suggesting we call
 Comparable on
  compareBottom until some doc beats it?

 I'm saying on seg transition you indeed get the Comparable for current
 bottom, but, don't attempt to invert it.  Instead, as seg 2 finds a
 hit, you get that hit's Comparables and compare to bottom.  If it
 beats bottom, it goes into the queue.  If it does not, you use the ord
 (in seg 2's ord space) to learn a bottom in the ord space of seg 2.

  That would hurt performance I lot though, no?

 Yeah I think likely it would, since we're talking about a binary
 search on transition VS having to do possibly many
 upgrade-to-Comparable and compare-Comparabls to slowly learn the
 equivalent ord in the new segment.  I was proposing it for cases where
 inversion is very difficult.  But realistically, since you must keep
 around the ful ord - Comparable for every segment anyway (in order to
 merge in the end), inversion shouldn't ever actually be difficult --
 it'd just be a binary search on presumably in-RAM storage.

 Mike

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 mailto:java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org
 mailto:java-dev-h...@lucene.apache.org




-- 
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: lucene 2.9 sorting algorithm

2009-10-22 Thread John Wang
Mark:
   Please be patient with me. I am seeing a difference and was wondering
if Mike would see the same thing. I thought Michael would be willing to
because he expressed interest in understanding what the performance
discrepancies are.

   Again, it is only a request. It is perfectly fine if Michael refuses
to. But it would be great if Michael speaks for himself.

Thanks

-John

On Thu, Oct 22, 2009 at 7:29 PM, Mark Miller markrmil...@gmail.com wrote:

 Why? What might he find? Whats with the cryptic request?

 Why would Java 1.5 perform better than 1.6? It erases 20 and 40% gains?

 I know point 2 certainly doesn't. Cards on the table?

 John Wang wrote:
  Hey Michael:
 
 Would you mind rerunning the test you have with jdk1.5?
 
 Also, if you would, change the comparator method to avoid
  brachning for int and string comparators, e.g.
 
 
return index.order[i.doc] - index.order[j.doc];
 
 
  Thanks
 
 
  -John
 
 
  On Thu, Oct 22, 2009 at 2:38 AM, Michael McCandless
  luc...@mikemccandless.com mailto:luc...@mikemccandless.com wrote:
 
  On Thu, Oct 22, 2009 at 2:17 AM, John Wang john.w...@gmail.com
  mailto:john.w...@gmail.com wrote:
 
I have been playing with the patch, and I think I have some
  information
   that you might like.
Let me spend sometime and gather some more numbers and
  update in jira.
 
  Excellent!
 
say bottom has ords 23, 45, 76, each corresponding to a
  string. When
   moving to the next segment, you need to make bottom to have ords
  that can be
   comparable to other docs in this new segment, so you would need
  to find the
   new ords for the values in 23,45 and 76, don't you? To find it,
  assuming the
   values are s1,s2,s3, you would do a bin. search on the new val
  array, and
   find index for s1,s2,s3.
 
  It's that inversion (from ord-Comparable in first seg, and
  Comparable-ord in second seg) that I'm trying to avoid (w/ this new
  proposal).
 
   Which is 3 bin searches per convert, I am not sure
   how you can short circuit it. Are you suggesting we call
  Comparable on
   compareBottom until some doc beats it?
 
  I'm saying on seg transition you indeed get the Comparable for
 current
  bottom, but, don't attempt to invert it.  Instead, as seg 2 finds a
  hit, you get that hit's Comparables and compare to bottom.  If it
  beats bottom, it goes into the queue.  If it does not, you use the
 ord
  (in seg 2's ord space) to learn a bottom in the ord space of seg 2.
 
   That would hurt performance I lot though, no?
 
  Yeah I think likely it would, since we're talking about a binary
  search on transition VS having to do possibly many
  upgrade-to-Comparable and compare-Comparabls to slowly learn the
  equivalent ord in the new segment.  I was proposing it for cases
 where
  inversion is very difficult.  But realistically, since you must keep
  around the ful ord - Comparable for every segment anyway (in order
 to
  merge in the end), inversion shouldn't ever actually be difficult
 --
  it'd just be a binary search on presumably in-RAM storage.
 
  Mike
 
  -
  To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
  mailto:java-dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-dev-h...@lucene.apache.org
  mailto:java-dev-h...@lucene.apache.org
 
 


 --
 - Mark

 http://www.lucidimagination.com




 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




Re: lucene 2.9 sorting algorithm

2009-10-22 Thread Jake Mannix
Mark,

  We're not seeing exactly the numbers that Mike is seeing in his tests,
running with jdk 1.5 on intel macs, so we're trying to eliminate factors of
difference.

  Point 2 does indeed make a difference, we've seen it, and it's only fair:
the
single pq comparator does this branch optimization but the current patch
multi-pq
does not, so let's level the playing field.

  John's on the road with limited net connectivity, but we'll have some
numbers to
compare more over the weekend for sure.

  -jake

On Thu, Oct 22, 2009 at 7:29 PM, Mark Miller markrmil...@gmail.com wrote:

 Why? What might he find? Whats with the cryptic request?

 Why would Java 1.5 perform better than 1.6? It erases 20 and 40% gains?

 I know point 2 certainly doesn't. Cards on the table?

 John Wang wrote:
  Hey Michael:
 
 Would you mind rerunning the test you have with jdk1.5?
 
 Also, if you would, change the comparator method to avoid
  brachning for int and string comparators, e.g.
 
 
return index.order[i.doc] - index.order[j.doc];
 
 
  Thanks
 
 
  -John
 
 
  On Thu, Oct 22, 2009 at 2:38 AM, Michael McCandless
  luc...@mikemccandless.com mailto:luc...@mikemccandless.com wrote:
 
  On Thu, Oct 22, 2009 at 2:17 AM, John Wang john.w...@gmail.com
  mailto:john.w...@gmail.com wrote:
 
I have been playing with the patch, and I think I have some
  information
   that you might like.
Let me spend sometime and gather some more numbers and
  update in jira.
 
  Excellent!
 
say bottom has ords 23, 45, 76, each corresponding to a
  string. When
   moving to the next segment, you need to make bottom to have ords
  that can be
   comparable to other docs in this new segment, so you would need
  to find the
   new ords for the values in 23,45 and 76, don't you? To find it,
  assuming the
   values are s1,s2,s3, you would do a bin. search on the new val
  array, and
   find index for s1,s2,s3.
 
  It's that inversion (from ord-Comparable in first seg, and
  Comparable-ord in second seg) that I'm trying to avoid (w/ this new
  proposal).
 
   Which is 3 bin searches per convert, I am not sure
   how you can short circuit it. Are you suggesting we call
  Comparable on
   compareBottom until some doc beats it?
 
  I'm saying on seg transition you indeed get the Comparable for
 current
  bottom, but, don't attempt to invert it.  Instead, as seg 2 finds a
  hit, you get that hit's Comparables and compare to bottom.  If it
  beats bottom, it goes into the queue.  If it does not, you use the
 ord
  (in seg 2's ord space) to learn a bottom in the ord space of seg 2.
 
   That would hurt performance I lot though, no?
 
  Yeah I think likely it would, since we're talking about a binary
  search on transition VS having to do possibly many
  upgrade-to-Comparable and compare-Comparabls to slowly learn the
  equivalent ord in the new segment.  I was proposing it for cases
 where
  inversion is very difficult.  But realistically, since you must keep
  around the ful ord - Comparable for every segment anyway (in order
 to
  merge in the end), inversion shouldn't ever actually be difficult
 --
  it'd just be a binary search on presumably in-RAM storage.
 
  Mike
 
  -
  To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
  mailto:java-dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-dev-h...@lucene.apache.org
  mailto:java-dev-h...@lucene.apache.org
 
 


 --
 - Mark

 http://www.lucidimagination.com




 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




Re: lucene 2.9 sorting algorithm

2009-10-22 Thread Mark Miller
I am patient :) And I'm not speaking for Mike, I'm speaking for me. I'm
wondering what your seeing. Asking Mike to rerun the tests without
giving any further info (you didn't even say that your seeing something
different) is unfair to the rest of us ;)

Giving 0 info along with your request just makes 0 sense to me and I
said as much.

John Wang wrote:
 Mark:

Please be patient with me. I am seeing a difference and was
 wondering if Mike would see the same thing. I thought Michael would be
 willing to because he expressed interest in understanding what the
 performance discrepancies are.

Again, it is only a request. It is perfectly fine if Michael
 refuses to. But it would be great if Michael speaks for himself.

 Thanks

 -John

 On Thu, Oct 22, 2009 at 7:29 PM, Mark Miller markrmil...@gmail.com
 mailto:markrmil...@gmail.com wrote:

 Why? What might he find? Whats with the cryptic request?

 Why would Java 1.5 perform better than 1.6? It erases 20 and 40%
 gains?

 I know point 2 certainly doesn't. Cards on the table?

 John Wang wrote:
  Hey Michael:
 
 Would you mind rerunning the test you have with jdk1.5?
 
 Also, if you would, change the comparator method to avoid
  brachning for int and string comparators, e.g.
 
 
return index.order[i.doc] - index.order[j.doc];
 
 
  Thanks
 
 
  -John
 
 
  On Thu, Oct 22, 2009 at 2:38 AM, Michael McCandless
  luc...@mikemccandless.com mailto:luc...@mikemccandless.com
 mailto:luc...@mikemccandless.com
 mailto:luc...@mikemccandless.com wrote:
 
  On Thu, Oct 22, 2009 at 2:17 AM, John Wang
 john.w...@gmail.com mailto:john.w...@gmail.com
  mailto:john.w...@gmail.com mailto:john.w...@gmail.com
 wrote:
 
I have been playing with the patch, and I think I
 have some
  information
   that you might like.
Let me spend sometime and gather some more numbers and
  update in jira.
 
  Excellent!
 
say bottom has ords 23, 45, 76, each corresponding to a
  string. When
   moving to the next segment, you need to make bottom to
 have ords
  that can be
   comparable to other docs in this new segment, so you would
 need
  to find the
   new ords for the values in 23,45 and 76, don't you? To
 find it,
  assuming the
   values are s1,s2,s3, you would do a bin. search on the new val
  array, and
   find index for s1,s2,s3.
 
  It's that inversion (from ord-Comparable in first seg, and
  Comparable-ord in second seg) that I'm trying to avoid (w/
 this new
  proposal).
 
   Which is 3 bin searches per convert, I am not sure
   how you can short circuit it. Are you suggesting we call
  Comparable on
   compareBottom until some doc beats it?
 
  I'm saying on seg transition you indeed get the Comparable
 for current
  bottom, but, don't attempt to invert it.  Instead, as seg 2
 finds a
  hit, you get that hit's Comparables and compare to bottom.
  If it
  beats bottom, it goes into the queue.  If it does not, you
 use the ord
  (in seg 2's ord space) to learn a bottom in the ord space
 of seg 2.
 
   That would hurt performance I lot though, no?
 
  Yeah I think likely it would, since we're talking about a binary
  search on transition VS having to do possibly many
  upgrade-to-Comparable and compare-Comparabls to slowly learn the
  equivalent ord in the new segment.  I was proposing it for
 cases where
  inversion is very difficult.  But realistically, since you
 must keep
  around the ful ord - Comparable for every segment anyway
 (in order to
  merge in the end), inversion shouldn't ever actually be
 difficult --
  it'd just be a binary search on presumably in-RAM storage.
 
  Mike
 
 
 -
  To unsubscribe, e-mail:
 java-dev-unsubscr...@lucene.apache.org
 mailto:java-dev-unsubscr...@lucene.apache.org
  mailto:java-dev-unsubscr...@lucene.apache.org
 mailto:java-dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail:
 java-dev-h...@lucene.apache.org
 mailto:java-dev-h...@lucene.apache.org
  mailto:java-dev-h...@lucene.apache.org
 mailto:java-dev-h...@lucene.apache.org
 
 


 --
 - Mark

 http://www.lucidimagination.com




 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 mailto:java-dev-unsubscr...@lucene.apache.org
 For 

Re: lucene 2.9 sorting algorithm

2009-10-22 Thread Mark Miller
Thanks - thats all I'm asking for. A simple explanation of why you'd ask
for a retest with those two things changed. Just seems its hold your
cards a little to close to say - please do this with 0 explanation.

As to point 2, thats fine - I'm sure it helps - I was just saying I
didn't buy it helps by 20-40%. Not arguing against doing it, but since
the request had no info, the only thing I could assume was that that was
supposed to change things.

I was about to run some of these tests myself (if i can find what darn
revision to patch), and its a bit frustrating to see you guys knew
something but were not telling ...

Jake Mannix wrote:
 Mark,
  
   We're not seeing exactly the numbers that Mike is seeing in his tests,
 running with jdk 1.5 on intel macs, so we're trying to eliminate
 factors of difference.

   Point 2 does indeed make a difference, we've seen it, and it's only
 fair: the
 single pq comparator does this branch optimization but the current
 patch multi-pq
 does not, so let's level the playing field.

   John's on the road with limited net connectivity, but we'll have
 some numbers to
 compare more over the weekend for sure.

   -jake

 On Thu, Oct 22, 2009 at 7:29 PM, Mark Miller markrmil...@gmail.com
 mailto:markrmil...@gmail.com wrote:

 Why? What might he find? Whats with the cryptic request?

 Why would Java 1.5 perform better than 1.6? It erases 20 and 40%
 gains?

 I know point 2 certainly doesn't. Cards on the table?

 John Wang wrote:
  Hey Michael:
 
 Would you mind rerunning the test you have with jdk1.5?
 
 Also, if you would, change the comparator method to avoid
  brachning for int and string comparators, e.g.
 
 
return index.order[i.doc] - index.order[j.doc];
 
 
  Thanks
 
 
  -John
 
 
  On Thu, Oct 22, 2009 at 2:38 AM, Michael McCandless
  luc...@mikemccandless.com mailto:luc...@mikemccandless.com
 mailto:luc...@mikemccandless.com
 mailto:luc...@mikemccandless.com wrote:
 
  On Thu, Oct 22, 2009 at 2:17 AM, John Wang
 john.w...@gmail.com mailto:john.w...@gmail.com
  mailto:john.w...@gmail.com mailto:john.w...@gmail.com
 wrote:
 
I have been playing with the patch, and I think I
 have some
  information
   that you might like.
Let me spend sometime and gather some more numbers and
  update in jira.
 
  Excellent!
 
say bottom has ords 23, 45, 76, each corresponding to a
  string. When
   moving to the next segment, you need to make bottom to
 have ords
  that can be
   comparable to other docs in this new segment, so you would
 need
  to find the
   new ords for the values in 23,45 and 76, don't you? To
 find it,
  assuming the
   values are s1,s2,s3, you would do a bin. search on the new val
  array, and
   find index for s1,s2,s3.
 
  It's that inversion (from ord-Comparable in first seg, and
  Comparable-ord in second seg) that I'm trying to avoid (w/
 this new
  proposal).
 
   Which is 3 bin searches per convert, I am not sure
   how you can short circuit it. Are you suggesting we call
  Comparable on
   compareBottom until some doc beats it?
 
  I'm saying on seg transition you indeed get the Comparable
 for current
  bottom, but, don't attempt to invert it.  Instead, as seg 2
 finds a
  hit, you get that hit's Comparables and compare to bottom.
  If it
  beats bottom, it goes into the queue.  If it does not, you
 use the ord
  (in seg 2's ord space) to learn a bottom in the ord space
 of seg 2.
 
   That would hurt performance I lot though, no?
 
  Yeah I think likely it would, since we're talking about a binary
  search on transition VS having to do possibly many
  upgrade-to-Comparable and compare-Comparabls to slowly learn the
  equivalent ord in the new segment.  I was proposing it for
 cases where
  inversion is very difficult.  But realistically, since you
 must keep
  around the ful ord - Comparable for every segment anyway
 (in order to
  merge in the end), inversion shouldn't ever actually be
 difficult --
  it'd just be a binary search on presumably in-RAM storage.
 
  Mike
 
 
 -
  To unsubscribe, e-mail:
 java-dev-unsubscr...@lucene.apache.org
 mailto:java-dev-unsubscr...@lucene.apache.org
  mailto:java-dev-unsubscr...@lucene.apache.org
 mailto:java-dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail:
 java-dev-h...@lucene.apache.org
 

Re: lucene 2.9 sorting algorithm

2009-10-22 Thread John Wang
Mike:
   I did just post with what I saw, feel free to read and comment on it.

   I am simply trying to work with Michael on this and trying to
understand the code.

   As I have expressed previously, I have seen a difference between 1.5
and 1.6 that is significant. Since Mike has posted some numbers on jdk 1.6,
I was hoping to eliminate all variables relating to the index and
environment and see if he sees the same thing.

I guess I should be more clear in the email.

-John

On Thu, Oct 22, 2009 at 7:39 PM, Mark Miller markrmil...@gmail.com wrote:

 I am patient :) And I'm not speaking for Mike, I'm speaking for me. I'm
 wondering what your seeing. Asking Mike to rerun the tests without
 giving any further info (you didn't even say that your seeing something
 different) is unfair to the rest of us ;)

 Giving 0 info along with your request just makes 0 sense to me and I
 said as much.

 John Wang wrote:
  Mark:
 
 Please be patient with me. I am seeing a difference and was
  wondering if Mike would see the same thing. I thought Michael would be
  willing to because he expressed interest in understanding what the
  performance discrepancies are.
 
 Again, it is only a request. It is perfectly fine if Michael
  refuses to. But it would be great if Michael speaks for himself.
 
  Thanks
 
  -John
 
  On Thu, Oct 22, 2009 at 7:29 PM, Mark Miller markrmil...@gmail.com
  mailto:markrmil...@gmail.com wrote:
 
  Why? What might he find? Whats with the cryptic request?
 
  Why would Java 1.5 perform better than 1.6? It erases 20 and 40%
  gains?
 
  I know point 2 certainly doesn't. Cards on the table?
 
  John Wang wrote:
   Hey Michael:
  
  Would you mind rerunning the test you have with jdk1.5?
  
  Also, if you would, change the comparator method to avoid
   brachning for int and string comparators, e.g.
  
  
 return index.order[i.doc] - index.order[j.doc];
  
  
   Thanks
  
  
   -John
  
  
   On Thu, Oct 22, 2009 at 2:38 AM, Michael McCandless
   luc...@mikemccandless.com mailto:luc...@mikemccandless.com
  mailto:luc...@mikemccandless.com
  mailto:luc...@mikemccandless.com wrote:
  
   On Thu, Oct 22, 2009 at 2:17 AM, John Wang
  john.w...@gmail.com mailto:john.w...@gmail.com
   mailto:john.w...@gmail.com mailto:john.w...@gmail.com
  wrote:
  
 I have been playing with the patch, and I think I
  have some
   information
that you might like.
 Let me spend sometime and gather some more numbers and
   update in jira.
  
   Excellent!
  
 say bottom has ords 23, 45, 76, each corresponding to a
   string. When
moving to the next segment, you need to make bottom to
  have ords
   that can be
comparable to other docs in this new segment, so you would
  need
   to find the
new ords for the values in 23,45 and 76, don't you? To
  find it,
   assuming the
values are s1,s2,s3, you would do a bin. search on the new
 val
   array, and
find index for s1,s2,s3.
  
   It's that inversion (from ord-Comparable in first seg, and
   Comparable-ord in second seg) that I'm trying to avoid (w/
  this new
   proposal).
  
Which is 3 bin searches per convert, I am not sure
how you can short circuit it. Are you suggesting we call
   Comparable on
compareBottom until some doc beats it?
  
   I'm saying on seg transition you indeed get the Comparable
  for current
   bottom, but, don't attempt to invert it.  Instead, as seg 2
  finds a
   hit, you get that hit's Comparables and compare to bottom.
   If it
   beats bottom, it goes into the queue.  If it does not, you
  use the ord
   (in seg 2's ord space) to learn a bottom in the ord space
  of seg 2.
  
That would hurt performance I lot though, no?
  
   Yeah I think likely it would, since we're talking about a
 binary
   search on transition VS having to do possibly many
   upgrade-to-Comparable and compare-Comparabls to slowly learn
 the
   equivalent ord in the new segment.  I was proposing it for
  cases where
   inversion is very difficult.  But realistically, since you
  must keep
   around the ful ord - Comparable for every segment anyway
  (in order to
   merge in the end), inversion shouldn't ever actually be
  difficult --
   it'd just be a binary search on presumably in-RAM storage.
  
   Mike
  
  
  -
   To unsubscribe, e-mail:
 

Re: lucene 2.9 sorting algorithm

2009-10-22 Thread John Wang
For some reason I guess this didn't go thru and caused all the confusion.

||Seg size||Query||Tot hits||Sort||Top N||QPS old||QPS new||Pct change||
|log|all|100|rand string|10|91.76|108.63|{color:green}18.4%{color}|
|log|all|100|rand string|25|92.39|106.79|{color:green}15.6%{color}|
|log|all|100|rand string|50|91.30|104.02|{color:green}13.9%{color}|
|log|all|100|rand string|500|86.16|63.27|{color:red}-26.6%{color}|
|log|all|100|rand string|1000|76.92|64.85|{color:red}-15.7%{color}|
|log|all|100|country|10|92.42|108.78|{color:green}17.7%{color}|
|log|all|100|country|25|92.60|106.26|{color:green}14.8%{color}|
|log|all|100|country|50|92.64|103.76|{color:green}12.0%{color}|
|log|all|100|country|500|83.92|50.30|{color:red}-40.1%{color}|
|log|all|100|country|1000|74.78|46.59|{color:red}-37.7%{color}|
|log|all|100|rand int|10|114.03|114.85|{color:green}0.7%{color}|
|log|all|100|rand int|25|113.77|112.92|{color:red}-0.7%{color}|
|log|all|100|rand int|50|113.36|109.56|{color:red}-3.4%{color}|
|log|all|100|rand int|500|103.90|66.29|{color:red}-36.2%{color}|
|log|all|100|rand int|1000|89.52|70.67|{color:red}-21.1%{color}|

On Thu, Oct 22, 2009 at 7:43 PM, John Wang john.w...@gmail.com wrote:

 Mike:
I did just post with what I saw, feel free to read and comment on
 it.

I am simply trying to work with Michael on this and trying to
 understand the code.

As I have expressed previously, I have seen a difference between 1.5
 and 1.6 that is significant. Since Mike has posted some numbers on jdk 1.6,
 I was hoping to eliminate all variables relating to the index and
 environment and see if he sees the same thing.

 I guess I should be more clear in the email.

 -John

 On Thu, Oct 22, 2009 at 7:39 PM, Mark Miller markrmil...@gmail.comwrote:

 I am patient :) And I'm not speaking for Mike, I'm speaking for me. I'm
 wondering what your seeing. Asking Mike to rerun the tests without
 giving any further info (you didn't even say that your seeing something
 different) is unfair to the rest of us ;)

 Giving 0 info along with your request just makes 0 sense to me and I
 said as much.

 John Wang wrote:
  Mark:
 
 Please be patient with me. I am seeing a difference and was
  wondering if Mike would see the same thing. I thought Michael would be
  willing to because he expressed interest in understanding what the
  performance discrepancies are.
 
 Again, it is only a request. It is perfectly fine if Michael
  refuses to. But it would be great if Michael speaks for himself.
 
  Thanks
 
  -John
 
  On Thu, Oct 22, 2009 at 7:29 PM, Mark Miller markrmil...@gmail.com
  mailto:markrmil...@gmail.com wrote:
 
  Why? What might he find? Whats with the cryptic request?
 
  Why would Java 1.5 perform better than 1.6? It erases 20 and 40%
  gains?
 
  I know point 2 certainly doesn't. Cards on the table?
 
  John Wang wrote:
   Hey Michael:
  
  Would you mind rerunning the test you have with jdk1.5?
  
  Also, if you would, change the comparator method to avoid
   brachning for int and string comparators, e.g.
  
  
 return index.order[i.doc] - index.order[j.doc];
  
  
   Thanks
  
  
   -John
  
  
   On Thu, Oct 22, 2009 at 2:38 AM, Michael McCandless
   luc...@mikemccandless.com mailto:luc...@mikemccandless.com
  mailto:luc...@mikemccandless.com
  mailto:luc...@mikemccandless.com wrote:
  
   On Thu, Oct 22, 2009 at 2:17 AM, John Wang
  john.w...@gmail.com mailto:john.w...@gmail.com
   mailto:john.w...@gmail.com mailto:john.w...@gmail.com
  wrote:
  
 I have been playing with the patch, and I think I
  have some
   information
that you might like.
 Let me spend sometime and gather some more numbers and
   update in jira.
  
   Excellent!
  
 say bottom has ords 23, 45, 76, each corresponding to a
   string. When
moving to the next segment, you need to make bottom to
  have ords
   that can be
comparable to other docs in this new segment, so you would
  need
   to find the
new ords for the values in 23,45 and 76, don't you? To
  find it,
   assuming the
values are s1,s2,s3, you would do a bin. search on the new
 val
   array, and
find index for s1,s2,s3.
  
   It's that inversion (from ord-Comparable in first seg, and
   Comparable-ord in second seg) that I'm trying to avoid (w/
  this new
   proposal).
  
Which is 3 bin searches per convert, I am not sure
how you can short circuit it. Are you suggesting we call
   Comparable on
compareBottom until some doc beats it?
  
   I'm 

Re: lucene 2.9 sorting algorithm

2009-10-22 Thread Mark Miller
   I guess I should be more clear in the email.

No - If you mentioned before the other info and I missed it, just say:
Mark you don't know what your talking about it and you missed the info.
Thats what I'd do.

You just caught me at a time when I'm trying to get these tests going
myself, and a little frustrated at the lack of info. I'd consider trying
Java 6 vs Java 1.5 or something on Linux, but with no reason why I
should, its like .. come on - throw me a bone.

John Wang wrote:
 Mike:

I did just post with what I saw, feel free to read and comment
 on it.

I am simply trying to work with Michael on this and trying to
 understand the code.

As I have expressed previously, I have seen a difference
 between 1.5 and 1.6 that is significant. Since Mike has posted some
 numbers on jdk 1.6, I was hoping to eliminate all variables relating
 to the index and environment and see if he sees the same thing.

 I guess I should be more clear in the email.

 -John

 On Thu, Oct 22, 2009 at 7:39 PM, Mark Miller markrmil...@gmail.com
 mailto:markrmil...@gmail.com wrote:

 I am patient :) And I'm not speaking for Mike, I'm speaking for
 me. I'm
 wondering what your seeing. Asking Mike to rerun the tests without
 giving any further info (you didn't even say that your seeing
 something
 different) is unfair to the rest of us ;)

 Giving 0 info along with your request just makes 0 sense to me and I
 said as much.

 John Wang wrote:
  Mark:
 
 Please be patient with me. I am seeing a difference and was
  wondering if Mike would see the same thing. I thought Michael
 would be
  willing to because he expressed interest in understanding what the
  performance discrepancies are.
 
 Again, it is only a request. It is perfectly fine if Michael
  refuses to. But it would be great if Michael speaks for himself.
 
  Thanks
 
  -John
 
  On Thu, Oct 22, 2009 at 7:29 PM, Mark Miller
 markrmil...@gmail.com mailto:markrmil...@gmail.com
  mailto:markrmil...@gmail.com mailto:markrmil...@gmail.com
 wrote:
 
  Why? What might he find? Whats with the cryptic request?
 
  Why would Java 1.5 perform better than 1.6? It erases 20 and 40%
  gains?
 
  I know point 2 certainly doesn't. Cards on the table?
 
  John Wang wrote:
   Hey Michael:
  
  Would you mind rerunning the test you have with jdk1.5?
  
  Also, if you would, change the comparator method to
 avoid
   brachning for int and string comparators, e.g.
  
  
 return index.order[i.doc] - index.order[j.doc];
  
  
   Thanks
  
  
   -John
  
  
   On Thu, Oct 22, 2009 at 2:38 AM, Michael McCandless
   luc...@mikemccandless.com
 mailto:luc...@mikemccandless.com
 mailto:luc...@mikemccandless.com mailto:luc...@mikemccandless.com
  mailto:luc...@mikemccandless.com
 mailto:luc...@mikemccandless.com
  mailto:luc...@mikemccandless.com
 mailto:luc...@mikemccandless.com wrote:
  
   On Thu, Oct 22, 2009 at 2:17 AM, John Wang
  john.w...@gmail.com mailto:john.w...@gmail.com
 mailto:john.w...@gmail.com mailto:john.w...@gmail.com
   mailto:john.w...@gmail.com
 mailto:john.w...@gmail.com mailto:john.w...@gmail.com
 mailto:john.w...@gmail.com
  wrote:
  
 I have been playing with the patch, and I think I
  have some
   information
that you might like.
 Let me spend sometime and gather some more
 numbers and
   update in jira.
  
   Excellent!
  
 say bottom has ords 23, 45, 76, each
 corresponding to a
   string. When
moving to the next segment, you need to make bottom to
  have ords
   that can be
comparable to other docs in this new segment, so you
 would
  need
   to find the
new ords for the values in 23,45 and 76, don't you? To
  find it,
   assuming the
values are s1,s2,s3, you would do a bin. search on
 the new val
   array, and
find index for s1,s2,s3.
  
   It's that inversion (from ord-Comparable in first
 seg, and
   Comparable-ord in second seg) that I'm trying to
 avoid (w/
  this new
   proposal).
  
Which is 3 bin searches per convert, I am not sure
how you can short circuit it. Are you suggesting we call
   Comparable on
compareBottom until some doc beats it?

Re: lucene 2.9 sorting algorithm

2009-10-22 Thread John Wang
Mark:
   There is no reason for me to withhold information. I just want to
understand and share my findings.

My bad for not being clear.

Mike's test is actually very well written, I just followed
instructions in the jira and got it running. I think the tests has good
coverage and shows the symptoms the algorithms would suggest.

-John

On Thu, Oct 22, 2009 at 7:42 PM, Mark Miller markrmil...@gmail.com wrote:

 Thanks - thats all I'm asking for. A simple explanation of why you'd ask
 for a retest with those two things changed. Just seems its hold your
 cards a little to close to say - please do this with 0 explanation.

 As to point 2, thats fine - I'm sure it helps - I was just saying I
 didn't buy it helps by 20-40%. Not arguing against doing it, but since
 the request had no info, the only thing I could assume was that that was
 supposed to change things.

 I was about to run some of these tests myself (if i can find what darn
 revision to patch), and its a bit frustrating to see you guys knew
 something but were not telling ...

 Jake Mannix wrote:
  Mark,
 
We're not seeing exactly the numbers that Mike is seeing in his tests,
  running with jdk 1.5 on intel macs, so we're trying to eliminate
  factors of difference.
 
Point 2 does indeed make a difference, we've seen it, and it's only
  fair: the
  single pq comparator does this branch optimization but the current
  patch multi-pq
  does not, so let's level the playing field.
 
John's on the road with limited net connectivity, but we'll have
  some numbers to
  compare more over the weekend for sure.
 
-jake
 
  On Thu, Oct 22, 2009 at 7:29 PM, Mark Miller markrmil...@gmail.com
  mailto:markrmil...@gmail.com wrote:
 
  Why? What might he find? Whats with the cryptic request?
 
  Why would Java 1.5 perform better than 1.6? It erases 20 and 40%
  gains?
 
  I know point 2 certainly doesn't. Cards on the table?
 
  John Wang wrote:
   Hey Michael:
  
  Would you mind rerunning the test you have with jdk1.5?
  
  Also, if you would, change the comparator method to avoid
   brachning for int and string comparators, e.g.
  
  
 return index.order[i.doc] - index.order[j.doc];
  
  
   Thanks
  
  
   -John
  
  
   On Thu, Oct 22, 2009 at 2:38 AM, Michael McCandless
   luc...@mikemccandless.com mailto:luc...@mikemccandless.com
  mailto:luc...@mikemccandless.com
  mailto:luc...@mikemccandless.com wrote:
  
   On Thu, Oct 22, 2009 at 2:17 AM, John Wang
  john.w...@gmail.com mailto:john.w...@gmail.com
   mailto:john.w...@gmail.com mailto:john.w...@gmail.com
  wrote:
  
 I have been playing with the patch, and I think I
  have some
   information
that you might like.
 Let me spend sometime and gather some more numbers and
   update in jira.
  
   Excellent!
  
 say bottom has ords 23, 45, 76, each corresponding to a
   string. When
moving to the next segment, you need to make bottom to
  have ords
   that can be
comparable to other docs in this new segment, so you would
  need
   to find the
new ords for the values in 23,45 and 76, don't you? To
  find it,
   assuming the
values are s1,s2,s3, you would do a bin. search on the new
 val
   array, and
find index for s1,s2,s3.
  
   It's that inversion (from ord-Comparable in first seg, and
   Comparable-ord in second seg) that I'm trying to avoid (w/
  this new
   proposal).
  
Which is 3 bin searches per convert, I am not sure
how you can short circuit it. Are you suggesting we call
   Comparable on
compareBottom until some doc beats it?
  
   I'm saying on seg transition you indeed get the Comparable
  for current
   bottom, but, don't attempt to invert it.  Instead, as seg 2
  finds a
   hit, you get that hit's Comparables and compare to bottom.
   If it
   beats bottom, it goes into the queue.  If it does not, you
  use the ord
   (in seg 2's ord space) to learn a bottom in the ord space
  of seg 2.
  
That would hurt performance I lot though, no?
  
   Yeah I think likely it would, since we're talking about a
 binary
   search on transition VS having to do possibly many
   upgrade-to-Comparable and compare-Comparabls to slowly learn
 the
   equivalent ord in the new segment.  I was proposing it for
  cases where
   inversion is very difficult.  But realistically, since you
  must keep
   around the ful ord - Comparable for every segment anyway
  (in order to
   merge in the end), 

Re: lucene 2.9 sorting algorithm

2009-10-22 Thread Mark Miller
John Wang wrote:
 Mark:

There is no reason for me to withhold information. I just want
 to understand and share my findings.
Right, I didn't mean to accuse you of that ;) Not that you were doing it
on purpose. I was just trying to string out more :) Which I've managed
to do - in my usual awkward ending up email thread way. Success :)

 My bad for not being clear.

 Mike's test is actually very well written, I just followed
 instructions in the jira and got it running. I think the tests has
 good coverage and shows the symptoms the algorithms would suggest.
Yeah, I'm not complaining about his tests - I'm just trying to find a
version of Lucene that it will patch into cleanly.

 -John

 On Thu, Oct 22, 2009 at 7:42 PM, Mark Miller markrmil...@gmail.com
 mailto:markrmil...@gmail.com wrote:

 Thanks - thats all I'm asking for. A simple explanation of why
 you'd ask
 for a retest with those two things changed. Just seems its hold your
 cards a little to close to say - please do this with 0 explanation.

 As to point 2, thats fine - I'm sure it helps - I was just saying I
 didn't buy it helps by 20-40%. Not arguing against doing it, but since
 the request had no info, the only thing I could assume was that
 that was
 supposed to change things.

 I was about to run some of these tests myself (if i can find what darn
 revision to patch), and its a bit frustrating to see you guys knew
 something but were not telling ...

 Jake Mannix wrote:
  Mark,
 
We're not seeing exactly the numbers that Mike is seeing in
 his tests,
  running with jdk 1.5 on intel macs, so we're trying to eliminate
  factors of difference.
 
Point 2 does indeed make a difference, we've seen it, and it's
 only
  fair: the
  single pq comparator does this branch optimization but the current
  patch multi-pq
  does not, so let's level the playing field.
 
John's on the road with limited net connectivity, but we'll have
  some numbers to
  compare more over the weekend for sure.
 
-jake
 
  On Thu, Oct 22, 2009 at 7:29 PM, Mark Miller
 markrmil...@gmail.com mailto:markrmil...@gmail.com
  mailto:markrmil...@gmail.com mailto:markrmil...@gmail.com
 wrote:
 
  Why? What might he find? Whats with the cryptic request?
 
  Why would Java 1.5 perform better than 1.6? It erases 20 and 40%
  gains?
 
  I know point 2 certainly doesn't. Cards on the table?
 
  John Wang wrote:
   Hey Michael:
  
  Would you mind rerunning the test you have with jdk1.5?
  
  Also, if you would, change the comparator method to
 avoid
   brachning for int and string comparators, e.g.
  
  
 return index.order[i.doc] - index.order[j.doc];
  
  
   Thanks
  
  
   -John
  
  
   On Thu, Oct 22, 2009 at 2:38 AM, Michael McCandless
   luc...@mikemccandless.com
 mailto:luc...@mikemccandless.com
 mailto:luc...@mikemccandless.com mailto:luc...@mikemccandless.com
  mailto:luc...@mikemccandless.com
 mailto:luc...@mikemccandless.com
  mailto:luc...@mikemccandless.com
 mailto:luc...@mikemccandless.com wrote:
  
   On Thu, Oct 22, 2009 at 2:17 AM, John Wang
  john.w...@gmail.com mailto:john.w...@gmail.com
 mailto:john.w...@gmail.com mailto:john.w...@gmail.com
   mailto:john.w...@gmail.com
 mailto:john.w...@gmail.com mailto:john.w...@gmail.com
 mailto:john.w...@gmail.com
  wrote:
  
 I have been playing with the patch, and I think I
  have some
   information
that you might like.
 Let me spend sometime and gather some more
 numbers and
   update in jira.
  
   Excellent!
  
 say bottom has ords 23, 45, 76, each
 corresponding to a
   string. When
moving to the next segment, you need to make bottom to
  have ords
   that can be
comparable to other docs in this new segment, so you
 would
  need
   to find the
new ords for the values in 23,45 and 76, don't you? To
  find it,
   assuming the
values are s1,s2,s3, you would do a bin. search on
 the new val
   array, and
find index for s1,s2,s3.
  
   It's that inversion (from ord-Comparable in first
 seg, and
   Comparable-ord in second seg) that I'm trying to
 avoid (w/
  this new
   proposal).
  
Which is 3 bin searches per convert, I 

Re: lucene 2.9 sorting algorithm

2009-10-22 Thread Mark Miller
bq. I just followed instructions in the jira and got it running.

Heh - I didn't read down far enough - first comment says 2.9 branch.
Thanks ; ) I've been flipping through revisions for a while now,
wondering how the heck the revs in the patch match up with trunk.


John Wang wrote:
 Mark:

There is no reason for me to withhold information. I just want
 to understand and share my findings.

 My bad for not being clear.

 Mike's test is actually very well written, I just followed
 instructions in the jira and got it running. I think the tests has
 good coverage and shows the symptoms the algorithms would suggest.

 -John

 On Thu, Oct 22, 2009 at 7:42 PM, Mark Miller markrmil...@gmail.com
 mailto:markrmil...@gmail.com wrote:

 Thanks - thats all I'm asking for. A simple explanation of why
 you'd ask
 for a retest with those two things changed. Just seems its hold your
 cards a little to close to say - please do this with 0 explanation.

 As to point 2, thats fine - I'm sure it helps - I was just saying I
 didn't buy it helps by 20-40%. Not arguing against doing it, but since
 the request had no info, the only thing I could assume was that
 that was
 supposed to change things.

 I was about to run some of these tests myself (if i can find what darn
 revision to patch), and its a bit frustrating to see you guys knew
 something but were not telling ...

 Jake Mannix wrote:
  Mark,
 
We're not seeing exactly the numbers that Mike is seeing in
 his tests,
  running with jdk 1.5 on intel macs, so we're trying to eliminate
  factors of difference.
 
Point 2 does indeed make a difference, we've seen it, and it's
 only
  fair: the
  single pq comparator does this branch optimization but the current
  patch multi-pq
  does not, so let's level the playing field.
 
John's on the road with limited net connectivity, but we'll have
  some numbers to
  compare more over the weekend for sure.
 
-jake
 
  On Thu, Oct 22, 2009 at 7:29 PM, Mark Miller
 markrmil...@gmail.com mailto:markrmil...@gmail.com
  mailto:markrmil...@gmail.com mailto:markrmil...@gmail.com
 wrote:
 
  Why? What might he find? Whats with the cryptic request?
 
  Why would Java 1.5 perform better than 1.6? It erases 20 and 40%
  gains?
 
  I know point 2 certainly doesn't. Cards on the table?
 
  John Wang wrote:
   Hey Michael:
  
  Would you mind rerunning the test you have with jdk1.5?
  
  Also, if you would, change the comparator method to
 avoid
   brachning for int and string comparators, e.g.
  
  
 return index.order[i.doc] - index.order[j.doc];
  
  
   Thanks
  
  
   -John
  
  
   On Thu, Oct 22, 2009 at 2:38 AM, Michael McCandless
   luc...@mikemccandless.com
 mailto:luc...@mikemccandless.com
 mailto:luc...@mikemccandless.com mailto:luc...@mikemccandless.com
  mailto:luc...@mikemccandless.com
 mailto:luc...@mikemccandless.com
  mailto:luc...@mikemccandless.com
 mailto:luc...@mikemccandless.com wrote:
  
   On Thu, Oct 22, 2009 at 2:17 AM, John Wang
  john.w...@gmail.com mailto:john.w...@gmail.com
 mailto:john.w...@gmail.com mailto:john.w...@gmail.com
   mailto:john.w...@gmail.com
 mailto:john.w...@gmail.com mailto:john.w...@gmail.com
 mailto:john.w...@gmail.com
  wrote:
  
 I have been playing with the patch, and I think I
  have some
   information
that you might like.
 Let me spend sometime and gather some more
 numbers and
   update in jira.
  
   Excellent!
  
 say bottom has ords 23, 45, 76, each
 corresponding to a
   string. When
moving to the next segment, you need to make bottom to
  have ords
   that can be
comparable to other docs in this new segment, so you
 would
  need
   to find the
new ords for the values in 23,45 and 76, don't you? To
  find it,
   assuming the
values are s1,s2,s3, you would do a bin. search on
 the new val
   array, and
find index for s1,s2,s3.
  
   It's that inversion (from ord-Comparable in first
 seg, and
   Comparable-ord in second seg) that I'm trying to
 avoid (w/
  this new
   proposal).
  
Which is 3 bin searches per convert, I am not sure
how you can short circuit it. Are you 

Re: lucene 2.9 sorting algorithm

2009-10-22 Thread Yonik Seeley
On Thu, Oct 22, 2009 at 10:35 PM, John Wang john.w...@gmail.com wrote:
        Please be patient with me. I am seeing a difference and was wondering
 if Mike would see the same thing.

Some differences are bound to be seen... with your changes (JVM
changes, branch optimizations), are you seeing better average
performance with multiPQ?

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: lucene 2.9 sorting algorithm

2009-10-22 Thread Jake Mannix
It's hard to read the column format, but if you look up above in the thread
from tonight,
you can see that yes, for PQ sizes less than 100 elements, multiPQ is
better, and only
starts to be worse at around 100 for strings, and 50 for ints.

  -jake

On Thu, Oct 22, 2009 at 8:06 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Thu, Oct 22, 2009 at 10:35 PM, John Wang john.w...@gmail.com wrote:
 Please be patient with me. I am seeing a difference and was
 wondering
  if Mike would see the same thing.

 Some differences are bound to be seen... with your changes (JVM
 changes, branch optimizations), are you seeing better average
 performance with multiPQ?

 -Yonik
 http://www.lucidimagination.com

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-10-22 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769039#action_12769039
 ] 

Mark Miller commented on LUCENE-1997:
-

Results from John Wang:

||Seg size||Query||Tot hits||Sort||Top N||QPS old||QPS new||Pct change||
|log|all|100|rand string|10|91.76|108.63|{color:green}18.4%{color}|
|log|all|100|rand string|25|92.39|106.79|{color:green}15.6%{color}|
|log|all|100|rand string|50|91.30|104.02|{color:green}13.9%{color}|
|log|all|100|rand string|500|86.16|63.27|{color:red}-26.6%{color}|
|log|all|100|rand string|1000|76.92|64.85|{color:red}-15.7%{color}|
|log|all|100|country|10|92.42|108.78|{color:green}17.7%{color}|
|log|all|100|country|25|92.60|106.26|{color:green}14.8%{color}|
|log|all|100|country|50|92.64|103.76|{color:green}12.0%{color}|
|log|all|100|country|500|83.92|50.30|{color:red}-40.1%{color}|
|log|all|100|country|1000|74.78|46.59|{color:red}-37.7%{color}|
|log|all|100|rand int|10|114.03|114.85|{color:green}0.7%{color}|
|log|all|100|rand int|25|113.77|112.92|{color:red}-0.7%{color}|
|log|all|100|rand int|50|113.36|109.56|{color:red}-3.4%{color}|
|log|all|100|rand int|500|103.90|66.29|{color:red}-36.2%{color}|
|log|all|100|rand int|1000|89.52|70.67|{color:red}-21.1%{color}|

 Explore performance of multi-PQ vs single-PQ sorting API
 

 Key: LUCENE-1997
 URL: https://issues.apache.org/jira/browse/LUCENE-1997
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-1997.patch, LUCENE-1997.patch


 Spinoff from recent lucene 2.9 sorting algorithm thread on java-dev,
 where a simpler (non-segment-based) comparator API is proposed that
 gathers results into multiple PQs (one per segment) and then merges
 them in the end.
 I started from John's multi-PQ code and worked it into
 contrib/benchmark so that we could run perf tests.  Then I generified
 the Python script I use for running search benchmarks (in
 contrib/benchmark/sortBench.py).
 The script first creates indexes with 1M docs (based on
 SortableSingleDocSource, and based on wikipedia, if available).  Then
 it runs various combinations:
   * Index with 20 balanced segments vs index with the normal log
 segment size
   * Queries with different numbers of hits (only for wikipedia index)
   * Different top N
   * Different sorts (by title, for wikipedia, and by random string,
 random int, and country for the random index)
 For each test, 7 search rounds are run and the best QPS is kept.  The
 script runs singlePQ then multiPQ, and records the resulting best QPS
 for each and produces table (in Jira format) as output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-10-22 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769042#action_12769042
 ] 

Jake Mannix commented on LUCENE-1997:
-

Hah!  Thanks for posting that, Mark!   Much easier to read. :)

Hey John, can you comment with your hardware specs on this, so it can be 
recorded for posterity? ;)

 Explore performance of multi-PQ vs single-PQ sorting API
 

 Key: LUCENE-1997
 URL: https://issues.apache.org/jira/browse/LUCENE-1997
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-1997.patch, LUCENE-1997.patch


 Spinoff from recent lucene 2.9 sorting algorithm thread on java-dev,
 where a simpler (non-segment-based) comparator API is proposed that
 gathers results into multiple PQs (one per segment) and then merges
 them in the end.
 I started from John's multi-PQ code and worked it into
 contrib/benchmark so that we could run perf tests.  Then I generified
 the Python script I use for running search benchmarks (in
 contrib/benchmark/sortBench.py).
 The script first creates indexes with 1M docs (based on
 SortableSingleDocSource, and based on wikipedia, if available).  Then
 it runs various combinations:
   * Index with 20 balanced segments vs index with the normal log
 segment size
   * Queries with different numbers of hits (only for wikipedia index)
   * Different top N
   * Different sorts (by title, for wikipedia, and by random string,
 random int, and country for the random index)
 For each test, 7 search rounds are run and the best QPS is kept.  The
 script runs singlePQ then multiPQ, and records the resulting best QPS
 for each and produces table (in Jira format) as output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: lucene 2.9 sorting algorithm

2009-10-22 Thread John Wang
Hi Yonik
I am, but I don't think I should. Even with branching etc., I should see
that much of a consistent difference.
I am traveling with my macbook pro, I wanted to eliminate all variables.
It really does not make sense to me...

-John

On Thu, Oct 22, 2009 at 8:06 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Thu, Oct 22, 2009 at 10:35 PM, John Wang john.w...@gmail.com wrote:
 Please be patient with me. I am seeing a difference and was
 wondering
  if Mike would see the same thing.

 Some differences are bound to be seen... with your changes (JVM
 changes, branch optimizations), are you seeing better average
 performance with multiPQ?

 -Yonik
 http://www.lucidimagination.com

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-10-22 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769045#action_12769045
 ] 

John Wang commented on LUCENE-1997:
---

My machine HW spec:

Model Name: MacBook Pro
  Model Identifier: MacBookPro3,1
  Processor Name:   Intel Core 2 Duo
  Processor Speed:  2.4 GHz
  Number Of Processors: 1
  Total Number Of Cores:2
  L2 Cache: 4 MB
  Memory:   4 GB
  Bus Speed:800 MHz

 Explore performance of multi-PQ vs single-PQ sorting API
 

 Key: LUCENE-1997
 URL: https://issues.apache.org/jira/browse/LUCENE-1997
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-1997.patch, LUCENE-1997.patch


 Spinoff from recent lucene 2.9 sorting algorithm thread on java-dev,
 where a simpler (non-segment-based) comparator API is proposed that
 gathers results into multiple PQs (one per segment) and then merges
 them in the end.
 I started from John's multi-PQ code and worked it into
 contrib/benchmark so that we could run perf tests.  Then I generified
 the Python script I use for running search benchmarks (in
 contrib/benchmark/sortBench.py).
 The script first creates indexes with 1M docs (based on
 SortableSingleDocSource, and based on wikipedia, if available).  Then
 it runs various combinations:
   * Index with 20 balanced segments vs index with the normal log
 segment size
   * Queries with different numbers of hits (only for wikipedia index)
   * Different top N
   * Different sorts (by title, for wikipedia, and by random string,
 random int, and country for the random index)
 For each test, 7 search rounds are run and the best QPS is kept.  The
 script runs singlePQ then multiPQ, and records the resulting best QPS
 for each and produces table (in Jira format) as output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: lucene 2.9 sorting algorithm

2009-10-22 Thread Yonik Seeley
On Thu, Oct 22, 2009 at 11:11 PM, Jake Mannix jake.man...@gmail.com wrote:
 It's hard to read the column format, but if you look up above in the thread
 from tonight,
 you can see that yes, for PQ sizes less than 100 elements, multiPQ is
 better, and only
 starts to be worse at around 100 for strings, and 50 for ints.

Ah, OK, I had missed John's followup with the numbers.

I assume this is for Java5 + optimizations?
What does Java6 show?

bq. Point 2 does indeed make a difference, we've seen it, and it's
only fair: the single pq comparator does this branch optimization but
the current patch multi-pq does not, so let's level the playing field.

Of course - it's not about leveling the playing field, but finding the
best solution for the average case - so everything should be optimized
as much as possible.  There are probably further optimizations
possible in both the single and multi PQ cases.

My biggest reservation is that we've gone down the road of telling
people to implement a new style of comparators, and told them that the
old style comparators would be deleted in the next release (which is
where we are).  Reversing that will be a bit of a headache/question...
the new stuff isn't deprecated, and having *both* isn't desirable, but
that's a separate decision to be made apart from performance testing.

Is there also an option of using a multiPQ approach with the new style
comparators?

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: lucene 2.9 sorting algorithm

2009-10-22 Thread Jake Mannix
On Thu, Oct 22, 2009 at 8:30 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Thu, Oct 22, 2009 at 11:11 PM, Jake Mannix jake.man...@gmail.com
 wrote:
  It's hard to read the column format, but if you look up above in the
 thread
  from tonight,
  you can see that yes, for PQ sizes less than 100 elements, multiPQ is
  better, and only
  starts to be worse at around 100 for strings, and 50 for ints.

 Ah, OK, I had missed John's followup with the numbers.

 I assume this is for Java5 + optimizations?


Yeah, this was for Java5 + optimizations.


 What does Java6 show?


Java6 on Mac showed close to what Mike posted in his report on the Jira
ticket -
that single-PQ performs a little better for small pq, and more like 30-40%
better
for large pq.


 My biggest reservation is that we've gone down the road of telling
 people to implement a new style of comparators, and told them that the
 old style comparators would be deleted in the next release (which is
 where we are).  Reversing that will be a bit of a headache/question...
 the new stuff isn't deprecated, and having *both* isn't desirable, but
 that's a separate decision to be made apart from performance testing.


Well the issue comes down to: if the performance is *basically comparable*
between the two approaches, then the new API is much harder for the
average user to use, and even for the experienced user, it's not terribly
fun,
and more importantly: for the user who has already implemented custom
sorts on the old API, upgrading is enough trouble that people may decide
it's not worth it.  It probably *is* worth it, but if you're going to even
put that
kind of thinking in the user's head, you've got to ask yourself: what's the
reasoning for going with a more complex API if you can get equal (slightly
better in some cases, slightly worse in others) performance with a simpler
API?

Yes, as Mike says, the new API is *not* breaking back-compat in a
functional sense, but how many users have converted to the new sorting
api already?  2.9 has barely just come out, and while it's work for the
community as a whole to reconsider the multi-segment sorting api, and
work to implement a change at this level, if it's the right thing to do,
we shouldn't let the question of which method is deprecated dictate
which one *should* be deprecated.


 Is there also an option of using a multiPQ approach with the new style
 comparators?


For the record: that would be the worst of all worlds, in my view: harder
API with only better performance in some cases, and sometimes worse
performance.

  -jake


[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-10-22 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769051#action_12769051
 ] 

Mark Miller commented on LUCENE-1997:
-

Another run:

I made the changes to int/string comparator to do the faster compare.
Java 1.5.0_20
Laptop
Quad Core - 2.0 Ghz
Ubuntu 9.10 Kernel 2.6.31
4 GB RAM

||Seg size||Query||Tot hits||Sort||Top N||QPS old||QPS new||Pct change||
|log|1|317925|title|10|87.38|75.42|{color:red}-13.7%{color}|
|log|1|317925|title|25|86.55|74.49|{color:red}-13.9%{color}|
|log|1|317925|title|50|90.49|71.90|{color:red}-20.5%{color}|
|log|1|317925|title|100|88.07|83.08|{color:red}-5.7%{color}|
|log|1|317925|title|500|76.67|54.34|{color:red}-29.1%{color}|
|log|1|317925|title|1000|69.29|38.54|{color:red}-44.4%{color}|
|log|all|100|title|10|109.01|92.78|{color:red}-14.9%{color}|
|log|all|100|title|25|108.30|89.43|{color:red}-17.4%{color}|
|log|all|100|title|50|107.19|85.86|{color:red}-19.9%{color}|
|log|all|100|title|100|94.84|80.25|{color:red}-15.4%{color}|
|log|all|100|title|500|78.84|49.10|{color:red}-37.7%{color}|
|log|all|100|title|1000|72.52|26.90|{color:red}-62.9%{color}|
|log|all|100|rand string|10|115.32|101.53|{color:red}-12.0%{color}|
|log|all|100|rand string|25|115.22|91.82|{color:red}-20.3%{color}|
|log|all|100|rand string|50|114.40|89.70|{color:red}-21.6%{color}|
|log|all|100|rand string|100|91.30|81.04|{color:red}-11.2%{color}|
|log|all|100|rand string|500|76.31|43.94|{color:red}-42.4%{color}|
|log|all|100|rand string|1000|67.33|28.29|{color:red}-58.0%{color}|
|log|all|100|country|10|115.40|101.46|{color:red}-12.1%{color}|
|log|all|100|country|25|115.06|92.15|{color:red}-19.9%{color}|
|log|all|100|country|50|114.03|90.06|{color:red}-21.0%{color}|
|log|all|100|country|100|99.30|80.07|{color:red}-19.4%{color}|
|log|all|100|country|500|75.64|43.44|{color:red}-42.6%{color}|
|log|all|100|country|1000|66.05|27.94|{color:red}-57.7%{color}|
|log|all|100|rand int|10|118.47|109.30|{color:red}-7.7%{color}|
|log|all|100|rand int|25|118.72|99.37|{color:red}-16.3%{color}|
|log|all|100|rand int|50|118.25|95.14|{color:red}-19.5%{color}|
|log|all|100|rand int|100|97.57|83.39|{color:red}-14.5%{color}|
|log|all|100|rand int|500|86.55|46.21|{color:red}-46.6%{color}|
|log|all|100|rand int|1000|78.23|28.94|{color:red}-63.0%{color}|



 Explore performance of multi-PQ vs single-PQ sorting API
 

 Key: LUCENE-1997
 URL: https://issues.apache.org/jira/browse/LUCENE-1997
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-1997.patch, LUCENE-1997.patch


 Spinoff from recent lucene 2.9 sorting algorithm thread on java-dev,
 where a simpler (non-segment-based) comparator API is proposed that
 gathers results into multiple PQs (one per segment) and then merges
 them in the end.
 I started from John's multi-PQ code and worked it into
 contrib/benchmark so that we could run perf tests.  Then I generified
 the Python script I use for running search benchmarks (in
 contrib/benchmark/sortBench.py).
 The script first creates indexes with 1M docs (based on
 SortableSingleDocSource, and based on wikipedia, if available).  Then
 it runs various combinations:
   * Index with 20 balanced segments vs index with the normal log
 segment size
   * Queries with different numbers of hits (only for wikipedia index)
   * Different top N
   * Different sorts (by title, for wikipedia, and by random string,
 random int, and country for the random index)
 For each test, 7 search rounds are run and the best QPS is kept.  The
 script runs singlePQ then multiPQ, and records the resulting best QPS
 for each and produces table (in Jira format) as output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: lucene 2.9 sorting algorithm

2009-10-22 Thread Mark Miller
 he new API is much harder for the
 average user to use, and even for the experienced user, it's not
terribly fun,
 and more importantly:

Do we have enough info to support that though? All the cases I have seen
on the list, people have figured it out pretty easily - havn't really
seen any complaints in that regard (not counting you and John - that is
two). The only other complaints I have noticed are those that happened
to count on unsupported behavior (eg people counting on no MultiSearcher
use)

I think Uwe had some good ideas for exposing an easier API with the new one.


Jake Mannix wrote:
 On Thu, Oct 22, 2009 at 8:30 PM, Yonik Seeley
 yo...@lucidimagination.com mailto:yo...@lucidimagination.com wrote:

 On Thu, Oct 22, 2009 at 11:11 PM, Jake Mannix
 jake.man...@gmail.com mailto:jake.man...@gmail.com wrote:
  It's hard to read the column format, but if you look up above in
 the thread
  from tonight,
  you can see that yes, for PQ sizes less than 100 elements,
 multiPQ is
  better, and only
  starts to be worse at around 100 for strings, and 50 for ints.

 Ah, OK, I had missed John's followup with the numbers.

 I assume this is for Java5 + optimizations?


 Yeah, this was for Java5 + optimizations.
  

 What does Java6 show?


 Java6 on Mac showed close to what Mike posted in his report on the
 Jira ticket -
 that single-PQ performs a little better for small pq, and more like
 30-40% better
 for large pq. 
  

 My biggest reservation is that we've gone down the road of telling
 people to implement a new style of comparators, and told them that the
 old style comparators would be deleted in the next release (which is
 where we are).  Reversing that will be a bit of a headache/question...
 the new stuff isn't deprecated, and having *both* isn't desirable, but
 that's a separate decision to be made apart from performance testing.


 Well the issue comes down to: if the performance is *basically comparable*
 between the two approaches, then the new API is much harder for the
 average user to use, and even for the experienced user, it's not
 terribly fun,
 and more importantly: for the user who has already implemented custom
 sorts on the old API, upgrading is enough trouble that people may decide
 it's not worth it.  It probably *is* worth it, but if you're going to
 even put that
 kind of thinking in the user's head, you've got to ask yourself:
 what's the
 reasoning for going with a more complex API if you can get equal (slightly
 better in some cases, slightly worse in others) performance with a
 simpler
 API?

 Yes, as Mike says, the new API is *not* breaking back-compat in a
 functional sense, but how many users have converted to the new sorting
 api already?  2.9 has barely just come out, and while it's work for the
 community as a whole to reconsider the multi-segment sorting api, and
 work to implement a change at this level, if it's the right thing to do,
 we shouldn't let the question of which method is deprecated dictate
 which one *should* be deprecated.


 Is there also an option of using a multiPQ approach with the new style
 comparators?


 For the record: that would be the worst of all worlds, in my view: harder
 API with only better performance in some cases, and sometimes worse
 performance.

   -jake


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-10-22 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769053#action_12769053
 ] 

Yonik Seeley commented on LUCENE-1997:
--

While Java5 numbers are still important, I'd say that Java6 (-server of course) 
should be weighted far heavier?  That must be what a majority of people are 
running in production for new systems?


 Explore performance of multi-PQ vs single-PQ sorting API
 

 Key: LUCENE-1997
 URL: https://issues.apache.org/jira/browse/LUCENE-1997
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-1997.patch, LUCENE-1997.patch


 Spinoff from recent lucene 2.9 sorting algorithm thread on java-dev,
 where a simpler (non-segment-based) comparator API is proposed that
 gathers results into multiple PQs (one per segment) and then merges
 them in the end.
 I started from John's multi-PQ code and worked it into
 contrib/benchmark so that we could run perf tests.  Then I generified
 the Python script I use for running search benchmarks (in
 contrib/benchmark/sortBench.py).
 The script first creates indexes with 1M docs (based on
 SortableSingleDocSource, and based on wikipedia, if available).  Then
 it runs various combinations:
   * Index with 20 balanced segments vs index with the normal log
 segment size
   * Queries with different numbers of hits (only for wikipedia index)
   * Different top N
   * Different sorts (by title, for wikipedia, and by random string,
 random int, and country for the random index)
 For each test, 7 search rounds are run and the best QPS is kept.  The
 script runs singlePQ then multiPQ, and records the resulting best QPS
 for each and produces table (in Jira format) as output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-10-22 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769051#action_12769051
 ] 

Mark Miller edited comment on LUCENE-1997 at 10/23/09 4:29 AM:
---

Another run:

I made the changes to int/string comparator to do the faster compare.
Java 1.5.0_20
Laptop - 64bit OS - 64bit JVM - 64bit
Quad Core - 2.0 Ghz
Ubuntu 9.10 Kernel 2.6.31
4 GB RAM

||Seg size||Query||Tot hits||Sort||Top N||QPS old||QPS new||Pct change||
|log|1|317925|title|10|87.38|75.42|{color:red}-13.7%{color}|
|log|1|317925|title|25|86.55|74.49|{color:red}-13.9%{color}|
|log|1|317925|title|50|90.49|71.90|{color:red}-20.5%{color}|
|log|1|317925|title|100|88.07|83.08|{color:red}-5.7%{color}|
|log|1|317925|title|500|76.67|54.34|{color:red}-29.1%{color}|
|log|1|317925|title|1000|69.29|38.54|{color:red}-44.4%{color}|
|log|all|100|title|10|109.01|92.78|{color:red}-14.9%{color}|
|log|all|100|title|25|108.30|89.43|{color:red}-17.4%{color}|
|log|all|100|title|50|107.19|85.86|{color:red}-19.9%{color}|
|log|all|100|title|100|94.84|80.25|{color:red}-15.4%{color}|
|log|all|100|title|500|78.84|49.10|{color:red}-37.7%{color}|
|log|all|100|title|1000|72.52|26.90|{color:red}-62.9%{color}|
|log|all|100|rand string|10|115.32|101.53|{color:red}-12.0%{color}|
|log|all|100|rand string|25|115.22|91.82|{color:red}-20.3%{color}|
|log|all|100|rand string|50|114.40|89.70|{color:red}-21.6%{color}|
|log|all|100|rand string|100|91.30|81.04|{color:red}-11.2%{color}|
|log|all|100|rand string|500|76.31|43.94|{color:red}-42.4%{color}|
|log|all|100|rand string|1000|67.33|28.29|{color:red}-58.0%{color}|
|log|all|100|country|10|115.40|101.46|{color:red}-12.1%{color}|
|log|all|100|country|25|115.06|92.15|{color:red}-19.9%{color}|
|log|all|100|country|50|114.03|90.06|{color:red}-21.0%{color}|
|log|all|100|country|100|99.30|80.07|{color:red}-19.4%{color}|
|log|all|100|country|500|75.64|43.44|{color:red}-42.6%{color}|
|log|all|100|country|1000|66.05|27.94|{color:red}-57.7%{color}|
|log|all|100|rand int|10|118.47|109.30|{color:red}-7.7%{color}|
|log|all|100|rand int|25|118.72|99.37|{color:red}-16.3%{color}|
|log|all|100|rand int|50|118.25|95.14|{color:red}-19.5%{color}|
|log|all|100|rand int|100|97.57|83.39|{color:red}-14.5%{color}|
|log|all|100|rand int|500|86.55|46.21|{color:red}-46.6%{color}|
|log|all|100|rand int|1000|78.23|28.94|{color:red}-63.0%{color}|



  was (Author: markrmil...@gmail.com):
Another run:

I made the changes to int/string comparator to do the faster compare.
Java 1.5.0_20
Laptop
Quad Core - 2.0 Ghz
Ubuntu 9.10 Kernel 2.6.31
4 GB RAM

||Seg size||Query||Tot hits||Sort||Top N||QPS old||QPS new||Pct change||
|log|1|317925|title|10|87.38|75.42|{color:red}-13.7%{color}|
|log|1|317925|title|25|86.55|74.49|{color:red}-13.9%{color}|
|log|1|317925|title|50|90.49|71.90|{color:red}-20.5%{color}|
|log|1|317925|title|100|88.07|83.08|{color:red}-5.7%{color}|
|log|1|317925|title|500|76.67|54.34|{color:red}-29.1%{color}|
|log|1|317925|title|1000|69.29|38.54|{color:red}-44.4%{color}|
|log|all|100|title|10|109.01|92.78|{color:red}-14.9%{color}|
|log|all|100|title|25|108.30|89.43|{color:red}-17.4%{color}|
|log|all|100|title|50|107.19|85.86|{color:red}-19.9%{color}|
|log|all|100|title|100|94.84|80.25|{color:red}-15.4%{color}|
|log|all|100|title|500|78.84|49.10|{color:red}-37.7%{color}|
|log|all|100|title|1000|72.52|26.90|{color:red}-62.9%{color}|
|log|all|100|rand string|10|115.32|101.53|{color:red}-12.0%{color}|
|log|all|100|rand string|25|115.22|91.82|{color:red}-20.3%{color}|
|log|all|100|rand string|50|114.40|89.70|{color:red}-21.6%{color}|
|log|all|100|rand string|100|91.30|81.04|{color:red}-11.2%{color}|
|log|all|100|rand string|500|76.31|43.94|{color:red}-42.4%{color}|
|log|all|100|rand string|1000|67.33|28.29|{color:red}-58.0%{color}|
|log|all|100|country|10|115.40|101.46|{color:red}-12.1%{color}|
|log|all|100|country|25|115.06|92.15|{color:red}-19.9%{color}|
|log|all|100|country|50|114.03|90.06|{color:red}-21.0%{color}|
|log|all|100|country|100|99.30|80.07|{color:red}-19.4%{color}|
|log|all|100|country|500|75.64|43.44|{color:red}-42.6%{color}|
|log|all|100|country|1000|66.05|27.94|{color:red}-57.7%{color}|
|log|all|100|rand int|10|118.47|109.30|{color:red}-7.7%{color}|
|log|all|100|rand int|25|118.72|99.37|{color:red}-16.3%{color}|
|log|all|100|rand int|50|118.25|95.14|{color:red}-19.5%{color}|
|log|all|100|rand int|100|97.57|83.39|{color:red}-14.5%{color}|
|log|all|100|rand int|500|86.55|46.21|{color:red}-46.6%{color}|
|log|all|100|rand int|1000|78.23|28.94|{color:red}-63.0%{color}|


  
 Explore performance of multi-PQ vs single-PQ sorting API
 

 Key: 

Re: lucene 2.9 sorting algorithm

2009-10-22 Thread Jake Mannix
On Thu, Oct 22, 2009 at 9:25 PM, Mark Miller markrmil...@gmail.com wrote:

  he new API is much harder for the
  average user to use, and even for the experienced user, it's not
 terribly fun,
  and more importantly:

 Do we have enough info to support that though? All the cases I have seen
 on the list, people have figured it out pretty easily - havn't really
 seen any complaints in that regard (not counting you and John - that is
 two). The only other complaints I have noticed are those that happened
 to count on unsupported behavior (eg people counting on no MultiSearcher
 use)


John and I and TomS all found it both complex, and we're all pretty serious
users of inner lucene apis.

You see *core developers* saying the api seems fine.  Have you seen *any
users*
of the new sorting api say anything positive about it?  Do you know of
*anyone* who
has implemented the new comparator interface at all,  let alone *likes* it?

3 negative votes by users, in comparison to *zero* positive votes by users
together with a bunch of core developers saying, yeah it looks easy, what
are
you guys complaining about?.

Internal apis take a while to percolate out to the user base - we're only
the first
few running into this, and while the sample size is small, it shouldn't be
discounted.

Yes, of course it is possible to migrate to the new APIs - which is what we,
as well
as many others, were in the process of doing.  This is just an example of an
API
which got more complex in going to 2.9, and unlike the Collector API, it's
possible
that in this case it wasn't necessary for it to be as complex as it did.

  -jake


[jira] Issue Comment Edited: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-10-22 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769055#action_12769055
 ] 

Mark Miller edited comment on LUCENE-1997 at 10/23/09 4:37 AM:
---

Hey John, did you pull from a wiki dump or use the random index?

*edit*

NM - that explains your shortened table - no wiki results - I go it.

  was (Author: markrmil...@gmail.com):
Hey John, did you pull from a wiki dump or use the random index?
  
 Explore performance of multi-PQ vs single-PQ sorting API
 

 Key: LUCENE-1997
 URL: https://issues.apache.org/jira/browse/LUCENE-1997
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-1997.patch, LUCENE-1997.patch


 Spinoff from recent lucene 2.9 sorting algorithm thread on java-dev,
 where a simpler (non-segment-based) comparator API is proposed that
 gathers results into multiple PQs (one per segment) and then merges
 them in the end.
 I started from John's multi-PQ code and worked it into
 contrib/benchmark so that we could run perf tests.  Then I generified
 the Python script I use for running search benchmarks (in
 contrib/benchmark/sortBench.py).
 The script first creates indexes with 1M docs (based on
 SortableSingleDocSource, and based on wikipedia, if available).  Then
 it runs various combinations:
   * Index with 20 balanced segments vs index with the normal log
 segment size
   * Queries with different numbers of hits (only for wikipedia index)
   * Different top N
   * Different sorts (by title, for wikipedia, and by random string,
 random int, and country for the random index)
 For each test, 7 search rounds are run and the best QPS is kept.  The
 script runs singlePQ then multiPQ, and records the resulting best QPS
 for each and produces table (in Jira format) as output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-10-22 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769055#action_12769055
 ] 

Mark Miller commented on LUCENE-1997:
-

Hey John, did you pull from a wiki dump or use the random index?

 Explore performance of multi-PQ vs single-PQ sorting API
 

 Key: LUCENE-1997
 URL: https://issues.apache.org/jira/browse/LUCENE-1997
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-1997.patch, LUCENE-1997.patch


 Spinoff from recent lucene 2.9 sorting algorithm thread on java-dev,
 where a simpler (non-segment-based) comparator API is proposed that
 gathers results into multiple PQs (one per segment) and then merges
 them in the end.
 I started from John's multi-PQ code and worked it into
 contrib/benchmark so that we could run perf tests.  Then I generified
 the Python script I use for running search benchmarks (in
 contrib/benchmark/sortBench.py).
 The script first creates indexes with 1M docs (based on
 SortableSingleDocSource, and based on wikipedia, if available).  Then
 it runs various combinations:
   * Index with 20 balanced segments vs index with the normal log
 segment size
   * Queries with different numbers of hits (only for wikipedia index)
   * Different top N
   * Different sorts (by title, for wikipedia, and by random string,
 random int, and country for the random index)
 For each test, 7 search rounds are run and the best QPS is kept.  The
 script runs singlePQ then multiPQ, and records the resulting best QPS
 for each and produces table (in Jira format) as output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-10-22 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769056#action_12769056
 ] 

Jake Mannix commented on LUCENE-1997:
-

Java6 is standard in production servers, since when?  What justified lucene 
staying java1.4 for so long if this is the case?  In my own experience, my last 
job only moved to java1.5 a year ago, and at my current company, we're still on 
1.5, and I've seen that be pretty common, and I'm in the Valley, where things 
update pretty quickly.

 Explore performance of multi-PQ vs single-PQ sorting API
 

 Key: LUCENE-1997
 URL: https://issues.apache.org/jira/browse/LUCENE-1997
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-1997.patch, LUCENE-1997.patch


 Spinoff from recent lucene 2.9 sorting algorithm thread on java-dev,
 where a simpler (non-segment-based) comparator API is proposed that
 gathers results into multiple PQs (one per segment) and then merges
 them in the end.
 I started from John's multi-PQ code and worked it into
 contrib/benchmark so that we could run perf tests.  Then I generified
 the Python script I use for running search benchmarks (in
 contrib/benchmark/sortBench.py).
 The script first creates indexes with 1M docs (based on
 SortableSingleDocSource, and based on wikipedia, if available).  Then
 it runs various combinations:
   * Index with 20 balanced segments vs index with the normal log
 segment size
   * Queries with different numbers of hits (only for wikipedia index)
   * Different top N
   * Different sorts (by title, for wikipedia, and by random string,
 random int, and country for the random index)
 For each test, 7 search rounds are run and the best QPS is kept.  The
 script runs singlePQ then multiPQ, and records the resulting best QPS
 for each and produces table (in Jira format) as output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-10-22 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769058#action_12769058
 ] 

Jake Mannix commented on LUCENE-1997:
-

I would say that of course weighting more highly linux and solaris should be 
done over results on macs, because while I love my mac, I've yet to see a 
production cluster running on MacBook Pros... :)

 Explore performance of multi-PQ vs single-PQ sorting API
 

 Key: LUCENE-1997
 URL: https://issues.apache.org/jira/browse/LUCENE-1997
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-1997.patch, LUCENE-1997.patch


 Spinoff from recent lucene 2.9 sorting algorithm thread on java-dev,
 where a simpler (non-segment-based) comparator API is proposed that
 gathers results into multiple PQs (one per segment) and then merges
 them in the end.
 I started from John's multi-PQ code and worked it into
 contrib/benchmark so that we could run perf tests.  Then I generified
 the Python script I use for running search benchmarks (in
 contrib/benchmark/sortBench.py).
 The script first creates indexes with 1M docs (based on
 SortableSingleDocSource, and based on wikipedia, if available).  Then
 it runs various combinations:
   * Index with 20 balanced segments vs index with the normal log
 segment size
   * Queries with different numbers of hits (only for wikipedia index)
   * Different top N
   * Different sorts (by title, for wikipedia, and by random string,
 random int, and country for the random index)
 For each test, 7 search rounds are run and the best QPS is kept.  The
 script runs singlePQ then multiPQ, and records the resulting best QPS
 for each and produces table (in Jira format) as output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-10-22 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769059#action_12769059
 ] 

Yonik Seeley commented on LUCENE-1997:
--

bq. Java6 is standard in production servers, since when?

Maybe I'm wrong... it  was just a guess. It's just what I've seen most 
customers deploying new projects on.

bq. What justified lucene staying java1.4 for so long if this is the case?

The decision of what JVM a business should use to deploy their new app is a 
very different one than what Lucene should require.
A minority of users may be justification enough to avoid requring a new JVM... 
unless the benefits are really that huge.  Lucene does not target the JVM that 
most people will be deploying on - if that were the case, I have a feeling we'd 
be switching to Java6 instead of Java5.

 Explore performance of multi-PQ vs single-PQ sorting API
 

 Key: LUCENE-1997
 URL: https://issues.apache.org/jira/browse/LUCENE-1997
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-1997.patch, LUCENE-1997.patch


 Spinoff from recent lucene 2.9 sorting algorithm thread on java-dev,
 where a simpler (non-segment-based) comparator API is proposed that
 gathers results into multiple PQs (one per segment) and then merges
 them in the end.
 I started from John's multi-PQ code and worked it into
 contrib/benchmark so that we could run perf tests.  Then I generified
 the Python script I use for running search benchmarks (in
 contrib/benchmark/sortBench.py).
 The script first creates indexes with 1M docs (based on
 SortableSingleDocSource, and based on wikipedia, if available).  Then
 it runs various combinations:
   * Index with 20 balanced segments vs index with the normal log
 segment size
   * Queries with different numbers of hits (only for wikipedia index)
   * Different top N
   * Different sorts (by title, for wikipedia, and by random string,
 random int, and country for the random index)
 For each test, 7 search rounds are run and the best QPS is kept.  The
 script runs singlePQ then multiPQ, and records the resulting best QPS
 for each and produces table (in Jira format) as output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-10-22 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769060#action_12769060
 ] 

Mark Miller commented on LUCENE-1997:
-

Same system, Java 1.6.0_15

||Seg size||Query||Tot hits||Sort||Top N||QPS old||QPS new||Pct change||
|log|1|317925|title|10|105.46|97.11|{color:red}-7.9%{color}|
|log|1|317925|title|25|109.08|98.34|{color:red}-9.8%{color}|
|log|1|317925|title|50|108.01|93.99|{color:red}-13.0%{color}|
|log|1|317925|title|100|105.79|84.08|{color:red}-20.5%{color}|
|log|1|317925|title|500|91.12|50.28|{color:red}-44.8%{color}|
|log|1|317925|title|1000|80.51|33.59|{color:red}-58.3%{color}|
|log|all|100|title|10|113.89|105.39|{color:red}-7.5%{color}|
|log|all|100|title|25|113.14|102.13|{color:red}-9.7%{color}|
|log|all|100|title|50|111.30|96.51|{color:red}-13.3%{color}|
|log|all|100|title|100|86.77|83.86|{color:red}-3.4%{color}|
|log|all|100|title|500|78.00|42.15|{color:red}-46.0%{color}|
|log|all|100|title|1000|70.50|27.02|{color:red}-61.7%{color}|
|log|all|100|rand string|10|107.78|106.09|{color:red}-1.6%{color}|
|log|all|100|rand string|25|103.09|102.53|{color:red}-0.5%{color}|
|log|all|100|rand string|50|106.42|95.17|{color:red}-10.6%{color}|
|log|all|100|rand string|100|86.28|85.41|{color:red}-1.0%{color}|
|log|all|100|rand string|500|76.69|37.76|{color:red}-50.8%{color}|
|log|all|100|rand string|1000|68.48|22.95|{color:red}-66.5%{color}|
|log|all|100|country|10|103.36|106.79|{color:green}3.3%{color}|
|log|all|100|country|25|103.43|102.69|{color:red}-0.7%{color}|
|log|all|100|country|50|102.93|94.97|{color:red}-7.7%{color}|
|log|all|100|country|100|108.49|85.71|{color:red}-21.0%{color}|
|log|all|100|country|500|80.87|38.23|{color:red}-52.7%{color}|
|log|all|100|country|1000|67.24|22.79|{color:red}-66.1%{color}|
|log|all|100|rand int|10|120.59|112.03|{color:red}-7.1%{color}|
|log|all|100|rand int|25|119.80|107.49|{color:red}-10.3%{color}|
|log|all|100|rand int|50|119.96|98.84|{color:red}-17.6%{color}|
|log|all|100|rand int|100|88.58|89.24|{color:green}0.7%{color}|
|log|all|100|rand int|500|83.50|40.13|{color:red}-51.9%{color}|
|log|all|100|rand int|1000|74.80|23.83|{color:red}-68.1%{color}|


 Explore performance of multi-PQ vs single-PQ sorting API
 

 Key: LUCENE-1997
 URL: https://issues.apache.org/jira/browse/LUCENE-1997
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-1997.patch, LUCENE-1997.patch


 Spinoff from recent lucene 2.9 sorting algorithm thread on java-dev,
 where a simpler (non-segment-based) comparator API is proposed that
 gathers results into multiple PQs (one per segment) and then merges
 them in the end.
 I started from John's multi-PQ code and worked it into
 contrib/benchmark so that we could run perf tests.  Then I generified
 the Python script I use for running search benchmarks (in
 contrib/benchmark/sortBench.py).
 The script first creates indexes with 1M docs (based on
 SortableSingleDocSource, and based on wikipedia, if available).  Then
 it runs various combinations:
   * Index with 20 balanced segments vs index with the normal log
 segment size
   * Queries with different numbers of hits (only for wikipedia index)
   * Different top N
   * Different sorts (by title, for wikipedia, and by random string,
 random int, and country for the random index)
 For each test, 7 search rounds are run and the best QPS is kept.  The
 script runs singlePQ then multiPQ, and records the resulting best QPS
 for each and produces table (in Jira format) as output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: lucene 2.9 sorting algorithm

2009-10-22 Thread Mark Miller
Jake Mannix wrote:


 On Thu, Oct 22, 2009 at 9:25 PM, Mark Miller markrmil...@gmail.com
 mailto:markrmil...@gmail.com wrote:

  he new API is much harder for the
  average user to use, and even for the experienced user, it's not
 terribly fun,
  and more importantly:

 Do we have enough info to support that though? All the cases I
 have seen
 on the list, people have figured it out pretty easily - havn't really
 seen any complaints in that regard (not counting you and John -
 that is
 two). The only other complaints I have noticed are those that happened
 to count on unsupported behavior (eg people counting on no
 MultiSearcher
 use)


 John and I and TomS all found it both complex, and we're all pretty
 serious
 users of inner lucene apis.

 You see *core developers* saying the api seems fine.  Have you seen
 *any users*
 of the new sorting api say anything positive about it?  Do you know of
 *anyone* who
 has implemented the new comparator interface at all,  let alone
 *likes* it? 
 3 negative votes by users, in comparison to *zero* positive votes by
 users
 together with a bunch of core developers saying, yeah it looks easy,
 what are
 you guys complaining about?.

 Internal apis take a while to percolate out to the user base - we're
 only the first
 few running into this, and while the sample size is small, it
 shouldn't be discounted.

 Yes, of course it is possible to migrate to the new APIs - which is
 what we, as well
 as many others, were in the process of doing.  This is just an example
 of an API
 which got more complex in going to 2.9, and unlike the Collector API,
 it's possible
 that in this case it wasn't necessary for it to be as complex as it did.

   -jake

Yes - I've seen a handful of non core devs report back that they
upgraded with no complaints on the difficulty. Its in the mailing list
archives. The only core dev I've seen say its easy is Uwe. He's super
sharp though, so I wasn't banking my comment on him ;)

-- 
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Hudson build is back to normal: Lucene-trunk #987

2009-10-22 Thread Apache Hudson Server
See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/987/changes



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: lucene 2.9 sorting algorithm

2009-10-22 Thread Jake Mannix
On Thu, Oct 22, 2009 at 9:58 PM, Mark Miller markrmil...@gmail.com wrote:

 Yes - I've seen a handful of non core devs report back that they
 upgraded with no complaints on the difficulty. Its in the mailing list
 archives. The only core dev I've seen say its easy is Uwe. He's super
 sharp though, so I wasn't banking my comment on him ;)


Upgrade custom sorting?  Where has anyone talked about this?

2.9 is great, I like the new apis, they're great in general.  It's just this
multi-segment sorting we're talking about here.

  -jake


[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-10-22 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769085#action_12769085
 ] 

Mark Miller commented on LUCENE-1997:
-

bq. Java6 is standard in production servers, since when?

bq. Maybe I'm wrong... it was just a guess. It's just what I've seen most 
customers deploying new projects on.

Thats my impression too - Java 1.6 is mainly just a bug fix and performance 
release and has been out for a while, so its usually the choice I've seen.
Sounds like Uwe thinks its more buggy though, so who knows if thats a good idea 
:)

 Explore performance of multi-PQ vs single-PQ sorting API
 

 Key: LUCENE-1997
 URL: https://issues.apache.org/jira/browse/LUCENE-1997
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-1997.patch, LUCENE-1997.patch


 Spinoff from recent lucene 2.9 sorting algorithm thread on java-dev,
 where a simpler (non-segment-based) comparator API is proposed that
 gathers results into multiple PQs (one per segment) and then merges
 them in the end.
 I started from John's multi-PQ code and worked it into
 contrib/benchmark so that we could run perf tests.  Then I generified
 the Python script I use for running search benchmarks (in
 contrib/benchmark/sortBench.py).
 The script first creates indexes with 1M docs (based on
 SortableSingleDocSource, and based on wikipedia, if available).  Then
 it runs various combinations:
   * Index with 20 balanced segments vs index with the normal log
 segment size
   * Queries with different numbers of hits (only for wikipedia index)
   * Different top N
   * Different sorts (by title, for wikipedia, and by random string,
 random int, and country for the random index)
 For each test, 7 search rounds are run and the best QPS is kept.  The
 script runs singlePQ then multiPQ, and records the resulting best QPS
 for each and produces table (in Jira format) as output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: lucene 2.9 sorting algorithm

2009-10-22 Thread John Wang
Hi Yonik:
I have been head deep in this trying to find out a good solution for
better part of the past two days, it's been hard because there are so many
variables:

1) how optimized are the code from either of the implementations
2) VM difference
3) HW etc.

Also, there are quite a few dimensions this issue is being discussed on:

Algorithm:

I think we should NOT jump to the conclusion that my number on the
multiQ is valid until others reproduce it (which is one of the reason I
asked mike to run his benchmark again with 1.5) I am gonna try to run it on
server machines when I get back to my office next week.

Overall, I think the single Q algorithm is better. (It however does pay
a price for some string compares etc.), Its benefit becomes more and more
significant when the product of PQ size and segment count increases, which
makes complete sense from the algorithm. However, when PQ size is small
(which is in most of the cases, the multiplier on the segment count is also
small) the benefit is not as obvious. And sometimes the trade-off for the
constant string compare cost may not be worth it. (this remains a
hypothesis)

With Java 1.6, maybe the singleQ approach is a winner in all cases.

 I will spend more time to find out a more definitive answer.

API:

The new FieldComparator API is not difficult to understand (especially
for Lucene experts such as yourselves), but it is more involved in
comparison to the ScoreDocComparator API. I think anyone would agree with
that. Furthermore, when implementing some custom comparators, (examples I
have given earlier in this thread), it can be difficult to implement while
maintaining performance.

I understand changing API is hard, that is why I am trying to raise this
as soon as possible, and it could very well be that the current API is fine.

Lucene's collector api allows anyone to plugin any sorting algorithm,
kinda like what Mike has done with the tests. So it is ok if an API selected
does not fit the needs for everyone.

In conclusion, please understand I am not trying to be right on this,
just trying to learn and to understand, which I did from reading and trying
to understand the code, along with guidance from Mike and Yonik and I am
more than impressed with the thoughts and code tuning that went into it.

Thanks

-John

On Thu, Oct 22, 2009 at 8:30 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Thu, Oct 22, 2009 at 11:11 PM, Jake Mannix jake.man...@gmail.com
 wrote:
  It's hard to read the column format, but if you look up above in the
 thread
  from tonight,
  you can see that yes, for PQ sizes less than 100 elements, multiPQ is
  better, and only
  starts to be worse at around 100 for strings, and 50 for ints.

 Ah, OK, I had missed John's followup with the numbers.

 I assume this is for Java5 + optimizations?
 What does Java6 show?

 bq. Point 2 does indeed make a difference, we've seen it, and it's
 only fair: the single pq comparator does this branch optimization but
 the current patch multi-pq does not, so let's level the playing field.

 Of course - it's not about leveling the playing field, but finding the
 best solution for the average case - so everything should be optimized
 as much as possible.  There are probably further optimizations
 possible in both the single and multi PQ cases.

 My biggest reservation is that we've gone down the road of telling
 people to implement a new style of comparators, and told them that the
 old style comparators would be deleted in the next release (which is
 where we are).  Reversing that will be a bit of a headache/question...
 the new stuff isn't deprecated, and having *both* isn't desirable, but
 that's a separate decision to be made apart from performance testing.

 Is there also an option of using a multiPQ approach with the new style
 comparators?

 -Yonik
 http://www.lucidimagination.com

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-10-22 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769088#action_12769088
 ] 

Mark Miller commented on LUCENE-1997:
-

John, what happened to your topn:100 results?

 Explore performance of multi-PQ vs single-PQ sorting API
 

 Key: LUCENE-1997
 URL: https://issues.apache.org/jira/browse/LUCENE-1997
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-1997.patch, LUCENE-1997.patch


 Spinoff from recent lucene 2.9 sorting algorithm thread on java-dev,
 where a simpler (non-segment-based) comparator API is proposed that
 gathers results into multiple PQs (one per segment) and then merges
 them in the end.
 I started from John's multi-PQ code and worked it into
 contrib/benchmark so that we could run perf tests.  Then I generified
 the Python script I use for running search benchmarks (in
 contrib/benchmark/sortBench.py).
 The script first creates indexes with 1M docs (based on
 SortableSingleDocSource, and based on wikipedia, if available).  Then
 it runs various combinations:
   * Index with 20 balanced segments vs index with the normal log
 segment size
   * Queries with different numbers of hits (only for wikipedia index)
   * Different top N
   * Different sorts (by title, for wikipedia, and by random string,
 random int, and country for the random index)
 For each test, 7 search rounds are run and the best QPS is kept.  The
 script runs singlePQ then multiPQ, and records the resulting best QPS
 for each and produces table (in Jira format) as output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-10-22 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769089#action_12769089
 ] 

Yonik Seeley commented on LUCENE-1997:
--

There was a bad stretch in Java6... they plopped in a major JVM upgrade (not 
just bug fixes) and there were bugs.  I think that's been behind us for a 
little while now though.  If someone were starting a project today, I'd 
recommend the latest Java6 JVM.

 Explore performance of multi-PQ vs single-PQ sorting API
 

 Key: LUCENE-1997
 URL: https://issues.apache.org/jira/browse/LUCENE-1997
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-1997.patch, LUCENE-1997.patch


 Spinoff from recent lucene 2.9 sorting algorithm thread on java-dev,
 where a simpler (non-segment-based) comparator API is proposed that
 gathers results into multiple PQs (one per segment) and then merges
 them in the end.
 I started from John's multi-PQ code and worked it into
 contrib/benchmark so that we could run perf tests.  Then I generified
 the Python script I use for running search benchmarks (in
 contrib/benchmark/sortBench.py).
 The script first creates indexes with 1M docs (based on
 SortableSingleDocSource, and based on wikipedia, if available).  Then
 it runs various combinations:
   * Index with 20 balanced segments vs index with the normal log
 segment size
   * Queries with different numbers of hits (only for wikipedia index)
   * Different top N
   * Different sorts (by title, for wikipedia, and by random string,
 random int, and country for the random index)
 For each test, 7 search rounds are run and the best QPS is kept.  The
 script runs singlePQ then multiPQ, and records the resulting best QPS
 for each and produces table (in Jira format) as output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



  1   2   >