Re: Lucene's default settings & back compatibility

2009-05-21 Thread Shai Erera
>
> Your example confused me.


You're right. I Wrote it with one eye closed already. I meant to say that if
I'm a 2.4 user and something gets deprecated in trunk (afterwards), it is
carried through 2.4.X and 2.5 and then removed in 2.6. So only 1 full minor
release.

It's somewhat crazy, but what if we deprecate stuff and rename it?
>

I absolutely love that idea ! But it means that:
1) We cannot support jar drop-in ability in those cases (which I'm fine with
because people can upgrade to 2.4.X to get bug fixes) not just because the
API does something different, but because it may not compile. For example,
the changes I'm doing in 1614 would have changed next() and skipTo()
signature, and so someone who wrote a DISI which has a next() that returns
boolean will fail to compile.
2) We give the deprecated API the mediocre names. (A funny thought: we can
give those methods/classes really stupid/nasty names, to emphasize the
beauty of the existing API, to encourage people to stick with the better API
:) ).
3) We document clearly what needs to be done in order to use the deprecated
API.

One thing we didn't address here fully are methods added to
interfaces/abstract classes. When we add a method to an abstract class with
a default impl, that's ok. But what if we need to make it abstract (like we
had to do in 1575 for the Collector versions)?

I guess for interfaces we should first move all of them to abstract classes.
I like interfaces. but abstract classes give us slightly more freedom when
we face back-compat issues. Maybe to support Earwin's idea, we use the name
for a new abstract class, and give the interface a different name? That way
to upgrade people just need to change implements to extends (I hope that
won't cause any problems if their classes already extend something else).

But if we apply this policy to interfaces, I think more users will need to
touch their code when upgrading even minor releases.

So Mike, about actsAsVersion ... I think I'm starting to get used to it. I
do relate to what Marvin writes though, about two different apps running in
the same JVM with different settings. We have such a case - two teams
develop two search solutions (for two back-ends). They live in the same JVM
but have different development plans/schedules. So it's not just a
hypothetical problem to me.

If we could have the app saying something
Version.getInstance(appId).actAsVersion(2.4) that would solve it because
each will have its own Id, and the Version class would maintain a map
between the Id and an instance. But I've still yet to resolve (in my mind)
how the Lucene code will use it, since the same code runs in two apps with
different IDs, and so won't know which appId to pass.

Oh well .. we're going to change the way those two teams work anyway, so for
me at least, this problem will be gone soon :)

I also agree that actsAsVersion breaks the localilty principle, in which
when you see a bug you should check in the surroundings where the bug
happened, and not realize the bug stems from files away. But I don't like
passing version information in the constructors also ...

What if we continue to process Marvin's proposal on saving that information
in the index. I think, Mike, that I asked you a similar question a while
ago, about whether Lucene has the ability to store index versions. Index
versions are important and can save some of the problems here - not just
with storing stopwords list, but also code that manipulates the index, or
makes decisions about scoring etc.

For the two apps in same JVM it should solve the problem since I think we
can safely assume each operates on its own index.

Arggh .. but again we face the same problem - how do we pass that
information to the different classes? How is a TokenStream expected to read
that info?

I think we may have to settle on the static Version class, even if it will
read the information from the index (by doing some Version.init(File
indexDir)).

Shai

On Fri, May 22, 2009 at 1:53 AM, Marvin Humphrey wrote:

> On Thu, May 21, 2009 at 05:19:43PM -0400, Michael McCandless wrote:
>
> > Marvin, which solution would you prefer?
>
> Between the two, I'd prefer settings constructor arguments, though I would
> be
> inclined to have settings classes that are specific to individual classes
> rather than Lucene-wide.
>
> At least that scheme gets locality right.  The global actsAsVersion
> variable
> violates that principle and has the potential to saddle a small number of
> users who have done absolutely nothing wrong with bugs that are very, very
> hard to hunt down.  That's unfair.
>
> As far as analyzers and token streams, the theoretical answer is making
> indexes self-describing via serializable schemas, as discussed on the Lucy
> dev
> list, and as implemented in KinoSearch svn trunk.  With versioning metadata
> attached to the index, there is no longer any worry about upgrading
> analysis
> modules provided that those modules handle their own versioning correctly.

[jira] Commented: (LUCENE-1636) TokenFilters with a null value in the constructor fail

2009-05-21 Thread Wouter Heijke (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711933#action_12711933
 ] 

Wouter Heijke commented on LUCENE-1636:
---

I'm on holiday now, but as far as I recollect (as I was not the author of the 
code) it was done on some filters that would be used in another situation 
(similar to a filter) to use the filter's functionality. Also it was used with 
filters that could not be extended, so a new filter was created, also here the 
orignal filter's public methods would be called.

In a way it doesn't matter, it could be done with the api without any problems 
with the latest few releases that i know of.

A more elegant way if one would like to introduce this new behaviour is to at 
least log some kind of error message in the 2.9 release so users would be 
alarmed that they use the Lucene api in a way that is not supported anymore.


> TokenFilters with a null value in the constructor fail
> --
>
> Key: LUCENE-1636
> URL: https://issues.apache.org/jira/browse/LUCENE-1636
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.9
>Reporter: Wouter Heijke
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1636.patch
>
>
> While migrating from 2.4.x to 2.9-dev I found a lot of failing unittests.
> One problem is with TokenFilters that do a super(null) in the constructor.
> I fixed it by changing the constructor to super(new EmptyTokenStream())
> This will cause problems and frustration to others while migrating to 2.9.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1636) TokenFilters with a null value in the constructor fail

2009-05-21 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711920#action_12711920
 ] 

Uwe Schindler edited comment on LUCENE-1636 at 5/21/09 6:53 PM:


Mike:
Would this affect backwards compatibility? If we make it final and nobody 
changes the stream, everything is ok. Is this also the case, when using 
lucene.jar as a dropin-replacement without recompilation? Will changing a final 
variable from code, compiled before finalization, be detected by the JVM? Is 
the compiled code with final still binary compatible to code compiled againt 
non-final members?
I think, we should try this out before committing!

  was (Author: thetaphi):
Mike:
Would this affect backwards compatibility? If we make it final and nobody 
changes the stream, everything is ok. Is this also the case, when using 
plugin-replacement without recompilation? Will changing a final variable from 
code, compiled before finalization, be detected by the JVM? Is the compiled 
code with final still binary compatible to code compiled againt non-final 
members?
I think, we should try this out before committing!
  
> TokenFilters with a null value in the constructor fail
> --
>
> Key: LUCENE-1636
> URL: https://issues.apache.org/jira/browse/LUCENE-1636
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.9
>Reporter: Wouter Heijke
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1636.patch
>
>
> While migrating from 2.4.x to 2.9-dev I found a lot of failing unittests.
> One problem is with TokenFilters that do a super(null) in the constructor.
> I fixed it by changing the constructor to super(new EmptyTokenStream())
> This will cause problems and frustration to others while migrating to 2.9.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1636) TokenFilters with a null value in the constructor fail

2009-05-21 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711920#action_12711920
 ] 

Uwe Schindler commented on LUCENE-1636:
---

Mike:
Would this affect backwards compatibility? If we make it final and nobody 
changes the stream, everything is ok. Is this also the case, when using 
plugin-replacement without recompilation? Will changing a final variable from 
code, compiled before finalization, be detected by the JVM? Is the compiled 
code with final still binary compatible to code compiled againt non-final 
members?
I think, we should try this out before committing!

> TokenFilters with a null value in the constructor fail
> --
>
> Key: LUCENE-1636
> URL: https://issues.apache.org/jira/browse/LUCENE-1636
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.9
>Reporter: Wouter Heijke
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1636.patch
>
>
> While migrating from 2.4.x to 2.9-dev I found a lot of failing unittests.
> One problem is with TokenFilters that do a super(null) in the constructor.
> I fixed it by changing the constructor to super(new EmptyTokenStream())
> This will cause problems and frustration to others while migrating to 2.9.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1636) TokenFilters with a null value in the constructor fail

2009-05-21 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711919#action_12711919
 ] 

Uwe Schindler commented on LUCENE-1636:
---

Hi Wouter,
I still want to find out, what you are trying to do with a TokenFilter without 
a delegate! Can you explain, why you want to initialize with super(null)?
If it is because you want to change it later to something non-null, it will not 
work anymore (this is why I want to make the delgate stream final). So please 
explain!

> TokenFilters with a null value in the constructor fail
> --
>
> Key: LUCENE-1636
> URL: https://issues.apache.org/jira/browse/LUCENE-1636
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.9
>Reporter: Wouter Heijke
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1636.patch
>
>
> While migrating from 2.4.x to 2.9-dev I found a lot of failing unittests.
> One problem is with TokenFilters that do a super(null) in the constructor.
> I fixed it by changing the constructor to super(new EmptyTokenStream())
> This will cause problems and frustration to others while migrating to 2.9.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1636) TokenFilters with a null value in the constructor fail

2009-05-21 Thread Wouter Heijke (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711915#action_12711915
 ] 

Wouter Heijke commented on LUCENE-1636:
---

I only hope users will understand this and they realize that 2.9 is not 
backwards compatible to previous versions! This code in our codebase was added 
by knowledgeable Lucene developers!

> TokenFilters with a null value in the constructor fail
> --
>
> Key: LUCENE-1636
> URL: https://issues.apache.org/jira/browse/LUCENE-1636
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.9
>Reporter: Wouter Heijke
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1636.patch
>
>
> While migrating from 2.4.x to 2.9-dev I found a lot of failing unittests.
> One problem is with TokenFilters that do a super(null) in the constructor.
> I fixed it by changing the constructor to super(new EmptyTokenStream())
> This will cause problems and frustration to others while migrating to 2.9.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1474) Incorrect SegmentInfo.delCount when IndexReader.flush() is used

2009-05-21 Thread Erik van Zijst (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711868#action_12711868
 ] 

Erik van Zijst edited comment on LUCENE-1474 at 5/21/09 4:47 PM:
-

I have attached the output of CheckIndex on all our index directories, which 
seems to report quite a few errors:

erik:cache ervzijst$ grep "CorruptIndexException\|AssertionError" 
CheckIndex.txt 
java.lang.AssertionError: delete count mismatch: info=1263 vs BitVector=1262
java.lang.AssertionError: delete count mismatch: info=496 vs BitVector=493
java.lang.AssertionError: delete count mismatch: info=101 vs BitVector=100
java.lang.AssertionError: delete count mismatch: info=300 vs BitVector=298
java.lang.AssertionError: delete count mismatch: info=109 vs BitVector=108
java.lang.AssertionError: delete count mismatch: info=140 vs BitVector=139
java.lang.AssertionError: delete count mismatch: info=122 vs BitVector=121
java.lang.AssertionError: delete count mismatch: info=91 vs BitVector=89
java.lang.AssertionError: delete count mismatch: info=1411 vs BitVector=1409
java.lang.AssertionError: delete count mismatch: info=801 vs BitVector=800
java.lang.AssertionError: delete count mismatch: info=630 vs BitVector=629
java.lang.AssertionError: delete count mismatch: info=510 vs BitVector=508
org.apache.lucene.index.CorruptIndexException: doc counts differ for segment 
_0: fieldsReader shows 12365 but segmentInfo shows 12232
org.apache.lucene.index.CorruptIndexException: doc counts differ for segment 
_1: fieldsReader shows 10144 but segmentInfo shows 8766
org.apache.lucene.index.CorruptIndexException: doc counts differ for segment 
_2: fieldsReader shows 4616 but segmentInfo shows 7006
org.apache.lucene.index.CorruptIndexException: doc counts differ for segment 
_3: fieldsReader shows 6681 but segmentInfo shows 4854
org.apache.lucene.index.CorruptIndexException: doc counts differ for segment 
_4: fieldsReader shows 2652 but segmentInfo shows 8808
org.apache.lucene.index.CorruptIndexException: doc counts differ for segment 
_5: fieldsReader shows 11500 but segmentInfo shows 14551
org.apache.lucene.index.CorruptIndexException: doc counts differ for segment 
_6: fieldsReader shows 16225 but segmentInfo shows 4375
erik:cache ervzijst$

P.S.
The full log is attached as CheckIndex.txt


  was (Author: erik.van.zijst):
I have attached the output of CheckIndex on all our index directories, 
which seems to report quite a few errors:

erik:cache ervzijst$ grep "CorruptIndexException\|AssertionError" 
CheckIndex.txt 
java.lang.AssertionError: delete count mismatch: info=1263 vs BitVector=1262
java.lang.AssertionError: delete count mismatch: info=496 vs BitVector=493
java.lang.AssertionError: delete count mismatch: info=101 vs BitVector=100
java.lang.AssertionError: delete count mismatch: info=300 vs BitVector=298
java.lang.AssertionError: delete count mismatch: info=109 vs BitVector=108
java.lang.AssertionError: delete count mismatch: info=140 vs BitVector=139
java.lang.AssertionError: delete count mismatch: info=122 vs BitVector=121
java.lang.AssertionError: delete count mismatch: info=91 vs BitVector=89
java.lang.AssertionError: delete count mismatch: info=1411 vs BitVector=1409
java.lang.AssertionError: delete count mismatch: info=801 vs BitVector=800
java.lang.AssertionError: delete count mismatch: info=630 vs BitVector=629
java.lang.AssertionError: delete count mismatch: info=510 vs BitVector=508
org.apache.lucene.index.CorruptIndexException: doc counts differ for segment 
_0: fieldsReader shows 12365 but segmentInfo shows 12232
org.apache.lucene.index.CorruptIndexException: doc counts differ for segment 
_1: fieldsReader shows 10144 but segmentInfo shows 8766
org.apache.lucene.index.CorruptIndexException: doc counts differ for segment 
_2: fieldsReader shows 4616 but segmentInfo shows 7006
org.apache.lucene.index.CorruptIndexException: doc counts differ for segment 
_3: fieldsReader shows 6681 but segmentInfo shows 4854
org.apache.lucene.index.CorruptIndexException: doc counts differ for segment 
_4: fieldsReader shows 2652 but segmentInfo shows 8808
org.apache.lucene.index.CorruptIndexException: doc counts differ for segment 
_5: fieldsReader shows 11500 but segmentInfo shows 14551
org.apache.lucene.index.CorruptIndexException: doc counts differ for segment 
_6: fieldsReader shows 16225 but segmentInfo shows 4375
erik:cache ervzijst$

  
> Incorrect SegmentInfo.delCount when IndexReader.flush() is used
> ---
>
> Key: LUCENE-1474
> URL: https://issues.apache.org/jira/browse/LUCENE-1474
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.4
>Reporter: Marcel Reutegger
>Assignee: Michael McCandless
> Fix For: 2.4.1, 2.9
>
>

[jira] Commented: (LUCENE-1474) Incorrect SegmentInfo.delCount when IndexReader.flush() is used

2009-05-21 Thread Erik van Zijst (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711868#action_12711868
 ] 

Erik van Zijst commented on LUCENE-1474:


I have attached the output of CheckIndex on all our index directories, which 
seems to report quite a few errors:

erik:cache ervzijst$ grep "CorruptIndexException\|AssertionError" 
CheckIndex.txt 
java.lang.AssertionError: delete count mismatch: info=1263 vs BitVector=1262
java.lang.AssertionError: delete count mismatch: info=496 vs BitVector=493
java.lang.AssertionError: delete count mismatch: info=101 vs BitVector=100
java.lang.AssertionError: delete count mismatch: info=300 vs BitVector=298
java.lang.AssertionError: delete count mismatch: info=109 vs BitVector=108
java.lang.AssertionError: delete count mismatch: info=140 vs BitVector=139
java.lang.AssertionError: delete count mismatch: info=122 vs BitVector=121
java.lang.AssertionError: delete count mismatch: info=91 vs BitVector=89
java.lang.AssertionError: delete count mismatch: info=1411 vs BitVector=1409
java.lang.AssertionError: delete count mismatch: info=801 vs BitVector=800
java.lang.AssertionError: delete count mismatch: info=630 vs BitVector=629
java.lang.AssertionError: delete count mismatch: info=510 vs BitVector=508
org.apache.lucene.index.CorruptIndexException: doc counts differ for segment 
_0: fieldsReader shows 12365 but segmentInfo shows 12232
org.apache.lucene.index.CorruptIndexException: doc counts differ for segment 
_1: fieldsReader shows 10144 but segmentInfo shows 8766
org.apache.lucene.index.CorruptIndexException: doc counts differ for segment 
_2: fieldsReader shows 4616 but segmentInfo shows 7006
org.apache.lucene.index.CorruptIndexException: doc counts differ for segment 
_3: fieldsReader shows 6681 but segmentInfo shows 4854
org.apache.lucene.index.CorruptIndexException: doc counts differ for segment 
_4: fieldsReader shows 2652 but segmentInfo shows 8808
org.apache.lucene.index.CorruptIndexException: doc counts differ for segment 
_5: fieldsReader shows 11500 but segmentInfo shows 14551
org.apache.lucene.index.CorruptIndexException: doc counts differ for segment 
_6: fieldsReader shows 16225 but segmentInfo shows 4375
erik:cache ervzijst$


> Incorrect SegmentInfo.delCount when IndexReader.flush() is used
> ---
>
> Key: LUCENE-1474
> URL: https://issues.apache.org/jira/browse/LUCENE-1474
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.4
>Reporter: Marcel Reutegger
>Assignee: Michael McCandless
> Fix For: 2.4.1, 2.9
>
> Attachments: CheckIndex.txt, IndexReaderTest.java
>
>
> When deleted documents are flushed using IndexReader.flush() the delCount in 
> SegmentInfo is updated based on the current value and 
> SegmentReader.pendingDeleteCount (introduced by LUCENE-1267). It seems that 
> pendingDeleteCount is not reset after the commit, which means after a second 
> flush() or close() of an index reader the delCount in SegmentInfo is 
> incorrect. A subsequent IndexReader.open() call will fail with an error when 
> assertions are enabled. E.g.:
> java.lang.AssertionError: delete count mismatch: info=3 vs BitVector=2
>   at 
> org.apache.lucene.index.SegmentReader.loadDeletedDocs(SegmentReader.java:405)
> [...]

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1474) Incorrect SegmentInfo.delCount when IndexReader.flush() is used

2009-05-21 Thread Erik van Zijst (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik van Zijst updated LUCENE-1474:
---

Attachment: CheckIndex.txt

> Incorrect SegmentInfo.delCount when IndexReader.flush() is used
> ---
>
> Key: LUCENE-1474
> URL: https://issues.apache.org/jira/browse/LUCENE-1474
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.4
>Reporter: Marcel Reutegger
>Assignee: Michael McCandless
> Fix For: 2.4.1, 2.9
>
> Attachments: CheckIndex.txt, IndexReaderTest.java
>
>
> When deleted documents are flushed using IndexReader.flush() the delCount in 
> SegmentInfo is updated based on the current value and 
> SegmentReader.pendingDeleteCount (introduced by LUCENE-1267). It seems that 
> pendingDeleteCount is not reset after the commit, which means after a second 
> flush() or close() of an index reader the delCount in SegmentInfo is 
> incorrect. A subsequent IndexReader.open() call will fail with an error when 
> assertions are enabled. E.g.:
> java.lang.AssertionError: delete count mismatch: info=3 vs BitVector=2
>   at 
> org.apache.lucene.index.SegmentReader.loadDeletedDocs(SegmentReader.java:405)
> [...]

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene's default settings & back compatibility

2009-05-21 Thread Marvin Humphrey
On Thu, May 21, 2009 at 05:19:43PM -0400, Michael McCandless wrote:

> Marvin, which solution would you prefer?

Between the two, I'd prefer settings constructor arguments, though I would be
inclined to have settings classes that are specific to individual classes
rather than Lucene-wide.  

At least that scheme gets locality right.  The global actsAsVersion variable
violates that principle and has the potential to saddle a small number of
users who have done absolutely nothing wrong with bugs that are very, very
hard to hunt down.  That's unfair.

As far as analyzers and token streams, the theoretical answer is making
indexes self-describing via serializable schemas, as discussed on the Lucy dev
list, and as implemented in KinoSearch svn trunk.  With versioning metadata
attached to the index, there is no longer any worry about upgrading analysis
modules provided that those modules handle their own versioning correctly.

For instance, in KS the Stopalizer always embeds the complete stoplist in the
schema file, so even if we update the "English" stoplist, we don't get invalid
search results for indexes which were created with the old stoplist.
Similarly, it may not be possible to keep around multiple variants of
Snowball, but at least we can fail catastrophically instead of subtly if we
detect that the Snowball version has changed.

Full-on schema serialization isn't feasible for Lucene, but attaching an
actsAsVersion variable to an index and feeding that to your analyzers would be
a decent start.

Lastly, I think a major java Lucene release is justified already.  Won't this
discussion die down somewhat if you can get 3.0 out?  If there are issues that
are half done, how about rolling back whatever's in the way?

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1646) QueryParser throws new exceptions even if custom parsing logic threw a better one

2009-05-21 Thread Trejkaz (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711850#action_12711850
 ] 

Trejkaz commented on LUCENE-1646:
-

Our improvements are (so far) specific to our subclass of QueryParser, in that 
we use it when getFieldQuery() gets a value which doesn't make sense for the 
given field.

So in a sense, in our case the query was parsed successfully by the parser, but 
the input was invalid within one of the fields.  As such our custom 
ParseException subclass has the field name and field value, but it isn't useful 
to the Lucene project as-is, as the only things throwing it are called from our 
subclass. :-(


> QueryParser throws new exceptions even if custom parsing logic threw a better 
> one
> -
>
> Key: LUCENE-1646
> URL: https://issues.apache.org/jira/browse/LUCENE-1646
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.4.1
>Reporter: Trejkaz
>
> We have subclassed QueryParser and have various custom fields.  When these 
> fields contain invalid values, we throw a subclass of ParseException which 
> has a more useful message (and also a localised message.)
> Problem is, Lucene's QueryParser is doing this:
> {code}
> catch (ParseException tme) {
> // rethrow to include the original query:
> throw new ParseException("Cannot parse '" +query+ "': " + 
> tme.getMessage());
> }
> {code}
> Thus, our nice and useful ParseException is thrown away, replaced by one with 
> no information about what's actually wrong with the query (it does append 
> getMessage() but that isn't localised.  And it also throws away the 
> underlying cause for the exception.)
> I am about to patch our copy to simply remove these four lines; the caller 
> knows what the query string was (they have to have a copy of it because they 
> are passing it in!) so having it in the error message itself is not useful.  
> Furthermore, when the query string is very big, what the user wants to know 
> is not that the whole query was bad, but which part of it was bad.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: Lucene's default settings & back compatibility

2009-05-21 Thread Steven A Rowe
On 5/21/2009 at 7:17 AM, Michael McCandless wrote:
> OK so it sounds like we've boiled the proposal down to two concrete
> changes to the back-compat policy:
> 
>   1) Default settings can change; we will always choose defaults
>  based on "latest & greatest for new users".  This only
>  affects "runtime behavior".  EG in 2.9, when sorting by
>  field you won't get scores by default.  When we do this we
>  should clearly document the change, and what settings one
>  could use to get back to the old behavior, in CHANGES.txt.
> 
>   2) An API, once released as deprecated, is fair game to be
>  removed in the next minor release.
> 
> We still only make bug fixes on point releases, support the index
> file format until the next major release -- those don't change.

Contrasting the upgrade experience of existing "maintenance" users (i.e., users 
not using new Lucene features) under current policy with their experience under 
the above proposals:

Currently there are two upgrade experiences for these users: a) upgrading 
within the same major release; and b) major release upgrades.  

For a), the user reads CHANGES for back-compat exceptions, but otherwise has 
drop-in compatibility.  For b), the user performs two upgrades: first, just 
like in a), to the last minor release in the same major release, addressing all 
deprecation warnings; and second, to the major release, with drop-in 
compatibility, modulo CHANGES.

Here's the upgrade procedure under the above proposals, from version X.Y to X.Z:

1. Address all deprecation warnings against the currently used Lucene version 
(call it version X.Y[0]).

2. Upgrade to X.(++Y), addressing all deprecation warnings and checking CHANGES 
for exceptions to the back-compat policy, including mechanisms to maintain 
X.Y[0] defaults. 

3. Iterate #2 until Y==Z.

One consequence of these changes is that major version upgrades the same as 
minor version upgrades, with the exception that index format support (and 
default settings support?) will potentially require attention.

Another consequence is that upgrade effort will no longer be amortizable.  
Currently, maintenance users can skip minor version upgrades with almost no 
penalty, and defer the upgrade pain to major release upgrades, since 
deprecation warnings can be safely ignored.  (Not advocating this practice, 
just noting that it's possible.)

Steve


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene's default settings & back compatibility

2009-05-21 Thread Robert Muir
On Thu, May 21, 2009 at 5:55 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> On Thu, May 21, 2009 at 5:44 PM, Robert Muir  wrote:
> > and what if your analyzer needs a third-party library (or two)?
>
> In such cases the back-compat of your analyzer is your responsibility,
> right?


ICUCollationFilter is a simple example. just saying, pretend there was a
back-compat issue with that, maybe even specific to a certain Locale.
that could get nasty quick


-- 
Robert Muir
rcm...@gmail.com


Re: Lucene's default settings & back compatibility

2009-05-21 Thread Michael McCandless
On Thu, May 21, 2009 at 5:44 PM, Robert Muir  wrote:
> and what if your analyzer needs a third-party library (or two)?

In such cases the back-compat of your analyzer is your responsibility,
right?

> i mean this isn't unique to analyzers, if something changes/bug is fixed in
> the guts of some query/scorer that affects scoring in the slightest then
> thats a potential issue too, right?
>
> for a big index burying a result deep is effectively the same as the
> stopword example...

If it's a bug fix, or a change in order-of-operations causing slightly
different floating point truncations, we are free to make those fixes
(even under the current back-compat policy)?  Ie, nothing is changing
for those cases.

But, say we found some improvement to how Lucene does scoring, and by
and large it improves relevance so we want to do it.  New users should
see this benefit.  Back-compat users, I think, should be able to set
actsAsVersion to get back to the old scoring model.

So yeah I think I agree it's not just changes that affect what gets
indexed, but also changes that affect how scores are computed, where
we need a way to specify a back-compat version on upgrading.

I think we can't get away with only policy changes here... I think we
need actsAsVersion to preserve back-compat.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene's default settings & back compatibility

2009-05-21 Thread Robert Muir
and what if your analyzer needs a third-party library (or two)?

i mean this isn't unique to analyzers, if something changes/bug is fixed in
the guts of some query/scorer that affects scoring in the slightest then
thats a potential issue too, right?

for a big index burying a result deep is effectively the same as the
stopword example...

On Thu, May 21, 2009 at 5:27 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> On Thu, May 21, 2009 at 5:19 PM, Earwin Burrfoot  wrote:
> >> Why not store an "actsAs" in the index, just for the changes that
> >> affect what's in the index?  Ie the index records the
> >> version that created it, and by default TokenStreams emulate their
> >> behavior as of that version?
> >
> > Because you don't always have access to index at the time you create
> > your TokenStreams?
>
> Such places would have to pass in their own actsAs when they ask the
> Analyzer for the tokenStream?
>
> Ie, the benefit of this approach vs the single global default is it'd
> be per-instance.
>
> Mike
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


-- 
Robert Muir
rcm...@gmail.com


Re: Lucene's default settings & back compatibility

2009-05-21 Thread Michael McCandless
On Thu, May 21, 2009 at 5:19 PM, Earwin Burrfoot  wrote:
>> Why not store an "actsAs" in the index, just for the changes that
>> affect what's in the index?  Ie the index records the
>> version that created it, and by default TokenStreams emulate their
>> behavior as of that version?
>
> Because you don't always have access to index at the time you create
> your TokenStreams?

Such places would have to pass in their own actsAs when they ask the
Analyzer for the tokenStream?

Ie, the benefit of this approach vs the single global default is it'd
be per-instance.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene's default settings & back compatibility

2009-05-21 Thread Michael McCandless
On Thu, May 21, 2009 at 1:59 PM, Marvin Humphrey  wrote:

> That bug has led to 'base' having a compromised reputation among elite users
> because of intermittent, inexplicable flakiness.  Is that what you want for
> Lucene?

While I agree a single global default is not great, I do think it's
the lesser of all evils here.

It really bothers me that our new users must wait so long (years) to
see improvements to our default settings, because we are so careful
about back-compat.

Marvin, which solution would you prefer?

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene's default settings & back compatibility

2009-05-21 Thread Earwin Burrfoot
> Why not store an "actsAs" in the index, just for the changes that
> affect what's in the index?  Ie the index records the
> version that created it, and by default TokenStreams emulate their
> behavior as of that version?
Because you don't always have access to index at the time you create
your TokenStreams?


-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1436) Make ReqExclScorer package private, and use DocIdSetIterator for excluded part.

2009-05-21 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711810#action_12711810
 ] 

Michael McCandless commented on LUCENE-1436:


OK why don't we make both package private?  Are there other classes 
BooleanScorer/2 are using that should be made package private?

> Make ReqExclScorer package private, and use DocIdSetIterator for excluded 
> part.
> ---
>
> Key: LUCENE-1436
> URL: https://issues.apache.org/jira/browse/LUCENE-1436
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Paul Elschot
>Priority: Trivial
> Attachments: LUCENE-1436.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene's default settings & back compatibility

2009-05-21 Thread Michael McCandless
On Thu, May 21, 2009 at 4:34 PM, Shai Erera  wrote:

> Changes to the index file formats need to be supported for 2 major releases. 
> I.e. 2.X indexes need to be read by 3.Y code, but not by 4.0.

Agreed.

> Method deprecations last for one full minor release.

Your example confused me.  I think if in 2.1 we deprecate a method,
then in 2.2 we can remove it?  Or are you saying it's not until 2.3 that
we can remove it (2 full minor releases)?

> What I don't like about actsAs, and perhaps I just don't understand the 
> proposal well, is that I'm not sure where it's added. Will it be added to 
> IndexWriter, which will pass it on to all the classes it will meet/use?

We would add say oal.Versions class, that has a static actsAs
method and static constants.  If back-compat is vital to your app
you'd do:

  Versions.actsAs(Version.LUCENE_24)

on upgrading to 2.9.  Else, you do nothing to get "latest &
greatest".

You call this once in your app up front, and then use Lucene
normally.  Then, when IndexSearcher is asked to do field sorting, it
consults actsAs to decide whether it should do scoring or not.

> Changes to default behaviors, bug fixes or improvements, that may compromise 
> the index structure or indexed data (such as InvalidAcronym) will last for at 
> least one major release, if not 2

I think this harms new users unnecessarily (ie, I'd rather do actsAs
than this).  I'd like StopFilter to not discard positional
information, fixes for bugs in StandardAnalyzer, and
a correction to the default stopwords list, to be immediately available
for new users on the next release.

Why not store an "actsAs" in the index, just for the changes that
affect what's in the index?  Ie the index records the
version that created it, and by default TokenStreams emulate their
behavior as of that version?

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene's default settings & back compatibility

2009-05-21 Thread Earwin Burrfoot
Sounds like a good proposition.

There's one problem I'd like to address. Good names for
classes/members matter, and matter much. They directly affect how fast
a newcomer is able to understand that particular API, it also affects
how comfortable you work with it once you did understand. When we're
deprecating existing methods and adding new, 'better' ones, bad or
mediocre names replace good names in parts of the code that are most
often used. And there's no way around it.

It's somewhat crazy, but what if we deprecate stuff and rename it? New
stuff gets best names, old stuff is still accessible and with a
"Migration Guide" it's easy to patch client code.

On Fri, May 22, 2009 at 00:34, Shai Erera  wrote:
> I thought we were actually on the track towards not introducing any Settings
> and/or actAs, but instead just change the policy?
>
> Can we agree on the following:
>
> * Changes to the index file formats need to be supported for 2 major
> releases. I.e. 2.X indexes need to be read by 3.Y code, but not by 4.0.
>
> * Method deprecations last for one full minor release. That is a deprecation
> in 2.X lasts through 2.X.1, 2.X+1 but is removed in 2.X+2. If all those X's
> are confusing --> a deprecation in 2.4 is kept in 2.4.X and 2.5, but we're
> free to remove it in 2.6.
>
> * Changes to default behaviors (whether they are bug fixes or improvements),
> where it only affects runtime code, and not the index structure or indexed
> data (such as the InvalidAcronym bug fix) are ok to go into any minor
> release, w/o deprecation - so long we're documenting the change in CHANGES
> along with some sample code on how to migrate easily.
>
> * Changes to default behaviors, bug fixes or improvements, that may
> compromise the index structure or indexed data (such as InvalidAcronym) will
> last for at least one major release, if not 2 (just like supporting file
> formats). The reason is - rebuilding indexes, besides that it might be a
> heavy process, is not often acceptable by the customers of those who develop
> search solutions. Therefore it may be out of our hands. Personally, I don't
> think those will happen a lot, but when they will we can choose between:
> (1) Deprecating a class entirely in favor of a new one, such that anyone who
> upgrades can still use that class
> (2) Introduce a static setter for that behavior, like for InvalidAcronym
> (3) Add a actsAs to that class only.
>
> Am I missing a back-compat issue?
>
> What I don't like about actsAs, and perhaps I just don't understand the
> proposal well, is that I'm not sure where it's added. Will it be added to
> IndexWriter, which will pass it on to all the classes it will meet/use?
>
> If I covered all the back-compat issues above, and we agree on them, then
> for the first 3 we just need to document them on the back-compat page, no
> code to develop.
>
> For the last one, if we choose to adopt (1) or (2), then we don't need to
> develop any mehcanism up-front, but decide on a per-case basis what's the
> best alternative. For example, for the InvalidAcronym we could have
> deprecated that particular TokenFilter in favor of a new one and give a code
> example on how to create a TokenStream with that deprecated TokenFilter.
>
> Shai
>
> On Thu, May 21, 2009 at 10:55 PM, Jason Rutherglen
>  wrote:
>>
>> I'm having trouble visualizing the various methods people are talking
>> about.  It seems like we could open an issue and post patches with code
>> illustrating what each person is talking about?
>>
>> On Thu, May 21, 2009 at 10:02 AM, Michael McCandless
>>  wrote:
>>>
>>> Actually, we started with the *Settings classes (to hold defaults),
>>> but then realized a simple actsAsVersion (single static method) would
>>> suffice for just the back-compat settings and then pushed further and
>>> thought perhaps we should relax our back-compat policy entirely so
>>> emulating older versions is not needed.
>>>
>>> So we no longer have the "defaults" class (*Settings).  We may still
>>> do it for the future (for its own benefits), but for just back-compat
>>> of default settings, it seems like overkill.
>>>
>>> But I agree, the index altering cases are spooky.  I think this'd make
>>> me favor going back to the actsAsVersion option instead of the hard
>>> flip on our back compat policy (at least for default settings; for API
>>> changes I think 1 whole minor release may be reasonable).
>>>
>>> Mike
>>>
>>> On Thu, May 21, 2009 at 12:54 PM, Matthew Hall
>>>  wrote:
>>> > Sorry, I wasn't quite sure what to call this new class you guys have
>>> > been
>>> > talking about.
>>> >
>>> > I was referring to the class that's being discussed to encapsulate all
>>> > of
>>> > the defaults for a given lucene release.  (Its caching strategies etc
>>> > etc)
>>> >
>>> > I'm just not certain that something like a static list of words belongs
>>> > in a
>>> > higher level defaults class like you guys are talking about, especially
>>> > considering that anyone using a stop enabled analy

[jira] Created: (LUCENE-1653) Change DateTools to not create a Calendar in every call to dateToString or timeToString

2009-05-21 Thread Shai Erera (JIRA)
Change DateTools to not create a Calendar in every call to dateToString or 
timeToString
---

 Key: LUCENE-1653
 URL: https://issues.apache.org/jira/browse/LUCENE-1653
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Other
Reporter: Shai Erera
Priority: Minor
 Fix For: 2.9


DateTools creates a Calendar instance on every call to dateToString and 
timeToString. Specifically:

# timeToString calls Calendar.getInstance on every call.
# dateToString calls timeToString(date.getTime()), which then instantiates a 
new Date(). I think we should change the order of the calls, or not have each 
call the other.
# round(), which is called from timeToString (after creating a Calendar 
instance) creates another (!) Calendar instance ...

Seems that if we synchronize the methods and create the Calendar instance once 
(static), it should solve it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean

2009-05-21 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711791#action_12711791
 ] 

Shai Erera commented on LUCENE-1614:


{quote}
Are both new in 2.9?

Yes.
{quote}

Oh that's great - and here I was deprecating next(int) in favor of the new 
advance. I'll just delete it then.

bq. I'm going to open the follow-up issue to that

Opened LUCENE-1652 and copied what's relevant from this issue to there. If I 
missed something, please add it.

> Add next() and skipTo() variants to DocIdSetIterator that return the current 
> doc, instead of boolean
> 
>
> Key: LUCENE-1614
> URL: https://issues.apache.org/jira/browse/LUCENE-1614
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
> Fix For: 2.9
>
> Attachments: LUCENE-1614.patch, LUCENE-1614.patch
>
>
> See 
> http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html
>  for the full discussion. The basic idea is to add variants to those two 
> methods that return the current doc they are at, to save successive calls to 
> doc(). If there are no more docs, return -1. A summary of what was discussed 
> so far:
> # Deprecate those two methods.
> # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI 
> (calls next() and skipTo() respectively, and will be changed to abstract in 
> 3.0).
> #* I actually would like to propose an alternative to the names: advance() 
> and advance(int) - the first advances by one, the second advances to target.
> # Wherever these are used, do something like '(doc = advance()) >= 0' instead 
> of comparing to -1 for improved performance.
> I will post a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1595) Split DocMaker into ContentSource and DocMaker

2009-05-21 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711792#action_12711792
 ] 

Shai Erera commented on LUCENE-1595:


Ok I'll make sure it's 1.4 compatible then.

> Split DocMaker into ContentSource and DocMaker
> --
>
> Key: LUCENE-1595
> URL: https://issues.apache.org/jira/browse/LUCENE-1595
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Reporter: Shai Erera
>Assignee: Mark Miller
> Fix For: 2.9
>
>
> This issue proposes some refactoring to the benchmark package. Today, 
> DocMaker has two roles: collecting documents from a collection and preparing 
> a Document object. These two should actually be split up to ContentSource and 
> DocMaker, which will use a ContentSource instance.
> ContentSource will implement all the methods of DocMaker, like 
> getNextDocData, raw size in bytes tracking etc. This can actually fit well w/ 
> 1591, by having a basic ContentSource that offers input stream services, and 
> wraps a file (for example) with a bzip or gzip streams etc.
> DocMaker will implement the makeDocument methods, reusing DocState etc.
> The idea is that collecting the Enwiki documents, for example, should be the 
> same whether I create documents using DocState, add payloads or index 
> additional metadata. Same goes for Trec and Reuters collections, as well as 
> LineDocMaker.
> In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are 
> 99% the same and 99% different. Most of their differences lie in the way they 
> read the data, while most of the similarity lies in the way they create 
> documents (using DocState).
> That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker 
> (just the reuse of DocState). Also, other DocMakers do not use that DocState 
> today, something they could have gotten for free with this refactoring 
> proposed.
> So by having a EnwikiContentSource, ReutersContentSource and others (TREC, 
> Line, Simple), I can write several DocMakers, such as DocStateMaker, 
> ConfigurableDocMaker (one which accpets all kinds of config options) and 
> custom DocMakers (payload, facets, sorting), passing to them a ContentSource 
> instance and reuse the same DocMaking algorithm with many content sources, as 
> well as the same ContentSource algorithm with many DocMaker implementations.
> This will also give us the opportunity to perf test content sources alone 
> (i.e., compare bzip, gzip and regular input streams), w/o the overhead of 
> creating a Document object.
> I've already done so in my code environment (I extend the benchmark package 
> for my application's purposes) and I like the flexibility I have. I think 
> this can be a nice contribution to the benchmark package, which can result in 
> some code cleanup as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1436) Make ReqExclScorer package private, and use DocIdSetIterator for excluded part.

2009-05-21 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711793#action_12711793
 ] 

Paul Elschot commented on LUCENE-1436:
--

The reason to make things package private is to have more freedom to change 
things lateron that are not really part of the public api now.

> Make ReqExclScorer package private, and use DocIdSetIterator for excluded 
> part.
> ---
>
> Key: LUCENE-1436
> URL: https://issues.apache.org/jira/browse/LUCENE-1436
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Paul Elschot
>Priority: Trivial
> Attachments: LUCENE-1436.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1652) Enhancements to Scorers following the changes to DocIdSetIterator

2009-05-21 Thread Shai Erera (JIRA)
Enhancements to Scorers following the changes to DocIdSetIterator
-

 Key: LUCENE-1652
 URL: https://issues.apache.org/jira/browse/LUCENE-1652
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 3.0


In LUCENE-1614, we changed the semantics of DocIdSetIterator's methods to 
return a sentinel NO_MORE_DOCS (= Integer.MAX_VALUE) when the iterator has 
exhausted. Due to backward compatibility issues, we couldn't implement that 
semantics in doc(). Therefore this issue, which can be introduced in 3.0 only 
will:
# Implement the new semantics in all extending classes, such that doc() will 
return NO_MORE_DOCS when the iterator has exhausted.
# Change BooleanScorer to take advantage of that by removing sub.done from 
SubScorer and operate under the assumption that NO_MORE_DOCS is larger than any 
doc ID (Integer.MAX_VALUE).
# Change ConjunctionScorer to operate under the same assumptions and remove 
'more'.
# Change ReqExclScorer to not rely on reqScorer in doc(), since the latter may 
be null.
# Make more changes to ConjunctionScorer's init() and remove 'firstTime' to 
improve the performance of nextDoc(), score(), advance().
# Add start()/finish() to DISI?

A snippet from LUCENE-1614 regarding the change in BooleanScorer

{code}
int doc = sub.done ? -1 : scorer.doc();
while (!sub.done && doc < end) {
  sub.collector.collect(doc);
  doc = scorer.nextDoc();
  sub.done = doc < 0;
}
{code}

To this:

{code}
int doc = scorer.doc();
while (doc < end) {
  sub.collector.collect(doc);
  doc = scorer.nextDoc();
}
{code}

And in ConjunctionScorer, change this:

{code}
while (more && (firstScorer=scorers[first]).doc() < (lastDoc=lastScorer.doc())) 
{
  more = firstScorer.advance(lastDoc) >= 0;
  lastScorer = firstScorer;
  first = (first == (scorers.length-1)) ? 0 : first+1;
}
return more;
{code}

To this:

{code}
while ((firstScorer=scorers[first]).doc() < (lastDoc=lastScorer.doc())) {
  firstScorer.advance(lastDoc);
  lastScorer = firstScorer;
  first = (first == (scorers.length-1)) ? 0 : first+1;
}
return lastDoc != DOC_SENTINEL;
{code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1436) Make ReqExclScorer package private, and use DocIdSetIterator for excluded part.

2009-05-21 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711790#action_12711790
 ] 

Paul Elschot commented on LUCENE-1436:
--

This should only affect external code that uses the ReqExclScorer class.
>From the amount of reactions on that so far I think such code probably does 
>not exist.

The move to DocIdSetIterator (superclass of Scorer) can be done because no 
score values are used on the excluded part. This is only a consequence of the 
extraction of the DocIdSetIterator superclass from Scorer.

So I don't expect LUCENE-1614 or new semantics of DISI to collide with this.
Also, I don't expect LUCENE-1614 to collide with LUCENE-1345 that allows filter 
as clause to boolean query.

Some time ago DisjunctionSumScorer was made private in a similar way.

ReqOptSumScorer could also be made package private, perhaps even at the same 
time as ReqExclScorer.




> Make ReqExclScorer package private, and use DocIdSetIterator for excluded 
> part.
> ---
>
> Key: LUCENE-1436
> URL: https://issues.apache.org/jira/browse/LUCENE-1436
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Paul Elschot
>Priority: Trivial
> Attachments: LUCENE-1436.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene's default settings & back compatibility

2009-05-21 Thread Shai Erera
I thought we were actually on the track towards not introducing any Settings
and/or actAs, but instead just change the policy?

Can we agree on the following:

* Changes to the index file formats need to be supported for 2 major
releases. I.e. 2.X indexes need to be read by 3.Y code, but not by 4.0.

* Method deprecations last for one full minor release. That is a deprecation
in 2.X lasts through 2.X.1, 2.X+1 but is removed in 2.X+2. If all those X's
are confusing --> a deprecation in 2.4 is kept in 2.4.X and 2.5, but we're
free to remove it in 2.6.

* Changes to default behaviors (whether they are bug fixes or improvements),
where it only affects runtime code, and not the index structure or indexed
data (such as the InvalidAcronym bug fix) are ok to go into any minor
release, w/o deprecation - so long we're documenting the change in CHANGES
along with some sample code on how to migrate easily.

* Changes to default behaviors, bug fixes or improvements, that may
compromise the index structure or indexed data (such as InvalidAcronym) will
last for at least one major release, if not 2 (just like supporting file
formats). The reason is - rebuilding indexes, besides that it might be a
heavy process, is not often acceptable by the customers of those who develop
search solutions. Therefore it may be out of our hands. Personally, I don't
think those will happen a lot, but when they will we can choose between:
(1) Deprecating a class entirely in favor of a new one, such that anyone who
upgrades can still use that class
(2) Introduce a static setter for that behavior, like for InvalidAcronym
(3) Add a actsAs to that class only.

Am I missing a back-compat issue?

What I don't like about actsAs, and perhaps I just don't understand the
proposal well, is that I'm not sure where it's added. Will it be added to
IndexWriter, which will pass it on to all the classes it will meet/use?

If I covered all the back-compat issues above, and we agree on them, then
for the first 3 we just need to document them on the back-compat page, no
code to develop.

For the last one, if we choose to adopt (1) or (2), then we don't need to
develop any mehcanism up-front, but decide on a per-case basis what's the
best alternative. For example, for the InvalidAcronym we could have
deprecated that particular TokenFilter in favor of a new one and give a code
example on how to create a TokenStream with that deprecated TokenFilter.

Shai

On Thu, May 21, 2009 at 10:55 PM, Jason Rutherglen <
jason.rutherg...@gmail.com> wrote:

> I'm having trouble visualizing the various methods people are talking
> about.  It seems like we could open an issue and post patches with code
> illustrating what each person is talking about?
>
> On Thu, May 21, 2009 at 10:02 AM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> Actually, we started with the *Settings classes (to hold defaults),
>> but then realized a simple actsAsVersion (single static method) would
>> suffice for just the back-compat settings and then pushed further and
>> thought perhaps we should relax our back-compat policy entirely so
>> emulating older versions is not needed.
>>
>> So we no longer have the "defaults" class (*Settings).  We may still
>> do it for the future (for its own benefits), but for just back-compat
>> of default settings, it seems like overkill.
>>
>> But I agree, the index altering cases are spooky.  I think this'd make
>> me favor going back to the actsAsVersion option instead of the hard
>> flip on our back compat policy (at least for default settings; for API
>> changes I think 1 whole minor release may be reasonable).
>>
>> Mike
>>
>> On Thu, May 21, 2009 at 12:54 PM, Matthew Hall
>>  wrote:
>> > Sorry, I wasn't quite sure what to call this new class you guys have
>> been
>> > talking about.
>> >
>> > I was referring to the class that's being discussed to encapsulate all
>> of
>> > the defaults for a given lucene release.  (Its caching strategies etc
>> etc)
>> >
>> > I'm just not certain that something like a static list of words belongs
>> in a
>> > higher level defaults class like you guys are talking about, especially
>> > considering that anyone using a stop enabled analyzer really should be
>> > familiar with this list, and oftentimes needs to override it.
>> >
>> > Meh, now that I'm actually typing it out though, perhaps I'm incorrect
>> here,
>> > assuming this class you guys are describing will be well
>> > advertised/documented maybe it will actually make it easier for end
>> > developers to twiddle around with this list, or at least certainly make
>> them
>> > more aware that its even something that they have the ability to
>> actually
>> > change.
>> >
>> > Matt
>> >
>> > Michael McCandless wrote:
>> >>
>> >> What is the "lucene defaults class"?
>> >>
>> >> Mike
>> >>
>> >> On Thu, May 21, 2009 at 12:37 PM, Matthew Hall
>> >>  wrote:
>> >>
>> >>>
>> >>> For extreme examples like this, couldn't the stopword list be
>> >>> encapsulated
>> >>> into a

[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean

2009-05-21 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711785#action_12711785
 ] 

Michael McCandless commented on LUCENE-1614:


bq. Are you sure about it? 

Yes.

bq. If so, then why test-tag failed on it?

Unfortunately, we came up with the idea of the back-compat branch after 2.4 was 
released, so we cut the branch at that point (in 2.9), so the back-compat 
branch does contain tests for early 2.9-only features. 

bq. Are both new in 2.9?

Yes.

{quote}
BTW, I'm going to open the follow-up issue to that, so we can discuss whatever 
improvements we want to make to the Scorers following the MAX_VAL sentinel 
there. Otherwise, they will get lost in this issue, and when we'll handle the 
follow-up one, we might not remember everything.
{quote}

Agreed!

> Add next() and skipTo() variants to DocIdSetIterator that return the current 
> doc, instead of boolean
> 
>
> Key: LUCENE-1614
> URL: https://issues.apache.org/jira/browse/LUCENE-1614
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
> Fix For: 2.9
>
> Attachments: LUCENE-1614.patch, LUCENE-1614.patch
>
>
> See 
> http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html
>  for the full discussion. The basic idea is to add variants to those two 
> methods that return the current doc they are at, to save successive calls to 
> doc(). If there are no more docs, return -1. A summary of what was discussed 
> so far:
> # Deprecate those two methods.
> # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI 
> (calls next() and skipTo() respectively, and will be changed to abstract in 
> 3.0).
> #* I actually would like to propose an alternative to the names: advance() 
> and advance(int) - the first advances by one, the second advances to target.
> # Wherever these are used, do something like '(doc = advance()) >= 0' instead 
> of comparing to -1 for improved performance.
> I will post a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1436) Make ReqExclScorer package private, and use DocIdSetIterator for excluded part.

2009-05-21 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711783#action_12711783
 ] 

Shai Erera commented on LUCENE-1436:


I just hope this does not collide with LUCENE-1614. Note that in the follow-up 
issue to 1614 (the number is yet to be generated) I intend to change the logic 
in ReqExclScorer so that it conforms to the new semantics of DISI.

Why do we need to make this class package-private? Any particular reason 
besides not needing to worry about changes to back-compat in the future?

> Make ReqExclScorer package private, and use DocIdSetIterator for excluded 
> part.
> ---
>
> Key: LUCENE-1436
> URL: https://issues.apache.org/jira/browse/LUCENE-1436
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Paul Elschot
>Priority: Trivial
> Attachments: LUCENE-1436.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean

2009-05-21 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711782#action_12711782
 ] 

Shai Erera commented on LUCENE-1614:


bq. Oh, it turns out OBSI.nextDoc is new in 2.9!

Are you sure about it? If so, then why test-tag failed on it? Notice that there 
are two methods nextDoc() and next(int). Are both new in 2.9? If so, it means 
somebody added them to the tag, for some reason ...

BTW, I'm going to open the follow-up issue to that, so we can discuss whatever 
improvements we want to make to the Scorers following the MAX_VAL sentinel 
there. Otherwise, they will get lost in this issue, and when we'll handle the 
follow-up one, we might not remember everything.

bq. It's called - @Since

That's absolutely a great idea !

> Add next() and skipTo() variants to DocIdSetIterator that return the current 
> doc, instead of boolean
> 
>
> Key: LUCENE-1614
> URL: https://issues.apache.org/jira/browse/LUCENE-1614
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
> Fix For: 2.9
>
> Attachments: LUCENE-1614.patch, LUCENE-1614.patch
>
>
> See 
> http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html
>  for the full discussion. The basic idea is to add variants to those two 
> methods that return the current doc they are at, to save successive calls to 
> doc(). If there are no more docs, return -1. A summary of what was discussed 
> so far:
> # Deprecate those two methods.
> # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI 
> (calls next() and skipTo() respectively, and will be changed to abstract in 
> 3.0).
> #* I actually would like to propose an alternative to the names: advance() 
> and advance(int) - the first advances by one, the second advances to target.
> # Wherever these are used, do something like '(doc = advance()) >= 0' instead 
> of comparing to -1 for improved performance.
> I will post a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene's default settings & back compatibility

2009-05-21 Thread Jason Rutherglen
I'm having trouble visualizing the various methods people are talking
about.  It seems like we could open an issue and post patches with code
illustrating what each person is talking about?

On Thu, May 21, 2009 at 10:02 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> Actually, we started with the *Settings classes (to hold defaults),
> but then realized a simple actsAsVersion (single static method) would
> suffice for just the back-compat settings and then pushed further and
> thought perhaps we should relax our back-compat policy entirely so
> emulating older versions is not needed.
>
> So we no longer have the "defaults" class (*Settings).  We may still
> do it for the future (for its own benefits), but for just back-compat
> of default settings, it seems like overkill.
>
> But I agree, the index altering cases are spooky.  I think this'd make
> me favor going back to the actsAsVersion option instead of the hard
> flip on our back compat policy (at least for default settings; for API
> changes I think 1 whole minor release may be reasonable).
>
> Mike
>
> On Thu, May 21, 2009 at 12:54 PM, Matthew Hall
>  wrote:
> > Sorry, I wasn't quite sure what to call this new class you guys have been
> > talking about.
> >
> > I was referring to the class that's being discussed to encapsulate all of
> > the defaults for a given lucene release.  (Its caching strategies etc
> etc)
> >
> > I'm just not certain that something like a static list of words belongs
> in a
> > higher level defaults class like you guys are talking about, especially
> > considering that anyone using a stop enabled analyzer really should be
> > familiar with this list, and oftentimes needs to override it.
> >
> > Meh, now that I'm actually typing it out though, perhaps I'm incorrect
> here,
> > assuming this class you guys are describing will be well
> > advertised/documented maybe it will actually make it easier for end
> > developers to twiddle around with this list, or at least certainly make
> them
> > more aware that its even something that they have the ability to actually
> > change.
> >
> > Matt
> >
> > Michael McCandless wrote:
> >>
> >> What is the "lucene defaults class"?
> >>
> >> Mike
> >>
> >> On Thu, May 21, 2009 at 12:37 PM, Matthew Hall
> >>  wrote:
> >>
> >>>
> >>> For extreme examples like this, couldn't the stopword list be
> >>> encapsulated
> >>> into a single class that's used by the lucene defaults class.
> >>>
> >>> That way if you folks released updates to mostly static content like a
> >>> stopword list, new or old users could get it easily with a simple drop
> in
> >>> fix?
> >>>
> >>> Just my two cents.
> >>>
> >>> Matt
> >>>
> >>> Michael McCandless wrote:
> >>>
> 
>  On Thu, May 21, 2009 at 12:19 PM, Robert Muir 
> wrote:
> 
> 
> >
> > even as simple as changing default stopword list for some analyzer
> > could
> > be
> > an issue, if the user doesn't re-index in response to that change.
> >
> >
> 
>  OK, right.
> 
>  So say we forgot to include "the" in the default English stopwords
>  list (yes, an extreme example...).
> 
>  Under the proposed changes 1 & 2 to back-compat policy, we would add
>  "the" to the default stopword list, so new users get the fix, but
>  still keep the the-less list accessible (deprecated).  We'd add an
>  entry in CHANGES.txt saying this happened, and then show code on how
>  to get back to the the-less stopword list.
> 
>  New users using that StopFilter would properly see "the" filtered out.
>   Users who upgraded would need to fix their code to switch back to the
>  deprecated the-less list.
> 
>  Mike
> 
>  -
>  To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>  For additional commands, e-mail: java-dev-h...@lucene.apache.org
> 
> 
> 
> >>>
> >>> --
> >>> Matthew Hall
> >>> Software Engineer
> >>> Mouse Genome Informatics
> >>> mh...@informatics.jax.org
> >>> (207) 288-6012
> >>>
> >>>
> >>>
> >>> -
> >>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> >>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
> >>>
> >>>
> >>>
> >>
> >> -
> >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-dev-h...@lucene.apache.org
> >>
> >>
> >
> >
> > --
> > Matthew Hall
> > Software Engineer
> > Mouse Genome Informatics
> > mh...@informatics.jax.org
> > (207) 288-6012
> >
> >
> >
> > -
> > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-dev-h...@lucene.apache.org
> >
> >
>
> --

[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean

2009-05-21 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711773#action_12711773
 ] 

Earwin Burrfoot commented on LUCENE-1614:
-

bq. Oh, it turns out OBSI.nextDoc is new in 2.9!
The phrase sounds all too familiar :)
There's one absolutely cool javadoc tag, which I suggest we start using for all 
user-visible classes and their members. It's called - @Since. Suddenly, 
everything that's not yet released (and that's a big bunch), is clearly marked 
as free for changes and amendments.

> Add next() and skipTo() variants to DocIdSetIterator that return the current 
> doc, instead of boolean
> 
>
> Key: LUCENE-1614
> URL: https://issues.apache.org/jira/browse/LUCENE-1614
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
> Fix For: 2.9
>
> Attachments: LUCENE-1614.patch, LUCENE-1614.patch
>
>
> See 
> http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html
>  for the full discussion. The basic idea is to add variants to those two 
> methods that return the current doc they are at, to save successive calls to 
> doc(). If there are no more docs, return -1. A summary of what was discussed 
> so far:
> # Deprecate those two methods.
> # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI 
> (calls next() and skipTo() respectively, and will be changed to abstract in 
> 3.0).
> #* I actually would like to propose an alternative to the names: advance() 
> and advance(int) - the first advances by one, the second advances to target.
> # Wherever these are used, do something like '(doc = advance()) >= 0' instead 
> of comparing to -1 for improved performance.
> I will post a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene's default settings & back compatibility

2009-05-21 Thread Earwin Burrfoot
> That bug has led to 'base' having a compromised reputation among elite users
> because of intermittent, inexplicable flakiness.  Is that what you want for
> Lucene?
While I agree with that point, Lucene already has lots and lots of
static configuration.
Having actsAsVersion won't add any new woes. Well, it won't remove them either.

I dislike both single static variable approach and Settings object you
have to pass into each end every constructor.
If I absolutely must choose and have no right to just throw all these
back-compat crutches out of the window, I'd choose a single static
variable.

On the offnote, it's amusing to see how the discussion is calming down
and will seemingly end in minor amendmends to the policy, if any at
all :)

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1436) Make ReqExclScorer package private, and use DocIdSetIterator for excluded part.

2009-05-21 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711753#action_12711753
 ] 

Michael McCandless commented on LUCENE-1436:


Paul, this is technically a change in back-compat (since it's a public class, 
that this patch makes package private).  Are you thinking it's just very 
unlikely this is used by anything except BooleanScorer2?

We also have at least ReqOptSumScorer (and maybe others?) which is public.

> Make ReqExclScorer package private, and use DocIdSetIterator for excluded 
> part.
> ---
>
> Key: LUCENE-1436
> URL: https://issues.apache.org/jira/browse/LUCENE-1436
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Paul Elschot
>Priority: Trivial
> Attachments: LUCENE-1436.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene's default settings & back compatibility

2009-05-21 Thread Marvin Humphrey
Mike McCandless:

> Well this is what I love about the actsAsVersion solution.  There's no
> pain for our back-compat users (besides the one-time effort to set
> actsAsVersion), and new users always get the best settings.

When some mad-as-hell user complains to this list after spending an inordinate
amount of time chasing down an action-at-a-distance bug because of this
insidious and irresponsible OO design decision, I intend to follow up their
email with an I-told-you-so.

There's an action-at-a-distance bug in the Perl core module 'base.pm' that
bedeviled people for years before I finally cornered it.  Turns out it can't
be fixed, but at least now we know what's happening:

http://rt.cpan.org/Public/Bug/Display.html?id=28799

While this error does not occur frequently in the wild, when it does, the
cost to the user is high because the debug path is obscure. I personally
encountered it after failing to wrap a "use_ok" test in a BEGIN block;
isolating it took me... longer than I would have liked. ;)

That bug has led to 'base' having a compromised reputation among elite users
because of intermittent, inexplicable flakiness.  Is that what you want for
Lucene?

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1636) TokenFilters with a null value in the constructor fail

2009-05-21 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1636.


   Resolution: Fixed
Fix Version/s: 2.9

> TokenFilters with a null value in the constructor fail
> --
>
> Key: LUCENE-1636
> URL: https://issues.apache.org/jira/browse/LUCENE-1636
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.9
>Reporter: Wouter Heijke
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1636.patch
>
>
> While migrating from 2.4.x to 2.9-dev I found a lot of failing unittests.
> One problem is with TokenFilters that do a super(null) in the constructor.
> I fixed it by changing the constructor to super(new EmptyTokenStream())
> This will cause problems and frustration to others while migrating to 2.9.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1636) TokenFilters with a null value in the constructor fail

2009-05-21 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711704#action_12711704
 ] 

Michael McCandless commented on LUCENE-1636:


I think we should change this in 2.9, for the reasons Uwe pointed out, to 
disallow changing the delegate after construction.

> TokenFilters with a null value in the constructor fail
> --
>
> Key: LUCENE-1636
> URL: https://issues.apache.org/jira/browse/LUCENE-1636
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.9
>Reporter: Wouter Heijke
>Assignee: Michael McCandless
> Attachments: LUCENE-1636.patch
>
>
> While migrating from 2.4.x to 2.9-dev I found a lot of failing unittests.
> One problem is with TokenFilters that do a super(null) in the constructor.
> I fixed it by changing the constructor to super(new EmptyTokenStream())
> This will cause problems and frustration to others while migrating to 2.9.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-1636) TokenFilters with a null value in the constructor fail

2009-05-21 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-1636:
--

Assignee: Michael McCandless

> TokenFilters with a null value in the constructor fail
> --
>
> Key: LUCENE-1636
> URL: https://issues.apache.org/jira/browse/LUCENE-1636
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.9
>Reporter: Wouter Heijke
>Assignee: Michael McCandless
> Attachments: LUCENE-1636.patch
>
>
> While migrating from 2.4.x to 2.9-dev I found a lot of failing unittests.
> One problem is with TokenFilters that do a super(null) in the constructor.
> I fixed it by changing the constructor to super(new EmptyTokenStream())
> This will cause problems and frustration to others while migrating to 2.9.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1637) Getting an IndexReader from a committed IndexWriter

2009-05-21 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711701#action_12711701
 ] 

Michael McCandless commented on LUCENE-1637:


Couldn't you simply call IW.getReader() just after calling IW.commit()?

Or just IndexReader.reopen?  (The extra work that reopen does vs 
IndexWriter.getReader is fairly minor -- the former must find & open the 
segments_N file; after that it's the same).

> Getting an IndexReader from a committed IndexWriter
> ---
>
> Key: LUCENE-1637
> URL: https://issues.apache.org/jira/browse/LUCENE-1637
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.9
>Reporter: Shay Banon
>
> I just had a look at the job done in IndexWriter in order to get an 
> IndexReader with all the current ongoing changes done using the IndexWriter. 
> This feature is very useful, and I was wondering if another feature, which (I 
> think) is simple to implement (compared to the previous one) might make 
> sense. 
>Many times, an application opens an IndexWriter, does whatever changes it 
> does, and then commits the changes. It would be nice to get an IndexReader 
> (read only one is fine) that corresponds to the committed (or even closed) 
> IndexWriter. This will allow for a cache of IndexReader that is already used 
> to be updated with a fresh IndexReader, without the need to reopen one (which 
> should be slower than opening one based on the IndexWriter information). The 
> main difference is the fact that the mentioned IndexReader could still be 
> reopened without the need to throw an AlreadyClosedException. 
>More information can be found here: 
> http://www.nabble.com/Getting-an-IndexReader-from-a-committed-IndexWriter-td23551978.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean

2009-05-21 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711695#action_12711695
 ] 

Michael McCandless commented on LUCENE-1614:


Oh, it turns out OBSI.nextDoc is new in 2.9!  So we are free to change it...

> Add next() and skipTo() variants to DocIdSetIterator that return the current 
> doc, instead of boolean
> 
>
> Key: LUCENE-1614
> URL: https://issues.apache.org/jira/browse/LUCENE-1614
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
> Fix For: 2.9
>
> Attachments: LUCENE-1614.patch, LUCENE-1614.patch
>
>
> See 
> http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html
>  for the full discussion. The basic idea is to add variants to those two 
> methods that return the current doc they are at, to save successive calls to 
> doc(). If there are no more docs, return -1. A summary of what was discussed 
> so far:
> # Deprecate those two methods.
> # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI 
> (calls next() and skipTo() respectively, and will be changed to abstract in 
> 3.0).
> #* I actually would like to propose an alternative to the names: advance() 
> and advance(int) - the first advances by one, the second advances to target.
> # Wherever these are used, do something like '(doc = advance()) >= 0' instead 
> of comparing to -1 for improved performance.
> I will post a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene's default settings & back compatibility

2009-05-21 Thread DM Smith

Michael McCandless wrote:

On Thu, May 21, 2009 at 12:19 PM, Robert Muir  wrote:
  

even as simple as changing default stopword list for some analyzer could be
an issue, if the user doesn't re-index in response to that change.



OK, right.

So say we forgot to include "the" in the default English stopwords
list (yes, an extreme example...).
  
"The" would be a bug fix. I think most users would expect that to be 
fixed. They might be willing, as I would be, to require all their 
indexes using that stopword list to be rebuilt.


How about a change that would be a bit more controversial, to which some 
would agree and others would not.


I wonder how many people are creating metadata about indexes so that 
they can track when an index could/should/must be rebuilt? Some kind of 
"versioned tool chain info" for the index. If analyzers and filters can 
change output then it needs to be tracked.


-- DM


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene's default settings & back compatibility

2009-05-21 Thread Michael McCandless
Actually, we started with the *Settings classes (to hold defaults),
but then realized a simple actsAsVersion (single static method) would
suffice for just the back-compat settings and then pushed further and
thought perhaps we should relax our back-compat policy entirely so
emulating older versions is not needed.

So we no longer have the "defaults" class (*Settings).  We may still
do it for the future (for its own benefits), but for just back-compat
of default settings, it seems like overkill.

But I agree, the index altering cases are spooky.  I think this'd make
me favor going back to the actsAsVersion option instead of the hard
flip on our back compat policy (at least for default settings; for API
changes I think 1 whole minor release may be reasonable).

Mike

On Thu, May 21, 2009 at 12:54 PM, Matthew Hall
 wrote:
> Sorry, I wasn't quite sure what to call this new class you guys have been
> talking about.
>
> I was referring to the class that's being discussed to encapsulate all of
> the defaults for a given lucene release.  (Its caching strategies etc etc)
>
> I'm just not certain that something like a static list of words belongs in a
> higher level defaults class like you guys are talking about, especially
> considering that anyone using a stop enabled analyzer really should be
> familiar with this list, and oftentimes needs to override it.
>
> Meh, now that I'm actually typing it out though, perhaps I'm incorrect here,
> assuming this class you guys are describing will be well
> advertised/documented maybe it will actually make it easier for end
> developers to twiddle around with this list, or at least certainly make them
> more aware that its even something that they have the ability to actually
> change.
>
> Matt
>
> Michael McCandless wrote:
>>
>> What is the "lucene defaults class"?
>>
>> Mike
>>
>> On Thu, May 21, 2009 at 12:37 PM, Matthew Hall
>>  wrote:
>>
>>>
>>> For extreme examples like this, couldn't the stopword list be
>>> encapsulated
>>> into a single class that's used by the lucene defaults class.
>>>
>>> That way if you folks released updates to mostly static content like a
>>> stopword list, new or old users could get it easily with a simple drop in
>>> fix?
>>>
>>> Just my two cents.
>>>
>>> Matt
>>>
>>> Michael McCandless wrote:
>>>

 On Thu, May 21, 2009 at 12:19 PM, Robert Muir  wrote:


>
> even as simple as changing default stopword list for some analyzer
> could
> be
> an issue, if the user doesn't re-index in response to that change.
>
>

 OK, right.

 So say we forgot to include "the" in the default English stopwords
 list (yes, an extreme example...).

 Under the proposed changes 1 & 2 to back-compat policy, we would add
 "the" to the default stopword list, so new users get the fix, but
 still keep the the-less list accessible (deprecated).  We'd add an
 entry in CHANGES.txt saying this happened, and then show code on how
 to get back to the the-less stopword list.

 New users using that StopFilter would properly see "the" filtered out.
  Users who upgraded would need to fix their code to switch back to the
 deprecated the-less list.

 Mike

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



>>>
>>> --
>>> Matthew Hall
>>> Software Engineer
>>> Mouse Genome Informatics
>>> mh...@informatics.jax.org
>>> (207) 288-6012
>>>
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>>
>>>
>>>
>>
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>>
>
>
> --
> Matthew Hall
> Software Engineer
> Mouse Genome Informatics
> mh...@informatics.jax.org
> (207) 288-6012
>
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene's default settings & back compatibility

2009-05-21 Thread Robert Muir
yeah, i was thinking the more likely case of where something like "teh" is
in the list...

On Thu, May 21, 2009 at 12:25 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> On Thu, May 21, 2009 at 12:19 PM, Robert Muir  wrote:
> > even as simple as changing default stopword list for some analyzer could
> be
> > an issue, if the user doesn't re-index in response to that change.
>
> OK, right.
>
> So say we forgot to include "the" in the default English stopwords
> list (yes, an extreme example...).
>
> Under the proposed changes 1 & 2 to back-compat policy, we would add
> "the" to the default stopword list, so new users get the fix, but
> still keep the the-less list accessible (deprecated).  We'd add an
> entry in CHANGES.txt saying this happened, and then show code on how
> to get back to the the-less stopword list.
>
> New users using that StopFilter would properly see "the" filtered out.
>  Users who upgraded would need to fix their code to switch back to the
> deprecated the-less list.
>
> Mike
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


-- 
Robert Muir
rcm...@gmail.com


Re: Lucene's default settings & back compatibility

2009-05-21 Thread Michael McCandless
On Thu, May 21, 2009 at 12:46 PM, DM Smith  wrote:
> I'm looking forward to the repackaging effort.

I'm looking forward to it too!  I can't wait for NumericRangeQuery...

But: someone with serious ant skill set, and some time, needs to get
the itch here and start iterating...

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene's default settings & back compatibility

2009-05-21 Thread Michael McCandless
On Thu, May 21, 2009 at 12:43 PM, Mark Miller  wrote:

> Hmmm - thats starting to sound nastier. Its another barrier to upgrading to
> a new jar. I have to monitor/hunt down and not miss all these little flags
> so that docs/terms don't disappear from my index? There is already some of
> that and I'd hate to see it grow. I'd rather have a stronger back compat
> policy here I think. Its just one thing after another stacking up to make
> upgrading a risk/pain at each jar update. I used to work on a project where
> I upgraded Lucene often, and it was because it was so easy to keep dropping
> in and picking new features as I wanted. We will really start pushing a
> heavy onus onto our users if we fully adopt 1 and 2. New users will benefit,
> but old users, unless they are Lucene hackers like you guys, will suffer.
>  Eventually our new users will be our old users.

Well this is what I love about the actsAsVersion solution.  There's no
pain for our back-compat users (besides the one-time effort to set
actsAsVersion), and new users always get the best settings.

Or... we could consider encoding "actsAsVersion" into the index by
default.  Then, when IndexWriter asks the Analyzer for a tokenStream,
it'd pass in the actsAsVersion, so that any tokenizers/filters in the
chain would preserve their behavior as of that Lucene version.  (You'd
have to be able to turn this off, too).

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene's default settings & back compatibility

2009-05-21 Thread Matthew Hall
Sorry, I wasn't quite sure what to call this new class you guys have 
been talking about.


I was referring to the class that's being discussed to encapsulate all 
of the defaults for a given lucene release.  (Its caching strategies etc 
etc)


I'm just not certain that something like a static list of words belongs 
in a higher level defaults class like you guys are talking about, 
especially considering that anyone using a stop enabled analyzer really 
should be familiar with this list, and oftentimes needs to override it.


Meh, now that I'm actually typing it out though, perhaps I'm incorrect 
here, assuming this class you guys are describing will be well 
advertised/documented maybe it will actually make it easier for end 
developers to twiddle around with this list, or at least certainly make 
them more aware that its even something that they have the ability to 
actually change.


Matt

Michael McCandless wrote:

What is the "lucene defaults class"?

Mike

On Thu, May 21, 2009 at 12:37 PM, Matthew Hall
 wrote:
  

For extreme examples like this, couldn't the stopword list be encapsulated
into a single class that's used by the lucene defaults class.

That way if you folks released updates to mostly static content like a
stopword list, new or old users could get it easily with a simple drop in
fix?

Just my two cents.

Matt

Michael McCandless wrote:


On Thu, May 21, 2009 at 12:19 PM, Robert Muir  wrote:

  

even as simple as changing default stopword list for some analyzer could
be
an issue, if the user doesn't re-index in response to that change.



OK, right.

So say we forgot to include "the" in the default English stopwords
list (yes, an extreme example...).

Under the proposed changes 1 & 2 to back-compat policy, we would add
"the" to the default stopword list, so new users get the fix, but
still keep the the-less list accessible (deprecated).  We'd add an
entry in CHANGES.txt saying this happened, and then show code on how
to get back to the the-less stopword list.

New users using that StopFilter would properly see "the" filtered out.
 Users who upgraded would need to fix their code to switch back to the
deprecated the-less list.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org


  

--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

  



--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene's default settings & back compatibility

2009-05-21 Thread DM Smith

Michael McCandless wrote:

On Thu, May 21, 2009 at 8:24 AM, DM Smith  wrote:
  

On May 21, 2009, at 7:17 AM, Michael McCandless wrote:



 1) Default settings can change; we will always choose defaults based
   on "latest & greatest for new users".  This only affects "runtime
   behavior".  EG in 2.9, when sorting by field you won't get scores
   by default.  When we do this we should clearly document the
   change, and what settings one could use to get back to the old
   behavior, in CHANGES.txt.
  

I'd reverse 1 and 2 and note in 1 that the old behavior might be deprecated.



OK.

  

 2) An API, once released as deprecated, is fair game to be removed
   in the next minor release.
  

I presume you mean that it will be present for at least one full minor
release. So, if at 3.1.5 a deprecation is introduced, then it won't be
removed until 3.3 at the earliest, because 3.2 was the first minor release
in which it appeared at the start. I don't think it is fair to expect users
to get every last point release.



Right.

  

We still only make bug fixes on point releases, support the index file
format until the next major release -- those don't change.
  

Is it just the index file format? I would hope that the behavior of filters,
analyzers and such would not change so as to invalidate an index.



Can you give an example of such changes?  EG if we fix a bug in
StandardAnalyzer, we will default it to fixed for new users and expect
you on upgrading to read CHANGES.txt and change your app to set that
setting to its non-defaulted value.
  
I guess I'm not too concerned with bug fixes. I'm kind of a nut when it 
comes to correctness. But, I'd want to know that such a bug broke strict 
backward compatibility. I guess I don't want backward compatibility to 
get too much in the way of fixing bugs. (I think sometimes it has.) I 
wouldn't expect a compatibility flag to preserve buggy behavior. I guess 
I'm willing to go to extra effort to work with bug fixes. But I wouldn't 
expect others to feel the same way.


Off the top of my head, in addition to Robert's stop word list, let's 
say that the filter that strips accents (I can't remember the name) is 
changed to be more than Latin-1 to ASCII folding. That would invalidate 
existing indexes.


Or a new and improved filter is created to replace a class I use and the 
old class is deprecated. If that old class goes away, my index is 
invalidated.


So if the stream of tokens out of an analyzer changes or the results of 
a filter is different, an index built with them is invalidated. If the 
output remains the same, I shouldn't care what has changed internally 
and probably don't care if the API has changed.


I don't know if it matters to this discussion, but there's a lot in 
contrib that people (of which I am one :) expect to be stable. I'm 
looking forward to the repackaging effort.


-- DM



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene's default settings & back compatibility

2009-05-21 Thread Mark Miller

Michael McCandless wrote:

On Thu, May 21, 2009 at 12:19 PM, Robert Muir  wrote:
  

even as simple as changing default stopword list for some analyzer could be
an issue, if the user doesn't re-index in response to that change.



OK, right.

So say we forgot to include "the" in the default English stopwords
list (yes, an extreme example...).

Under the proposed changes 1 & 2 to back-compat policy, we would add
"the" to the default stopword list, so new users get the fix, but
still keep the the-less list accessible (deprecated).  We'd add an
entry in CHANGES.txt saying this happened, and then show code on how
to get back to the the-less stopword list.

New users using that StopFilter would properly see "the" filtered out.
 Users who upgraded would need to fix their code to switch back to the
deprecated the-less list.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

  
Hmmm - thats starting to sound nastier. Its another barrier to upgrading 
to a new jar. I have to monitor/hunt down and not miss all these little 
flags so that docs/terms don't disappear from my index? There is already 
some of that and I'd hate to see it grow. I'd rather have a stronger 
back compat policy here I think. Its just one thing after another 
stacking up to make upgrading a risk/pain at each jar update. I used to 
work on a project where I upgraded Lucene often, and it was because it 
was so easy to keep dropping in and picking new features as I wanted. We 
will really start pushing a heavy onus onto our users if we fully adopt 
1 and 2. New users will benefit, but old users, unless they are Lucene 
hackers like you guys, will suffer.  Eventually our new users will be 
our old users.


I'm fully on the fence. I think relaxing will help development, but 
Lucene's stability has also been a strong quality. It would be nice to 
see it remain in some form.


--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene's default settings & back compatibility

2009-05-21 Thread Michael McCandless
What is the "lucene defaults class"?

Mike

On Thu, May 21, 2009 at 12:37 PM, Matthew Hall
 wrote:
> For extreme examples like this, couldn't the stopword list be encapsulated
> into a single class that's used by the lucene defaults class.
>
> That way if you folks released updates to mostly static content like a
> stopword list, new or old users could get it easily with a simple drop in
> fix?
>
> Just my two cents.
>
> Matt
>
> Michael McCandless wrote:
>>
>> On Thu, May 21, 2009 at 12:19 PM, Robert Muir  wrote:
>>
>>>
>>> even as simple as changing default stopword list for some analyzer could
>>> be
>>> an issue, if the user doesn't re-index in response to that change.
>>>
>>
>> OK, right.
>>
>> So say we forgot to include "the" in the default English stopwords
>> list (yes, an extreme example...).
>>
>> Under the proposed changes 1 & 2 to back-compat policy, we would add
>> "the" to the default stopword list, so new users get the fix, but
>> still keep the the-less list accessible (deprecated).  We'd add an
>> entry in CHANGES.txt saying this happened, and then show code on how
>> to get back to the the-less stopword list.
>>
>> New users using that StopFilter would properly see "the" filtered out.
>>  Users who upgraded would need to fix their code to switch back to the
>> deprecated the-less list.
>>
>> Mike
>>
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>>
>
>
> --
> Matthew Hall
> Software Engineer
> Mouse Genome Informatics
> mh...@informatics.jax.org
> (207) 288-6012
>
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean

2009-05-21 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711682#action_12711682
 ] 

Michael McCandless commented on LUCENE-1614:


bq. If they are all on -1 to start with, they are already all sorted.

Right but that defeats the optimization.  I'm talking about this code in 
ConjunctionScorer:
{code}
Arrays.sort(scorers, new Comparator() { // sort the array
public int compare(Object o1, Object o2) {
  return ((Scorer)o1).doc() - ((Scorer)o2).doc();
}
  });

doNext();

// If first-time skip distance is any predictor of
// scorer sparseness, then we should always try to skip first on
// those scorers.
// Keep last scorer in it's last place (it will be the first
// to be skipped on), but reverse all of the others so that
// they will be skipped on in order of original high skip.
int end=(scorers.length-1);
for (int i=0; i<(end>>1); i++) {
  Scorer tmp = scorers[i];
  scorers[i] = scorers[end-i-1];
  scorers[end-i-1] = tmp;
}
{code}

Ie it sets things up so that "typically" the rarest sub-scorer drives the 
intersection.  If they are all on -1 then this heuristic won't work.

{quote}
We could do some smart sorting in the constructor so that we skip in cheap and 
fast scorers first (TermScorers first, ordered by df, followed by simple 
conjunctions of terms, followed by other more expensive stuff like sloppy 
phrase queries and complex boolean queries. Perhaps in the future, even a 
method on Scorer that estimates it's cost?
{quote}

Right, we'd need to do something along these lines if we switch DISI to start 
with doc() = -1.

> Add next() and skipTo() variants to DocIdSetIterator that return the current 
> doc, instead of boolean
> 
>
> Key: LUCENE-1614
> URL: https://issues.apache.org/jira/browse/LUCENE-1614
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
> Fix For: 2.9
>
> Attachments: LUCENE-1614.patch, LUCENE-1614.patch
>
>
> See 
> http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html
>  for the full discussion. The basic idea is to add variants to those two 
> methods that return the current doc they are at, to save successive calls to 
> doc(). If there are no more docs, return -1. A summary of what was discussed 
> so far:
> # Deprecate those two methods.
> # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI 
> (calls next() and skipTo() respectively, and will be changed to abstract in 
> 3.0).
> #* I actually would like to propose an alternative to the names: advance() 
> and advance(int) - the first advances by one, the second advances to target.
> # Wherever these are used, do something like '(doc = advance()) >= 0' instead 
> of comparing to -1 for improved performance.
> I will post a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene's default settings & back compatibility

2009-05-21 Thread Matthew Hall
For extreme examples like this, couldn't the stopword list be 
encapsulated into a single class that's used by the lucene defaults class.


That way if you folks released updates to mostly static content like a 
stopword list, new or old users could get it easily with a simple drop 
in fix?


Just my two cents.

Matt

Michael McCandless wrote:

On Thu, May 21, 2009 at 12:19 PM, Robert Muir  wrote:
  

even as simple as changing default stopword list for some analyzer could be
an issue, if the user doesn't re-index in response to that change.



OK, right.

So say we forgot to include "the" in the default English stopwords
list (yes, an extreme example...).

Under the proposed changes 1 & 2 to back-compat policy, we would add
"the" to the default stopword list, so new users get the fix, but
still keep the the-less list accessible (deprecated).  We'd add an
entry in CHANGES.txt saying this happened, and then show code on how
to get back to the the-less stopword list.

New users using that StopFilter would properly see "the" filtered out.
 Users who upgraded would need to fix their code to switch back to the
deprecated the-less list.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

  



--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mh...@informatics.jax.org
(207) 288-6012



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1595) Split DocMaker into ContentSource and DocMaker

2009-05-21 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711678#action_12711678
 ] 

Mark Miller commented on LUCENE-1595:
-

Right - the back compat for each contrib is completely up to that contrib. In 
the past though, anything thats 1.4 has stayed 1.4 without good reason so that 
users are not jolted (probably more out there using java 1.4 than you might 
think).

On 3.0, when core goes 1.5, it will make sense to allow 1.5 in all the contribs 
that are 1.4 now.

> Split DocMaker into ContentSource and DocMaker
> --
>
> Key: LUCENE-1595
> URL: https://issues.apache.org/jira/browse/LUCENE-1595
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Reporter: Shai Erera
>Assignee: Mark Miller
> Fix For: 2.9
>
>
> This issue proposes some refactoring to the benchmark package. Today, 
> DocMaker has two roles: collecting documents from a collection and preparing 
> a Document object. These two should actually be split up to ContentSource and 
> DocMaker, which will use a ContentSource instance.
> ContentSource will implement all the methods of DocMaker, like 
> getNextDocData, raw size in bytes tracking etc. This can actually fit well w/ 
> 1591, by having a basic ContentSource that offers input stream services, and 
> wraps a file (for example) with a bzip or gzip streams etc.
> DocMaker will implement the makeDocument methods, reusing DocState etc.
> The idea is that collecting the Enwiki documents, for example, should be the 
> same whether I create documents using DocState, add payloads or index 
> additional metadata. Same goes for Trec and Reuters collections, as well as 
> LineDocMaker.
> In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are 
> 99% the same and 99% different. Most of their differences lie in the way they 
> read the data, while most of the similarity lies in the way they create 
> documents (using DocState).
> That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker 
> (just the reuse of DocState). Also, other DocMakers do not use that DocState 
> today, something they could have gotten for free with this refactoring 
> proposed.
> So by having a EnwikiContentSource, ReutersContentSource and others (TREC, 
> Line, Simple), I can write several DocMakers, such as DocStateMaker, 
> ConfigurableDocMaker (one which accpets all kinds of config options) and 
> custom DocMakers (payload, facets, sorting), passing to them a ContentSource 
> instance and reuse the same DocMaking algorithm with many content sources, as 
> well as the same ContentSource algorithm with many DocMaker implementations.
> This will also give us the opportunity to perf test content sources alone 
> (i.e., compare bzip, gzip and regular input streams), w/o the overhead of 
> creating a Document object.
> I've already done so in my code environment (I extend the benchmark package 
> for my application's purposes) and I like the flexibility I have. I think 
> this can be a nice contribution to the benchmark package, which can result in 
> some code cleanup as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene's default settings & back compatibility

2009-05-21 Thread Michael McCandless
On Thu, May 21, 2009 at 12:19 PM, Robert Muir  wrote:
> even as simple as changing default stopword list for some analyzer could be
> an issue, if the user doesn't re-index in response to that change.

OK, right.

So say we forgot to include "the" in the default English stopwords
list (yes, an extreme example...).

Under the proposed changes 1 & 2 to back-compat policy, we would add
"the" to the default stopword list, so new users get the fix, but
still keep the the-less list accessible (deprecated).  We'd add an
entry in CHANGES.txt saying this happened, and then show code on how
to get back to the the-less stopword list.

New users using that StopFilter would properly see "the" filtered out.
 Users who upgraded would need to fix their code to switch back to the
deprecated the-less list.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean

2009-05-21 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711676#action_12711676
 ] 

Yonik Seeley commented on LUCENE-1614:
--

bq. But: wouldn't ConjunctionScorer still need an init() to sort its 
sub-scorers?

If they are all on -1 to start with, they are already all sorted.

We could do some smart sorting in the constructor so that we skip in cheap and 
fast scorers first (TermScorers first, ordered by df, followed by simple 
conjunctions of terms, followed by other more expensive stuff like sloppy 
phrase queries and complex boolean queries.  Perhaps in the future, even a 
method on Scorer that estimates it's cost?



> Add next() and skipTo() variants to DocIdSetIterator that return the current 
> doc, instead of boolean
> 
>
> Key: LUCENE-1614
> URL: https://issues.apache.org/jira/browse/LUCENE-1614
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
> Fix For: 2.9
>
> Attachments: LUCENE-1614.patch, LUCENE-1614.patch
>
>
> See 
> http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html
>  for the full discussion. The basic idea is to add variants to those two 
> methods that return the current doc they are at, to save successive calls to 
> doc(). If there are no more docs, return -1. A summary of what was discussed 
> so far:
> # Deprecate those two methods.
> # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI 
> (calls next() and skipTo() respectively, and will be changed to abstract in 
> 3.0).
> #* I actually would like to propose an alternative to the names: advance() 
> and advance(int) - the first advances by one, the second advances to target.
> # Wherever these are used, do something like '(doc = advance()) >= 0' instead 
> of comparing to -1 for improved performance.
> I will post a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1595) Split DocMaker into ContentSource and DocMaker

2009-05-21 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711674#action_12711674
 ] 

Michael McCandless commented on LUCENE-1595:


Probably it's best to stick w/ 1.4.  Someday, I hope, we will get to 3.0 :)

> Split DocMaker into ContentSource and DocMaker
> --
>
> Key: LUCENE-1595
> URL: https://issues.apache.org/jira/browse/LUCENE-1595
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Reporter: Shai Erera
>Assignee: Mark Miller
> Fix For: 2.9
>
>
> This issue proposes some refactoring to the benchmark package. Today, 
> DocMaker has two roles: collecting documents from a collection and preparing 
> a Document object. These two should actually be split up to ContentSource and 
> DocMaker, which will use a ContentSource instance.
> ContentSource will implement all the methods of DocMaker, like 
> getNextDocData, raw size in bytes tracking etc. This can actually fit well w/ 
> 1591, by having a basic ContentSource that offers input stream services, and 
> wraps a file (for example) with a bzip or gzip streams etc.
> DocMaker will implement the makeDocument methods, reusing DocState etc.
> The idea is that collecting the Enwiki documents, for example, should be the 
> same whether I create documents using DocState, add payloads or index 
> additional metadata. Same goes for Trec and Reuters collections, as well as 
> LineDocMaker.
> In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are 
> 99% the same and 99% different. Most of their differences lie in the way they 
> read the data, while most of the similarity lies in the way they create 
> documents (using DocState).
> That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker 
> (just the reuse of DocState). Also, other DocMakers do not use that DocState 
> today, something they could have gotten for free with this refactoring 
> proposed.
> So by having a EnwikiContentSource, ReutersContentSource and others (TREC, 
> Line, Simple), I can write several DocMakers, such as DocStateMaker, 
> ConfigurableDocMaker (one which accpets all kinds of config options) and 
> custom DocMakers (payload, facets, sorting), passing to them a ContentSource 
> instance and reuse the same DocMaking algorithm with many content sources, as 
> well as the same ContentSource algorithm with many DocMaker implementations.
> This will also give us the opportunity to perf test content sources alone 
> (i.e., compare bzip, gzip and regular input streams), w/o the overhead of 
> creating a Document object.
> I've already done so in my code environment (I extend the benchmark package 
> for my application's purposes) and I like the flexibility I have. I think 
> this can be a nice contribution to the benchmark package, which can result in 
> some code cleanup as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene's default settings & back compatibility

2009-05-21 Thread Robert Muir
even as simple as changing default stopword list for some analyzer could be
an issue, if the user doesn't re-index in response to that change.


> Can you give an example of such changes?  EG if we fix a bug in
> StandardAnalyzer, we will default it to fixed for new users and expect
> you on upgrading to read CHANGES.txt and change your app to set that
> setting to its non-defaulted value.
>
> Mike
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


-- 
Robert Muir
rcm...@gmail.com


[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean

2009-05-21 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711670#action_12711670
 ] 

Michael McCandless commented on LUCENE-1614:


We could also consider adding DISI.start (we discussed this under another 
issue).

And maybe likewise DISI.finish -- there's a question on the user's list now "Do 
TermDocs and TermEnum need to be closed?" that notes that DISI never gives one 
a chance to close.

> Add next() and skipTo() variants to DocIdSetIterator that return the current 
> doc, instead of boolean
> 
>
> Key: LUCENE-1614
> URL: https://issues.apache.org/jira/browse/LUCENE-1614
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
> Fix For: 2.9
>
> Attachments: LUCENE-1614.patch, LUCENE-1614.patch
>
>
> See 
> http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html
>  for the full discussion. The basic idea is to add variants to those two 
> methods that return the current doc they are at, to save successive calls to 
> doc(). If there are no more docs, return -1. A summary of what was discussed 
> so far:
> # Deprecate those two methods.
> # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI 
> (calls next() and skipTo() respectively, and will be changed to abstract in 
> 3.0).
> #* I actually would like to propose an alternative to the names: advance() 
> and advance(int) - the first advances by one, the second advances to target.
> # Wherever these are used, do something like '(doc = advance()) >= 0' instead 
> of comparing to -1 for improved performance.
> I will post a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene's default settings & back compatibility

2009-05-21 Thread Michael McCandless
On Thu, May 21, 2009 at 8:24 AM, DM Smith  wrote:
>
> On May 21, 2009, at 7:17 AM, Michael McCandless wrote:
>
>>  1) Default settings can change; we will always choose defaults based
>>    on "latest & greatest for new users".  This only affects "runtime
>>    behavior".  EG in 2.9, when sorting by field you won't get scores
>>    by default.  When we do this we should clearly document the
>>    change, and what settings one could use to get back to the old
>>    behavior, in CHANGES.txt.
>
> I'd reverse 1 and 2 and note in 1 that the old behavior might be deprecated.

OK.

>>  2) An API, once released as deprecated, is fair game to be removed
>>    in the next minor release.
>
> I presume you mean that it will be present for at least one full minor
> release. So, if at 3.1.5 a deprecation is introduced, then it won't be
> removed until 3.3 at the earliest, because 3.2 was the first minor release
> in which it appeared at the start. I don't think it is fair to expect users
> to get every last point release.

Right.

>> We still only make bug fixes on point releases, support the index file
>> format until the next major release -- those don't change.
>
> Is it just the index file format? I would hope that the behavior of filters,
> analyzers and such would not change so as to invalidate an index.

Can you give an example of such changes?  EG if we fix a bug in
StandardAnalyzer, we will default it to fixed for new users and expect
you on upgrading to read CHANGES.txt and change your app to set that
setting to its non-defaulted value.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1651) Make IndexReader.open() always return MSR to simplify (re-)opens.

2009-05-21 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1651:
---

Fix Version/s: 2.9

> Make IndexReader.open() always return MSR to simplify (re-)opens.
> -
>
> Key: LUCENE-1651
> URL: https://issues.apache.org/jira/browse/LUCENE-1651
> Project: Lucene - Java
>  Issue Type: Task
>Affects Versions: 2.9
>Reporter: Earwin Burrfoot
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1651.patch
>
>
> As per discussion in mailing list, I'm making DirectoryIndexReader.open() 
> always return MSR, even for single-segment indexes.
> While theoretically valid in the past (if you make sure to keep your index 
> constantly optimized) this feature is made practically obsolete by 
> per-segment collection.
> The patch somewhat de-hairies (re-)open logic for MSR/SR.
> SR no longer needs an ability to pose as toplevel directory-owning IR.
> All related logic is moved from DIR to MSR.
> DIR becomes almost empty, and copying two or three remaining fields over to 
> MSR/SR, I remove it.
> Lots of tests fail, as they rely on SR returned from IR.open(), I fix by 
> introducing SR.getOnlySegmentReader static package-private method.
> Some previous bugs are uncovered, one is fixed in LUCENE-1645, another 
> (partially fixed in LUCENE-1648) is fixed in this patch. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1651) Make IndexReader.open() always return MSR to simplify (re-)opens.

2009-05-21 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711668#action_12711668
 ] 

Michael McCandless commented on LUCENE-1651:


Excellent!  Thanks Earwin.

bq. test-tag fails due to no-longer-correct test assumptions. Should I checkout 
the branch and make a separate patch, or how is it done?

Yes, please make mods to the tags/XXX checkout, and include those mods in the 
patch.

> Make IndexReader.open() always return MSR to simplify (re-)opens.
> -
>
> Key: LUCENE-1651
> URL: https://issues.apache.org/jira/browse/LUCENE-1651
> Project: Lucene - Java
>  Issue Type: Task
>Affects Versions: 2.9
>Reporter: Earwin Burrfoot
>Assignee: Michael McCandless
> Attachments: LUCENE-1651.patch
>
>
> As per discussion in mailing list, I'm making DirectoryIndexReader.open() 
> always return MSR, even for single-segment indexes.
> While theoretically valid in the past (if you make sure to keep your index 
> constantly optimized) this feature is made practically obsolete by 
> per-segment collection.
> The patch somewhat de-hairies (re-)open logic for MSR/SR.
> SR no longer needs an ability to pose as toplevel directory-owning IR.
> All related logic is moved from DIR to MSR.
> DIR becomes almost empty, and copying two or three remaining fields over to 
> MSR/SR, I remove it.
> Lots of tests fail, as they rely on SR returned from IR.open(), I fix by 
> introducing SR.getOnlySegmentReader static package-private method.
> Some previous bugs are uncovered, one is fixed in LUCENE-1645, another 
> (partially fixed in LUCENE-1648) is fixed in this patch. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-1651) Make IndexReader.open() always return MSR to simplify (re-)opens.

2009-05-21 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-1651:
--

Assignee: Michael McCandless

> Make IndexReader.open() always return MSR to simplify (re-)opens.
> -
>
> Key: LUCENE-1651
> URL: https://issues.apache.org/jira/browse/LUCENE-1651
> Project: Lucene - Java
>  Issue Type: Task
>Affects Versions: 2.9
>Reporter: Earwin Burrfoot
>Assignee: Michael McCandless
> Attachments: LUCENE-1651.patch
>
>
> As per discussion in mailing list, I'm making DirectoryIndexReader.open() 
> always return MSR, even for single-segment indexes.
> While theoretically valid in the past (if you make sure to keep your index 
> constantly optimized) this feature is made practically obsolete by 
> per-segment collection.
> The patch somewhat de-hairies (re-)open logic for MSR/SR.
> SR no longer needs an ability to pose as toplevel directory-owning IR.
> All related logic is moved from DIR to MSR.
> DIR becomes almost empty, and copying two or three remaining fields over to 
> MSR/SR, I remove it.
> Lots of tests fail, as they rely on SR returned from IR.open(), I fix by 
> introducing SR.getOnlySegmentReader static package-private method.
> Some previous bugs are uncovered, one is fixed in LUCENE-1645, another 
> (partially fixed in LUCENE-1648) is fixed in this patch. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: SegmentReader instantiation

2009-05-21 Thread DM Smith

Michael McCandless wrote:

On Thu, May 21, 2009 at 10:53 AM, Earwin Burrfoot  wrote:

  

I agree we should probably remove it, unless there are users relying
on this.  Maintaining side-by-side sources is difficult with time.
  

As I said in the initial message, this feature introduces no runtime
behaviour changes, so you can't really 'rely' on it and break if it's
removed.



Well maybe someone loves the performance improvement... and took
it further by making their own native code extensions.  I'm not
sure how much these gains are.  But people can get quite crazy when
it comes to performance :)

  

Can you send an email to java-user to take a quick survey on whether
anyone is somehow needing this?
  

Never subscribed there. Too low signal-to-noise ratio. I can, but ..
is it a must? :)



In fact I find many good ideas for improving Lucene come from our
users, and one can't really understand what's important in Lucene
without being grounded on how it's used.  "Development" and "using" go
hand in hand.

The discussions that take place there spawn still more ideas, and
following those dicussions causes me to think harder about the areas
being discussed, so I learn more myself about Lucene and find
more things to improve and ponder.

Not to mention when there's a sneaky bug, it usually appears on the
users list first.  I jump a those ;)

So, yeah, I think it is a must.  It's likely nobody will respond after
a few days, then we should remove gcj.

I'll go ask if anyone is relying on gcj native code on java-user.


Fedora uses Lucene for Eclipse and uses gcj for Eclipse. It might be 
used elsewhere. Don't know if that means they need the gcj stuff in 
Lucene. I just wish they'd rework to use openjdk.


-- DM

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1651) Make IndexReader.open() always return MSR to simplify (re-)opens.

2009-05-21 Thread Earwin Burrfoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Earwin Burrfoot updated LUCENE-1651:


Attachment: LUCENE-1651.patch

Okay, here's the first patch, against latest trunk.
test-core, test-contrib pass. test-tag fails due to no-longer-correct test 
assumptions. Should I checkout the branch and make a separate patch, or how is 
it done?


> Make IndexReader.open() always return MSR to simplify (re-)opens.
> -
>
> Key: LUCENE-1651
> URL: https://issues.apache.org/jira/browse/LUCENE-1651
> Project: Lucene - Java
>  Issue Type: Task
>Affects Versions: 2.9
>Reporter: Earwin Burrfoot
> Attachments: LUCENE-1651.patch
>
>
> As per discussion in mailing list, I'm making DirectoryIndexReader.open() 
> always return MSR, even for single-segment indexes.
> While theoretically valid in the past (if you make sure to keep your index 
> constantly optimized) this feature is made practically obsolete by 
> per-segment collection.
> The patch somewhat de-hairies (re-)open logic for MSR/SR.
> SR no longer needs an ability to pose as toplevel directory-owning IR.
> All related logic is moved from DIR to MSR.
> DIR becomes almost empty, and copying two or three remaining fields over to 
> MSR/SR, I remove it.
> Lots of tests fail, as they rely on SR returned from IR.open(), I fix by 
> introducing SR.getOnlySegmentReader static package-private method.
> Some previous bugs are uncovered, one is fixed in LUCENE-1645, another 
> (partially fixed in LUCENE-1648) is fixed in this patch. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: SegmentReader instantiation

2009-05-21 Thread Michael McCandless
On Thu, May 21, 2009 at 10:53 AM, Earwin Burrfoot  wrote:

>> I agree we should probably remove it, unless there are users relying
>> on this.  Maintaining side-by-side sources is difficult with time.
>
> As I said in the initial message, this feature introduces no runtime
> behaviour changes, so you can't really 'rely' on it and break if it's
> removed.

Well maybe someone loves the performance improvement... and took
it further by making their own native code extensions.  I'm not
sure how much these gains are.  But people can get quite crazy when
it comes to performance :)

>> Can you send an email to java-user to take a quick survey on whether
>> anyone is somehow needing this?
> Never subscribed there. Too low signal-to-noise ratio. I can, but ..
> is it a must? :)

In fact I find many good ideas for improving Lucene come from our
users, and one can't really understand what's important in Lucene
without being grounded on how it's used.  "Development" and "using" go
hand in hand.

The discussions that take place there spawn still more ideas, and
following those dicussions causes me to think harder about the areas
being discussed, so I learn more myself about Lucene and find
more things to improve and ponder.

Not to mention when there's a sneaky bug, it usually appears on the
users list first.  I jump a those ;)

So, yeah, I think it is a must.  It's likely nobody will respond after
a few days, then we should remove gcj.

I'll go ask if anyone is relying on gcj native code on java-user.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1651) Make IndexReader.open() always return MSR to simplify (re-)opens.

2009-05-21 Thread Earwin Burrfoot (JIRA)
Make IndexReader.open() always return MSR to simplify (re-)opens.
-

 Key: LUCENE-1651
 URL: https://issues.apache.org/jira/browse/LUCENE-1651
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 2.9
Reporter: Earwin Burrfoot


As per discussion in mailing list, I'm making DirectoryIndexReader.open() 
always return MSR, even for single-segment indexes.
While theoretically valid in the past (if you make sure to keep your index 
constantly optimized) this feature is made practically obsolete by per-segment 
collection.

The patch somewhat de-hairies (re-)open logic for MSR/SR.
SR no longer needs an ability to pose as toplevel directory-owning IR.
All related logic is moved from DIR to MSR.
DIR becomes almost empty, and copying two or three remaining fields over to 
MSR/SR, I remove it.
Lots of tests fail, as they rely on SR returned from IR.open(), I fix by 
introducing SR.getOnlySegmentReader static package-private method.
Some previous bugs are uncovered, one is fixed in LUCENE-1645, another 
(partially fixed in LUCENE-1648) is fixed in this patch. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean

2009-05-21 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711653#action_12711653
 ] 

Michael McCandless commented on LUCENE-1614:


{quote}
On the other end of the scale... getting rid of "if (firstTime)" is another 
check I've long wanted to eliminate.
if doc() produced -1 the first time, before any calls to next() or skipTo(), we 
could get rid of the if (firstTime) code in ConjunctionScorer and others I 
think. The question is, would this be a burden to any scorers or DISI 
implementations?
{quote}

+1

Since we're changing DISI's semantics, now seems like a great time to make this 
change to.  Eliminating the "if (firstTime)" from next() would be great.

But: wouldn't ConjunctionScorer still need an init() to sort its sub-scorers?  
(Though, really, we ought to do that sort based on more accurate criteria, eg 
add a DIS.approxCount() (the first docID of each sub-scorer is an approximation 
that could easily be very wrong).  If we had that, then in the 
ConjunctionScorer's ctor we would do the ordering).

> Add next() and skipTo() variants to DocIdSetIterator that return the current 
> doc, instead of boolean
> 
>
> Key: LUCENE-1614
> URL: https://issues.apache.org/jira/browse/LUCENE-1614
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
> Fix For: 2.9
>
> Attachments: LUCENE-1614.patch, LUCENE-1614.patch
>
>
> See 
> http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html
>  for the full discussion. The basic idea is to add variants to those two 
> methods that return the current doc they are at, to save successive calls to 
> doc(). If there are no more docs, return -1. A summary of what was discussed 
> so far:
> # Deprecate those two methods.
> # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI 
> (calls next() and skipTo() respectively, and will be changed to abstract in 
> 3.0).
> #* I actually would like to propose an alternative to the names: advance() 
> and advance(int) - the first advances by one, the second advances to target.
> # Wherever these are used, do something like '(doc = advance()) >= 0' instead 
> of comparing to -1 for improved performance.
> I will post a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: SegmentReader instantiation

2009-05-21 Thread Earwin Burrfoot
2009/5/21 Michael McCandless :
> It looks like this was done in order to implement
> SegmentTermDocs.read(int[], int[]) natively, when using a gcj
> environment, since that gave performance improvements?
Yup, you're right. But something tells me, since Lucene 1.9 many
things changed and this is no longer relevant.

> I agree we should probably remove it, unless there are users relying
> on this.  Maintaining side-by-side sources is difficult with time.
As I said in the initial message, this feature introduces no runtime
behaviour changes, so you can't really 'rely' on it and break if it's
removed.

> Can you send an email to java-user to take a quick survey on whether
> anyone is somehow needing this?
Never subscribed there. Too low signal-to-noise ratio. I can, but ..
is it a must? :)

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: SegmentReader instantiation

2009-05-21 Thread Michael McCandless
It looks like this was done in order to implement
SegmentTermDocs.read(int[], int[]) natively, when using a gcj
environment, since that gave performance improvements?

I agree we should probably remove it, unless there are users relying
on this.  Maintaining side-by-side sources is difficult with time.

Also, this seems like something source code specialization
(LUCENE-1594) should cover.

Can you send an email to java-user to take a quick survey on whether
anyone is somehow needing this?

Mike

On Thu, May 21, 2009 at 10:17 AM, Earwin Burrfoot  wrote:
> Right now a set of system properties and Class.newInstance() is used
> to create SegmentReader.
>
> I've tracked down this code's origins to:
> r150531 | cutting | 2004-09-22 22:32:27 +0400 (ср, 22 сен 2004) | 2 lines
> Add GCJ native code for SegmentTermDocs.read(int[],int[]) to
> accellerate TermScorer.  TermScorer and BooleanScorer are now usually
> a bit faster under GCJ than under Sun's JVM.
>
> Anybody uses this? Let's just go ahead, scrap it, and construct
> classes using er.. constructors? :)
> No adequate docs for this feature exist, except a very brief mention
> in CHANGES.txt.
> Even more - when ReadonlySegmentReader was introduced, instantiation
> code was blindly copypasted, but no matching GCJReadonlySegmentReader
> was created and corresponding Makefile wasn't altered, thus
> GCJ-optimized Lucene is currently broken on trunk.
> Removal of this feature also won't break any theoretical old apps that
> magically happen to use it, because we'll simply ignore the property
> with zero change in runtime behaviour.
>
> --
> Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
> Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
> ICQ: 104465785
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



SegmentReader instantiation

2009-05-21 Thread Earwin Burrfoot
Right now a set of system properties and Class.newInstance() is used
to create SegmentReader.

I've tracked down this code's origins to:
r150531 | cutting | 2004-09-22 22:32:27 +0400 (ср, 22 сен 2004) | 2 lines
Add GCJ native code for SegmentTermDocs.read(int[],int[]) to
accellerate TermScorer.  TermScorer and BooleanScorer are now usually
a bit faster under GCJ than under Sun's JVM.

Anybody uses this? Let's just go ahead, scrap it, and construct
classes using er.. constructors? :)
No adequate docs for this feature exist, except a very brief mention
in CHANGES.txt.
Even more - when ReadonlySegmentReader was introduced, instantiation
code was blindly copypasted, but no matching GCJReadonlySegmentReader
was created and corresponding Makefile wasn't altered, thus
GCJ-optimized Lucene is currently broken on trunk.
Removal of this feature also won't break any theoretical old apps that
magically happen to use it, because we'll simply ignore the property
with zero change in runtime behaviour.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: DateTools performance

2009-05-21 Thread Michael McCandless
Yes, please fix :)

I think there may already be an issue open on the single instance /
synchronization / ThreadLocal issue.

Mike

On Thu, May 21, 2009 at 9:52 AM, Shai Erera  wrote:
> How much is DateTools in use? I noticed a couple of potential improvements
> to it, which at least for the benchmark package can improve performance:
>
> 1. timeToString calls Calendar.getInstance on every call? That's a very
> expensive call to make. Why not store it as a static member? We always call
> it with GMT timezone, and it reads internally the default Locale, so I don't
> think it will change when the JVM is up, unless someone calls
> Locale.setDefault() at some point.
>
> If we'll do this then we will need to make the method synchronized though
> ... but I don't think that's too critical.
>
> 2. dateToString calls timeToString(date.getTime()), which then instantiates
> a new Date(). Kind of wasteful, isn't it?
>
> 3. round(), which is called from timeToString (after creating a Calendarr
> instace) creates another (!) Calendar instance ...
>
> I found one usage in QueryParser when it parses range queries and some more
> in the test package.
>
> I don't mind fixing those.
>
> Shai
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean

2009-05-21 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-1614:
---

Attachment: LUCENE-1614.patch

MAX_VAL as sentinel + the documentation changes + a new entry to CHANGES 
"back-compat change" on OBSI.nextDoc() and next(int) + tag fixes.

All tests pass

> Add next() and skipTo() variants to DocIdSetIterator that return the current 
> doc, instead of boolean
> 
>
> Key: LUCENE-1614
> URL: https://issues.apache.org/jira/browse/LUCENE-1614
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
> Fix For: 2.9
>
> Attachments: LUCENE-1614.patch, LUCENE-1614.patch
>
>
> See 
> http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html
>  for the full discussion. The basic idea is to add variants to those two 
> methods that return the current doc they are at, to save successive calls to 
> doc(). If there are no more docs, return -1. A summary of what was discussed 
> so far:
> # Deprecate those two methods.
> # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI 
> (calls next() and skipTo() respectively, and will be changed to abstract in 
> 3.0).
> #* I actually would like to propose an alternative to the names: advance() 
> and advance(int) - the first advances by one, the second advances to target.
> # Wherever these are used, do something like '(doc = advance()) >= 0' instead 
> of comparing to -1 for improved performance.
> I will post a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



DateTools performance

2009-05-21 Thread Shai Erera
How much is DateTools in use? I noticed a couple of potential improvements
to it, which at least for the benchmark package can improve performance:

1. timeToString calls Calendar.getInstance on every call? That's a very
expensive call to make. Why not store it as a static member? We always call
it with GMT timezone, and it reads internally the default Locale, so I don't
think it will change when the JVM is up, unless someone calls
Locale.setDefault() at some point.

If we'll do this then we will need to make the method synchronized though
... but I don't think that's too critical.

2. dateToString calls timeToString(date.getTime()), which then instantiates
a new Date(). Kind of wasteful, isn't it?

3. round(), which is called from timeToString (after creating a Calendarr
instace) creates another (!) Calendar instance ...

I found one usage in QueryParser when it parses range queries and some more
in the test package.

I don't mind fixing those.

Shai


[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean

2009-05-21 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711615#action_12711615
 ] 

Shai Erera commented on LUCENE-1614:


I plan to open another issue for 3.0 to take advantage of MAX_VAL being 
returned from doc() also (we cannot rely on doc() returning MAX_VAL today when 
there are no more docs, hence why we need to wait with these changes until 3.0).

You're proposing to add another contract to doc() - to return -1 before 
nextDoc() and advance(int) were called. I can do that, but we can use this 
contract only in 3.0.

Unless the community decides to change back-compat policy starting with 2.9, 
which will give us the opportunity to take advantage of "latest & greatest" 
right away.

> Add next() and skipTo() variants to DocIdSetIterator that return the current 
> doc, instead of boolean
> 
>
> Key: LUCENE-1614
> URL: https://issues.apache.org/jira/browse/LUCENE-1614
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
> Fix For: 2.9
>
> Attachments: LUCENE-1614.patch
>
>
> See 
> http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html
>  for the full discussion. The basic idea is to add variants to those two 
> methods that return the current doc they are at, to save successive calls to 
> doc(). If there are no more docs, return -1. A summary of what was discussed 
> so far:
> # Deprecate those two methods.
> # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI 
> (calls next() and skipTo() respectively, and will be changed to abstract in 
> 3.0).
> #* I actually would like to propose an alternative to the names: advance() 
> and advance(int) - the first advances by one, the second advances to target.
> # Wherever these are used, do something like '(doc = advance()) >= 0' instead 
> of comparing to -1 for improved performance.
> I will post a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean

2009-05-21 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711611#action_12711611
 ] 

Yonik Seeley commented on LUCENE-1614:
--

I'm warming to some of the simplifications that a MAX_VAL sentinel can bring.

On the other end of the scale... getting rid of "if (firstTime)" is another 
check I've long wanted to eliminate.
if doc() produced -1 the first time, before any calls to next() or skipTo(), we 
could get rid of the if (firstTime) code in ConjunctionScorer and others I 
think.  The question is, would this be a burden to any scorers or DISI 
implementations?


> Add next() and skipTo() variants to DocIdSetIterator that return the current 
> doc, instead of boolean
> 
>
> Key: LUCENE-1614
> URL: https://issues.apache.org/jira/browse/LUCENE-1614
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
> Fix For: 2.9
>
> Attachments: LUCENE-1614.patch
>
>
> See 
> http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html
>  for the full discussion. The basic idea is to add variants to those two 
> methods that return the current doc they are at, to save successive calls to 
> doc(). If there are no more docs, return -1. A summary of what was discussed 
> so far:
> # Deprecate those two methods.
> # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI 
> (calls next() and skipTo() respectively, and will be changed to abstract in 
> 3.0).
> #* I actually would like to propose an alternative to the names: advance() 
> and advance(int) - the first advances by one, the second advances to target.
> # Wherever these are used, do something like '(doc = advance()) >= 0' instead 
> of comparing to -1 for improved performance.
> I will post a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean

2009-05-21 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711605#action_12711605
 ] 

Michael McCandless commented on LUCENE-1614:


bq. So Mike - does that mean I can change nextDoc() behavior in OBSI and 
document it?

OK let's tentatively go forward with that?

> Add next() and skipTo() variants to DocIdSetIterator that return the current 
> doc, instead of boolean
> 
>
> Key: LUCENE-1614
> URL: https://issues.apache.org/jira/browse/LUCENE-1614
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
> Fix For: 2.9
>
> Attachments: LUCENE-1614.patch
>
>
> See 
> http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html
>  for the full discussion. The basic idea is to add variants to those two 
> methods that return the current doc they are at, to save successive calls to 
> doc(). If there are no more docs, return -1. A summary of what was discussed 
> so far:
> # Deprecate those two methods.
> # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI 
> (calls next() and skipTo() respectively, and will be changed to abstract in 
> 3.0).
> #* I actually would like to propose an alternative to the names: advance() 
> and advance(int) - the first advances by one, the second advances to target.
> # Wherever these are used, do something like '(doc = advance()) >= 0' instead 
> of comparing to -1 for improved performance.
> I will post a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1648) when you clone or reopen an IndexReader with pending changes, the new reader doesn't commit the changes

2009-05-21 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711604#action_12711604
 ] 

Michael McCandless commented on LUCENE-1648:


OK -- good catch!  I've reopened this (to remember this bug is present); let's 
fix this issue under "always use MSR".

> when you clone or reopen an IndexReader with pending changes, the new reader 
> doesn't commit the changes
> ---
>
> Key: LUCENE-1648
> URL: https://issues.apache.org/jira/browse/LUCENE-1648
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1648-followup.patch, LUCENE-1648-followup.patch, 
> LUCENE-1648.patch
>
>
> While working on LUCENE-1647, I came across this issue... we are failing to 
> carry over hasChanges, norms/deletionsDirty, etc, when cloning the new reader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Reopened: (LUCENE-1648) when you clone or reopen an IndexReader with pending changes, the new reader doesn't commit the changes

2009-05-21 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reopened LUCENE-1648:



> when you clone or reopen an IndexReader with pending changes, the new reader 
> doesn't commit the changes
> ---
>
> Key: LUCENE-1648
> URL: https://issues.apache.org/jira/browse/LUCENE-1648
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1648-followup.patch, LUCENE-1648-followup.patch, 
> LUCENE-1648.patch
>
>
> While working on LUCENE-1647, I came across this issue... we are failing to 
> carry over hasChanges, norms/deletionsDirty, etc, when cloning the new reader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean

2009-05-21 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711597#action_12711597
 ] 

Shai Erera commented on LUCENE-1614:


bq. I don't see any calls to OpenBitSetIterator.nextDoc in solr's source code.

So Mike - does that mean I can change nextDoc() behavior in OBSI and document 
it?

> Add next() and skipTo() variants to DocIdSetIterator that return the current 
> doc, instead of boolean
> 
>
> Key: LUCENE-1614
> URL: https://issues.apache.org/jira/browse/LUCENE-1614
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
> Fix For: 2.9
>
> Attachments: LUCENE-1614.patch
>
>
> See 
> http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html
>  for the full discussion. The basic idea is to add variants to those two 
> methods that return the current doc they are at, to save successive calls to 
> doc(). If there are no more docs, return -1. A summary of what was discussed 
> so far:
> # Deprecate those two methods.
> # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI 
> (calls next() and skipTo() respectively, and will be changed to abstract in 
> 3.0).
> #* I actually would like to propose an alternative to the names: advance() 
> and advance(int) - the first advances by one, the second advances to target.
> # Wherever these are used, do something like '(doc = advance()) >= 0' instead 
> of comparing to -1 for improved performance.
> I will post a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1648) when you clone or reopen an IndexReader with pending changes, the new reader doesn't commit the changes

2009-05-21 Thread Earwin Burrfoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Earwin Burrfoot updated LUCENE-1648:


Attachment: LUCENE-1648-followup.patch

And here's the fix.
The problem - it's not elegant at all. We can scrap it and I can do the fix 
within "always use MSR" patch, it looks way cleaner there.

> when you clone or reopen an IndexReader with pending changes, the new reader 
> doesn't commit the changes
> ---
>
> Key: LUCENE-1648
> URL: https://issues.apache.org/jira/browse/LUCENE-1648
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1648-followup.patch, LUCENE-1648-followup.patch, 
> LUCENE-1648.patch
>
>
> While working on LUCENE-1647, I came across this issue... we are failing to 
> carry over hasChanges, norms/deletionsDirty, etc, when cloning the new reader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene's default settings & back compatibility

2009-05-21 Thread DM Smith


On May 21, 2009, at 7:17 AM, Michael McCandless wrote:


 1) Default settings can change; we will always choose defaults based
on "latest & greatest for new users".  This only affects "runtime
behavior".  EG in 2.9, when sorting by field you won't get scores
by default.  When we do this we should clearly document the
change, and what settings one could use to get back to the old
behavior, in CHANGES.txt.


I'd reverse 1 and 2 and note in 1 that the old behavior might be  
deprecated.




 2) An API, once released as deprecated, is fair game to be removed
in the next minor release.


I presume you mean that it will be present for at least one full minor  
release. So, if at 3.1.5 a deprecation is introduced, then it won't be  
removed until 3.3 at the earliest, because 3.2 was the first minor  
release in which it appeared at the start. I don't think it is fair to  
expect users to get every last point release.


If so +1 from a user.



We still only make bug fixes on point releases, support the index file
format until the next major release -- those don't change.


Is it just the index file format? I would hope that the behavior of  
filters, analyzers and such would not change so as to invalidate an  
index.


-- DM


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1648) when you clone or reopen an IndexReader with pending changes, the new reader doesn't commit the changes

2009-05-21 Thread Earwin Burrfoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Earwin Burrfoot updated LUCENE-1648:


Attachment: LUCENE-1648-followup.patch

bq. Bad news is something is wrong w/ your patch, because MSR needs to carry 
over hasChanges/deletions/norms on clone().
Something is wrong with trunk.
Try the test in the patch. It is absolutely identical to yours, except it adds 
another doc to index to have two segments and consequently MSR.

> when you clone or reopen an IndexReader with pending changes, the new reader 
> doesn't commit the changes
> ---
>
> Key: LUCENE-1648
> URL: https://issues.apache.org/jira/browse/LUCENE-1648
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1648-followup.patch, LUCENE-1648.patch
>
>
> While working on LUCENE-1647, I came across this issue... we are failing to 
> carry over hasChanges, norms/deletionsDirty, etc, when cloning the new reader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean

2009-05-21 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711588#action_12711588
 ] 

Shalin Shekhar Mangar commented on LUCENE-1614:
---

bq. Perhaps the Solr guys can state then if and how much they mind this change? 
Before we start the journey of finding a different name for DISI.nextDoc(), 
just to deprecated OBSI.nextDoc()

I don't see any calls to OpenBitSetIterator.nextDoc in solr's source code.

> Add next() and skipTo() variants to DocIdSetIterator that return the current 
> doc, instead of boolean
> 
>
> Key: LUCENE-1614
> URL: https://issues.apache.org/jira/browse/LUCENE-1614
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
> Fix For: 2.9
>
> Attachments: LUCENE-1614.patch
>
>
> See 
> http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html
>  for the full discussion. The basic idea is to add variants to those two 
> methods that return the current doc they are at, to save successive calls to 
> doc(). If there are no more docs, return -1. A summary of what was discussed 
> so far:
> # Deprecate those two methods.
> # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI 
> (calls next() and skipTo() respectively, and will be changed to abstract in 
> 3.0).
> #* I actually would like to propose an alternative to the names: advance() 
> and advance(int) - the first advances by one, the second advances to target.
> # Wherever these are used, do something like '(doc = advance()) >= 0' instead 
> of comparing to -1 for improved performance.
> I will post a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1595) Split DocMaker into ContentSource and DocMaker

2009-05-21 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711586#action_12711586
 ] 

Shai Erera commented on LUCENE-1595:


BTW, am I allowed to use Java 5 generics in benchmark? Or until 3.0 benchmark 
should stay on 1.4 as well? I'm asking because I heard a couple of times that 
contrib is allowed to move to Java 5

> Split DocMaker into ContentSource and DocMaker
> --
>
> Key: LUCENE-1595
> URL: https://issues.apache.org/jira/browse/LUCENE-1595
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Reporter: Shai Erera
>Assignee: Mark Miller
> Fix For: 2.9
>
>
> This issue proposes some refactoring to the benchmark package. Today, 
> DocMaker has two roles: collecting documents from a collection and preparing 
> a Document object. These two should actually be split up to ContentSource and 
> DocMaker, which will use a ContentSource instance.
> ContentSource will implement all the methods of DocMaker, like 
> getNextDocData, raw size in bytes tracking etc. This can actually fit well w/ 
> 1591, by having a basic ContentSource that offers input stream services, and 
> wraps a file (for example) with a bzip or gzip streams etc.
> DocMaker will implement the makeDocument methods, reusing DocState etc.
> The idea is that collecting the Enwiki documents, for example, should be the 
> same whether I create documents using DocState, add payloads or index 
> additional metadata. Same goes for Trec and Reuters collections, as well as 
> LineDocMaker.
> In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are 
> 99% the same and 99% different. Most of their differences lie in the way they 
> read the data, while most of the similarity lies in the way they create 
> documents (using DocState).
> That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker 
> (just the reuse of DocState). Also, other DocMakers do not use that DocState 
> today, something they could have gotten for free with this refactoring 
> proposed.
> So by having a EnwikiContentSource, ReutersContentSource and others (TREC, 
> Line, Simple), I can write several DocMakers, such as DocStateMaker, 
> ConfigurableDocMaker (one which accpets all kinds of config options) and 
> custom DocMakers (payload, facets, sorting), passing to them a ContentSource 
> instance and reuse the same DocMaking algorithm with many content sources, as 
> well as the same ContentSource algorithm with many DocMaker implementations.
> This will also give us the opportunity to perf test content sources alone 
> (i.e., compare bzip, gzip and regular input streams), w/o the overhead of 
> creating a Document object.
> I've already done so in my code environment (I extend the benchmark package 
> for my application's purposes) and I like the flexibility I have. I think 
> this can be a nice contribution to the benchmark package, which can result in 
> some code cleanup as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1647) IndexReader.undeleteAll can mess up the deletion count stored in the segments file

2009-05-21 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1647:
---

Attachment: LUCENE-1647.patch

Attached patch w/ test showing the issue, and fix that fixes the test.

I plan to commit in a day or two.

> IndexReader.undeleteAll can mess up the deletion count stored in the segments 
> file
> --
>
> Key: LUCENE-1647
> URL: https://issues.apache.org/jira/browse/LUCENE-1647
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1647.patch, LUCENE-1647.patch
>
>
> Spinoff from LUCENE-1474.  I'll attach a test case showing the issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1595) Split DocMaker into ContentSource and DocMaker

2009-05-21 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711583#action_12711583
 ] 

Shai Erera commented on LUCENE-1595:


bq. Maybe make the seed an optional config? If it's not present, let it pick a 
random seed?

I already went ahead and did that. Only if it's not present, I chose 13 instead 
of drawing one every time. That way, runs can be consistent and compared to 
each other.

> Split DocMaker into ContentSource and DocMaker
> --
>
> Key: LUCENE-1595
> URL: https://issues.apache.org/jira/browse/LUCENE-1595
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Reporter: Shai Erera
>Assignee: Mark Miller
> Fix For: 2.9
>
>
> This issue proposes some refactoring to the benchmark package. Today, 
> DocMaker has two roles: collecting documents from a collection and preparing 
> a Document object. These two should actually be split up to ContentSource and 
> DocMaker, which will use a ContentSource instance.
> ContentSource will implement all the methods of DocMaker, like 
> getNextDocData, raw size in bytes tracking etc. This can actually fit well w/ 
> 1591, by having a basic ContentSource that offers input stream services, and 
> wraps a file (for example) with a bzip or gzip streams etc.
> DocMaker will implement the makeDocument methods, reusing DocState etc.
> The idea is that collecting the Enwiki documents, for example, should be the 
> same whether I create documents using DocState, add payloads or index 
> additional metadata. Same goes for Trec and Reuters collections, as well as 
> LineDocMaker.
> In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are 
> 99% the same and 99% different. Most of their differences lie in the way they 
> read the data, while most of the similarity lies in the way they create 
> documents (using DocState).
> That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker 
> (just the reuse of DocState). Also, other DocMakers do not use that DocState 
> today, something they could have gotten for free with this refactoring 
> proposed.
> So by having a EnwikiContentSource, ReutersContentSource and others (TREC, 
> Line, Simple), I can write several DocMakers, such as DocStateMaker, 
> ConfigurableDocMaker (one which accpets all kinds of config options) and 
> custom DocMakers (payload, facets, sorting), passing to them a ContentSource 
> instance and reuse the same DocMaking algorithm with many content sources, as 
> well as the same ContentSource algorithm with many DocMaker implementations.
> This will also give us the opportunity to perf test content sources alone 
> (i.e., compare bzip, gzip and regular input streams), w/o the overhead of 
> creating a Document object.
> I've already done so in my code environment (I extend the benchmark package 
> for my application's purposes) and I like the flexibility I have. I think 
> this can be a nice contribution to the benchmark package, which can result in 
> some code cleanup as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean

2009-05-21 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711580#action_12711580
 ] 

Shai Erera commented on LUCENE-1614:


bq. My guess is eg Solr probably relies heavily on OBSI.nextDoc returning -1

Perhaps the Solr guys can state then if and how much they mind this change? 
Before we start the journey of finding a different name for DISI.nextDoc(), 
just to deprecated OBSI.nextDoc() ...

> Add next() and skipTo() variants to DocIdSetIterator that return the current 
> doc, instead of boolean
> 
>
> Key: LUCENE-1614
> URL: https://issues.apache.org/jira/browse/LUCENE-1614
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
> Fix For: 2.9
>
> Attachments: LUCENE-1614.patch
>
>
> See 
> http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html
>  for the full discussion. The basic idea is to add variants to those two 
> methods that return the current doc they are at, to save successive calls to 
> doc(). If there are no more docs, return -1. A summary of what was discussed 
> so far:
> # Deprecate those two methods.
> # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI 
> (calls next() and skipTo() respectively, and will be changed to abstract in 
> 3.0).
> #* I actually would like to propose an alternative to the names: advance() 
> and advance(int) - the first advances by one, the second advances to target.
> # Wherever these are used, do something like '(doc = advance()) >= 0' instead 
> of comparing to -1 for improved performance.
> I will post a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1595) Split DocMaker into ContentSource and DocMaker

2009-05-21 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711581#action_12711581
 ] 

Michael McCandless commented on LUCENE-1595:


bq. While I change SortableSingleDocMaker I noticed it create a new Random() in 
getNextDocData(). Shouldn't that Random be created once? Also, I think it 
should be created with a seed?

OK, sharing a Random instance seems good.

Maybe make the seed an optional config?  If it's not present, let it pick a 
random seed?

> Split DocMaker into ContentSource and DocMaker
> --
>
> Key: LUCENE-1595
> URL: https://issues.apache.org/jira/browse/LUCENE-1595
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Reporter: Shai Erera
>Assignee: Mark Miller
> Fix For: 2.9
>
>
> This issue proposes some refactoring to the benchmark package. Today, 
> DocMaker has two roles: collecting documents from a collection and preparing 
> a Document object. These two should actually be split up to ContentSource and 
> DocMaker, which will use a ContentSource instance.
> ContentSource will implement all the methods of DocMaker, like 
> getNextDocData, raw size in bytes tracking etc. This can actually fit well w/ 
> 1591, by having a basic ContentSource that offers input stream services, and 
> wraps a file (for example) with a bzip or gzip streams etc.
> DocMaker will implement the makeDocument methods, reusing DocState etc.
> The idea is that collecting the Enwiki documents, for example, should be the 
> same whether I create documents using DocState, add payloads or index 
> additional metadata. Same goes for Trec and Reuters collections, as well as 
> LineDocMaker.
> In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are 
> 99% the same and 99% different. Most of their differences lie in the way they 
> read the data, while most of the similarity lies in the way they create 
> documents (using DocState).
> That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker 
> (just the reuse of DocState). Also, other DocMakers do not use that DocState 
> today, something they could have gotten for free with this refactoring 
> proposed.
> So by having a EnwikiContentSource, ReutersContentSource and others (TREC, 
> Line, Simple), I can write several DocMakers, such as DocStateMaker, 
> ConfigurableDocMaker (one which accpets all kinds of config options) and 
> custom DocMakers (payload, facets, sorting), passing to them a ContentSource 
> instance and reuse the same DocMaking algorithm with many content sources, as 
> well as the same ContentSource algorithm with many DocMaker implementations.
> This will also give us the opportunity to perf test content sources alone 
> (i.e., compare bzip, gzip and regular input streams), w/o the overhead of 
> creating a Document object.
> I've already done so in my code environment (I extend the benchmark package 
> for my application's purposes) and I like the flexibility I have. I think 
> this can be a nice contribution to the benchmark package, which can result in 
> some code cleanup as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1648) when you clone or reopen an IndexReader with pending changes, the new reader doesn't commit the changes

2009-05-21 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1648.


Resolution: Fixed

> when you clone or reopen an IndexReader with pending changes, the new reader 
> doesn't commit the changes
> ---
>
> Key: LUCENE-1648
> URL: https://issues.apache.org/jira/browse/LUCENE-1648
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1648.patch
>
>
> While working on LUCENE-1647, I came across this issue... we are failing to 
> carry over hasChanges, norms/deletionsDirty, etc, when cloning the new reader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1648) when you clone or reopen an IndexReader with pending changes, the new reader doesn't commit the changes

2009-05-21 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711578#action_12711578
 ] 

Michael McCandless commented on LUCENE-1648:


{quote}
Or to be more exact, it fixed the tests that were failing for me.
Your two new tests fail as soon as I apply "always use MSR", because MSR does 
not carry over hasChanges/hasDeletions on clone().
{quote}
Hmm, OK so it's a good news / bad news situation.

Good news is patch resolved your original issue.  I'll commit it shortly.

Bad news is something is wrong w/ your patch, because MSR needs to carry over 
hasChanges/deletions/norms on clone().

> when you clone or reopen an IndexReader with pending changes, the new reader 
> doesn't commit the changes
> ---
>
> Key: LUCENE-1648
> URL: https://issues.apache.org/jira/browse/LUCENE-1648
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1648.patch
>
>
> While working on LUCENE-1647, I came across this issue... we are failing to 
> carry over hasChanges, norms/deletionsDirty, etc, when cloning the new reader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene's default settings & back compatibility

2009-05-21 Thread Michael McCandless
On Thu, May 21, 2009 at 7:21 AM, Shai Erera  wrote:
> I thought that the index file format is supposed to be supported until the
> 2nd major release. I.e. 3.0 will still read 2.0 indexes, but 4.0 won't. Is
> that what you meant, or am I wrong?

Woops, you're correct:

   http://wiki.apache.org/jakarta-lucene/BackwardsCompatibility

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean

2009-05-21 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711576#action_12711576
 ] 

Michael McCandless commented on LUCENE-1614:


bq. I think I'll emphasize that in the javadocs, documenting the limitation of 
MAX_VAL so that people won't assume the wrong things.

+1

A docID is different from an "int", because docIDs must be 0 .. MAX_VAL-1.

{quote}
Why? just because OBSI declared a method which we wanted anyway? You know .. 
it's something we don't give much thought to when we add methods to abstract 
classes, but what if someone extended DISI and added his own advance(int) or 
nextDoc() which don't behave like we expect them to. When he'll pass his DISI 
to the search flow somehow, not knowing these have become the primary methods, 
something will break.

I'm not saying we should protect these cases too, because otherwise we won't be 
able to make any changes. But just because OBSI had nextDoc() declared doesn't 
mean we should go and find a different name. That's slightly unrelated to this 
issue, but our back-compat policy forces us to replace good names with moderate 
ones, just because we cannot change methods.
{quote}

You're right, a random subclass of an abstract class could very well choose the 
name we are wanting to add, and then their class fails to compile, or (if the 
sigs turn out to be identical) runs bug possibly causes problems.

But in this case we know we have just such a class that has done so (OBSI).  
And of course it did so for exactly the reasons that we are now wanting to add 
nextDoc to DISI.  My guess is eg Solr probably relies heavily on OBSI.nextDoc 
returning -1 when it's done and we're gonna cause AIOOB exceptions if we up and 
change to returning MAX_VAL.

bq.  That's slightly unrelated to this issue, but our back-compat policy forces 
us to replace good names with moderate ones, just because we cannot change 
methods.

As much as it bothers me having to accept inferior names (so they don't 
conflict with the existing names), I think it's very much the lesser-of-evils 
here.

> Add next() and skipTo() variants to DocIdSetIterator that return the current 
> doc, instead of boolean
> 
>
> Key: LUCENE-1614
> URL: https://issues.apache.org/jira/browse/LUCENE-1614
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
> Fix For: 2.9
>
> Attachments: LUCENE-1614.patch
>
>
> See 
> http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html
>  for the full discussion. The basic idea is to add variants to those two 
> methods that return the current doc they are at, to save successive calls to 
> doc(). If there are no more docs, return -1. A summary of what was discussed 
> so far:
> # Deprecate those two methods.
> # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI 
> (calls next() and skipTo() respectively, and will be changed to abstract in 
> 3.0).
> #* I actually would like to propose an alternative to the names: advance() 
> and advance(int) - the first advances by one, the second advances to target.
> # Wherever these are used, do something like '(doc = advance()) >= 0' instead 
> of comparing to -1 for improved performance.
> I will post a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1595) Split DocMaker into ContentSource and DocMaker

2009-05-21 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711577#action_12711577
 ] 

Shai Erera commented on LUCENE-1595:


While I change SortableSingleDocMaker I noticed it create a new Random() in 
getNextDocData(). Shouldn't that Random be created once? Also, I think it 
should be created with a seed?

> Split DocMaker into ContentSource and DocMaker
> --
>
> Key: LUCENE-1595
> URL: https://issues.apache.org/jira/browse/LUCENE-1595
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Reporter: Shai Erera
>Assignee: Mark Miller
> Fix For: 2.9
>
>
> This issue proposes some refactoring to the benchmark package. Today, 
> DocMaker has two roles: collecting documents from a collection and preparing 
> a Document object. These two should actually be split up to ContentSource and 
> DocMaker, which will use a ContentSource instance.
> ContentSource will implement all the methods of DocMaker, like 
> getNextDocData, raw size in bytes tracking etc. This can actually fit well w/ 
> 1591, by having a basic ContentSource that offers input stream services, and 
> wraps a file (for example) with a bzip or gzip streams etc.
> DocMaker will implement the makeDocument methods, reusing DocState etc.
> The idea is that collecting the Enwiki documents, for example, should be the 
> same whether I create documents using DocState, add payloads or index 
> additional metadata. Same goes for Trec and Reuters collections, as well as 
> LineDocMaker.
> In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are 
> 99% the same and 99% different. Most of their differences lie in the way they 
> read the data, while most of the similarity lies in the way they create 
> documents (using DocState).
> That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker 
> (just the reuse of DocState). Also, other DocMakers do not use that DocState 
> today, something they could have gotten for free with this refactoring 
> proposed.
> So by having a EnwikiContentSource, ReutersContentSource and others (TREC, 
> Line, Simple), I can write several DocMakers, such as DocStateMaker, 
> ConfigurableDocMaker (one which accpets all kinds of config options) and 
> custom DocMakers (payload, facets, sorting), passing to them a ContentSource 
> instance and reuse the same DocMaking algorithm with many content sources, as 
> well as the same ContentSource algorithm with many DocMaker implementations.
> This will also give us the opportunity to perf test content sources alone 
> (i.e., compare bzip, gzip and regular input streams), w/o the overhead of 
> creating a Document object.
> I've already done so in my code environment (I extend the benchmark package 
> for my application's purposes) and I like the flexibility I have. I think 
> this can be a nice contribution to the benchmark package, which can result in 
> some code cleanup as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1648) when you clone or reopen an IndexReader with pending changes, the new reader doesn't commit the changes

2009-05-21 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711575#action_12711575
 ] 

Earwin Burrfoot commented on LUCENE-1648:
-

Or to be more exact, it fixed the tests that were failing for me.
Your two new tests fail as soon as I apply "always use MSR", because MSR does 
not carry over hasChanges/hasDeletions on clone().

> when you clone or reopen an IndexReader with pending changes, the new reader 
> doesn't commit the changes
> ---
>
> Key: LUCENE-1648
> URL: https://issues.apache.org/jira/browse/LUCENE-1648
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1648.patch
>
>
> While working on LUCENE-1647, I came across this issue... we are failing to 
> carry over hasChanges, norms/deletionsDirty, etc, when cloning the new reader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1648) when you clone or reopen an IndexReader with pending changes, the new reader doesn't commit the changes

2009-05-21 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711571#action_12711571
 ] 

Earwin Burrfoot commented on LUCENE-1648:
-

bq. Try the patch?
Yup, it fixed everything.


> when you clone or reopen an IndexReader with pending changes, the new reader 
> doesn't commit the changes
> ---
>
> Key: LUCENE-1648
> URL: https://issues.apache.org/jira/browse/LUCENE-1648
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1648.patch
>
>
> While working on LUCENE-1647, I came across this issue... we are failing to 
> carry over hasChanges, norms/deletionsDirty, etc, when cloning the new reader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1646) QueryParser throws new exceptions even if custom parsing logic threw a better one

2009-05-21 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711570#action_12711570
 ] 

Michael McCandless commented on LUCENE-1646:


bq. I guess that's true if you look at exceptions as a logging mechanism, but 
in our case it's a parsing exception for text coming from the user. Because of 
this, our use case is for the user to get a useful error message, and it's not 
useful at all if we just tell them their entire query was bad. Thus we have 
inserted improvements (in our subclass) to make it complain only about the 
fragment of the query which is actually a problem, so they know which part to 
fix.

I agree, in general an app ought to be catching this ParseException
and messaging something much more succinct/digestible to the end
user.

When QueryParser throws ParseException, it does attempt to explain
what went wrong in many cases.  It sounds like you've modified it
further to do a better job?  Is this something you could donate back?

I agree we should make it easier to customize the ParseException, but
I think by default having the exception message include the full query
is useful for those apps that simply log the exception.  This way we
can have our cake and eat it too...

bq. Related, but is there any way it could at least be reduced to the portion 
of the query which caused the problem? In a way it would be nice if 
ParseException had methods to get out the problematic fragment (my subclass has 
it...) I'm guessing this is much easier for exceptions relating to values 
inside fields which otherwise parsed correctly, but a lot harder to do for 
exceptions from the parser proper.

It seems like it already makes some effort here?  Eg it has
attrs like currentToken, expectedTokenSequences,
etc., recorded when the exception is thrown.  And getMessage()
pulls from these attrs to build up the message?  Can you post a patch
with your improvements here?


> QueryParser throws new exceptions even if custom parsing logic threw a better 
> one
> -
>
> Key: LUCENE-1646
> URL: https://issues.apache.org/jira/browse/LUCENE-1646
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.4.1
>Reporter: Trejkaz
>
> We have subclassed QueryParser and have various custom fields.  When these 
> fields contain invalid values, we throw a subclass of ParseException which 
> has a more useful message (and also a localised message.)
> Problem is, Lucene's QueryParser is doing this:
> {code}
> catch (ParseException tme) {
> // rethrow to include the original query:
> throw new ParseException("Cannot parse '" +query+ "': " + 
> tme.getMessage());
> }
> {code}
> Thus, our nice and useful ParseException is thrown away, replaced by one with 
> no information about what's actually wrong with the query (it does append 
> getMessage() but that isn't localised.  And it also throws away the 
> underlying cause for the exception.)
> I am about to patch our copy to simply remove these four lines; the caller 
> knows what the query string was (they have to have a copy of it because they 
> are passing it in!) so having it in the error message itself is not useful.  
> Furthermore, when the query string is very big, what the user wants to know 
> is not that the whole query was bad, but which part of it was bad.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene's default settings & back compatibility

2009-05-21 Thread Shai Erera
I thought that the index file format is supposed to be supported until the
2nd major release. I.e. 3.0 will still read 2.0 indexes, but 4.0 won't. Is
that what you meant, or am I wrong?

Shai

On Thu, May 21, 2009 at 2:17 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> OK so it sounds like we've boiled the proposal down to two concrete
> changes to the back-compat policy:
>
>  1) Default settings can change; we will always choose defaults based
> on "latest & greatest for new users".  This only affects "runtime
> behavior".  EG in 2.9, when sorting by field you won't get scores
> by default.  When we do this we should clearly document the
> change, and what settings one could use to get back to the old
> behavior, in CHANGES.txt.
>
>  2) An API, once released as deprecated, is fair game to be removed
> in the next minor release.
>
> We still only make bug fixes on point releases, support the index file
> format until the next major release -- those don't change.
>
> Mike
>
> On Wed, May 20, 2009 at 11:34 PM, Shai Erera  wrote:
> >> With the new way, you can get the first bug fix release, but then you
> will
> >> quickly be left out of new bug fixes until you update your code.
> >
> > Mark, apologies for the late reference, but it struck me only after I
> left
> > the computer yesterday. Again, I'm not sure how bit of a problem is it.
> > Suppose that I upgrade to 2.4 and the next version (6 months?) is 2.9.
> Then
> > a bug is discovered and is fixed in 2.4.1 and 2.9. In addition, 2.9
> contains
> > all these changes in Collectors. When 2.9 is out I decide not to upgrade
> to
> > 2.9 because I don't have time. When 3.0 comes out (3-4 months later?) I
> will
> > be forced to upgrade. That means ~1 year since I last upgraded my Lucene
> > code sections.
> > (True, if there will be any bug fixes in 2.9, I will miss them).
> >
> > How unreasonable is to ask this? Seriously, how many apps are not touched
> > *at all* during one year? And even if these code segments are stable and
> no
> > one touches them anymore, how problematic is it to request users to once
> a
> > year do a sort of cleanup and update to their code?
> >
> >> In other words, we keep deprecated around for only one or two versions.
> >
> > That is a reasonable approach. Meaning, defaults may change between
> releases
> > because we'd like Lucene users to get the latest & greatest stuff, (and
> also
> > count on the fact not so many out there strongly rely on the defaults?)
> but
> > methods removal/rename should cause a little more trouble, so we can give
> > the users one release to arrange their time before they have to do
> anything.
> >
> > Maybe the TokenStream API needs to stay deprecated for longer, until
> we're
> > sure everybody is happy with the new API.
> >
> > Shai
> >
> > On Thu, May 21, 2009 at 4:23 AM, Grant Ingersoll 
> > wrote:
> >>
> >> On May 20, 2009, at 4:06 PM, Michael McCandless wrote:
> >>
> >>> On Wed, May 20, 2009 at 3:24 PM, Shai Erera  wrote:
> 
>  Then why go through all this trouble and not simply change the
>  back-compat
>  policy?
> >>>
> >>> Back-compat is insanely costly, especially the longer it takes us to
> >>> get to the next major release...  yet, the specific cost that bothers
> >>> me the most is that we hurt our new users because of the back-compat
> >>> users.  It hurts Lucene's adoption/growth.
> >>>
> >>> Another consideration on relaxing policy is that back-compat is well
> >>> nigh impossible to actually achieve.  We spend an insane amount of our
> >>> energy maintaining back-compat, but then one accidental breakage that
> >>> slips through quickly causes many back-compat users to conclude we are
> >>> not back-compat.  It's not much bang and alot of buck.
> >>>
> >>> It is tempting to change our policy to something like:
> >>>
> >>>  * Bug fixes only on each 2.4.X release
> >>>
> >>>  * Anything can change on each 2.X release, but any prior 2.Y index
> >>>   format is readable
> >>>
> >>> I think it's not unreasonable to say "if you want to take advantage of
> >>> Lucene's perf improvements and new features, on upgrading you'll have
> >>> to recompile, fix APIs, etc.".
> >>
> >>
> >> All reasonable, Mike.  My take is that Lucene has always been pragmatic
> >> about darn near everything, except back compat, where we are pretty
> >> dogmatic.
> >>
> >> In general, I think it is reasonable to say that even from 2.x to 2.y we
> >> will try to be back compatible, but when we deem it necessary, we
> reserve
> >> the right to change things.  I don't think anyone here is suggesting we
> >> would ever do something drastic like a complete overhaul of all the APIs
> in
> >> a version change.  I also think it is reasonable to deprecate things by
> >> saying @deprecated Will be removed in 2.Y.  Use coolNewMethod instead.
> In
> >> other words, we keep deprecated around for only one or two versions.  Of
> >> course, the timing can vary.  Things like changi

Re: Lucene's default settings & back compatibility

2009-05-21 Thread Michael McCandless
OK so it sounds like we've boiled the proposal down to two concrete
changes to the back-compat policy:

  1) Default settings can change; we will always choose defaults based
 on "latest & greatest for new users".  This only affects "runtime
 behavior".  EG in 2.9, when sorting by field you won't get scores
 by default.  When we do this we should clearly document the
 change, and what settings one could use to get back to the old
 behavior, in CHANGES.txt.

  2) An API, once released as deprecated, is fair game to be removed
 in the next minor release.

We still only make bug fixes on point releases, support the index file
format until the next major release -- those don't change.

Mike

On Wed, May 20, 2009 at 11:34 PM, Shai Erera  wrote:
>> With the new way, you can get the first bug fix release, but then you will
>> quickly be left out of new bug fixes until you update your code.
>
> Mark, apologies for the late reference, but it struck me only after I left
> the computer yesterday. Again, I'm not sure how bit of a problem is it.
> Suppose that I upgrade to 2.4 and the next version (6 months?) is 2.9. Then
> a bug is discovered and is fixed in 2.4.1 and 2.9. In addition, 2.9 contains
> all these changes in Collectors. When 2.9 is out I decide not to upgrade to
> 2.9 because I don't have time. When 3.0 comes out (3-4 months later?) I will
> be forced to upgrade. That means ~1 year since I last upgraded my Lucene
> code sections.
> (True, if there will be any bug fixes in 2.9, I will miss them).
>
> How unreasonable is to ask this? Seriously, how many apps are not touched
> *at all* during one year? And even if these code segments are stable and no
> one touches them anymore, how problematic is it to request users to once a
> year do a sort of cleanup and update to their code?
>
>> In other words, we keep deprecated around for only one or two versions.
>
> That is a reasonable approach. Meaning, defaults may change between releases
> because we'd like Lucene users to get the latest & greatest stuff, (and also
> count on the fact not so many out there strongly rely on the defaults?) but
> methods removal/rename should cause a little more trouble, so we can give
> the users one release to arrange their time before they have to do anything.
>
> Maybe the TokenStream API needs to stay deprecated for longer, until we're
> sure everybody is happy with the new API.
>
> Shai
>
> On Thu, May 21, 2009 at 4:23 AM, Grant Ingersoll 
> wrote:
>>
>> On May 20, 2009, at 4:06 PM, Michael McCandless wrote:
>>
>>> On Wed, May 20, 2009 at 3:24 PM, Shai Erera  wrote:

 Then why go through all this trouble and not simply change the
 back-compat
 policy?
>>>
>>> Back-compat is insanely costly, especially the longer it takes us to
>>> get to the next major release...  yet, the specific cost that bothers
>>> me the most is that we hurt our new users because of the back-compat
>>> users.  It hurts Lucene's adoption/growth.
>>>
>>> Another consideration on relaxing policy is that back-compat is well
>>> nigh impossible to actually achieve.  We spend an insane amount of our
>>> energy maintaining back-compat, but then one accidental breakage that
>>> slips through quickly causes many back-compat users to conclude we are
>>> not back-compat.  It's not much bang and alot of buck.
>>>
>>> It is tempting to change our policy to something like:
>>>
>>>  * Bug fixes only on each 2.4.X release
>>>
>>>  * Anything can change on each 2.X release, but any prior 2.Y index
>>>   format is readable
>>>
>>> I think it's not unreasonable to say "if you want to take advantage of
>>> Lucene's perf improvements and new features, on upgrading you'll have
>>> to recompile, fix APIs, etc.".
>>
>>
>> All reasonable, Mike.  My take is that Lucene has always been pragmatic
>> about darn near everything, except back compat, where we are pretty
>> dogmatic.
>>
>> In general, I think it is reasonable to say that even from 2.x to 2.y we
>> will try to be back compatible, but when we deem it necessary, we reserve
>> the right to change things.  I don't think anyone here is suggesting we
>> would ever do something drastic like a complete overhaul of all the APIs in
>> a version change.  I also think it is reasonable to deprecate things by
>> saying @deprecated Will be removed in 2.Y.  Use coolNewMethod instead.   In
>> other words, we keep deprecated around for only one or two versions.  Of
>> course, the timing can vary.  Things like changing the Document stuff like
>> we've talked about might last longer (or shorter, actually) while minor
>> deprecations may only be kept for one.  The index compatibility stuff is a
>> must.
>>
>> It is probably worthwhile to ask on java-user@ how many people rely on our
>> back compat policies.
>>
>>  Of course, we do already support back compat for all
>> versions:  svn checkout
>> http://svn.apache.org/repos/asf/lucene/java/tags/lucene_2_3_1/  > cheek>
>>
>>
>> 

[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean

2009-05-21 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711555#action_12711555
 ] 

Shai Erera commented on LUCENE-1614:


BTW, regarding SortedVIntList - even though it extends DocIdSet, its javadocs 
start with "Store and iterate sorted integers in compressed form in RAM." - doc 
Ids are not mentioned. Also, the class is public, so nothing prevents someone 
from using it for integers that are not Doc Ids.

I think I'll emphasize that in the javadocs, documenting the limitation of 
MAX_VAL so that people won't assume the wrong things.

> Add next() and skipTo() variants to DocIdSetIterator that return the current 
> doc, instead of boolean
> 
>
> Key: LUCENE-1614
> URL: https://issues.apache.org/jira/browse/LUCENE-1614
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
> Fix For: 2.9
>
> Attachments: LUCENE-1614.patch
>
>
> See 
> http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html
>  for the full discussion. The basic idea is to add variants to those two 
> methods that return the current doc they are at, to save successive calls to 
> doc(). If there are no more docs, return -1. A summary of what was discussed 
> so far:
> # Deprecate those two methods.
> # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI 
> (calls next() and skipTo() respectively, and will be changed to abstract in 
> 3.0).
> #* I actually would like to propose an alternative to the names: advance() 
> and advance(int) - the first advances by one, the second advances to target.
> # Wherever these are used, do something like '(doc = advance()) >= 0' instead 
> of comparing to -1 for improved performance.
> I will post a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean

2009-05-21 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711553#action_12711553
 ] 

Shai Erera commented on LUCENE-1614:


bq. SortedVIntList subclasses DocIdSet

Sorry, did not notice that. It's just that the test confused me, since I though 
it just stores VInts with no direct relation to doc Ids.

bq. maybe we need to choose a different name than nextDoc()

Why? just because OBSI declared a method which we wanted anyway? You know .. 
it's something we don't give much thought to when we add methods to abstract 
classes, but what if someone extended DISI and added his own advance(int) or 
nextDoc() which don't behave like we expect them to. When he'll pass his DISI 
to the search flow somehow, not knowing these have become the primary methods, 
something will break.

I'm not saying we should protect these cases too, because otherwise we won't be 
able to make any changes. But just because OBSI had nextDoc() declared doesn't 
mean we should go and find a different name. That's slightly unrelated to this 
issue, but our back-compat policy forces us to replace good names with moderate 
ones, just because we cannot change methods.

Can't we just document in CHANGES that nextDoc() now returns MAX_VAL when no 
more docs exist, and we fix the test in tag? I mean, how many users do we think 
use OBSI directly?

bq. MAX_VAL is not a valid docID

I'll remove the test then (from trunk and tag) and document on DISI this 
assumption.

> Add next() and skipTo() variants to DocIdSetIterator that return the current 
> doc, instead of boolean
> 
>
> Key: LUCENE-1614
> URL: https://issues.apache.org/jira/browse/LUCENE-1614
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
> Fix For: 2.9
>
> Attachments: LUCENE-1614.patch
>
>
> See 
> http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html
>  for the full discussion. The basic idea is to add variants to those two 
> methods that return the current doc they are at, to save successive calls to 
> doc(). If there are no more docs, return -1. A summary of what was discussed 
> so far:
> # Deprecate those two methods.
> # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI 
> (calls next() and skipTo() respectively, and will be changed to abstract in 
> 3.0).
> #* I actually would like to propose an alternative to the names: advance() 
> and advance(int) - the first advances by one, the second advances to target.
> # Wherever these are used, do something like '(doc = advance()) >= 0' instead 
> of comparing to -1 for improved performance.
> I will post a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



  1   2   >