[jira] Updated: (LUCENE-1679) Make WildcardTermEnum#difference() non-final

2009-06-10 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-1679:


Attachment: WildcardTermEnum_cleanup.patch
WildcardTermEnum.patch

> Make WildcardTermEnum#difference() non-final
> 
>
> Key: LUCENE-1679
> URL: https://issues.apache.org/jira/browse/LUCENE-1679
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Simon Willnauer
>Priority: Minor
> Attachments: WildcardTermEnum.patch, WildcardTermEnum_cleanup.patch
>
>
> The method WildcardTermEnum#difference() is declared final. I found it very 
> useful to subclass WildcardTermEnum to implement different scoring for exact 
> vs. partial matches. The change is rather trivial (attached)  but I guess it 
> could make life easier for a couple of users.
> I attached two patches:
>  - one which contains the single change to make difference() non-final 
> (WildcardTermEnum.patch)
>  - one which does also contain some minor cleanup of WildcardTermEnum. I 
> removed unnecessary member initialization and made those final. ( 
> WildcardTermEnum_cleanup.patch)
> Thanks simon

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Created: (LUCENE-1679) Make WildcardTermEnum#difference() non-final

2009-06-10 Thread Simon Willnauer (JIRA)
Make WildcardTermEnum#difference() non-final


 Key: LUCENE-1679
 URL: https://issues.apache.org/jira/browse/LUCENE-1679
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.9
Reporter: Simon Willnauer
Priority: Minor
 Attachments: WildcardTermEnum.patch, WildcardTermEnum_cleanup.patch

The method WildcardTermEnum#difference() is declared final. I found it very 
useful to subclass WildcardTermEnum to implement different scoring for exact 
vs. partial matches. The change is rather trivial (attached)  but I guess it 
could make life easier for a couple of users.

I attached two patches:
 - one which contains the single change to make difference() non-final 
(WildcardTermEnum.patch)
 - one which does also contain some minor cleanup of WildcardTermEnum. I 
removed unnecessary member initialization and made those final. ( 
WildcardTermEnum_cleanup.patch)

Thanks simon

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Created: (LUCENE-1680) Make prefixLength accessible to PrefixTermEnum subclasses

2009-06-10 Thread Simon Willnauer (JIRA)
Make prefixLength accessible to PrefixTermEnum subclasses
-

 Key: LUCENE-1680
 URL: https://issues.apache.org/jira/browse/LUCENE-1680
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.9
Reporter: Simon Willnauer
 Fix For: 2.9
 Attachments: PrefixTermEnum.patch

PrefixTermEnum#difference() offers a way to influence scoring based on the 
difference between the prefix Term and a term in the enumeration. To 
effectively use this facility the length of the prefix should be accessible to 
subclasses. Currently the prefix term is private to PrefixTermEnum. I added a 
getter for the prefix length and made PrefixTermEnum#endEnum(), 
PrefixTermEnum#termCompare() final for consistency with other TermEnum 
subclasses.

Patch is attached.

Simon

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Updated: (LUCENE-1680) Make prefixLength accessible to PrefixTermEnum subclasses

2009-06-10 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-1680:


Attachment: PrefixTermEnum.patch

> Make prefixLength accessible to PrefixTermEnum subclasses
> -
>
> Key: LUCENE-1680
> URL: https://issues.apache.org/jira/browse/LUCENE-1680
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Simon Willnauer
> Fix For: 2.9
>
> Attachments: PrefixTermEnum.patch
>
>
> PrefixTermEnum#difference() offers a way to influence scoring based on the 
> difference between the prefix Term and a term in the enumeration. To 
> effectively use this facility the length of the prefix should be accessible 
> to subclasses. Currently the prefix term is private to PrefixTermEnum. I 
> added a getter for the prefix length and made PrefixTermEnum#endEnum(), 
> PrefixTermEnum#termCompare() final for consistency with other TermEnum 
> subclasses.
> Patch is attached.
> Simon

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Updated: (LUCENE-1680) Make prefixLength accessible to PrefixTermEnum subclasses

2009-06-10 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-1680:


Priority: Minor  (was: Major)

> Make prefixLength accessible to PrefixTermEnum subclasses
> -
>
> Key: LUCENE-1680
> URL: https://issues.apache.org/jira/browse/LUCENE-1680
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Simon Willnauer
>Priority: Minor
> Fix For: 2.9
>
> Attachments: PrefixTermEnum.patch
>
>
> PrefixTermEnum#difference() offers a way to influence scoring based on the 
> difference between the prefix Term and a term in the enumeration. To 
> effectively use this facility the length of the prefix should be accessible 
> to subclasses. Currently the prefix term is private to PrefixTermEnum. I 
> added a getter for the prefix length and made PrefixTermEnum#endEnum(), 
> PrefixTermEnum#termCompare() final for consistency with other TermEnum 
> subclasses.
> Patch is attached.
> Simon

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Updated: (LUCENE-1453) When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting

2009-06-10 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1453:
--

Attachment: LUCENE-1453.patch

Hi Earwin,
attached is a patch, that simply reuses SegmentReader.Ref. Factoring it out to 
o.a.l.util would be harder to do at the moment (some test-cases rely on this 
class). And SegmentReader seems to be the only class, that uses such a Ref 
construct. Other classes have their refCounter as Field.
As the Filter is just a deprecated wrapper, that is removed in 3.0, I think 
reusing SegmentReader.Ref for that is ok.

Closeable is a Java 1.5 interface only, so this refactoring must wait until 
3.0, but the idea is good!

> When reopen returns a new IndexReader, both IndexReaders may now control the 
> lifecycle of the underlying Directory which is managed by reference counting
> -
>
> Key: LUCENE-1453
> URL: https://issues.apache.org/jira/browse/LUCENE-1453
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.4
>Reporter: Mark Miller
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4.1, 2.9
>
> Attachments: Failing-testcase-LUCENE-1453.patch, 
> LUCENE-1453-with-FSDir-open.patch, LUCENE-1453.patch, LUCENE-1453.patch, 
> LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch
>
>
> Rough summary. Basically, FSDirectory tracks references to FSDirectory and 
> when IndexReader.reopen shares a Directory with a created IndexReader and 
> closeDirectory is true, FSDirectory's ref management will see two decrements 
> for one increment. You can end up getting an AlreadyClosed exception on the 
> Directory when the IndexReader is open.
> I have a test I'll put up. A solution seems fairly straightforward (at least 
> in what needs to be accomplished).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Issue Comment Edited: (LUCENE-1453) When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting

2009-06-10 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718004#action_12718004
 ] 

Uwe Schindler edited comment on LUCENE-1453 at 6/10/09 2:40 AM:


Hi Earwin,
attached is a patch, that simply reuses SegmentReader.Ref. Factoring it out to 
o.a.l.util would be harder to do at the moment (some test-cases rely on this 
class). And SegmentReader seems to be the only class, that uses such a Ref 
construct. Other classes have their refCounter as Field.
As the Filter is just a deprecated wrapper, that is removed in 3.0, I think 
reusing SegmentReader.Ref for that is ok.

This patch also contains a new test for clone(), that does the same like the 
reopen test (checking the refcounts).

Closeable is a Java 1.5 interface only, so this refactoring must wait until 
3.0, but the idea is good!

  was (Author: thetaphi):
Hi Earwin,
attached is a patch, that simply reuses SegmentReader.Ref. Factoring it out to 
o.a.l.util would be harder to do at the moment (some test-cases rely on this 
class). And SegmentReader seems to be the only class, that uses such a Ref 
construct. Other classes have their refCounter as Field.
As the Filter is just a deprecated wrapper, that is removed in 3.0, I think 
reusing SegmentReader.Ref for that is ok.

Closeable is a Java 1.5 interface only, so this refactoring must wait until 
3.0, but the idea is good!
  
> When reopen returns a new IndexReader, both IndexReaders may now control the 
> lifecycle of the underlying Directory which is managed by reference counting
> -
>
> Key: LUCENE-1453
> URL: https://issues.apache.org/jira/browse/LUCENE-1453
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.4
>Reporter: Mark Miller
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4.1, 2.9
>
> Attachments: Failing-testcase-LUCENE-1453.patch, 
> LUCENE-1453-with-FSDir-open.patch, LUCENE-1453.patch, LUCENE-1453.patch, 
> LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch
>
>
> Rough summary. Basically, FSDirectory tracks references to FSDirectory and 
> when IndexReader.reopen shares a Directory with a created IndexReader and 
> closeDirectory is true, FSDirectory's ref management will see two decrements 
> for one increment. You can end up getting an AlreadyClosed exception on the 
> Directory when the IndexReader is open.
> I have a test I'll put up. A solution seems fairly straightforward (at least 
> in what needs to be accomplished).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-1453) When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting

2009-06-10 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718008#action_12718008
 ] 

Michael McCandless commented on LUCENE-1453:


bq. Mike will you do this, or should I assign myself to this issue?

Go ahead & assign to yourself!

> When reopen returns a new IndexReader, both IndexReaders may now control the 
> lifecycle of the underlying Directory which is managed by reference counting
> -
>
> Key: LUCENE-1453
> URL: https://issues.apache.org/jira/browse/LUCENE-1453
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.4
>Reporter: Mark Miller
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4.1, 2.9
>
> Attachments: Failing-testcase-LUCENE-1453.patch, 
> LUCENE-1453-with-FSDir-open.patch, LUCENE-1453.patch, LUCENE-1453.patch, 
> LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch
>
>
> Rough summary. Basically, FSDirectory tracks references to FSDirectory and 
> when IndexReader.reopen shares a Directory with a created IndexReader and 
> closeDirectory is true, FSDirectory's ref management will see two decrements 
> for one increment. You can end up getting an AlreadyClosed exception on the 
> Directory when the IndexReader is open.
> I have a test I'll put up. A solution seems fairly straightforward (at least 
> in what needs to be accomplished).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-1453) When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting

2009-06-10 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718009#action_12718009
 ] 

Earwin Burrfoot commented on LUCENE-1453:
-

bq. As the Filter is just a deprecated wrapper, that is removed in 3.0, I think 
reusing SegmentReader.Ref for that is ok. 
Ok. Maybe you are right.

bq. Closeable is a Java 1.5 interface only, so this refactoring must wait until 
3.0, but the idea is good!
We can introduce our own Closeable, and replace it with java native in 3.0, 
thank gods the interface is simple :)

> When reopen returns a new IndexReader, both IndexReaders may now control the 
> lifecycle of the underlying Directory which is managed by reference counting
> -
>
> Key: LUCENE-1453
> URL: https://issues.apache.org/jira/browse/LUCENE-1453
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.4
>Reporter: Mark Miller
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4.1, 2.9
>
> Attachments: Failing-testcase-LUCENE-1453.patch, 
> LUCENE-1453-with-FSDir-open.patch, LUCENE-1453.patch, LUCENE-1453.patch, 
> LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch
>
>
> Rough summary. Basically, FSDirectory tracks references to FSDirectory and 
> when IndexReader.reopen shares a Directory with a created IndexReader and 
> closeDirectory is true, FSDirectory's ref management will see two decrements 
> for one increment. You can end up getting an AlreadyClosed exception on the 
> Directory when the IndexReader is open.
> I have a test I'll put up. A solution seems fairly straightforward (at least 
> in what needs to be accomplished).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Assigned: (LUCENE-1453) When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting

2009-06-10 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler reassigned LUCENE-1453:
-

Assignee: Uwe Schindler  (was: Michael McCandless)

> When reopen returns a new IndexReader, both IndexReaders may now control the 
> lifecycle of the underlying Directory which is managed by reference counting
> -
>
> Key: LUCENE-1453
> URL: https://issues.apache.org/jira/browse/LUCENE-1453
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.4
>Reporter: Mark Miller
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 2.4.1, 2.9
>
> Attachments: Failing-testcase-LUCENE-1453.patch, 
> LUCENE-1453-with-FSDir-open.patch, LUCENE-1453.patch, LUCENE-1453.patch, 
> LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch
>
>
> Rough summary. Basically, FSDirectory tracks references to FSDirectory and 
> when IndexReader.reopen shares a Directory with a created IndexReader and 
> closeDirectory is true, FSDirectory's ref management will see two decrements 
> for one increment. You can end up getting an AlreadyClosed exception on the 
> Directory when the IndexReader is open.
> I have a test I'll put up. A solution seems fairly straightforward (at least 
> in what needs to be accomplished).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-1453) When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting

2009-06-10 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718013#action_12718013
 ] 

Uwe Schindler commented on LUCENE-1453:
---

Mike: OK, I commit the latest patch soon!

Earwin:
{quote}
bq. Closeable is a Java 1.5 interface only, so this refactoring must wait until 
3.0, but the idea is good!

We can introduce our own Closeable, and replace it with java native in 3.0, 
thank gods the interface is simple 
{quote}
I think you should open an issue about that. But the own closeable should be 
declared as deprecated from the beginning with the note "will be replaced by 
java.io.Closeable" in 3.0.

> When reopen returns a new IndexReader, both IndexReaders may now control the 
> lifecycle of the underlying Directory which is managed by reference counting
> -
>
> Key: LUCENE-1453
> URL: https://issues.apache.org/jira/browse/LUCENE-1453
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.4
>Reporter: Mark Miller
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 2.4.1, 2.9
>
> Attachments: Failing-testcase-LUCENE-1453.patch, 
> LUCENE-1453-with-FSDir-open.patch, LUCENE-1453.patch, LUCENE-1453.patch, 
> LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch
>
>
> Rough summary. Basically, FSDirectory tracks references to FSDirectory and 
> when IndexReader.reopen shares a Directory with a created IndexReader and 
> closeDirectory is true, FSDirectory's ref management will see two decrements 
> for one increment. You can end up getting an AlreadyClosed exception on the 
> Directory when the IndexReader is open.
> I have a test I'll put up. A solution seems fairly straightforward (at least 
> in what needs to be accomplished).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Closed: (LUCENE-1453) When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting

2009-06-10 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler closed LUCENE-1453.
-

Resolution: Fixed

Committed revision 783280.

2.4 branch is untouched, if backporting is needed (because somebody has 
problems with reopen/clone), reopen this issue!

> When reopen returns a new IndexReader, both IndexReaders may now control the 
> lifecycle of the underlying Directory which is managed by reference counting
> -
>
> Key: LUCENE-1453
> URL: https://issues.apache.org/jira/browse/LUCENE-1453
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.4
>Reporter: Mark Miller
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 2.9, 2.4.1
>
> Attachments: Failing-testcase-LUCENE-1453.patch, 
> LUCENE-1453-with-FSDir-open.patch, LUCENE-1453.patch, LUCENE-1453.patch, 
> LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch
>
>
> Rough summary. Basically, FSDirectory tracks references to FSDirectory and 
> when IndexReader.reopen shares a Directory with a created IndexReader and 
> closeDirectory is true, FSDirectory's ref management will see two decrements 
> for one increment. You can end up getting an AlreadyClosed exception on the 
> Directory when the IndexReader is open.
> I have a test I'll put up. A solution seems fairly straightforward (at least 
> in what needs to be accomplished).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Assigned: (LUCENE-1680) Make prefixLength accessible to PrefixTermEnum subclasses

2009-06-10 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-1680:
--

Assignee: Michael McCandless

> Make prefixLength accessible to PrefixTermEnum subclasses
> -
>
> Key: LUCENE-1680
> URL: https://issues.apache.org/jira/browse/LUCENE-1680
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Simon Willnauer
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: PrefixTermEnum.patch
>
>
> PrefixTermEnum#difference() offers a way to influence scoring based on the 
> difference between the prefix Term and a term in the enumeration. To 
> effectively use this facility the length of the prefix should be accessible 
> to subclasses. Currently the prefix term is private to PrefixTermEnum. I 
> added a getter for the prefix length and made PrefixTermEnum#endEnum(), 
> PrefixTermEnum#termCompare() final for consistency with other TermEnum 
> subclasses.
> Patch is attached.
> Simon

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-1680) Make prefixLength accessible to PrefixTermEnum subclasses

2009-06-10 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718024#action_12718024
 ] 

Michael McCandless commented on LUCENE-1680:


Should we just add a getter for the prefix Term, to be more general?

Also, I think we can't suddenly change protected methods to private (that 
breaks back compat).

> Make prefixLength accessible to PrefixTermEnum subclasses
> -
>
> Key: LUCENE-1680
> URL: https://issues.apache.org/jira/browse/LUCENE-1680
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Simon Willnauer
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: PrefixTermEnum.patch
>
>
> PrefixTermEnum#difference() offers a way to influence scoring based on the 
> difference between the prefix Term and a term in the enumeration. To 
> effectively use this facility the length of the prefix should be accessible 
> to subclasses. Currently the prefix term is private to PrefixTermEnum. I 
> added a getter for the prefix length and made PrefixTermEnum#endEnum(), 
> PrefixTermEnum#termCompare() final for consistency with other TermEnum 
> subclasses.
> Patch is attached.
> Simon

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Reopened: (LUCENE-1453) When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting

2009-06-10 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reopened LUCENE-1453:



I'm seeing a failure in back-compat tests ("and test-tag 
-Dtestcase=TestIndexReader"):

{code}
[junit] Testcase: 
testFalseDirectoryAlreadyClosed(org.apache.lucene.index.TestIndexReader): FAILED
[junit] did not hit expected exception
[junit] junit.framework.AssertionFailedError: did not hit expected exception
[junit] at 
org.apache.lucene.index.TestIndexReader.testFalseDirectoryAlreadyClosed(TestIndexReader.java:1514)
{code}

(I assume, but I'm not certain, it's from this fix...).

> When reopen returns a new IndexReader, both IndexReaders may now control the 
> lifecycle of the underlying Directory which is managed by reference counting
> -
>
> Key: LUCENE-1453
> URL: https://issues.apache.org/jira/browse/LUCENE-1453
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.4
>Reporter: Mark Miller
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 2.4.1, 2.9
>
> Attachments: Failing-testcase-LUCENE-1453.patch, 
> LUCENE-1453-with-FSDir-open.patch, LUCENE-1453.patch, LUCENE-1453.patch, 
> LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch
>
>
> Rough summary. Basically, FSDirectory tracks references to FSDirectory and 
> when IndexReader.reopen shares a Directory with a created IndexReader and 
> closeDirectory is true, FSDirectory's ref management will see two decrements 
> for one increment. You can end up getting an AlreadyClosed exception on the 
> Directory when the IndexReader is open.
> I have a test I'll put up. A solution seems fairly straightforward (at least 
> in what needs to be accomplished).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Assigned: (LUCENE-1679) Make WildcardTermEnum#difference() non-final

2009-06-10 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-1679:
--

Assignee: Michael McCandless

> Make WildcardTermEnum#difference() non-final
> 
>
> Key: LUCENE-1679
> URL: https://issues.apache.org/jira/browse/LUCENE-1679
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Simon Willnauer
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: WildcardTermEnum.patch, WildcardTermEnum_cleanup.patch
>
>
> The method WildcardTermEnum#difference() is declared final. I found it very 
> useful to subclass WildcardTermEnum to implement different scoring for exact 
> vs. partial matches. The change is rather trivial (attached)  but I guess it 
> could make life easier for a couple of users.
> I attached two patches:
>  - one which contains the single change to make difference() non-final 
> (WildcardTermEnum.patch)
>  - one which does also contain some minor cleanup of WildcardTermEnum. I 
> removed unnecessary member initialization and made those final. ( 
> WildcardTermEnum_cleanup.patch)
> Thanks simon

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Updated: (LUCENE-1679) Make WildcardTermEnum#difference() non-final

2009-06-10 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1679:
---

Fix Version/s: 2.9

> Make WildcardTermEnum#difference() non-final
> 
>
> Key: LUCENE-1679
> URL: https://issues.apache.org/jira/browse/LUCENE-1679
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Simon Willnauer
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: WildcardTermEnum.patch, WildcardTermEnum_cleanup.patch
>
>
> The method WildcardTermEnum#difference() is declared final. I found it very 
> useful to subclass WildcardTermEnum to implement different scoring for exact 
> vs. partial matches. The change is rather trivial (attached)  but I guess it 
> could make life easier for a couple of users.
> I attached two patches:
>  - one which contains the single change to make difference() non-final 
> (WildcardTermEnum.patch)
>  - one which does also contain some minor cleanup of WildcardTermEnum. I 
> removed unnecessary member initialization and made those final. ( 
> WildcardTermEnum_cleanup.patch)
> Thanks simon

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-1679) Make WildcardTermEnum#difference() non-final

2009-06-10 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718029#action_12718029
 ] 

Michael McCandless commented on LUCENE-1679:


I like the cleanup patch, but, I think we should not remove close()?  Even 
though it's basically a no-op, removing it breaks back-compat.

Technically, changing the members to final is also a break to back-compat, but 
I think it's acceptable because WildcardTermEnum basically requires that these 
are final (ie, you can't up and change say "pre" after creation, because the 
enum has already been set).

> Make WildcardTermEnum#difference() non-final
> 
>
> Key: LUCENE-1679
> URL: https://issues.apache.org/jira/browse/LUCENE-1679
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Simon Willnauer
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: WildcardTermEnum.patch, WildcardTermEnum_cleanup.patch
>
>
> The method WildcardTermEnum#difference() is declared final. I found it very 
> useful to subclass WildcardTermEnum to implement different scoring for exact 
> vs. partial matches. The change is rather trivial (attached)  but I guess it 
> could make life easier for a couple of users.
> I attached two patches:
>  - one which contains the single change to make difference() non-final 
> (WildcardTermEnum.patch)
>  - one which does also contain some minor cleanup of WildcardTermEnum. I 
> removed unnecessary member initialization and made those final. ( 
> WildcardTermEnum_cleanup.patch)
> Thanks simon

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: Payloads and TrieRangeQuery

2009-06-10 Thread Michael McCandless
Use them how?  (Sounds interesting...).

Mike

On Tue, Jun 9, 2009 at 10:32 PM, Jason
Rutherglen wrote:
> At the SF Lucene User's group, Michael Busch mentioned using
> payloads with TrieRangeQueries. Is this something that's being
> worked on? I'm interested in what sort performance benefits
> there would be to this method?
>

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-1678) Deprecate Analyzer.tokenStream

2009-06-10 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718031#action_12718031
 ] 

Michael McCandless commented on LUCENE-1678:


bq. So, given it is already broken, why not fix it the right way?

Because two wrongs don't make a right?

(I assume you're suggesting changing tokenStream to match reusableTokenStream, 
ie allowing it to return a reused TokenStream between calls, and then 
deprecating reusableTokenStream).

Apps that get multiple TokenStreams from a single Analyzer and then iterate 
through them, would silently break, if we up and made this 2nd 
non-back-compatible change.

> Deprecate Analyzer.tokenStream
> --
>
> Key: LUCENE-1678
> URL: https://issues.apache.org/jira/browse/LUCENE-1678
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
>
> The addition of reusableTokenStream to the core analyzers unfortunately broke 
> back compat of external subclasses:
> 
> http://www.nabble.com/Extending-StandardAnalyzer-considered-harmful-td23863822.html
> On upgrading, such subclasses would silently not be used anymore, since 
> Lucene's indexing invokes reusableTokenStream.
> I think we should should at least deprecate Analyzer.tokenStream, today, so 
> that users see deprecation warnings if their classes override this method.  
> But going forward when we want to change the API of core classes that are 
> extended, I think we have to  introduce entirely new classes, to keep back 
> compatibility.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-1678) Deprecate Analyzer.tokenStream

2009-06-10 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718032#action_12718032
 ] 

Michael McCandless commented on LUCENE-1678:


bq. The sane/smart way is to do it on a case by case basis.

Right, and the huge periodic discussions on back-compat do soften
"our" stance on these.  For example LUCENE-1542 was just such a case,
where we chose to simply fix the [rather nasty] bug at the expense of
possible apps relying on the broken behavior.

LUCENE-1679 is another (rather trivial) example, where we plan to
change certain fields in WildcardTermEnum to be final.


> Deprecate Analyzer.tokenStream
> --
>
> Key: LUCENE-1678
> URL: https://issues.apache.org/jira/browse/LUCENE-1678
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
>
> The addition of reusableTokenStream to the core analyzers unfortunately broke 
> back compat of external subclasses:
> 
> http://www.nabble.com/Extending-StandardAnalyzer-considered-harmful-td23863822.html
> On upgrading, such subclasses would silently not be used anymore, since 
> Lucene's indexing invokes reusableTokenStream.
> I think we should should at least deprecate Analyzer.tokenStream, today, so 
> that users see deprecation warnings if their classes override this method.  
> But going forward when we want to change the API of core classes that are 
> extended, I think we have to  introduce entirely new classes, to keep back 
> compatibility.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Updated: (LUCENE-1680) Make prefixLength accessible to PrefixTermEnum subclasses

2009-06-10 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-1680:


Attachment: PrefixTermEnum_2nd.patch

> Make prefixLength accessible to PrefixTermEnum subclasses
> -
>
> Key: LUCENE-1680
> URL: https://issues.apache.org/jira/browse/LUCENE-1680
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Simon Willnauer
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: PrefixTermEnum.patch, PrefixTermEnum_2nd.patch
>
>
> PrefixTermEnum#difference() offers a way to influence scoring based on the 
> difference between the prefix Term and a term in the enumeration. To 
> effectively use this facility the length of the prefix should be accessible 
> to subclasses. Currently the prefix term is private to PrefixTermEnum. I 
> added a getter for the prefix length and made PrefixTermEnum#endEnum(), 
> PrefixTermEnum#termCompare() final for consistency with other TermEnum 
> subclasses.
> Patch is attached.
> Simon

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Updated: (LUCENE-1679) Make WildcardTermEnum#difference() non-final

2009-06-10 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-1679:


Attachment: WildcardTermEnum_cleanup_2nd.patch

> Make WildcardTermEnum#difference() non-final
> 
>
> Key: LUCENE-1679
> URL: https://issues.apache.org/jira/browse/LUCENE-1679
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Simon Willnauer
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: WildcardTermEnum.patch, WildcardTermEnum_cleanup.patch, 
> WildcardTermEnum_cleanup_2nd.patch
>
>
> The method WildcardTermEnum#difference() is declared final. I found it very 
> useful to subclass WildcardTermEnum to implement different scoring for exact 
> vs. partial matches. The change is rather trivial (attached)  but I guess it 
> could make life easier for a couple of users.
> I attached two patches:
>  - one which contains the single change to make difference() non-final 
> (WildcardTermEnum.patch)
>  - one which does also contain some minor cleanup of WildcardTermEnum. I 
> removed unnecessary member initialization and made those final. ( 
> WildcardTermEnum_cleanup.patch)
> Thanks simon

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-1680) Make prefixLength accessible to PrefixTermEnum subclasses

2009-06-10 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718038#action_12718038
 ] 

Simon Willnauer commented on LUCENE-1680:
-

You are right, adding a getter for prefix Term is more general. I had a brief 
look at Term#set and I missed that it is package private. Otherwise it would 
have been mutable and therefore I would have preferred to just return the 
prefix length.
I added a new version of the patch. Thanks

> Make prefixLength accessible to PrefixTermEnum subclasses
> -
>
> Key: LUCENE-1680
> URL: https://issues.apache.org/jira/browse/LUCENE-1680
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Simon Willnauer
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: PrefixTermEnum.patch, PrefixTermEnum_2nd.patch
>
>
> PrefixTermEnum#difference() offers a way to influence scoring based on the 
> difference between the prefix Term and a term in the enumeration. To 
> effectively use this facility the length of the prefix should be accessible 
> to subclasses. Currently the prefix term is private to PrefixTermEnum. I 
> added a getter for the prefix length and made PrefixTermEnum#endEnum(), 
> PrefixTermEnum#termCompare() final for consistency with other TermEnum 
> subclasses.
> Patch is attached.
> Simon

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-1679) Make WildcardTermEnum#difference() non-final

2009-06-10 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718039#action_12718039
 ] 

Simon Willnauer commented on LUCENE-1679:
-

I created a new patch containing the #close() method. 
I 100% believe you about your comment but I have a hard time to understand why 
it breaks back-compat. Could you give me a quick explanation please I might 
miss something.

Thanks, Simon

> Make WildcardTermEnum#difference() non-final
> 
>
> Key: LUCENE-1679
> URL: https://issues.apache.org/jira/browse/LUCENE-1679
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Simon Willnauer
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: WildcardTermEnum.patch, WildcardTermEnum_cleanup.patch, 
> WildcardTermEnum_cleanup_2nd.patch
>
>
> The method WildcardTermEnum#difference() is declared final. I found it very 
> useful to subclass WildcardTermEnum to implement different scoring for exact 
> vs. partial matches. The change is rather trivial (attached)  but I guess it 
> could make life easier for a couple of users.
> I attached two patches:
>  - one which contains the single change to make difference() non-final 
> (WildcardTermEnum.patch)
>  - one which does also contain some minor cleanup of WildcardTermEnum. I 
> removed unnecessary member initialization and made those final. ( 
> WildcardTermEnum_cleanup.patch)
> Thanks simon

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-1453) When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting

2009-06-10 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718041#action_12718041
 ] 

Uwe Schindler commented on LUCENE-1453:
---

Thanks Mike,
it is from this fix. The test should normally also fail with trunk, but it 
doesn't because we are using FSDir.getDirectory() in the trunk test. This is 
another test that relies on the refcounting of FSDir.getDirectory.

The problem:
If you do IndexReader.open() on a invalid index and IndexReader.open fails with 
an Exception, the Directory keeps open (because the wrapper has no chance to 
close it). I'll fix this and also enable FSDir.getDirectory for this test in 
trunk.

> When reopen returns a new IndexReader, both IndexReaders may now control the 
> lifecycle of the underlying Directory which is managed by reference counting
> -
>
> Key: LUCENE-1453
> URL: https://issues.apache.org/jira/browse/LUCENE-1453
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.4
>Reporter: Mark Miller
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 2.4.1, 2.9
>
> Attachments: Failing-testcase-LUCENE-1453.patch, 
> LUCENE-1453-with-FSDir-open.patch, LUCENE-1453.patch, LUCENE-1453.patch, 
> LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch
>
>
> Rough summary. Basically, FSDirectory tracks references to FSDirectory and 
> when IndexReader.reopen shares a Directory with a created IndexReader and 
> closeDirectory is true, FSDirectory's ref management will see two decrements 
> for one increment. You can end up getting an AlreadyClosed exception on the 
> Directory when the IndexReader is open.
> I have a test I'll put up. A solution seems fairly straightforward (at least 
> in what needs to be accomplished).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Updated: (LUCENE-1453) When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting

2009-06-10 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1453:
--

Attachment: LUCENE-1453-fix-TestIndexReader.patch

This fixes this special case and the test on trunk to also hit this, I commit 
soon!

> When reopen returns a new IndexReader, both IndexReaders may now control the 
> lifecycle of the underlying Directory which is managed by reference counting
> -
>
> Key: LUCENE-1453
> URL: https://issues.apache.org/jira/browse/LUCENE-1453
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.4
>Reporter: Mark Miller
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 2.4.1, 2.9
>
> Attachments: Failing-testcase-LUCENE-1453.patch, 
> LUCENE-1453-fix-TestIndexReader.patch, LUCENE-1453-with-FSDir-open.patch, 
> LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch, 
> LUCENE-1453.patch, LUCENE-1453.patch
>
>
> Rough summary. Basically, FSDirectory tracks references to FSDirectory and 
> when IndexReader.reopen shares a Directory with a created IndexReader and 
> closeDirectory is true, FSDirectory's ref management will see two decrements 
> for one increment. You can end up getting an AlreadyClosed exception on the 
> Directory when the IndexReader is open.
> I have a test I'll put up. A solution seems fairly straightforward (at least 
> in what needs to be accomplished).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Issue Comment Edited: (LUCENE-1453) When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting

2009-06-10 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718041#action_12718041
 ] 

Uwe Schindler edited comment on LUCENE-1453 at 6/10/09 5:07 AM:


Thanks Mike,
it is from this fix. The test should normally also fail with trunk, but it 
doesn't because we are using FSDir.open() in the trunk test. This is another 
test that relies on the refcounting of FSDir.getDirectory.

The problem:
If you do IndexReader.open() on a invalid index and IndexReader.open fails with 
an Exception, the Directory keeps open (because the wrapper has no chance to 
close it). I'll fix this and also enable FSDir.getDirectory for this test in 
trunk.

  was (Author: thetaphi):
Thanks Mike,
it is from this fix. The test should normally also fail with trunk, but it 
doesn't because we are using FSDir.getDirectory() in the trunk test. This is 
another test that relies on the refcounting of FSDir.getDirectory.

The problem:
If you do IndexReader.open() on a invalid index and IndexReader.open fails with 
an Exception, the Directory keeps open (because the wrapper has no chance to 
close it). I'll fix this and also enable FSDir.getDirectory for this test in 
trunk.
  
> When reopen returns a new IndexReader, both IndexReaders may now control the 
> lifecycle of the underlying Directory which is managed by reference counting
> -
>
> Key: LUCENE-1453
> URL: https://issues.apache.org/jira/browse/LUCENE-1453
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.4
>Reporter: Mark Miller
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 2.4.1, 2.9
>
> Attachments: Failing-testcase-LUCENE-1453.patch, 
> LUCENE-1453-fix-TestIndexReader.patch, LUCENE-1453-with-FSDir-open.patch, 
> LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch, 
> LUCENE-1453.patch, LUCENE-1453.patch
>
>
> Rough summary. Basically, FSDirectory tracks references to FSDirectory and 
> when IndexReader.reopen shares a Directory with a created IndexReader and 
> closeDirectory is true, FSDirectory's ref management will see two decrements 
> for one increment. You can end up getting an AlreadyClosed exception on the 
> Directory when the IndexReader is open.
> I have a test I'll put up. A solution seems fairly straightforward (at least 
> in what needs to be accomplished).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Closed: (LUCENE-1453) When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting

2009-06-10 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler closed LUCENE-1453.
-

Resolution: Fixed

Committed revision 783314. Thanks Mike! Next time I will also test-tag, sorry.

I will go through all other tests using FSDir.open() in trunk now and check, if 
there are more cases, that rely on refcounting. They can be found easily, 
because they catch AlreadyClosedException.

> When reopen returns a new IndexReader, both IndexReaders may now control the 
> lifecycle of the underlying Directory which is managed by reference counting
> -
>
> Key: LUCENE-1453
> URL: https://issues.apache.org/jira/browse/LUCENE-1453
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.4
>Reporter: Mark Miller
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 2.9, 2.4.1
>
> Attachments: Failing-testcase-LUCENE-1453.patch, 
> LUCENE-1453-fix-TestIndexReader.patch, LUCENE-1453-with-FSDir-open.patch, 
> LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch, 
> LUCENE-1453.patch, LUCENE-1453.patch
>
>
> Rough summary. Basically, FSDirectory tracks references to FSDirectory and 
> when IndexReader.reopen shares a Directory with a created IndexReader and 
> closeDirectory is true, FSDirectory's ref management will see two decrements 
> for one increment. You can end up getting an AlreadyClosed exception on the 
> Directory when the IndexReader is open.
> I have a test I'll put up. A solution seems fairly straightforward (at least 
> in what needs to be accomplished).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-1679) Make WildcardTermEnum#difference() non-final

2009-06-10 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718051#action_12718051
 ] 

Michael McCandless commented on LUCENE-1679:


bq. I created a new patch containing the #close() method. 

Woops, sorry, I was wrong: removing close is fine, since super.close is still 
there, and since we will no longer assign nulls to the members.

> Make WildcardTermEnum#difference() non-final
> 
>
> Key: LUCENE-1679
> URL: https://issues.apache.org/jira/browse/LUCENE-1679
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Simon Willnauer
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: WildcardTermEnum.patch, WildcardTermEnum_cleanup.patch, 
> WildcardTermEnum_cleanup_2nd.patch
>
>
> The method WildcardTermEnum#difference() is declared final. I found it very 
> useful to subclass WildcardTermEnum to implement different scoring for exact 
> vs. partial matches. The change is rather trivial (attached)  but I guess it 
> could make life easier for a couple of users.
> I attached two patches:
>  - one which contains the single change to make difference() non-final 
> (WildcardTermEnum.patch)
>  - one which does also contain some minor cleanup of WildcardTermEnum. I 
> removed unnecessary member initialization and made those final. ( 
> WildcardTermEnum_cleanup.patch)
> Thanks simon

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-1680) Make prefixLength accessible to PrefixTermEnum subclasses

2009-06-10 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718054#action_12718054
 ] 

Michael McCandless commented on LUCENE-1680:


bq. Also, I think we can't suddenly change protected methods to private (that 
breaks back compat).

I meant to say "can't suddenly change add 'final' to protected methods", but I 
see from your new patch that you understood what I meant anyway :)

Patch looks good -- I'll commit shortly.  Thanks!

> Make prefixLength accessible to PrefixTermEnum subclasses
> -
>
> Key: LUCENE-1680
> URL: https://issues.apache.org/jira/browse/LUCENE-1680
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Simon Willnauer
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: PrefixTermEnum.patch, PrefixTermEnum_2nd.patch
>
>
> PrefixTermEnum#difference() offers a way to influence scoring based on the 
> difference between the prefix Term and a term in the enumeration. To 
> effectively use this facility the length of the prefix should be accessible 
> to subclasses. Currently the prefix term is private to PrefixTermEnum. I 
> added a getter for the prefix length and made PrefixTermEnum#endEnum(), 
> PrefixTermEnum#termCompare() final for consistency with other TermEnum 
> subclasses.
> Patch is attached.
> Simon

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-1453) When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting

2009-06-10 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718055#action_12718055
 ] 

Michael McCandless commented on LUCENE-1453:


OK thanks Uwe!

> When reopen returns a new IndexReader, both IndexReaders may now control the 
> lifecycle of the underlying Directory which is managed by reference counting
> -
>
> Key: LUCENE-1453
> URL: https://issues.apache.org/jira/browse/LUCENE-1453
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.4
>Reporter: Mark Miller
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 2.4.1, 2.9
>
> Attachments: Failing-testcase-LUCENE-1453.patch, 
> LUCENE-1453-fix-TestIndexReader.patch, LUCENE-1453-with-FSDir-open.patch, 
> LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch, 
> LUCENE-1453.patch, LUCENE-1453.patch
>
>
> Rough summary. Basically, FSDirectory tracks references to FSDirectory and 
> when IndexReader.reopen shares a Directory with a created IndexReader and 
> closeDirectory is true, FSDirectory's ref management will see two decrements 
> for one increment. You can end up getting an AlreadyClosed exception on the 
> Directory when the IndexReader is open.
> I have a test I'll put up. A solution seems fairly straightforward (at least 
> in what needs to be accomplished).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Updated: (LUCENE-1595) Split DocMaker into ContentSource and DocMaker

2009-06-10 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-1595:
---

Attachment: LUCENE-1595.patch

Some updates:
# Added to PerfTask a log.step config parameter, and implemented in tearDown 
logging messages. Also introduced a getLogMessage(int recsCount) which can be 
overridden by sub classes.
#* Overrode getLogMessage in the relevant tasks which logged messages, such as 
AddDocTask, DeleteDocTask, WriteLineDocTask ... I also removed logging from 
these tasks
# Added ConsumeContentSource task together with a readContent.Source.alg - this 
can be used to simply read from a content source, if we want to measure the 
performance of a particular impl.
# Removed the "xerces" class name from EnwikiContentSource (read more below).

I changed EnwikiContentSource to not specifically request for a Xerces 
SAXParser. However, the default is to use the JRE's SAXParser, which is Xerces.

I wanted to remove the Xerces .jar, but when I attempted to read the 
enwiki-20090306-pages-articles.xml, it failed w/ an AIOOBE, so I don't think we 
can remove the .jar yet.
BTW, in LUCENE-1591 I reported that I am not able to parse that particular 
enwiki version, w/ and w/o Xerces, however Mike succeeded. So I don't know if 
this enwiki version is defective, or it's a problem on Windows.

Anyway, the bottom line is we cannot remove the Xerces .jar.

I think this patch is ready for commit. All benchmark tests pass.

> Split DocMaker into ContentSource and DocMaker
> --
>
> Key: LUCENE-1595
> URL: https://issues.apache.org/jira/browse/LUCENE-1595
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Reporter: Shai Erera
>Assignee: Mark Miller
> Fix For: 2.9
>
> Attachments: LUCENE-1595.patch, LUCENE-1595.patch, LUCENE-1595.patch
>
>
> This issue proposes some refactoring to the benchmark package. Today, 
> DocMaker has two roles: collecting documents from a collection and preparing 
> a Document object. These two should actually be split up to ContentSource and 
> DocMaker, which will use a ContentSource instance.
> ContentSource will implement all the methods of DocMaker, like 
> getNextDocData, raw size in bytes tracking etc. This can actually fit well w/ 
> 1591, by having a basic ContentSource that offers input stream services, and 
> wraps a file (for example) with a bzip or gzip streams etc.
> DocMaker will implement the makeDocument methods, reusing DocState etc.
> The idea is that collecting the Enwiki documents, for example, should be the 
> same whether I create documents using DocState, add payloads or index 
> additional metadata. Same goes for Trec and Reuters collections, as well as 
> LineDocMaker.
> In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are 
> 99% the same and 99% different. Most of their differences lie in the way they 
> read the data, while most of the similarity lies in the way they create 
> documents (using DocState).
> That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker 
> (just the reuse of DocState). Also, other DocMakers do not use that DocState 
> today, something they could have gotten for free with this refactoring 
> proposed.
> So by having a EnwikiContentSource, ReutersContentSource and others (TREC, 
> Line, Simple), I can write several DocMakers, such as DocStateMaker, 
> ConfigurableDocMaker (one which accpets all kinds of config options) and 
> custom DocMakers (payload, facets, sorting), passing to them a ContentSource 
> instance and reuse the same DocMaking algorithm with many content sources, as 
> well as the same ContentSource algorithm with many DocMaker implementations.
> This will also give us the opportunity to perf test content sources alone 
> (i.e., compare bzip, gzip and regular input streams), w/o the overhead of 
> creating a Document object.
> I've already done so in my code environment (I extend the benchmark package 
> for my application's purposes) and I like the flexibility I have. I think 
> this can be a nice contribution to the benchmark package, which can result in 
> some code cleanup as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-1679) Make WildcardTermEnum#difference() non-final

2009-06-10 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718069#action_12718069
 ] 

Simon Willnauer commented on LUCENE-1679:
-

I see. I could not thing of anything which would break the backwards compat. 
when removing the close method. The only thing I could have thought of was some 
convention I did not know about. So I assume the WildcardTermEnum_cleanup.patch 
is fine.

simon

> Make WildcardTermEnum#difference() non-final
> 
>
> Key: LUCENE-1679
> URL: https://issues.apache.org/jira/browse/LUCENE-1679
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Simon Willnauer
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: WildcardTermEnum.patch, WildcardTermEnum_cleanup.patch, 
> WildcardTermEnum_cleanup_2nd.patch
>
>
> The method WildcardTermEnum#difference() is declared final. I found it very 
> useful to subclass WildcardTermEnum to implement different scoring for exact 
> vs. partial matches. The change is rather trivial (attached)  but I guess it 
> could make life easier for a couple of users.
> I attached two patches:
>  - one which contains the single change to make difference() non-final 
> (WildcardTermEnum.patch)
>  - one which does also contain some minor cleanup of WildcardTermEnum. I 
> removed unnecessary member initialization and made those final. ( 
> WildcardTermEnum_cleanup.patch)
> Thanks simon

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-1678) Deprecate Analyzer.tokenStream

2009-06-10 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718081#action_12718081
 ] 

Michael McCandless commented on LUCENE-1678:


bq. Adopting a fixed release cycle with small intervals between releases 
(compared to what we have now). 

I think this is almost a good solution, though instead of "fixed" it
could be that we try [harder] to do major releases more frequently.
Let's face it: Lucene is changing quite quickly now, so it seems
reasonable that the major releases also come quickly.

I say "almost" because alot of the pain in implementing our
current policy is the need to have a "stepping stone" between old and
new.  Ie, we now must always do a release that deprecates old APIs and
introduces new ones so that you can upgrade to that, fix deprecations,
and you know you're set for the next major release.  So eg changes to
interfaces is a big problem.  If we were free to suddenly make a new
major releases, with instructions on how to migrate old -> new, that'd
be very liberating.

I think nearly everyone agrees our back-compat policy is exceptionally
costly.  On a given interesting change to Lucene, a very large part of
the effort is spent on preserving back-compat. It causes all kinds of
spooky code, pollutes the APIs, causes us to go forward with sub-par
names, etc.  The freedom Marvin has to make changes to Lucy is
fabulous, though in exchange, it's not yet released...

I think most would also agree that it's far from easy even carrying
out the policy we have without making mistakes: this change (addition
of reusableTokenStream) violated our policy (I did it by accident and
nobody noticed until now).  I actually believe programming languages /
runtime envs need to provide more support for developers; we have
inadequate tools now.  But we can't wait for that...


> Deprecate Analyzer.tokenStream
> --
>
> Key: LUCENE-1678
> URL: https://issues.apache.org/jira/browse/LUCENE-1678
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
>
> The addition of reusableTokenStream to the core analyzers unfortunately broke 
> back compat of external subclasses:
> 
> http://www.nabble.com/Extending-StandardAnalyzer-considered-harmful-td23863822.html
> On upgrading, such subclasses would silently not be used anymore, since 
> Lucene's indexing invokes reusableTokenStream.
> I think we should should at least deprecate Analyzer.tokenStream, today, so 
> that users see deprecation warnings if their classes override this method.  
> But going forward when we want to change the API of core classes that are 
> extended, I think we have to  introduce entirely new classes, to keep back 
> compatibility.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-1678) Deprecate Analyzer.tokenStream

2009-06-10 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718079#action_12718079
 ] 

Michael McCandless commented on LUCENE-1678:



bq. Mike was gung ho for it for a while, and even he backed off. 

Well... my particular itch (most recently!) was an addition to Lucene
that'd let us conditionalize the default settings so that new users
get the latest & greatest, but back-compat users can easily preserve
old behavior.

Ie, it was a software change, not a policy change; I tried hard to
steer clear of any proposed changes to back-compat policy.

But, for better or worse, back-compat policy is one of those
"magnetic" topics: whenever you get too close to it, it suddenly
sticks to you and takes over your thread.

And in the end we arrived at a workable solution to my particular
itch, which is to make such settings explicit or switch to new APIs
that change the defaults (eg the new FSDir.open).

That said, improving our back compat policy *is* an important and
amazingly complex topic.


> Deprecate Analyzer.tokenStream
> --
>
> Key: LUCENE-1678
> URL: https://issues.apache.org/jira/browse/LUCENE-1678
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
>
> The addition of reusableTokenStream to the core analyzers unfortunately broke 
> back compat of external subclasses:
> 
> http://www.nabble.com/Extending-StandardAnalyzer-considered-harmful-td23863822.html
> On upgrading, such subclasses would silently not be used anymore, since 
> Lucene's indexing invokes reusableTokenStream.
> I think we should should at least deprecate Analyzer.tokenStream, today, so 
> that users see deprecation warnings if their classes override this method.  
> But going forward when we want to change the API of core classes that are 
> extended, I think we have to  introduce entirely new classes, to keep back 
> compatibility.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-1678) Deprecate Analyzer.tokenStream

2009-06-10 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718080#action_12718080
 ] 

Michael McCandless commented on LUCENE-1678:



bq. The way Lucene stuff generally goes, if someone like Grant or Mike really 
wanted to push changes, the changes would happen. 

Well, it's consensus that we all need to reach (at least enough
consensus to vote on it), and on complex topics it's not easy to get
to consensus.

bq. Giving up is really not the answer though - thats why the discussion has 
come and gone in the past.

I don't think anyone has given up.  The issue still smoulders and
flares up here and there (like, this issue).  Eventually we'll get
enough consensus for something concrete to change.


bq. I have no moral right to hammer my ideals into heads that did tremendously 
more for the project, than I did.

In fact you do & should.  This is exactly how change happens.  Here's
a great (though sexist) quote:

"The reasonable man adapts himself to the world; the unreasonable one persists 
to adapt the world to himself. Therefore all progress depends on the 
unreasonable man." - George Bernard Shaw




> Deprecate Analyzer.tokenStream
> --
>
> Key: LUCENE-1678
> URL: https://issues.apache.org/jira/browse/LUCENE-1678
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
>
> The addition of reusableTokenStream to the core analyzers unfortunately broke 
> back compat of external subclasses:
> 
> http://www.nabble.com/Extending-StandardAnalyzer-considered-harmful-td23863822.html
> On upgrading, such subclasses would silently not be used anymore, since 
> Lucene's indexing invokes reusableTokenStream.
> I think we should should at least deprecate Analyzer.tokenStream, today, so 
> that users see deprecation warnings if their classes override this method.  
> But going forward when we want to change the API of core classes that are 
> extended, I think we have to  introduce entirely new classes, to keep back 
> compatibility.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Created: (LUCENE-1681) DocValues infinite loop caused by - a call to getMinValue | getMaxValue | getAverageValue

2009-06-10 Thread Simon Willnauer (JIRA)
DocValues infinite loop caused by - a call to getMinValue | getMaxValue | 
getAverageValue
-

 Key: LUCENE-1681
 URL: https://issues.apache.org/jira/browse/LUCENE-1681
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.4.1, 2.4, 2.3.2, 2.3.1, 2.3, 2.2, 2.3.3, 2.4.2, 2.9, 3.0
Reporter: Simon Willnauer
Priority: Minor
 Fix For: 2.9


org.apache.lucene.search.function.DocValues offers 3 public (optional) methods 
to access value statistics like min, max and average values of the internal 
values. A call to one of the methods will result in an infinite loop. The 
internal counter is not incremented. 
I added a testcase, javadoc and a slightly different implementation to it. I 
guess this is not breaking any back compat. as a call to those methodes would 
have caused an infinite loop anyway.
I changed the return value of all of those methods to Float.NaN if the 
DocValues implementation does not contain any values.

It might be considerable to fix this in 2.4.2 and 2.3.3



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Updated: (LUCENE-1681) DocValues infinite loop caused by - a call to getMinValue | getMaxValue | getAverageValue

2009-06-10 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-1681:


Attachment: DocValues.patch

> DocValues infinite loop caused by - a call to getMinValue | getMaxValue | 
> getAverageValue
> -
>
> Key: LUCENE-1681
> URL: https://issues.apache.org/jira/browse/LUCENE-1681
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 
> 3.0
>Reporter: Simon Willnauer
>Priority: Minor
> Fix For: 2.9
>
> Attachments: DocValues.patch
>
>
> org.apache.lucene.search.function.DocValues offers 3 public (optional) 
> methods to access value statistics like min, max and average values of the 
> internal values. A call to one of the methods will result in an infinite 
> loop. The internal counter is not incremented. 
> I added a testcase, javadoc and a slightly different implementation to it. I 
> guess this is not breaking any back compat. as a call to those methodes would 
> have caused an infinite loop anyway.
> I changed the return value of all of those methods to Float.NaN if the 
> DocValues implementation does not contain any values.
> It might be considerable to fix this in 2.4.2 and 2.3.3

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [jira] Commented: (LUCENE-1678) Deprecate Analyzer.tokenStream

2009-06-10 Thread Mark Miller

Michael McCandless (JIRA) wrote:
[ https://issues.apache.org/jira/browse/LUCENE-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718080#action_12718080 ] 


Michael McCandless commented on LUCENE-1678:



bq. The way Lucene stuff generally goes, if someone like Grant or Mike really wanted to push changes, the changes would happen. 


Well, it's consensus that we all need to reach (at least enough
consensus to vote on it), and on complex topics it's not easy to get
to consensus.
  
Right. I didn't mean you guys could ram it down everyones throats - I 
basically meant, if you wanted to build the censuses, and you thought 
the idea was good - you could do it easier than many of the new guys 
might think. I've seen it happen before.

bq. Giving up is really not the answer though - thats why the discussion has 
come and gone in the past.

I don't think anyone has given up.  The issue still smoulders and
flares up here and there (like, this issue).  Eventually we'll get
enough consensus for something concrete to change.
  
I think some of the newer people in the community do sink into a give up 
mentality (based on the comments I've seen). I think the issue is, and 
the reason I even responded to this to begin with, people jump to 
conclusions about whats going on here. They think the committers are 
stubborn and/or stuck in our old ways. That we are too in love with our 
back compat policy ;) Its common for some of us to point out things that 
slow issues down, and we don't always contribute much towards pushing 
some issues forward. Some of the newer guys in the community have gotten 
the wrong idea about that. Things tend to happen slowly, but with 
persistence they do happen. Lucene is kind of a conservative project, 
but I don't like the idea that some of the newer guys see things as 
locked up. I've been around long enough to know they are not. Everything 
is up for debate, and things have been moving steadily towards progress 
in Lucene land. Again, its like a constitution though - if it was easy 
to whip around the rules, we would have a lot of problems. When I make 
comments for or against something, I try and think about whats best for 
the community. I think others likely do the same thing.


Anyway, when I see those comments, I think - there is no need to lash 
out with little jives. Persistence will move things forward. It is about 
censuses building, and that takes time and effort. More for some than 
others. The funny thing is, from what I've seen, when push comes to 
shove, its easier to get consensus around here than some of the email 
discussions might suggest. It just takes effort and persistence.


bq. I have no moral right to hammer my ideals into heads that did tremendously 
more for the project, than I did.

In fact you do & should.  This is exactly how change happens.  Here's
a great (though sexist) quote:

"The reasonable man adapts himself to the world; the unreasonable one persists to 
adapt the world to himself. Therefore all progress depends on the unreasonable man." 
- George Bernard Sha
Right. Even though I think some of the newer guys have an odd (minor) 
disrespect for what came before them (when Lucene was younger, most of 
these issues didn't yet exist! And the project has been very 
successful/stable thus far), I am extremely happy that there are a bunch 
of new people shaking things up. I'd rather they didnt go away (thinking 
we Lucene is locked up in insanity) or stop talking about improving back 
compat :)


--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-1595) Split DocMaker into ContentSource and DocMaker

2009-06-10 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718090#action_12718090
 ] 

Mark Miller commented on LUCENE-1595:
-

Someone else can nab this from me if they want to go ahead before I get a 
chance.

Otherwise, I'll try and take a look by the end of the weekend. I started 
looking at it before, but just havn't yet had a chance to go back to it.

In either case, we will get it in soon.

> Split DocMaker into ContentSource and DocMaker
> --
>
> Key: LUCENE-1595
> URL: https://issues.apache.org/jira/browse/LUCENE-1595
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Reporter: Shai Erera
>Assignee: Mark Miller
> Fix For: 2.9
>
> Attachments: LUCENE-1595.patch, LUCENE-1595.patch, LUCENE-1595.patch
>
>
> This issue proposes some refactoring to the benchmark package. Today, 
> DocMaker has two roles: collecting documents from a collection and preparing 
> a Document object. These two should actually be split up to ContentSource and 
> DocMaker, which will use a ContentSource instance.
> ContentSource will implement all the methods of DocMaker, like 
> getNextDocData, raw size in bytes tracking etc. This can actually fit well w/ 
> 1591, by having a basic ContentSource that offers input stream services, and 
> wraps a file (for example) with a bzip or gzip streams etc.
> DocMaker will implement the makeDocument methods, reusing DocState etc.
> The idea is that collecting the Enwiki documents, for example, should be the 
> same whether I create documents using DocState, add payloads or index 
> additional metadata. Same goes for Trec and Reuters collections, as well as 
> LineDocMaker.
> In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are 
> 99% the same and 99% different. Most of their differences lie in the way they 
> read the data, while most of the similarity lies in the way they create 
> documents (using DocState).
> That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker 
> (just the reuse of DocState). Also, other DocMakers do not use that DocState 
> today, something they could have gotten for free with this refactoring 
> proposed.
> So by having a EnwikiContentSource, ReutersContentSource and others (TREC, 
> Line, Simple), I can write several DocMakers, such as DocStateMaker, 
> ConfigurableDocMaker (one which accpets all kinds of config options) and 
> custom DocMakers (payload, facets, sorting), passing to them a ContentSource 
> instance and reuse the same DocMaking algorithm with many content sources, as 
> well as the same ContentSource algorithm with many DocMaker implementations.
> This will also give us the opportunity to perf test content sources alone 
> (i.e., compare bzip, gzip and regular input streams), w/o the overhead of 
> creating a Document object.
> I've already done so in my code environment (I extend the benchmark package 
> for my application's purposes) and I like the flexibility I have. I think 
> this can be a nice contribution to the benchmark package, which can result in 
> some code cleanup as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Resolved: (LUCENE-1680) Make prefixLength accessible to PrefixTermEnum subclasses

2009-06-10 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1680.


Resolution: Fixed

Thanks Simon!

> Make prefixLength accessible to PrefixTermEnum subclasses
> -
>
> Key: LUCENE-1680
> URL: https://issues.apache.org/jira/browse/LUCENE-1680
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Simon Willnauer
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: PrefixTermEnum.patch, PrefixTermEnum_2nd.patch
>
>
> PrefixTermEnum#difference() offers a way to influence scoring based on the 
> difference between the prefix Term and a term in the enumeration. To 
> effectively use this facility the length of the prefix should be accessible 
> to subclasses. Currently the prefix term is private to PrefixTermEnum. I 
> added a getter for the prefix length and made PrefixTermEnum#endEnum(), 
> PrefixTermEnum#termCompare() final for consistency with other TermEnum 
> subclasses.
> Patch is attached.
> Simon

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Resolved: (LUCENE-1679) Make WildcardTermEnum#difference() non-final

2009-06-10 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1679.


Resolution: Fixed

Thanks Simon!

> Make WildcardTermEnum#difference() non-final
> 
>
> Key: LUCENE-1679
> URL: https://issues.apache.org/jira/browse/LUCENE-1679
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Simon Willnauer
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: WildcardTermEnum.patch, WildcardTermEnum_cleanup.patch, 
> WildcardTermEnum_cleanup_2nd.patch
>
>
> The method WildcardTermEnum#difference() is declared final. I found it very 
> useful to subclass WildcardTermEnum to implement different scoring for exact 
> vs. partial matches. The change is rather trivial (attached)  but I guess it 
> could make life easier for a couple of users.
> I attached two patches:
>  - one which contains the single change to make difference() non-final 
> (WildcardTermEnum.patch)
>  - one which does also contain some minor cleanup of WildcardTermEnum. I 
> removed unnecessary member initialization and made those final. ( 
> WildcardTermEnum_cleanup.patch)
> Thanks simon

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: Some thoughts around the use of reader.isDeleted and hasDeletions

2009-06-10 Thread Shai Erera
>
> it makes sense because isDeleted() is essentially the *only* thing
> being done in the loop, and hence we can eliminate the loop entirely
>

You mean that in case there is a matching segment, we can call
matchingVectorsReader.rawDocs(rawDocLengths, rawDocLengths2, 0, maxDoc)?
But in case it does not have a matching segment, we'd still need to iterate
on the docs, and copy the term vectors one by one, right?

I'm not very familiar w/ the code, so I'd like to confirm my understanding.

Shai

On Tue, Jun 9, 2009 at 9:54 PM, Yonik Seeley <
[email protected]> wrote:

> 2009/6/9 Shai Erera :
> >> If there are no deletions, it's just a null pointer check, right?
> >
> > Well ... one null pointer check here, one null pointer check there and at
> > some point you will see a difference. My point wasn't the null pointer
> check
> > itself, but the pointer check for *every* document in mergeFields() and
> > *every* document in mergeVectors().
>
> I like performance, but it does seem like anything that complicates
> the code (duplication and specialization) should result in an actual
> measurable performance increase.
>
> But in this specific case (I just looked at the code for mergeVectors)
> it makes sense because isDeleted() is essentially the *only* thing
> being done in the loop, and hence we can eliminate the loop entirely
> (an algorithmic change, not just eliminating a null pointer check
> per-document in the context of doing something else per-document).
>
> patch away ;-)
>
> -Yonik
> http://www.lucidimagination.com
>
> -
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>


[jira] Updated: (LUCENE-1682) unit tests should use private directories

2009-06-10 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1682:
---

Attachment: LUCENE-1682.patch

I plan to commit later today...

> unit tests should use private directories
> -
>
> Key: LUCENE-1682
> URL: https://issues.apache.org/jira/browse/LUCENE-1682
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1682.patch
>
>
> This only affects our unit tests...
> I run "ant test" and "ant test-tag" concurrently, but some tests have false 
> failures (eg TestPayloads) because they use a fixed test directory in the 
> filesystem for testing.
> I've added a simple method to _TestUtil to get a temp dir, and switched over 
> those tests that I've hit false failures on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Created: (LUCENE-1682) unit tests should use private directories

2009-06-10 Thread Michael McCandless (JIRA)
unit tests should use private directories
-

 Key: LUCENE-1682
 URL: https://issues.apache.org/jira/browse/LUCENE-1682
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9
 Attachments: LUCENE-1682.patch

This only affects our unit tests...

I run "ant test" and "ant test-tag" concurrently, but some tests have false 
failures (eg TestPayloads) because they use a fixed test directory in the 
filesystem for testing.

I've added a simple method to _TestUtil to get a temp dir, and switched over 
those tests that I've hit false failures on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: Some thoughts around the use of reader.isDeleted and hasDeletions

2009-06-10 Thread Michael McCandless
On Wed, Jun 10, 2009 at 11:16 AM, Shai Erera wrote:
>> it makes sense because isDeleted() is essentially the *only* thing
>> being done in the loop, and hence we can eliminate the loop entirely
>
> You mean that in case there is a matching segment, we can call
> matchingVectorsReader.rawDocs(rawDocLengths, rawDocLengths2, 0, maxDoc)?

Right, except you'll have to do it in chunks of rawDocLengths.length,
until you get to maxDoc.

> But in case it does not have a matching segment, we'd still need to iterate
> on the docs, and copy the term vectors one by one, right?

Yes.

Mike

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: Some thoughts around the use of reader.isDeleted and hasDeletions

2009-06-10 Thread Yonik Seeley
On Wed, Jun 10, 2009 at 11:16 AM, Shai Erera  wrote:
>> it makes sense because isDeleted() is essentially the *only* thing
>> being done in the loop, and hence we can eliminate the loop entirely
>
> You mean that in case there is a matching segment, we can call
> matchingVectorsReader.rawDocs(rawDocLengths, rawDocLengths2, 0, maxDoc)?

Right... or rather directly calculate numDocs and docNum instead of
using the loop.

> But in case it does not have a matching segment, we'd still need to iterate
> on the docs, and copy the term vectors one by one, right?

Right, and that's the case where I think duplicating the code to
remove a single branch-predictable boolean flag isn't warranted as it
won't result in a measurable performance increase.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-1678) Deprecate Analyzer.tokenStream

2009-06-10 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718122#action_12718122
 ] 

Shai Erera commented on LUCENE-1678:


We've had this thread 
http://www.nabble.com/Lucene%27s-default-settings---back-compatibility-td23605466.html,
 and in the latest post 
(http://www.nabble.com/Re%3A-Lucene%27s-default-settings---back-compatibility-p23792927.html)
 I tried to put together some wording for a revised (and relaxed) back-compat 
policy. I believe it was Grant who asked for some writeup to get to the users' 
list, and I read also that we may want to discuss each item separately, to get 
to a consensus.

Perhaps we can continue the discussion on that thread, and try to get to a 
consensus on any of the items? We don't necessarily need to change all of it in 
one day, but getting some feedback from you on any of the items can help bring 
that discussion back to life, and hopefully reach a consensus.

As was said on this thread, persistence will eventually drive us to reach a 
consensus, so I'm being persistent :).

> Deprecate Analyzer.tokenStream
> --
>
> Key: LUCENE-1678
> URL: https://issues.apache.org/jira/browse/LUCENE-1678
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
>
> The addition of reusableTokenStream to the core analyzers unfortunately broke 
> back compat of external subclasses:
> 
> http://www.nabble.com/Extending-StandardAnalyzer-considered-harmful-td23863822.html
> On upgrading, such subclasses would silently not be used anymore, since 
> Lucene's indexing invokes reusableTokenStream.
> I think we should should at least deprecate Analyzer.tokenStream, today, so 
> that users see deprecation warnings if their classes override this method.  
> But going forward when we want to change the API of core classes that are 
> extended, I think we have to  introduce entirely new classes, to keep back 
> compatibility.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: Lucene's default settings & back compatibility

2009-06-10 Thread Mark Miller
No one really responded to this Shai? And I take it that the user list 
never saw it?


Perhaps we should just ask for opinion from the user list based on what 
you already have - just to gauge the reaction on different points. 
Unless someone responds shortly, we could take a year waiting to shake 
it out.

The threat of sending should prompt anyone with any issues to speak up.

I think we should add though:
explicitly what has changed (eg if we switch something, what was the 
policy before - most users won't even know)

an overview of why we are interested in relaxing back compat

- Mark

Shai Erera wrote:
Ok, so digging back in this thread, I think the following proposals 
were made (if I missed some, please add them):


1. API deprecation last *at least* one full minor release. Example: if 
we deprecate an API in 2.4, we can remove it in 2.5. BUT, we are also 
free to keep it there and remove it in 2.6, 2.9, 3.0, 3.5. I would 
like to reserve that option for controversial deprecations, like 
TokenStream, and maybe even the HitCollector recent changes. Those 
that we feel will have a large impact on the users, we might want to 
keep around for a bit longer until we get enough feedback from the 
field and are more confident with that change.


2. Bugs are fixed backwards on the last "dot" release only. Example, A 
bug that's discovered after 2.4 is released, is fixed on 2.4.X branch. 
Once 2.5 is released, any bug fixes happen on trunk and 2.5.X. A 
slight relaxation would be adding something like "we may still fix 
bugs on the 2.4.X branch if we feel it's important enough". For 
example if 2.5 contains a lot of API changes and we think a 
considerable portion of our users are still on 2.4.


3. Jar drop-in ability is only guaranteed on point releases (this is 
slightly of an outcome of (1) and (2), but (6) will also affect it).


4. Changes to the index format last at least one full major release. 
Example: a change to the index format in 2.X, is supported in all 3.Y 
releases, and removed in 4.0. Again, I write "at least" since we 
should have the freedom to extend support for a particular change.


5. Changes to the default settings are allowed between minor releases, 
provided that we give the users a way to revert back to the old 
behavior. Examples are LUCENE-1542 and the latest issues Mike opened. 
Those changes will be applied out-of-the-box. The provided API to 
revert to the old behavior may be a supported API, or a deprecated 
API. For deprecation we can decide to keep the API longer than one 
minor release.


5.1) An exception to (5) are bug fixes which break back-compat - those 
are always visible, w/ a way to revert to the buggy behavior. That way 
may be deprecated or not, and its support lifetime can be made on a 
case-by-case basis.


6. Minor changes to APIs can happen w/o any deprecation. Example, 
LUCENE-1614, adding 1/2 methods to an interface with a good 
documentation and trivial proposal for implementation etc.


You will notice that almost every proposal has a "we may decide to 
keep it for longer" - I wrote it following one of the early responses 
on this thread (I think it was Grant's) - we should not attempt to set 
things in stone. Our back-compat policy should ensure some level of 
SLA to our users, but otherwise we should not act as robots, and if we 
think a certain case requires a different handling than the policy 
states (only for the user's benefit though), it should be done that 
way. The burden is still put on the committers, only now the policy is 
relaxed a bit, and handles different cases in different ways, and the 
committers/contributors don't need to feel that their hands are tied.


These set the ground/basis, but otherwise we should decide on a 
case-by-case basis on any extension/relaxation of the policy, for our 
users' benefits. After quite some time I've been following the 
discussions on this mailing list, I don't remember ever seeing an 
issue being driven against our users' benefit. All issues attempt to 
improve Lucene's performance and our users' experience (end users as 
well as search application developers). I think it's only fair to ask 
this "users" community be more forgiving and open to make changes on 
their side too, making the life of the committers/contributors a bit 
easier.


I also agree that the next step would be taking this to java-user and 
get a sense of whether our "users" community agree with those changes 
or not. I hope that the above summary captures what's needed to be 
sent to this list.


Shai

On Sat, May 30, 2009 at 2:21 PM, Michael McCandless 
mailto:[email protected]>> wrote:


Actually, I think this is a common, and in fact natural/expected
occurrence in open-source.  When a tricky topic is discussed, and the
opinions are often divergent, frequently the conversation never
"converges" to a consensus and the discussion dies.  Only if
discussion reaches a semblance of consensus do we vote on it

Re: [jira] Commented: (LUCENE-1678) Deprecate Analyzer.tokenStream

2009-06-10 Thread Mark Miller

Michael McCandless (JIRA) wrote:



bq. Adopting a fixed release cycle with small intervals between releases (compared to what we have now). 


I think this is almost a good solution, though instead of "fixed" it
could be that we try [harder] to do major releases more frequently.
Let's face it: Lucene is changing quite quickly now, so it seems
reasonable that the major releases also come quickly.

  



though instead of "fixed" it
could be that we try [harder] to do major releases more frequently.



I've heard that one before ;) In fact, we pretty much committed to 
releasing more often. Now if 2.9 would just fall into line with our darn 
commitments :)


--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



back compat is good

2009-06-10 Thread Yonik Seeley
I'm starting to feel like the lone holdout that thinks back compat for
commonly used interfaces and index formats is important.  So I'll sum
up some of my thoughts and leave it at that:

- I doubt that the number of new users for each release of Lucene
exceeds the sum total of all existing users of Lucene.  Lucene is
already the dominant open source search library, so we're never going
to hit that type of exponential growth going forward.  Existing users
are very important.
- Good back compat makes the lives of all Lucene users easier
- Good back compat makes the lives of Lucene developers easier in some
ways also.  We don't *need* to go back and patcholder releases, since
we can say "use a newer release".  If things change too much, that
will no longer be an easy option for many users, and more people will
get stuck in the past because upgrading is too painful.
- The difficulty of change can also be a good thing - it forces people
to really think if changes are worth it and only add them where it
really makes sense.

The last threads on back compat generated so much volume that I
couldn't keep up, and I expect there are many others that couldn't
either.  I'm not personally interested in discussing it in the
abstract further... I'm more interested in actual code
patches/proposals.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [jira] Commented: (LUCENE-1678) Deprecate Analyzer.tokenStream

2009-06-10 Thread Michael McCandless
On Wed, Jun 10, 2009 at 12:45 PM, Mark Miller wrote:

> I've heard that one before ;) In fact, we pretty much committed to releasing
> more often. Now if 2.9 would just fall into line with our darn commitments
> :)

I hear you!

So... how about we try to wrap up 2.9/3.0 and ship with what we have,
now? It's been 8 months since 2.4.0 was released, and 2.9's got plenty
of new stuff, and we are all itching to remove these deprecated APIs,
switch to Java 1.5, etc.

We should try to finish the issues that are open and underway... but I
think many of the issues marked 2.9 now, especially those not even
started, should not in fact block 2.9.

Mike

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: Payloads and TrieRangeQuery

2009-06-10 Thread Jason Rutherglen
I think instead of ORing postings (trie range, rangequery, etc), have a
custom Query + Scorer that examines the payload (somehow)?  It could encode
the multiple levels of trie bits in it?  (I'm just guessing here).

On Wed, Jun 10, 2009 at 4:04 AM, Michael McCandless <
[email protected]> wrote:

> Use them how?  (Sounds interesting...).
>
> Mike
>
> On Tue, Jun 9, 2009 at 10:32 PM, Jason
> Rutherglen wrote:
> > At the SF Lucene User's group, Michael Busch mentioned using
> > payloads with TrieRangeQueries. Is this something that's being
> > worked on? I'm interested in what sort performance benefits
> > there would be to this method?
> >
>
> -
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>


Re: back compat is good

2009-06-10 Thread Michael McCandless
Well... Lucene still seems to be experiencing strong adoption/growth,
eg combined user+dev email traffic:

  http://lucene.markmail.org/

Net/net, I also think that back-compat is important and we shouldn't
up and abandon it or relax our policy too much.

However, I wish we had better tools for *implementing* our policy.
Really, the programming language should provide facilities... but it
won't (for a looong time), so we discuss our own solutions like
actsAsVersion.

And it pains me when our back compat policy forces us to sacrifice new
users' experience (not being to change default settings; not being
able to fix bugs in analyzers; etc).  At least we have an OK
workaround for that, and I also think we have softened our stance on
when to make exceptions here.

Mike

On Wed, Jun 10, 2009 at 1:00 PM, Yonik Seeley wrote:
> I'm starting to feel like the lone holdout that thinks back compat for
> commonly used interfaces and index formats is important.  So I'll sum
> up some of my thoughts and leave it at that:
>
> - I doubt that the number of new users for each release of Lucene
> exceeds the sum total of all existing users of Lucene.  Lucene is
> already the dominant open source search library, so we're never going
> to hit that type of exponential growth going forward.  Existing users
> are very important.
> - Good back compat makes the lives of all Lucene users easier
> - Good back compat makes the lives of Lucene developers easier in some
> ways also.  We don't *need* to go back and patcholder releases, since
> we can say "use a newer release".  If things change too much, that
> will no longer be an easy option for many users, and more people will
> get stuck in the past because upgrading is too painful.
> - The difficulty of change can also be a good thing - it forces people
> to really think if changes are worth it and only add them where it
> really makes sense.
>
> The last threads on back compat generated so much volume that I
> couldn't keep up, and I expect there are many others that couldn't
> either.  I'm not personally interested in discussing it in the
> abstract further... I'm more interested in actual code
> patches/proposals.
>
> -Yonik
> http://www.lucidimagination.com
>
> -
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: back compat is good

2009-06-10 Thread Mark Miller

Yonik Seeley wrote:

I'm starting to feel like the lone holdout that thinks back compat for
commonly used interfaces and index formats is important.  

I think the fact that your not the only one is why things got stymied.

I wouldnt personally support anything that didnt try and maintain 
stability in commonly used interfaces,
and it appeared that consensus easily favored maintaining strong index 
back compat.


The current policy has much stronger hooks than just common interfaces 
and index formats though.


For really important things, we make exceptions anyway, and that will 
probably still be the case.


The win we can probably get, I think, is a policy that makes things 
easier where we pay a lot for a little. Its worth a lot of pain to 
support common interfaces

and index formats. That doesnt cover all of the ground though.

We have already dealt with a lot of this by making special exceptions, 
using abstract classes, and 'experimental APIs'.


Perhaps it makes sense to just bring our back compat policy up to date 
with the reality of what has been happening anyway.


Or maybe nothing needs to be done after all. But I think we need to 
address the out of the box performance in some manner.


--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: back compat is good

2009-06-10 Thread Yonik Seeley
On Wed, Jun 10, 2009 at 2:01 PM, Michael McCandless
 wrote:
> Well... Lucene still seems to be experiencing strong adoption/growth,
> eg combined user+dev email traffic:
> http://lucene.markmail.org/

I think that includes all Lucene sub-projects (Solr, Tika, Mahout,
Nutch, Droids, etc).

http://lucene.markmail.org/search/?q=list%3Aorg.apache.lucene.java-user

> And it pains me when our back compat policy forces us to sacrifice new
> users' experience (not being to change default settings; not being
> able to fix bugs in analyzers; etc).

As far as default settings, it seems like it can be mostly fixed with
documentation (i.e. recommended settings for maximum performance).
That seems like a very small burden for people writing new
applications with Lucene anyway (compare to the cost of writing the
whole application).  On the other hand, existing users may be
essentially "done" with the Lucene development in their project, and
want to upgrade for bug fixes, performance increases, and maybe to
incrementally add new features.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



RE: Payloads and TrieRangeQuery

2009-06-10 Thread Uwe Schindler
Hi, sorry I missed the first mail.

 

The idea we discussed in Amsterdam during ApacheCon was:

 

Instead of indexing all trie precisions from e.g. the leftmost 8 bits downto
all 64 bits, the TrieTokenStream only creates terms from e.g. precisions 8
to 56. The last precision is left out. Instead the last term (precision 56)
contains the highest precision as payload.

On the query side, TrieRangeQuery would create the filter bitmap as before
until it reaches the lowest available precision with the payloads. Instead
of further splitting this precision into terms, all TermPositions instead of
just TermDocs are listed, but only those set in the result BitSet, that have
the payload inside the range bounds. By this the trie query first selects
large ranges in the middle like before, but uses the highest (but not full
precision term) to select more docids than needed but filters them with the
payload.

 

With String Dates (the simplified example Michael Busch shows in his talk):

Searching all docs from 2005-11-10 to 2008-03-11 with current trierange
variant would select terms 2005-11-10 to 2005-11-30, then the whole
December, the whole years 2006 and 2007 and so on. With payloads, trierange
would select only whole months (November, December, 2006, 2007, Jan, Feb,
Mar). At the ends the payloads are used to filter out the days in Nov 2005
and Mar 2008.

 

With the latest TrieRange impl this would be possible to implement (because
the TrieTokenStreams now used for indexing could create the payloads). Only
the searching side would no longer so "simple" implemented as yet. My
biggest problem is how to configure this optimal and make the API clean.

 

Was it understandable? (Its complicated, I know)

 

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
  http://www.thetaphi.de
eMail: [email protected]

  _  

From: Jason Rutherglen [mailto:[email protected]] 
Sent: Wednesday, June 10, 2009 7:59 PM
To: [email protected]
Subject: Re: Payloads and TrieRangeQuery

 

I think instead of ORing postings (trie range, rangequery, etc), have a
custom Query + Scorer that examines the payload (somehow)?  It could encode
the multiple levels of trie bits in it?  (I'm just guessing here).

On Wed, Jun 10, 2009 at 4:04 AM, Michael McCandless
 wrote:

Use them how?  (Sounds interesting...).

Mike


On Tue, Jun 9, 2009 at 10:32 PM, Jason
Rutherglen wrote:
> At the SF Lucene User's group, Michael Busch mentioned using
> payloads with TrieRangeQueries. Is this something that's being
> worked on? I'm interested in what sort performance benefits
> there would be to this method?
>

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

 



Re: Payloads and TrieRangeQuery

2009-06-10 Thread Michael McCandless
Ooh that sounds compelling!

So you would not need to use payloads for the "inside" brackets,
right?  Only for the edges?

I wonder how performance would compare.  Without payloads, there are
many more terms (for the tiny ranges) in the index, and your OR query
will have lots of these tiny terms.  But then these tiny terms don't
hit many docs, and with BooleanScorer (which we should switch to for
OR queries) ought not be very costly.  Vs w/ payloads having to use
TermPositions, having to load, decode & check the payload, and I guess
assuming on average that 1/2 the docs are filtered out.

Mike

On Wed, Jun 10, 2009 at 2:28 PM, Uwe Schindler wrote:
> Hi, sorry I missed the first mail.
>
>
>
> The idea we discussed in Amsterdam during ApacheCon was:
>
>
>
> Instead of indexing all trie precisions from e.g. the leftmost 8 bits downto
> all 64 bits, the TrieTokenStream only creates terms from e.g. precisions 8
> to 56. The last precision is left out. Instead the last term (precision 56)
> contains the highest precision as payload.
>
> On the query side, TrieRangeQuery would create the filter bitmap as before
> until it reaches the lowest available precision with the payloads. Instead
> of further splitting this precision into terms, all TermPositions instead of
> just TermDocs are listed, but only those set in the result BitSet, that have
> the payload inside the range bounds. By this the trie query first selects
> large ranges in the middle like before, but uses the highest (but not full
> precision term) to select more docids than needed but filters them with the
> payload.
>
>
>
> With String Dates (the simplified example Michael Busch shows in his talk):
>
> Searching all docs from 2005-11-10 to 2008-03-11 with current trierange
> variant would select terms 2005-11-10 to 2005-11-30, then the whole
> December, the whole years 2006 and 2007 and so on. With payloads, trierange
> would select only whole months (November, December, 2006, 2007, Jan, Feb,
> Mar). At the ends the payloads are used to filter out the days in Nov 2005
> and Mar 2008.
>
>
>
> With the latest TrieRange impl this would be possible to implement (because
> the TrieTokenStreams now used for indexing could create the payloads). Only
> the searching side would no longer so “simple” implemented as yet. My
> biggest problem is how to configure this optimal and make the API clean.
>
>
>
> Was it understandable? (Its complicated, I know)
>
>
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: [email protected]
>
> 
>
> From: Jason Rutherglen [mailto:[email protected]]
> Sent: Wednesday, June 10, 2009 7:59 PM
> To: [email protected]
> Subject: Re: Payloads and TrieRangeQuery
>
>
>
> I think instead of ORing postings (trie range, rangequery, etc), have a
> custom Query + Scorer that examines the payload (somehow)?  It could encode
> the multiple levels of trie bits in it?  (I'm just guessing here).
>
> On Wed, Jun 10, 2009 at 4:04 AM, Michael McCandless
>  wrote:
>
> Use them how?  (Sounds interesting...).
>
> Mike
>
> On Tue, Jun 9, 2009 at 10:32 PM, Jason
> Rutherglen wrote:
>> At the SF Lucene User's group, Michael Busch mentioned using
>> payloads with TrieRangeQueries. Is this something that's being
>> worked on? I'm interested in what sort performance benefits
>> there would be to this method?
>>
>
> -
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: back compat is good

2009-06-10 Thread Simon Willnauer
On Wed, Jun 10, 2009 at 7:00 PM, Yonik Seeley wrote:
> I'm starting to feel like the lone holdout that thinks back compat for
> commonly used interfaces and index formats is important.  So I'll sum
> up some of my thoughts and leave it at that:
>
> - I doubt that the number of new users for each release of Lucene
> exceeds the sum total of all existing users of Lucene.  Lucene is
> already the dominant open source search library, so we're never going
> to hit that type of exponential growth going forward.  Existing users
> are very important.
> - Good back compat makes the lives of all Lucene users easier
> - Good back compat makes the lives of Lucene developers easier in some
> ways also.  We don't *need* to go back and patcholder releases, since
> we can say "use a newer release".  If things change too much, that
> will no longer be an easy option for many users, and more people will
> get stuck in the past because upgrading is too painful.
> - The difficulty of change can also be a good thing - it forces people
> to really think if changes are worth it and only add them where it
> really makes sense.
I have been around since 1.4 and looking back from today I assume it
is/was worth all the pain. Being able to not looking at lucene for 1
1/2 years and using it again without thinking too much about what has
changed is a huge advantage!

On the other hand, I really appreciate the decision of the Python
community moving forward and getting rid of legacy code, functions,
interfaces etc. in P3K. Each time you decide to take such a step you
will be in the same situation with back compatibility. I would not
change the policy and rather go a similar way as the python community
went with p3k.
A clean cut can have major advantages but after breaking compatibility
keep on sticking to the policy is a must I guess. the bad thing about
APIs is that you have only one chance to get it right.

I did not follow the thread about back compat at all so if that has
been proposed / discussed just ignore it.


>
> The last threads on back compat generated so much volume that I
> couldn't keep up, and I expect there are many others that couldn't
> either.  I'm not personally interested in discussing it in the
> abstract further... I'm more interested in actual code
> patches/proposals.
>
> -Yonik
> http://www.lucidimagination.com
>
> -
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: Payloads and TrieRangeQuery

2009-06-10 Thread Yonik Seeley
Yep, makes sense.  It could be a little slower, but it would decrease
the number of terms indexed by a factor of 256 (for 8 bits).

But the payload part... seems like another case of using that because
CSF isn't there yet, right?
(well, perhaps except if you didn't want to store the field...)

-Yonik
http://www.lucidimagination.com

On Wed, Jun 10, 2009 at 2:28 PM, Uwe Schindler  wrote:
> Hi, sorry I missed the first mail.
>
>
>
> The idea we discussed in Amsterdam during ApacheCon was:
>
>
>
> Instead of indexing all trie precisions from e.g. the leftmost 8 bits downto
> all 64 bits, the TrieTokenStream only creates terms from e.g. precisions 8
> to 56. The last precision is left out. Instead the last term (precision 56)
> contains the highest precision as payload.
>
> On the query side, TrieRangeQuery would create the filter bitmap as before
> until it reaches the lowest available precision with the payloads. Instead
> of further splitting this precision into terms, all TermPositions instead of
> just TermDocs are listed, but only those set in the result BitSet, that have
> the payload inside the range bounds. By this the trie query first selects
> large ranges in the middle like before, but uses the highest (but not full
> precision term) to select more docids than needed but filters them with the
> payload.
>
>
>
> With String Dates (the simplified example Michael Busch shows in his talk):
>
> Searching all docs from 2005-11-10 to 2008-03-11 with current trierange
> variant would select terms 2005-11-10 to 2005-11-30, then the whole
> December, the whole years 2006 and 2007 and so on. With payloads, trierange
> would select only whole months (November, December, 2006, 2007, Jan, Feb,
> Mar). At the ends the payloads are used to filter out the days in Nov 2005
> and Mar 2008.
>
>
>
> With the latest TrieRange impl this would be possible to implement (because
> the TrieTokenStreams now used for indexing could create the payloads). Only
> the searching side would no longer so “simple” implemented as yet. My
> biggest problem is how to configure this optimal and make the API clean.
>
>
>
> Was it understandable? (Its complicated, I know)
>
>
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: [email protected]
>
> 
>
> From: Jason Rutherglen [mailto:[email protected]]
> Sent: Wednesday, June 10, 2009 7:59 PM
> To: [email protected]
> Subject: Re: Payloads and TrieRangeQuery
>
>
>
> I think instead of ORing postings (trie range, rangequery, etc), have a
> custom Query + Scorer that examines the payload (somehow)?  It could encode
> the multiple levels of trie bits in it?  (I'm just guessing here).
>
> On Wed, Jun 10, 2009 at 4:04 AM, Michael McCandless
>  wrote:
>
> Use them how?  (Sounds interesting...).
>
> Mike
>
> On Tue, Jun 9, 2009 at 10:32 PM, Jason
> Rutherglen wrote:
>> At the SF Lucene User's group, Michael Busch mentioned using
>> payloads with TrieRangeQueries. Is this something that's being
>> worked on? I'm interested in what sort performance benefits
>> there would be to this method?
>>
>
> -
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Lucene / Solr Function API

2009-06-10 Thread Simon Willnauer
Hey there,

I'm curious if anybody is working on the issue
https://issues.apache.org/jira/browse/LUCENE-1085
and the blocker https://issues.apache.org/jira/browse/LUCENE-1085 ?
I would love to see both solr and lucene using the same api for search
functions.
The issues have been idle for a while so I would take over and try to
make it happen in 3.0.

simon

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



RE: Payloads and TrieRangeQuery

2009-06-10 Thread Uwe Schindler
> Ooh that sounds compelling!
> 
> So you would not need to use payloads for the "inside" brackets,
> right?  Only for the edges?

Exactly.

> I wonder how performance would compare.  Without payloads, there are
> many more terms (for the tiny ranges) in the index, and your OR query
> will have lots of these tiny terms.  But then these tiny terms don't
> hit many docs, and with BooleanScorer (which we should switch to for
> OR queries) ought not be very costly. 

That ist true. The main idea was to limit also seeking during the query.
When splitting the range, you need to often start new TermEnums and iterate
over lot of term. By catching many docs with less terms, you only need to
scan forward in the payloads.

> Vs w/ payloads having to use
> TermPositions, having to load, decode & check the payload, and I guess
> assuming on average that 1/2 the docs are filtered out.

Maybe decoding the payload is not needed, I would encode the bounds as
byte[] and compare the arrays. But you would filter about half of the docs
out.

My problem with all this is how to optimize after which shift value to
switch between terms and payloads. And this information about the trie
structure and where payloads are should be stored in FieldInfos.

As we now search on each segment separately, this information can be stored
per segment and also used for each per-segment Filter/Scorer.

The whole thing works out of the box with TrieRangeFilter (its just
iterating over terms, getting TermDocs/TermPositions and setting bits, when
payloads available after checking these), for TrieRangeQuery using
BooleanQuery it is more complicated (MTQ cannot simply add the terms from
the FilteredTermEnum to a BooleanQuery).

Until now I had no time to think about it in detail, but with maybe the
possibility to have TrieRange in Core and store trie-specific FieldInfos per
segment, I will get clearer how to manage this in the API.

Uwe

> On Wed, Jun 10, 2009 at 2:28 PM, Uwe Schindler wrote:
> > Hi, sorry I missed the first mail.
> >
> >
> >
> > The idea we discussed in Amsterdam during ApacheCon was:
> >
> >
> >
> > Instead of indexing all trie precisions from e.g. the leftmost 8 bits
> downto
> > all 64 bits, the TrieTokenStream only creates terms from e.g. precisions
> 8
> > to 56. The last precision is left out. Instead the last term (precision
> 56)
> > contains the highest precision as payload.
> >
> > On the query side, TrieRangeQuery would create the filter bitmap as
> before
> > until it reaches the lowest available precision with the payloads.
> Instead
> > of further splitting this precision into terms, all TermPositions
> instead of
> > just TermDocs are listed, but only those set in the result BitSet, that
> have
> > the payload inside the range bounds. By this the trie query first
> selects
> > large ranges in the middle like before, but uses the highest (but not
> full
> > precision term) to select more docids than needed but filters them with
> the
> > payload.
> >
> >
> >
> > With String Dates (the simplified example Michael Busch shows in his
> talk):
> >
> > Searching all docs from 2005-11-10 to 2008-03-11 with current trierange
> > variant would select terms 2005-11-10 to 2005-11-30, then the whole
> > December, the whole years 2006 and 2007 and so on. With payloads,
> trierange
> > would select only whole months (November, December, 2006, 2007, Jan,
> Feb,
> > Mar). At the ends the payloads are used to filter out the days in Nov
> 2005
> > and Mar 2008.
> >
> >
> >
> > With the latest TrieRange impl this would be possible to implement
> (because
> > the TrieTokenStreams now used for indexing could create the payloads).
> Only
> > the searching side would no longer so “simple” implemented as yet. My
> > biggest problem is how to configure this optimal and make the API clean.
> >
> >
> >
> > Was it understandable? (Its complicated, I know)
> >
> >
> >
> > -
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: [email protected]
> >
> > 
> >
> > From: Jason Rutherglen [mailto:[email protected]]
> > Sent: Wednesday, June 10, 2009 7:59 PM
> > To: [email protected]
> > Subject: Re: Payloads and TrieRangeQuery
> >
> >
> >
> > I think instead of ORing postings (trie range, rangequery, etc), have a
> > custom Query + Scorer that examines the payload (somehow)?  It could
> encode
> > the multiple levels of trie bits in it?  (I'm just guessing here).
> >
> > On Wed, Jun 10, 2009 at 4:04 AM, Michael McCandless
> >  wrote:
> >
> > Use them how?  (Sounds interesting...).
> >
> > Mike
> >
> > On Tue, Jun 9, 2009 at 10:32 PM, Jason
> > Rutherglen wrote:
> >> At the SF Lucene User's group, Michael Busch mentioned using
> >> payloads with TrieRangeQueries. Is this something that's being
> >> worked on? I'm interested in what sort performance benefits
> >> there would be to this method?
> >>
> >
> > 

Re: back compat is good

2009-06-10 Thread Michael McCandless
On Wed, Jun 10, 2009 at 2:23 PM, Yonik Seeley wrote:

>> Well... Lucene still seems to be experiencing strong adoption/growth,
>> eg combined user+dev email traffic:
>> http://lucene.markmail.org/
>
> I think that includes all Lucene sub-projects (Solr, Tika, Mahout,
> Nutch, Droids, etc).
>
> http://lucene.markmail.org/search/?q=list%3Aorg.apache.lucene.java-user

Woops you're right.  java-user alone looks to have flattened out
recently... though usage of eg Solr is also usage of Lucene:

  
http://lucene.markmail.org/search/?q=list%3Aorg.apache.lucene.java-user+list%3Aorg.apache.lucene.solr-user

What I'd really love to see is "how many cumulative searches have
been done by Lucene, everywhere" as a function of time...

>> And it pains me when our back compat policy forces us to sacrifice new
>> users' experience (not being to change default settings; not being
>> able to fix bugs in analyzers; etc).
>
> As far as default settings, it seems like it can be mostly fixed with
> documentation (i.e. recommended settings for maximum performance).
> That seems like a very small burden for people writing new
> applications with Lucene anyway (compare to the cost of writing the
> whole application).  On the other hand, existing users may be
> essentially "done" with the Lucene development in their project, and
> want to upgrade for bug fixes, performance increases, and maybe to
> incrementally add new features.

I think we need to do both.  We should doc things like "use a big RAM
buffer", "turn off CFS", "use an SSD", "use threads", etc.

But for things like "open a readOnly reader", "turn on the acronym fix
in StandardAnalyzer", "use BooleanScorer not BooleanScorer2", "don't
discard positions in StopFilter", "use NIOFSDirectory not
FSDirectory", "turn off scoring when sorting by field", we should fix
Lucene to do those by default.

I'd like for Lucene to make a good first impression.

Mike

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: Lucene / Solr Function API

2009-06-10 Thread Michael McCandless
Well, it's unassigned and has no comments so my guess is: it's all yours!

This would be a great step forward.  The line between Solr & Lucene
ought to be more "crisp" and this issue is a step towards that...

Mike

On Wed, Jun 10, 2009 at 2:59 PM, Simon
Willnauer wrote:
> Hey there,
>
> I'm curious if anybody is working on the issue
> https://issues.apache.org/jira/browse/LUCENE-1085
> and the blocker https://issues.apache.org/jira/browse/LUCENE-1085 ?
> I would love to see both solr and lucene using the same api for search
> functions.
> The issues have been idle for a while so I would take over and try to
> make it happen in 3.0.
>
> simon
>
> -
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: Payloads and TrieRangeQuery

2009-06-10 Thread Yonik Seeley
On Wed, Jun 10, 2009 at 3:07 PM, Uwe Schindler  wrote:
> My problem with all this is how to optimize after which shift value to
> switch between terms and payloads.

Just make it a configurable number of bits at the end that are
"stored" instead of indexed.  People will want to select different
tradeoffs anyway.

What about using the position (as opposed to a payload) to encode the
last bits?  Should be faster, no?

> And this information about the trie
> structure and where payloads are should be stored in FieldInfos.

As is the case today, the info is encoded in the class you use (and
it's settings)... no need to add it to the index structure.  In any
case, it's a completely different issue and shouldn't be tied to
TrieRange improvements.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: Payloads and TrieRangeQuery

2009-06-10 Thread Michael McCandless
On Wed, Jun 10, 2009 at 3:07 PM, Uwe Schindler wrote:

>> I wonder how performance would compare.  Without payloads, there are
>> many more terms (for the tiny ranges) in the index, and your OR query
>> will have lots of these tiny terms.  But then these tiny terms don't
>> hit many docs, and with BooleanScorer (which we should switch to for
>> OR queries) ought not be very costly.
>
> That ist true. The main idea was to limit also seeking during the query.
> When splitting the range, you need to often start new TermEnums and iterate
> over lot of term. By catching many docs with less terms, you only need to
> scan forward in the payloads.

OK, though we should separately test "cold" searches (seeking matters)
and "hot" searches (seeking doesn't).  And we should separately test
SSD vs spinning drive for the cold case.  Seeking is much less costly
(though still more costly than "hot" searches) with SSDs...

>> Vs w/ payloads having to use
>> TermPositions, having to load, decode & check the payload, and I guess
>> assuming on average that 1/2 the docs are filtered out.
>
> Maybe decoding the payload is not needed, I would encode the bounds as
> byte[] and compare the arrays. But you would filter about half of the docs
> out.

Yonik's idea (encoding in the position) seems great here.

> My problem with all this is how to optimize after which shift value to
> switch between terms and payloads.

Presumably you'd "roughly" balance seek time vs "wasted doc filtered
out" time, to set the default, and make it configurable.

> And this information about the trie
> structure and where payloads are should be stored in FieldInfos.
>
> As we now search on each segment separately, this information can be stored
> per segment and also used for each per-segment Filter/Scorer.

Right, I think it should, but I agree w/ Yonik (partially) that it's orthogonal.

> The whole thing works out of the box with TrieRangeFilter (its just
> iterating over terms, getting TermDocs/TermPositions and setting bits, when
> payloads available after checking these), for TrieRangeQuery using
> BooleanQuery it is more complicated (MTQ cannot simply add the terms from
> the FilteredTermEnum to a BooleanQuery).

Seems like we should generalize MTQ so that the subclass could return
which clause should be added for each term, to the BQ?  (We are also
still needing to improve MTQ to decouple constant-scoring from "use BQ
or filter"... there's an issue opened for that).

> Until now I had no time to think about it in detail, but with maybe the
> possibility to have TrieRange in Core and store trie-specific FieldInfos per
> segment, I will get clearer how to manage this in the API.

I'd really like to see TrieRange in core for 2.9...

Mike

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: Payloads and TrieRangeQuery

2009-06-10 Thread Michael McCandless
On Wed, Jun 10, 2009 at 3:19 PM, Yonik Seeley wrote:

>> And this information about the trie
>> structure and where payloads are should be stored in FieldInfos.
>
> As is the case today, the info is encoded in the class you use (and
> it's settings)... no need to add it to the index structure.  In any
> case, it's a completely different issue and shouldn't be tied to
> TrieRange improvements.

The problem is, because the details of Trie* at index time affect
what's in each segment, this information needs to be stored per
segment.

Mike

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: Payloads and TrieRangeQuery

2009-06-10 Thread Earwin Burrfoot
>>> And this information about the trie
>>> structure and where payloads are should be stored in FieldInfos.
>>
>> As is the case today, the info is encoded in the class you use (and
>> it's settings)... no need to add it to the index structure.  In any
>> case, it's a completely different issue and shouldn't be tied to
>> TrieRange improvements.
>
> The problem is, because the details of Trie* at index time affect
> what's in each segment, this information needs to be stored per
> segment.

And then, when you merge segments indexed with different Trie*
settings, you need to convert them to some common form.
Sounds like something too complex and with minimum returns.

-- 
Kirill Zakharenko/Кирилл Захаренко ([email protected])
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-1609) Eliminate synchronization contention on initial index reading in TermInfosReader ensureIndexIsRead

2009-06-10 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718175#action_12718175
 ] 

Michael McCandless commented on LUCENE-1609:


Alas... the big problem with doing this up-front is the 
IndexReader.setTermIndexInterval, which relies on the fact that the index is 
loaded lazily.

So, maybe we need to wait until 3.0 to do this up-front.

But perhaps for this issue we should make it possible to pass in the 
termIndexInterval to IndexReader.open, and deprecate the current methods, and 
then in 3.0 we could switch to up-front loading.

> Eliminate synchronization contention on initial index reading in 
> TermInfosReader ensureIndexIsRead 
> ---
>
> Key: LUCENE-1609
> URL: https://issues.apache.org/jira/browse/LUCENE-1609
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.9
> Environment: Solr 
> Tomcat 5.5
> Ubuntu 2.6.20-17-generic
> Intel(R) Pentium(R) 4 CPU 2.80GHz, 2Gb RAM
>Reporter: Dan Rosher
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1609.patch, LUCENE-1609.patch
>
>
> synchronized method ensureIndexIsRead in TermInfosReader causes contention 
> under heavy load
> Simple to reproduce: e.g. Under Solr, with all caches turned off, do a simple 
> range search e.g. id:[0 TO 99] on even a small index (in my case 28K 
> docs) and under a load/stress test application, and later, examining the 
> Thread dump (kill -3) , many threads are blocked on 'waiting for monitor 
> entry' to this method.
> Rather than using Double-Checked Locking which is known to have issues, this 
> implementation uses a state pattern, where only one thread can move the 
> object from IndexNotRead state to IndexRead, and in doing so alters the 
> objects behavior, i.e. once the index is loaded, the index nolonger needs a 
> synchronized method. 
> In my particular test, this uncreased throughput at least 30 times.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: Payloads and TrieRangeQuery

2009-06-10 Thread Yonik Seeley
On Wed, Jun 10, 2009 at 3:43 PM, Michael McCandless
 wrote:
> On Wed, Jun 10, 2009 at 3:19 PM, Yonik Seeley 
> wrote:
>
>>> And this information about the trie
>>> structure and where payloads are should be stored in FieldInfos.
>>
>> As is the case today, the info is encoded in the class you use (and
>> it's settings)... no need to add it to the index structure.  In any
>> case, it's a completely different issue and shouldn't be tied to
>> TrieRange improvements.
>
> The problem is, because the details of Trie* at index time affect
> what's in each segment, this information needs to be stored per
> segment.

That's the case with the analysis for every field.  If you change your
analyzer in a non-compatible fashion, you need to re-index.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream

2009-06-10 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718176#action_12718176
 ] 

Michael McCandless commented on LUCENE-1448:


Michael are you going to get to this soonish?  Else let's push until after 3.0?

> add getFinalOffset() to TokenStream
> ---
>
> Key: LUCENE-1448
> URL: https://issues.apache.org/jira/browse/LUCENE-1448
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Reporter: Michael McCandless
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, 
> LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a 
> document, and you then index those fields with TermVectors storing offsets, 
> it's very likely the offsets for all but the first field instance will be 
> wrong.
> This is because IndexWriter under the hood adds a cumulative base to the 
> offsets of each field instance, where that base is 1 + the endOffset of the 
> last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer 
> is being used, and the text being analyzed ended in 3 whitespace characters, 
> then that information is lost and then next field's offsets are then all 3 
> too small.  Similarly, if a StopFilter appears in the chain, and the last N 
> tokens were stop words, then the base will be 1 + the endOffset of the last 
> non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm 
> thinking by default it returns -1, which means "I don't know so you figure it 
> out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Updated: (LUCENE-1584) Callback for intercepting merging segments in IndexWriter

2009-06-10 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1584:
---

Fix Version/s: (was: 2.9)

Moving out.

> Callback for intercepting merging segments in IndexWriter
> -
>
> Key: LUCENE-1584
> URL: https://issues.apache.org/jira/browse/LUCENE-1584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
>Priority: Minor
> Attachments: LUCENE-1584.patch
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> For things like merging field caches or bitsets, it's useful to
> know which segments were merged to create a new segment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: back compat is good

2009-06-10 Thread Mark Miller



As far as default settings, it seems like it can be mostly fixed with
documentation (i.e. recommended settings for maximum performance).
That seems like a very small burden for people writing new
applications with Lucene anyway (compare to the cost of writing the
whole application).  On the other hand, existing users may be
essentially "done" with the Lucene development in their project, and
want to upgrade for bug fixes, performance increases, and maybe to
incrementally add new features.



I think we need to do both.  We should doc things like "use a big RAM
buffer", "turn off CFS", "use an SSD", "use threads", etc.

But for things like "open a readOnly reader", "turn on the acronym fix
in StandardAnalyzer", "use BooleanScorer not BooleanScorer2", "don't
discard positions in StopFilter", "use NIOFSDirectory not
FSDirectory", "turn off scoring when sorting by field", we should fix
Lucene to do those by default.

I'd like for Lucene to make a good first impression.

Mike

-
I agree that this needs to be fixed with Lucene. For a bit, I also 
thought that documentation was enough.


But on further thought, its a bit absurd. The computer should handle 
that for me. It really should be as easy
as saying, look I want the best new defaults, or I want the back compat 
defaults. The computer should figure
out the rest for me. I know its not as easy as typing it, but thats 
still a doable goal I would think. Its got to be somehow.
I know it comes down to different inconveniences, but I refuse to 
believe that there is not a solution.


- Mark

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Updated: (LUCENE-1577) Benchmark of different in RAM realtime techniques

2009-06-10 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1577:
---

Fix Version/s: (was: 2.9)

Moving out.

> Benchmark of different in RAM realtime techniques
> -
>
> Key: LUCENE-1577
> URL: https://issues.apache.org/jira/browse/LUCENE-1577
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
>Priority: Minor
> Attachments: LUCENE-1577.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> A place to post code that benchmarks the differences in the speed of indexing 
> and searching using different realtime techniques.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: Lucene's default settings & back compatibility

2009-06-10 Thread Shai Erera
Well .. to be honest I haven't monitored java-user for quite some time, so I
don't know if it hasn't been raised there.

But now there's the other thread that Yonik started, so I'm not really sure
where to answer.

I think that if we look back at 2.0 and compare to 2.9, anyone upgrading
from that version to 2.9 is going to need to learn a lot about Lucene. It's
not just deprecation, but best practices, different approaches for different
situations etc. For example, ConstantScoreQuery is not a *default* thing - I
need to know it exists and what benefits does it give me, in order to use
it. So no back-compat / deprecation stuff would teach me how to use it. Nor
will I miraculaously understand that I'd better not score when sorting. Yes,
the API has changed, but not in a way I now can understand it. Maybe we've
documented it well, dunno ...

If people upgrade from 2.0 to 2.9, then their lives would be a lot easier if
2.9 provided the greatest and latest right out-of-the-box. So yes, they'd
need to fix all the deprecations, but that's easy because we document the
alternative. Add that to the "best defaults" and we've got a good code
migration story.

Again, as long as we release every ~6 months (and I don't think we should
release sooner), I don't think it's such a problem to request someone to
make minor modifications/maintenance to his code every 1year (!). Especially
since we believe a major release will come every ~2 years, by which I need
to re-build my indices, which is by far a more costly operation (sometimes
out of your hands) than updating code.

So relaxing the back-compat a bit overall does not seem like a great "crime
against the Lucene users" to me - all is done (>98% of the time?) for the
better.

But maybe these days will pass soon. If we continue to get rid of interfaces
and adopt abstract classes, perhaps we won't work too hard to improve
things. In 1614 it was quite easy to improve DISI since it is an abstract
class.

Shai

On Wed, Jun 10, 2009 at 7:32 PM, Mark Miller  wrote:

> No one really responded to this Shai? And I take it that the user list
> never saw it?
>
> Perhaps we should just ask for opinion from the user list based on what you
> already have - just to gauge the reaction on different points. Unless
> someone responds shortly, we could take a year waiting to shake it out.
> The threat of sending should prompt anyone with any issues to speak up.
>
> I think we should add though:
> explicitly what has changed (eg if we switch something, what was the policy
> before - most users won't even know)
> an overview of why we are interested in relaxing back compat
>
> - Mark
>
> Shai Erera wrote:
>
>> Ok, so digging back in this thread, I think the following proposals were
>> made (if I missed some, please add them):
>>
>> 1. API deprecation last *at least* one full minor release. Example: if we
>> deprecate an API in 2.4, we can remove it in 2.5. BUT, we are also free to
>> keep it there and remove it in 2.6, 2.9, 3.0, 3.5. I would like to reserve
>> that option for controversial deprecations, like TokenStream, and maybe even
>> the HitCollector recent changes. Those that we feel will have a large impact
>> on the users, we might want to keep around for a bit longer until we get
>> enough feedback from the field and are more confident with that change.
>>
>> 2. Bugs are fixed backwards on the last "dot" release only. Example, A bug
>> that's discovered after 2.4 is released, is fixed on 2.4.X branch. Once 2.5
>> is released, any bug fixes happen on trunk and 2.5.X. A slight relaxation
>> would be adding something like "we may still fix bugs on the 2.4.X branch if
>> we feel it's important enough". For example if 2.5 contains a lot of API
>> changes and we think a considerable portion of our users are still on 2.4.
>>
>> 3. Jar drop-in ability is only guaranteed on point releases (this is
>> slightly of an outcome of (1) and (2), but (6) will also affect it).
>>
>> 4. Changes to the index format last at least one full major release.
>> Example: a change to the index format in 2.X, is supported in all 3.Y
>> releases, and removed in 4.0. Again, I write "at least" since we should have
>> the freedom to extend support for a particular change.
>>
>> 5. Changes to the default settings are allowed between minor releases,
>> provided that we give the users a way to revert back to the old behavior.
>> Examples are LUCENE-1542 and the latest issues Mike opened. Those changes
>> will be applied out-of-the-box. The provided API to revert to the old
>> behavior may be a supported API, or a deprecated API. For deprecation we can
>> decide to keep the API longer than one minor release.
>>
>> 5.1) An exception to (5) are bug fixes which break back-compat - those are
>> always visible, w/ a way to revert to the buggy behavior. That way may be
>> deprecated or not, and its support lifetime can be made on a case-by-case
>> basis.
>>
>> 6. Minor changes to APIs can happen w/o any deprecation. Example,
>> L

Re: Lucene memory usage

2009-06-10 Thread Jason Rutherglen
Great! If I understand correctly it looks like RAM savings? Will
there be an improvement in lookup speed? (We're using binary
search here?).

Is there a precedence in database systems for what was mentioned
about placing the term dict, delDocs, and filters onto disk and
reading them from there (with the IO cache taking care of
keeping the data in RAM)? (Would there be a future advantage to
this approach when SSDs are more prevalent?) It seems like we
could have some generalized pluggable system where one could try
out this or the current heap approach, and benchmark.

Given our continued inability to properly measure Java RAM
usage, this approach may be a good one for Lucene? Where heap
based LRU caches are a shot in the dark when it comes to mem
size, as we never really know how much they're using.

Once we generalize delDocs, filters, and field caches
(LUCENE-831?), then perhaps CSF is a good place to test out this
approach? We could have a generic class that handles the
underlying IO that simply returns values based on a position or
iteration.

On Wed, Jun 10, 2009 at 11:26 AM, Michael McCandless <
[email protected]> wrote:

> Roughly, the current approach for the default terms dict codec in
> LUCENE-1458 is:
>
>  * Create a separate class per-field (the String field in each Term
>is redundant).  This is a big change over Lucene today
>
>  * That class has String[] indexText and long[] indexPointer, each
>length = the number of index terms.  No TermInfo instance nor Term
>instance are used.
>
>  * Modify the tis format to also store its data by field
>
>  * Modify the tis format so that at a seek point (ie an indexed
>term), absolute values are written for freq/prox pointer, but
>continue to delta-code in between indexed terms.  EG this is how
>video codecs work (every so often they write a "key frame" which
>you can seek to & immediately decode w/ no prior context).
>
>  * tii then just stores text/long (delta coded) for all indexed
>terms, and is slurped into the arrays on init.
>
> This is a sizable RAM savings over what's done now because you save 2
> objects, 3 pointers, 2 longs, 2 ints (I think), per indexed term.
>
> Mike
>
> On Wed, Jun 10, 2009 at 2:02 PM, Jason
> Rutherglen wrote:
> >> LUCENE-1458 (flexible indexing) has these improvements,
> >
> > Mike, can you explain how it's different?  I looked through the code once
> > but yeah, it's in with a lot of other changes.
> >
> > On Wed, Jun 10, 2009 at 5:40 AM, Michael McCandless <
> > [email protected]> wrote:
> >
> >> This (very large number of unique terms) is a problem for Lucene
> currently.
> >>
> >> There are some simple improvements we could make to the terms dict
> >> format to not require so much RAM per term in the terms index...
> >> LUCENE-1458 (flexible indexing) has these improvements, but
> >> unfortunately tied in w/ lots of other changes.  Maybe we should break
> >> out a separate issue for this... this'd be a great contained
> >> improvement, if anyone out there has "the itch" :)
> >>
> >> One simple workaround is to call IndexReader.setTermIndexInterval
> >> immediately after opening the reader; this simply loads fewer terms in
> >> the index, using far less RAM, but at the expense of somewhat slower
> >> searching.
> >>
> >> Also: you should peek at your index, eg using Luke, to understand why
> >> you have so many terms.  It could be legitimate (indexing a massive
> >> catalog with eg part numbers), or, it could be your document filtering
> >> / analyzer are accidentally producing garbage terms.
> >>
> >> Mike
> >>
> >> On Wed, Jun 10, 2009 at 8:23 AM, Benedikt Boss wrote:
> >> > Hej hej,
> >> >
> >> > i have a question regarding lucenes memory usage
> >> > when launching a query. When i execute my query
> >> > lucene eats up over 1gig of heap-memory even
> >> > when my result-set is only a single hit. I
> >> > found out that this is due to the "ensureIndexIsRead()"
> >> > method-call in the "TermInfosReader" class, which
> >> > iterates over all Terms found in the index and saves
> >> > them (including all value-strings) in a Term-Array.
> >> > Is it possible to not read all that stuff
> >> > into memory at all?
> >> >
> >> > Im doing the query like in the following pseudo-code:
> >> >
> 
> >> >
> >> > TopScoreDocCollector collector = new TopScoreDocCollector(10);
> >> >
> >> > QueryParser   parser= new QueryParser(field, new WhitespaceAnalyzer()
> );
> >> > Directory fsDir = new FSDirectory(indexDir, null);
> >> > IndexSearcher is= new IndexSearcher(fsdir);
> >> >
> >> > Query query = parser.parse(q);
> >> >
> >> > is.search(query, collector);
> >> > ScoreDoc[] hits = collector.topDocs();
> >> >
> >> > ... < iterate over hits and print results >
> >> >
> >> >
> >> > Thanks in advance
> >> > Benedikt
> >> >
> >> > --

[jira] Commented: (LUCENE-1607) String.intern() faster alternative

2009-06-10 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718181#action_12718181
 ] 

Michael McCandless commented on LUCENE-1607:


Yonik is this ready to go in...?

> String.intern() faster alternative
> --
>
> Key: LUCENE-1607
> URL: https://issues.apache.org/jira/browse/LUCENE-1607
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Earwin Burrfoot
> Fix For: 2.9
>
> Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, 
> LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, 
> LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can 
> be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned 
> strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Resolved: (LUCENE-1682) unit tests should use private directories

2009-06-10 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1682.


Resolution: Fixed

> unit tests should use private directories
> -
>
> Key: LUCENE-1682
> URL: https://issues.apache.org/jira/browse/LUCENE-1682
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1682.patch
>
>
> This only affects our unit tests...
> I run "ant test" and "ant test-tag" concurrently, but some tests have false 
> failures (eg TestPayloads) because they use a fixed test directory in the 
> filesystem for testing.
> I've added a simple method to _TestUtil to get a temp dir, and switched over 
> those tests that I've hit false failures on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Updated: (LUCENE-1671) FSDirectory internally caches and clones FSIndexInput

2009-06-10 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1671:
---

Fix Version/s: (was: 2.9)

Moving out.

> FSDirectory internally caches and clones FSIndexInput
> -
>
> Key: LUCENE-1671
> URL: https://issues.apache.org/jira/browse/LUCENE-1671
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
>Priority: Trivial
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> The patch will fix this small problem where if FSDirectory.openInput is 
> called, a new unnecessary file descriptors is opened (whereas an 
> IndexInput.clone would work).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-1584) Callback for intercepting merging segments in IndexWriter

2009-06-10 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718185#action_12718185
 ] 

Jason Rutherglen commented on LUCENE-1584:
--

Can we put this one in 2.9?  It seems like a fairly straightfoward change.  Or 
make it a protected method?

> Callback for intercepting merging segments in IndexWriter
> -
>
> Key: LUCENE-1584
> URL: https://issues.apache.org/jira/browse/LUCENE-1584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
>Priority: Minor
> Attachments: LUCENE-1584.patch
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> For things like merging field caches or bitsets, it's useful to
> know which segments were merged to create a new segment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: Lucene's default settings & back compatibility

2009-06-10 Thread Mark Miller
Right - I'd actually hold off now. I figured the threat of sending might 
prompt some action ;)


It still wouldn't hurt to know what the users think, perhaps at more 
digestible, overview level though.


I do think Yonik torpedoed something this liberal :)

Thats not a bad thing though. We will find the right answer somewhere 
between the two of you I hope.


We may already be at some half way point - we have experimental apis and 
exceptions at an ever growing rate.


As you also mention, as more of the code moves to abstract classes, back 
compat is eased anyway.


- Mark

Shai Erera wrote:
Well .. to be honest I haven't monitored java-user for quite some 
time, so I don't know if it hasn't been raised there.


But now there's the other thread that Yonik started, so I'm not really 
sure where to answer.


I think that if we look back at 2.0 and compare to 2.9, anyone 
upgrading from that version to 2.9 is going to need to learn a lot 
about Lucene. It's not just deprecation, but best practices, different 
approaches for different situations etc. For example, 
ConstantScoreQuery is not a *default* thing - I need to know it exists 
and what benefits does it give me, in order to use it. So no 
back-compat / deprecation stuff would teach me how to use it. Nor will 
I miraculaously understand that I'd better not score when sorting. 
Yes, the API has changed, but not in a way I now can understand it. 
Maybe we've documented it well, dunno ...


If people upgrade from 2.0 to 2.9, then their lives would be a lot 
easier if 2.9 provided the greatest and latest right out-of-the-box. 
So yes, they'd need to fix all the deprecations, but that's easy 
because we document the alternative. Add that to the "best defaults" 
and we've got a good code migration story.


Again, as long as we release every ~6 months (and I don't think we 
should release sooner), I don't think it's such a problem to request 
someone to make minor modifications/maintenance to his code every 
1year (!). Especially since we believe a major release will come every 
~2 years, by which I need to re-build my indices, which is by far a 
more costly operation (sometimes out of your hands) than updating code.


So relaxing the back-compat a bit overall does not seem like a great 
"crime against the Lucene users" to me - all is done (>98% of the 
time?) for the better.


But maybe these days will pass soon. If we continue to get rid of 
interfaces and adopt abstract classes, perhaps we won't work too hard 
to improve things. In 1614 it was quite easy to improve DISI since it 
is an abstract class.


Shai

On Wed, Jun 10, 2009 at 7:32 PM, Mark Miller > wrote:


No one really responded to this Shai? And I take it that the user
list never saw it?

Perhaps we should just ask for opinion from the user list based on
what you already have - just to gauge the reaction on different
points. Unless someone responds shortly, we could take a year
waiting to shake it out.
The threat of sending should prompt anyone with any issues to
speak up.

I think we should add though:
explicitly what has changed (eg if we switch something, what was
the policy before - most users won't even know)
an overview of why we are interested in relaxing back compat

- Mark

Shai Erera wrote:

Ok, so digging back in this thread, I think the following
proposals were made (if I missed some, please add them):

1. API deprecation last *at least* one full minor release.
Example: if we deprecate an API in 2.4, we can remove it in
2.5. BUT, we are also free to keep it there and remove it in
2.6, 2.9, 3.0, 3.5. I would like to reserve that option for
controversial deprecations, like TokenStream, and maybe even
the HitCollector recent changes. Those that we feel will have
a large impact on the users, we might want to keep around for
a bit longer until we get enough feedback from the field and
are more confident with that change.

2. Bugs are fixed backwards on the last "dot" release only.
Example, A bug that's discovered after 2.4 is released, is
fixed on 2.4.X branch. Once 2.5 is released, any bug fixes
happen on trunk and 2.5.X. A slight relaxation would be adding
something like "we may still fix bugs on the 2.4.X branch if
we feel it's important enough". For example if 2.5 contains a
lot of API changes and we think a considerable portion of our
users are still on 2.4.

3. Jar drop-in ability is only guaranteed on point releases
(this is slightly of an outcome of (1) and (2), but (6) will
also affect it).

4. Changes to the index format last at least one full major
release. Example: a change to the index format in 2.X, is
supported in all 3.Y releases, and removed in 4.0. Again, I
write "at least" since we 

Re: back compat is good

2009-06-10 Thread Yonik Seeley
On Wed, Jun 10, 2009 at 4:11 PM, Mark Miller  wrote:
> The computer should handle that
> for me. It really should be as easy
> as saying, look I want the best new defaults, or I want the back compat
> defaults. The computer should figure
> out the rest for me.

actsAsVersion ;-)
nice and back compatible.
Introduce Settings classes in the future when+where it makes sense.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: Lucene memory usage

2009-06-10 Thread Michael McCandless
On Wed, Jun 10, 2009 at 4:13 PM, Jason
Rutherglen wrote:
> Great! If I understand correctly it looks like RAM savings? Will
> there be an improvement in lookup speed? (We're using binary
> search here?).

Yes, sizable RAM reduction for apps that have many unique terms.  And,
init'ing (warming) the reader should be faster.

Lookup speed should be faster (binary search against the terms in a
single field, not all terms).

> Is there a precedence in database systems for what was mentioned
> about placing the term dict, delDocs, and filters onto disk and
> reading them from there (with the IO cache taking care of
> keeping the data in RAM)? (Would there be a future advantage to
> this approach when SSDs are more prevalent?) It seems like we
> could have some generalized pluggable system where one could try
> out this or the current heap approach, and benchmark.

LUCENE-1458 creates exactly such a pluggable system.  Ie it's lets you
swap in your own codec for terms, freq, prox, etc.

But: I'm leary of having terms dict live entirely on disk, though we
should certainly explore it.

> Given our continued inability to properly measure Java RAM
> usage, this approach may be a good one for Lucene? Where heap
> based LRU caches are a shot in the dark when it comes to mem
> size, as we never really know how much they're using.

Well remember mmap uses an LRU policy to decide when pages are swapped
to disk... so a search that's unlucky can easily hit many page faults
just in consulting the terms dict.  You could be at 200 msec cost
before you even hit a postings list... I prefer to have the terms
index RAM resident (of course the OS can still swap THAT out too...).

> Once we generalize delDocs, filters, and field caches
> (LUCENE-831?), then perhaps CSF is a good place to test out this
> approach? We could have a generic class that handles the
> underlying IO that simply returns values based on a position or
> iteration.

I agree, a CSF codec that uses mmap seems like a good place to
start...

Mike

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-1607) String.intern() faster alternative

2009-06-10 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718188#action_12718188
 ] 

Yonik Seeley commented on LUCENE-1607:
--

I think so... but I was waiting for some kind of feedback if people in general 
thought it was the right approach.  It introduces another static, and people 
tend to not like that.  I accidentally didn't upload the latest version with 
the javadoc + helper method.  I'll do that now.

> String.intern() faster alternative
> --
>
> Key: LUCENE-1607
> URL: https://issues.apache.org/jira/browse/LUCENE-1607
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Earwin Burrfoot
> Fix For: 2.9
>
> Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, 
> LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, 
> LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can 
> be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned 
> strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: back compat is good

2009-06-10 Thread Grant Ingersoll
I'm not against back compatibility.  In fact, I agree with your  
points, especially the use of the phrase "commonly used interfaces".


My main problem is our approach seems to be very dogmatic and  
detrimental for _less_ commonly used interfaces (more importantly less  
commonly _implemented_ Interfaces) and it creates a whole lot of cruft  
in the code.  Code that is only released every 6-12 months anyway.


Specific examples include:
1. Fieldable
2. FieldCache and ExtendedFieldCache
3. The five gazillion IndexWriter constructors
4. The Analyzer.tokenStream stuff.

The thing is, we have this false sense about back compatibility  
anyway.  We think we are doing it, but time and again it slips through  
because there is _NO WAY_ we can know all of the myriad of uses of  
Lucene.  My take:  be strict about index compatibility, take API  
changes on a case-by-case basis, favoring _preserving_ back  
compatibility unless it is too expensive.  Communicate any changes  
loudly.


So, yes, back compatibility as part of a pragmatic approach that  
recognizes our release timeframes and the ability for modern IDEs to  
help in refactoring is good.  Back compatibility for the sake of back  
compatibility is harmful and will ultimately be the downfall of  
Lucene, IMO, because it won't keep up simply because it will take  
twice as long to develop new ways of doing things and it will scare  
away new contributors who can't possibly fathom all of the back  
compatibility requirements (heck, us committers who have been around  
for a long time can't even do it).


At any rate, I also am promoting the case by case approach.  And I  
will kick it off by opening an issue that gets rid of the stupid  
ExtendedFieldCache abomination and breaks the FieldCache back compat.  
interface construct.



-Grant

On Jun 10, 2009, at 1:00 PM, Yonik Seeley wrote:


I'm starting to feel like the lone holdout that thinks back compat for
commonly used interfaces and index formats is important.  So I'll sum
up some of my thoughts and leave it at that:

- I doubt that the number of new users for each release of Lucene
exceeds the sum total of all existing users of Lucene.  Lucene is
already the dominant open source search library, so we're never going
to hit that type of exponential growth going forward.  Existing users
are very important.
- Good back compat makes the lives of all Lucene users easier
- Good back compat makes the lives of Lucene developers easier in some
ways also.  We don't *need* to go back and patcholder releases, since
we can say "use a newer release".  If things change too much, that
will no longer be an easy option for many users, and more people will
get stuck in the past because upgrading is too painful.
- The difficulty of change can also be a good thing - it forces people
to really think if changes are worth it and only add them where it
really makes sense.

The last threads on back compat generated so much volume that I
couldn't keep up, and I expect there are many others that couldn't
either.  I'm not personally interested in discussing it in the
abstract further... I'm more interested in actual code
patches/proposals.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]




-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Updated: (LUCENE-1607) String.intern() faster alternative

2009-06-10 Thread Yonik Seeley (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated LUCENE-1607:
-

Attachment: LUCENE-1607.patch

latest patch - could use a multi-threaded testcase to ensure no exceptions are 
thrown and that intern() always returns the same instance.

> String.intern() faster alternative
> --
>
> Key: LUCENE-1607
> URL: https://issues.apache.org/jira/browse/LUCENE-1607
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Earwin Burrfoot
> Fix For: 2.9
>
> Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, 
> LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, 
> LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can 
> be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned 
> strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-1607) String.intern() faster alternative

2009-06-10 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718198#action_12718198
 ] 

Earwin Burrfoot commented on LUCENE-1607:
-

bq. but I was waiting for some kind of feedback if people in general thought it 
was the right approach. It introduces another static, and people tend to not 
like that.
Just forgot somehow about this issue.
You're right about static, it's not clear how and when to initialize it, plus 
you introduce some public classes we'll be unable to change/remove later.
I still have a feeling we should expose a single static method - intern() and 
hide implementation away, possibly tuning it to be advantageous for  String.intern() faster alternative
> --
>
> Key: LUCENE-1607
> URL: https://issues.apache.org/jira/browse/LUCENE-1607
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Earwin Burrfoot
> Fix For: 2.9
>
> Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, 
> LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, 
> LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch
>
>
> By using our own interned string pool on top of default, String.intern() can 
> be greatly optimized.
> On my setup (java 6) this alternative runs ~15.8x faster for already interned 
> strings, and ~2.2x faster for 'new String(interned)'
> For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: Payloads and TrieRangeQuery

2009-06-10 Thread Michael McCandless
On Wed, Jun 10, 2009 at 4:04 PM, Earwin Burrfoot wrote:

> And then, when you merge segments indexed with different Trie*
> settings, you need to convert them to some common form.
> Sounds like something too complex and with minimum returns.

Oh yeah... tricky.  So... there are various situations to handle with
trie:

  * Was the field even indexed w/ Trie, or indexed as "simple text"?
It's useful to know this "automatically" at search time, so eg a
RangeQuery can do the right thing by default.  FieldInfos seems
like the natural place to store this.  It's basically Lucene's
per-segment write-once schema.  Eg we use this to record "did any
token in this field have a Payload?", which is analogous.

  * How did you tune your payload-vs-trie-range setting.  OK, I agree:
this is most similar to "you changed your analyzer in an
incompatible way, so, you have to reindex".  Plus, during merging
we can't [easily] translate this.  So we shouldn't try to keep
track of this.

  * We have a bug (or an important improvement) in how Trie encodes
terms that we need to fix.  This one is not easy to handle, since
such a change could alter the term order, and merging segments
then becomes problematic.  Not sure how to handle that.  Yonik,
has Solr ever had to make a change to NumberUtils?

Mike

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



RE: Payloads and TrieRangeQuery

2009-06-10 Thread Uwe Schindler
> On Wed, Jun 10, 2009 at 3:43 PM, Michael McCandless
>  wrote:
> > On Wed, Jun 10, 2009 at 3:19 PM, Yonik
> Seeley wrote:
> >
> >>> And this information about the trie
> >>> structure and where payloads are should be stored in FieldInfos.
> >>
> >> As is the case today, the info is encoded in the class you use (and
> >> it's settings)... no need to add it to the index structure.  In any
> >> case, it's a completely different issue and shouldn't be tied to
> >> TrieRange improvements.
> >
> > The problem is, because the details of Trie* at index time affect
> > what's in each segment, this information needs to be stored per
> > segment.
> 
> That's the case with the analysis for every field.  If you change your
> analyzer in a non-compatible fashion, you need to re-index.

I agree with Mike to store information like the data type in the index, but
on the other hand, Yonik is correct, too. If I change my analyzer (and
TrieTokenStream is in fact one, an analyzer that creates tokens out of a
number), I have to reindex.

The problem with storing different indexing settings (precisionStep,
payload/position bits) per segment makes merging nearly impossible, so I
would not do this (see also Earwins comment about that).

About releasing 2.9:
I would really like to leave this optimization out for 2.9. We can still add
this after 2.9 as an optimization. The number of bits encoded into the
TermPosition (this is really a cool idea, thanks Yonik, I was missing
exactly that, because you do not need to convert the bits, you can directly
put them into the index as int and use them on the query side!) is simply 0
for indexes created with 2.9. With later versions, you could also shift the
lower bits into the TermPosition and tell TrieRange to filter them.

I would like to go forward with moving the classes into the right packages
and optimize the way, how queries and analyzers are created (only one class
for each). The idea from LUCENE-1673 to use static factories to create these
classes for the different data types seems to be more elegant and simplier
to maintain than the current way (having a class for each bit size).

So I think I will start with 1673 and try to present something useable, soon
(but without payloads, so the payload/position-bits setting is "0").
Now the oen question: Which name for the numeric range queries/fields? :-(

Uwe


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: Payloads and TrieRangeQuery

2009-06-10 Thread Michael McCandless
On Wed, Jun 10, 2009 at 5:07 PM, Uwe Schindler wrote:
> I would really like to leave this optimization out for 2.9. We can still add
> this after 2.9 as an optimization. The number of bits encoded into the
> TermPosition (this is really a cool idea, thanks Yonik, I was missing
> exactly that, because you do not need to convert the bits, you can directly
> put them into the index as int and use them on the query side!) is simply 0
> for indexes created with 2.9. With later versions, you could also shift the
> lower bits into the TermPosition and tell TrieRange to filter them.

I agree, let's aim for after 3.0 for this.  (Note that, in theory, 3.0
should follow quickly after 2.9, having "only" removed deprecated
APIs, changed settings, etc.).  Can you open & issue & mark as 3.1?

> I would like to go forward with moving the classes into the right packages
> and optimize the way, how queries and analyzers are created (only one class
> for each). The idea from LUCENE-1673 to use static factories to create these
> classes for the different data types seems to be more elegant and simplier
> to maintain than the current way (having a class for each bit size).

+1

> So I think I will start with 1673 and try to present something useable, soon
> (but without payloads, so the payload/position-bits setting is "0").
> Now the oen question: Which name for the numeric range queries/fields? :-(

How about:

  Range* -> TermRange*
  TrieRange* -> NumericRange*
  FieldCacheRangeFilter -> FieldCacheTermRangeFilter
  ConstantScoreRangeQuery stays as is (it's deprecated)

Are there any others that need renaming?

Mike

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: Payloads and TrieRangeQuery

2009-06-10 Thread Yonik Seeley
On Wed, Jun 10, 2009 at 5:03 PM, Michael McCandless
 wrote:
> On Wed, Jun 10, 2009 at 4:04 PM, Earwin Burrfoot wrote:
>  * Was the field even indexed w/ Trie, or indexed as "simple text"?

Why the special treatment for Trie?

>    It's useful to know this "automatically" at search time, so eg a
>    RangeQuery can do the right thing by default.  FieldInfos seems
>    like the natural place to store this.  It's basically Lucene's
>    per-segment write-once schema.  Eg we use this to record "did any
>    token in this field have a Payload?", which is analogous.

It doesn't seem analogous to me.  Trie is just another implementation
for numerics with it's own tradeoffs.

>  * We have a bug (or an important improvement) in how Trie encodes
>    terms that we need to fix.  This one is not easy to handle, since
>    such a change could alter the term order, and merging segments
>    then becomes problematic.  Not sure how to handle that.  Yonik,
>    has Solr ever had to make a change to NumberUtils?

Nope.  If we needed to, we would make a new field type so that
existing schemas/indexes would continue to work.

-Yonik

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



RE: Payloads and TrieRangeQuery

2009-06-10 Thread Uwe Schindler
> I would like to go forward with moving the classes into the right packages
> and optimize the way, how queries and analyzers are created (only one
> class
> for each). The idea from LUCENE-1673 to use static factories to create
> these
> classes for the different data types seems to be more elegant and simplier
> to maintain than the current way (having a class for each bit size).
> 
> So I think I will start with 1673 and try to present something useable,
> soon
> (but without payloads, so the payload/position-bits setting is "0").

Another question not so simple to answer: When embedding these TermPositions
into the whole process, how would this work with MultiTermQuery? The current
algorithm is simple: The TrieRangeTermEnum simply enumerates the possible
terms from the index and MTQ creates the BitSet or a BooleanQuery of
TermQueries. How to do this with positions? In both cases there need
specialities (the TermEnum must return that the actual term is a
payload/position one and must filter using TermPositions). For the filter
its then easy, the TermQueries added to BooleanQuery in the other case must
also use the payloads. Questions & more questions.

I tend to release TrieRange with 2.9 without Positions/Payloads.

Uwe


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: Payloads and TrieRangeQuery

2009-06-10 Thread Earwin Burrfoot
>  * Was the field even indexed w/ Trie, or indexed as "simple text"?
>    It's useful to know this "automatically" at search time, so eg a
>    RangeQuery can do the right thing by default.  FieldInfos seems
>    like the natural place to store this.  It's basically Lucene's
>    per-segment write-once schema.  Eg we use this to record "did any
>    token in this field have a Payload?", which is analogous.
This should really be in a schema of some kind (like in my project for
instance).
Why do you do autodetection for tries, but recently removed it for FieldCache?
Things should be concise, either store all settings in the index (and
die in the process), or don't store them there at all.

>  * We have a bug (or an important improvement) in how Trie encodes
>    terms that we need to fix.  This one is not easy to handle, since
>    such a change could alter the term order, and merging segments
>    then becomes problematic.  Not sure how to handle that.  Yonik,
>    has Solr ever had to make a change to NumberUtils?
There are cases when reindexing is inevitable. What so horrible about
it anyway? Even if you have a humongous index, you can rebuild it in a
matter of days, and you don't do this often.

-- 
Kirill Zakharenko/Кирилл Захаренко ([email protected])
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: Payloads and TrieRangeQuery

2009-06-10 Thread Michael McCandless
I think we'd need richer communication between MTQ and its subclasses,
so that eg your enum would return a Query instead of a Term?

Then you'd either return a TermQuery, or, a BooleanQuery that's
filtering the TermQuery?

But yes, doing after 3.0 seems good!

Mike

On Wed, Jun 10, 2009 at 5:26 PM, Uwe Schindler wrote:
>> I would like to go forward with moving the classes into the right packages
>> and optimize the way, how queries and analyzers are created (only one
>> class
>> for each). The idea from LUCENE-1673 to use static factories to create
>> these
>> classes for the different data types seems to be more elegant and simplier
>> to maintain than the current way (having a class for each bit size).
>>
>> So I think I will start with 1673 and try to present something useable,
>> soon
>> (but without payloads, so the payload/position-bits setting is "0").
>
> Another question not so simple to answer: When embedding these TermPositions
> into the whole process, how would this work with MultiTermQuery? The current
> algorithm is simple: The TrieRangeTermEnum simply enumerates the possible
> terms from the index and MTQ creates the BitSet or a BooleanQuery of
> TermQueries. How to do this with positions? In both cases there need
> specialities (the TermEnum must return that the actual term is a
> payload/position one and must filter using TermPositions). For the filter
> its then easy, the TermQueries added to BooleanQuery in the other case must
> also use the payloads. Questions & more questions.
>
> I tend to release TrieRange with 2.9 without Positions/Payloads.
>
> Uwe
>
>
> -
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] Commented: (LUCENE-1673) Move TrieRange to core

2009-06-10 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718209#action_12718209
 ] 

Michael McCandless commented on LUCENE-1673:


bq. NumericRangeQuery.newFloatRange(Float a, Float b, precisionStep) and so on.

Could we also do this for a "term range"?  Then, we could have a single 
RangeQuery that rewrites to the right impl based on what kind of range you are 
doing?

(And in fact it could fold in FieldCacheRangeFilter too).

> Move TrieRange to core
> --
>
> Key: LUCENE-1673
> URL: https://issues.apache.org/jira/browse/LUCENE-1673
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 2.9
>
>
> TrieRange was iterated many times and seems stable now (LUCENE-1470, 
> LUCENE-1582, LUCENE-1602). There is lots of user interest, Solr added it to 
> its default FieldTypes (SOLR-940) and if possible I want to move it to core 
> before release of 2.9.
> Before this can be done, there are some things to think about:
> # There are now classes called LongTrieRangeQuery, IntTrieRangeQuery, how 
> should they be called in core? I would suggest to leave it as it is. On the 
> other hand, if this keeps our only numeric query implementation, we could 
> call it LongRangeQuery, IntRangeQuery or NumericRangeQuery (see below, here 
> are problems). Same for the TokenStreams and Filters.
> # Maybe the pairs of classes for indexing and searching should be moved into 
> one class: NumericTokenStream, NumericRangeQuery, NumericRangeFilter. The 
> problem here: ctors must be able to pass int, long, double, float as range 
> parameters. For the end user, mixing these 4 types in one class is hard to 
> handle. If somebody forgets to add a L to a long, it suddenly instantiates a 
> int version of range query, hitting no results and so on. Same with other 
> types. Maybe accept java.lang.Number as parameter (because nullable for 
> half-open bounds) and one enum for the type.
> # TrieUtils move into o.a.l.util? or document or?
> # Move TokenStreams into o.a.l.analysis, ShiftAttribute into 
> o.a.l.analysis.tokenattributes? Somewhere else?
> # If we rename the classes, should Solr stay with Trie (because there are 
> different impls)?
> # Maybe add a subclass of AbstractField, that automatically creates these 
> TokenStreams and omits norms/tf per default for easier addition to Document 
> instances?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: Payloads and TrieRangeQuery

2009-06-10 Thread Michael McCandless
On Wed, Jun 10, 2009 at 5:24 PM, Yonik Seeley wrote:
> On Wed, Jun 10, 2009 at 5:03 PM, Michael McCandless
>  wrote:
>> On Wed, Jun 10, 2009 at 4:04 PM, Earwin Burrfoot wrote:
>>  * Was the field even indexed w/ Trie, or indexed as "simple text"?
>
> Why the special treatment for Trie?

So that at search time things default properly.  Ie, RangeFilter would
rewrite to the right impl (if we made a single RangeFilter that
handled both term & numeric ranges), and sorting could pick the right
parser.

Ie, ideally one simply adds NumericField to their Document, indexes
it, and then range filtering & sorting "just work".  It's confusing
now the separate steps you must go through to use trie, because
Lucene doesn't remember that you indexed with trie.

But, I realize this is a stretch... eg we'd have to fix rewrite to be
per-segment, which certainly seems spooky.  A top-level schema would
definitely be cleaner.

>>  * We have a bug (or an important improvement) in how Trie encodes
>>terms that we need to fix.  This one is not easy to handle, since
>>such a change could alter the term order, and merging segments
>>then becomes problematic.  Not sure how to handle that.  Yonik,
>>has Solr ever had to make a change to NumberUtils?
>
> Nope.  If we needed to, we would make a new field type so that
> existing schemas/indexes would continue to work.

OK seems like Lucene should take the same approach.

Mike

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: Payloads and TrieRangeQuery

2009-06-10 Thread Yonik Seeley
> Another question not so simple to answer: When embedding these TermPositions
> into the whole process, how would this work with MultiTermQuery?

There's no reason why Trie has to use MultiTermQuery, right?

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



RE: Payloads and TrieRangeQuery

2009-06-10 Thread Uwe Schindler
> I think we'd need richer communication between MTQ and its subclasses,
> so that eg your enum would return a Query instead of a Term?
> 
> Then you'd either return a TermQuery, or, a BooleanQuery that's
> filtering the TermQuery?
> 
> But yes, doing after 3.0 seems good!

There is one other thing that needs to wait for 3.x: If you then want to
sort against such a field or use the trie values for function queries in a
field cache, we can have a really fast numeric UninverterValueSource,
because less terms, each with many documents. The value to store in the
cache is only (prefixCodedToLong(term) << positionBits | termPosition) -
cool. Would be really fast!

(But for that we need the new field cache stuff).

...now going to sleep with many ideas buzzing around.
Uwe


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



RE: Payloads and TrieRangeQuery

2009-06-10 Thread Uwe Schindler
> > Another question not so simple to answer: When embedding these
> TermPositions
> > into the whole process, how would this work with MultiTermQuery?
> 
> There's no reason why Trie has to use MultiTermQuery, right?

No but is elegant and simplifies much (see current code in trunk).

Uwe


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



  1   2   >