date:20070730

[jira] Resolved: (LUCENE-970) FilterIndexReader should overwrite isOptimized()

2007-07-30 Thread Michael Busch (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch resolved LUCENE-970.
--

   Resolution: Fixed
Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

> FilterIndexReader should overwrite isOptimized()
> 
>
> Key: LUCENE-970
> URL: https://issues.apache.org/jira/browse/LUCENE-970
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Trivial
> Fix For: 2.3
>
> Attachments: lucene-970.patch
>
>
> A call of FilterIndexReader.isOptimized() results in a NPE because 
> FilterIndexReader does not overwrite isOptimized().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

2007-07-30 Thread Doron Cohen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516547
 ] 

Doron Cohen commented on LUCENE-965:


> Is there a way to plug in a patch into my local source repository, so I can 
> diff with my favorite diff tool?
: patch -p 0 < foo.patch  

Try with --dry-run first.
Another convenient way in case you are using Eclipse is the Subclipse plugin 
that lets you visually diff patches just before applying them.

> But may I suggest the alternative? 

I think you have a valid point here. I too don't understand the proposed 
"Axiomatic Retrieval Function" (ARF) in this regard: in Lucene, the norm value 
stored for a document (assuming all boosts are 1) is
norm(D) = 1 / sqrt(numTerms(D))
This value is ready to use at scoring time, multiplying it with  
tf(t in d)  -   idf(t)^^2   
as described in 
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/search/Similarity.html

Now, the ARF paper in http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf 
describes Lucene scoring using |D| in place of norm(D) above, and describes ARF 
scoring using |D| again, the same as it seems to be implemented in this patch 
e.g. in TermScorer. However, the paper defines |D| as "the length of D". I find 
this confusing. Usually "|D|" really means the number of words in a document, 
and "avgdl" would mean the average of all |D|'s in the collection (see for 
instance "Okapi BM25" in Wikipedia). 

Now, your proposed change is something I can understand - it first translates 
back norm(D) into Length(D) (ignoring boosts), and only then averaging it. 

In any case - I mean if either this is fixed or I am wrong and an explanation 
shows why no fix is needed - I have to admit I still don't understand the logic 
behind ARF, intuitively, why would it be better? Guess provable search quality 
results can help in persuading...  (LUCENE-836 is resolved btw).

> Implement a state-of-the-art retrieval function in Lucene
> -
>
> Key: LUCENE-965
> URL: https://issues.apache.org/jira/browse/LUCENE-965
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.2
>Reporter: Hui Fang
> Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art 
> retrieval function, to 
> replace the default similarity function in Lucene. We compared the 
> performance of these two functions and reported the results at 
> http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is 
> much better than the default function. The axiomatic retrieval function is 
> able to find more relevant documents and users can see more relevant 
> documents in the top-ranked documents. Incorporating such a state-of-the-art 
> retrieval function could improve the search performance of all the 
> applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, 
> TermScorer and TermQuery.java.  However, many test cases are hand coded to 
> test whether the implementation of the default function is correct. Thus, I 
> also made the modification to many test files to make the new retrieval 
> function pass those cases. In fact, we found that some old test cases are not 
> reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 
> xx w2 yy w3". 
> The second document should be more relevant than the first one, because it 
> has more 
> occurrences of the query term "w3". But the original test case would require 
> us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Updated: (LUCENE-743) IndexReader.reopen()

2007-07-30 Thread Mark Miller


https://issues.apache.org/jira/browse/LUCENE-743

Michael Busch (JIRA) wrote:

 [ 
https://issues.apache.org/jira/browse/LUCENE-743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-743:
-

Fix Version/s: 2.3

  

IndexReader.reopen()


Key: LUCENE-743
URL: https://issues.apache.org/jira/browse/LUCENE-743
Project: Lucene - Java
 Issue Type: Improvement
 Components: Index
   Reporter: Otis Gospodnetic
   Assignee: Michael Busch
   Priority: Minor
Fix For: 2.3

Attachments: IndexReaderUtils.java, lucene-743.patch, lucene-743.patch, 
MyMultiReader.java, MySegmentReader.java


This is Robert Engels' implementation of IndexReader.reopen() functionality, as 
a set of 3 new classes (this was easier for him to implement, but should 
probably be folded into the core, if this looks good).



  


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-969) Optimize the core tokenizers/analyzers & deprecate Token.termText

2007-07-30 Thread Steven Rowe (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated LUCENE-969:
---

Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

CharSequence was introduced in 1.4: 
http://java.sun.com/j2se/1.4.2/docs/api/java/lang/CharSequence.html

> Optimize the core tokenizers/analyzers & deprecate Token.termText
> -
>
> Key: LUCENE-969
> URL: https://issues.apache.org/jira/browse/LUCENE-969
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-969.patch
>
>
> There is some "low hanging fruit" for optimizing the core tokenizers
> and analyzers:
>   - Re-use a single Token instance during indexing instead of creating
> a new one for every term.  To do this, I added a new method "Token
> next(Token result)" (Doron's suggestion) which means TokenStream
> may use the "Token result" as the returned Token, but is not
> required to (ie, can still return an entirely different Token if
> that is more convenient).  I added default implementations for
> both next() methods in TokenStream.java so that a TokenStream can
> choose to implement only one of the next() methods.
>   - Use "char[] termBuffer" in Token instead of the "String
> termText".
> Token now maintains a char[] termBuffer for holding the term's
> text.  Tokenizers & filters should retrieve this buffer and
> directly alter it to put the term text in or change the term
> text.
> I only deprecated the termText() method.  I still allow the ctors
> that pass in String termText, as well as setTermText(String), but
> added a NOTE about performance cost of using these methods.  I
> think it's OK to keep these as convenience methods?
> After the next release, when we can remove the deprecated API, we
> should clean up Token.java to no longer maintain "either String or
> char[]" (and the initTermBuffer() private method) and always use
> the char[] termBuffer instead.
>   - Re-use TokenStream instances across Fields & Documents instead of
> creating a new one for each doc.  To do this I added an optional
> "reusableTokenStream(...)" to Analyzer which just defaults to
> calling tokenStream(...), and then I implemented this for the core
> analyzers.
> I'm using the patch from LUCENE-967 for benchmarking just
> tokenization.
> The changes above give 21% speedup (742 seconds -> 585 seconds) for
> LowerCaseTokenizer -> StopFilter -> PorterStemFilter chain, tokenizing
> all of Wikipedia, on JDK 1.6 -server -Xmx1024M, Debian Linux, RAID 5
> IO system (best of 2 runs).
> If I pre-break Wikipedia docs into 100 token docs then it's 37% faster
> (1236 sec -> 774 sec), I think because of re-using TokenStreams across
> docs.
> I'm just running with this alg and recording the elapsed time:
>   analyzer=org.apache.lucene.analysis.LowercaseStopPorterAnalyzer
>   doc.tokenize.log.step=5
>   docs.file=/lucene/wikifull.txt
>   doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
>   doc.tokenized=true
>   doc.maker.forever=false
>   {ReadTokens > : *
> See this thread for discussion leading up to this:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/51283
> I also fixed Token.toString() to work correctly when termBuffer is
> used (and added unit test).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-969) Optimize the core tokenizers/analyzers & deprecate Token.termText

2007-07-30 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516517
 ] 

Yonik Seeley commented on LUCENE-969:
-

> [...] implement CharSequence
I think CharSequence is Java5

> Optimize the core tokenizers/analyzers & deprecate Token.termText
> -
>
> Key: LUCENE-969
> URL: https://issues.apache.org/jira/browse/LUCENE-969
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-969.patch
>
>
> There is some "low hanging fruit" for optimizing the core tokenizers
> and analyzers:
>   - Re-use a single Token instance during indexing instead of creating
> a new one for every term.  To do this, I added a new method "Token
> next(Token result)" (Doron's suggestion) which means TokenStream
> may use the "Token result" as the returned Token, but is not
> required to (ie, can still return an entirely different Token if
> that is more convenient).  I added default implementations for
> both next() methods in TokenStream.java so that a TokenStream can
> choose to implement only one of the next() methods.
>   - Use "char[] termBuffer" in Token instead of the "String
> termText".
> Token now maintains a char[] termBuffer for holding the term's
> text.  Tokenizers & filters should retrieve this buffer and
> directly alter it to put the term text in or change the term
> text.
> I only deprecated the termText() method.  I still allow the ctors
> that pass in String termText, as well as setTermText(String), but
> added a NOTE about performance cost of using these methods.  I
> think it's OK to keep these as convenience methods?
> After the next release, when we can remove the deprecated API, we
> should clean up Token.java to no longer maintain "either String or
> char[]" (and the initTermBuffer() private method) and always use
> the char[] termBuffer instead.
>   - Re-use TokenStream instances across Fields & Documents instead of
> creating a new one for each doc.  To do this I added an optional
> "reusableTokenStream(...)" to Analyzer which just defaults to
> calling tokenStream(...), and then I implemented this for the core
> analyzers.
> I'm using the patch from LUCENE-967 for benchmarking just
> tokenization.
> The changes above give 21% speedup (742 seconds -> 585 seconds) for
> LowerCaseTokenizer -> StopFilter -> PorterStemFilter chain, tokenizing
> all of Wikipedia, on JDK 1.6 -server -Xmx1024M, Debian Linux, RAID 5
> IO system (best of 2 runs).
> If I pre-break Wikipedia docs into 100 token docs then it's 37% faster
> (1236 sec -> 774 sec), I think because of re-using TokenStreams across
> docs.
> I'm just running with this alg and recording the elapsed time:
>   analyzer=org.apache.lucene.analysis.LowercaseStopPorterAnalyzer
>   doc.tokenize.log.step=5
>   docs.file=/lucene/wikifull.txt
>   doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
>   doc.tokenized=true
>   doc.maker.forever=false
>   {ReadTokens > : *
> See this thread for discussion leading up to this:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/51283
> I also fixed Token.toString() to work correctly when termBuffer is
> used (and added unit test).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-969) Optimize the core tokenizers/analyzers & deprecate Token.termText

2007-07-30 Thread Michael Busch (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516510
 ] 

Michael Busch commented on LUCENE-969:
--

Hi Mike,

this is just an idea to keep Token.java simpler, but I haven't really thought 
about all the consequences. So feel free to tell me that it's a bad idea ;)

Could you add a new class TermBuffer including the char[] array and your 
resize() logic that would implement CharSequence? Then you could get rid of the 
duplicate constructors and setters for String and char[], because String also 
implements CharSequence. And CharSequence has the method charAt(int index), so 
it should be almost as fast as directly accessing the char array in case the 
TermBuffer is used. You would need to change the existing constructors and 
setter to take a CharSequence object instead of a String, but this is not an 
API change as users can still pass in a String object. And then you would just 
need to add a new constructor with offset and length and a similiar setter. 
Thoughts?

> Optimize the core tokenizers/analyzers & deprecate Token.termText
> -
>
> Key: LUCENE-969
> URL: https://issues.apache.org/jira/browse/LUCENE-969
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-969.patch
>
>
> There is some "low hanging fruit" for optimizing the core tokenizers
> and analyzers:
>   - Re-use a single Token instance during indexing instead of creating
> a new one for every term.  To do this, I added a new method "Token
> next(Token result)" (Doron's suggestion) which means TokenStream
> may use the "Token result" as the returned Token, but is not
> required to (ie, can still return an entirely different Token if
> that is more convenient).  I added default implementations for
> both next() methods in TokenStream.java so that a TokenStream can
> choose to implement only one of the next() methods.
>   - Use "char[] termBuffer" in Token instead of the "String
> termText".
> Token now maintains a char[] termBuffer for holding the term's
> text.  Tokenizers & filters should retrieve this buffer and
> directly alter it to put the term text in or change the term
> text.
> I only deprecated the termText() method.  I still allow the ctors
> that pass in String termText, as well as setTermText(String), but
> added a NOTE about performance cost of using these methods.  I
> think it's OK to keep these as convenience methods?
> After the next release, when we can remove the deprecated API, we
> should clean up Token.java to no longer maintain "either String or
> char[]" (and the initTermBuffer() private method) and always use
> the char[] termBuffer instead.
>   - Re-use TokenStream instances across Fields & Documents instead of
> creating a new one for each doc.  To do this I added an optional
> "reusableTokenStream(...)" to Analyzer which just defaults to
> calling tokenStream(...), and then I implemented this for the core
> analyzers.
> I'm using the patch from LUCENE-967 for benchmarking just
> tokenization.
> The changes above give 21% speedup (742 seconds -> 585 seconds) for
> LowerCaseTokenizer -> StopFilter -> PorterStemFilter chain, tokenizing
> all of Wikipedia, on JDK 1.6 -server -Xmx1024M, Debian Linux, RAID 5
> IO system (best of 2 runs).
> If I pre-break Wikipedia docs into 100 token docs then it's 37% faster
> (1236 sec -> 774 sec), I think because of re-using TokenStreams across
> docs.
> I'm just running with this alg and recording the elapsed time:
>   analyzer=org.apache.lucene.analysis.LowercaseStopPorterAnalyzer
>   doc.tokenize.log.step=5
>   docs.file=/lucene/wikifull.txt
>   doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
>   doc.tokenized=true
>   doc.maker.forever=false
>   {ReadTokens > : *
> See this thread for discussion leading up to this:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/51283
> I also fixed Token.toString() to work correctly when termBuffer is
> used (and added unit test).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

2007-07-30 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516507
 ] 

Doug Cutting commented on LUCENE-965:
-

> Is there a way to plug in a patch into my local source repository, so I can 
> diff with my favorite diff tool?

patch -p 0 < foo.patch


> Implement a state-of-the-art retrieval function in Lucene
> -
>
> Key: LUCENE-965
> URL: https://issues.apache.org/jira/browse/LUCENE-965
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.2
>Reporter: Hui Fang
> Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art 
> retrieval function, to 
> replace the default similarity function in Lucene. We compared the 
> performance of these two functions and reported the results at 
> http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is 
> much better than the default function. The axiomatic retrieval function is 
> able to find more relevant documents and users can see more relevant 
> documents in the top-ranked documents. Incorporating such a state-of-the-art 
> retrieval function could improve the search performance of all the 
> applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, 
> TermScorer and TermQuery.java.  However, many test cases are hand coded to 
> test whether the implementation of the default function is correct. Thus, I 
> also made the modification to many test files to make the new retrieval 
> function pass those cases. In fact, we found that some old test cases are not 
> reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 
> xx w2 yy w3". 
> The second document should be more relevant than the first one, because it 
> has more 
> occurrences of the query term "w3". But the original test case would require 
> us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-970) FilterIndexReader should overwrite isOptimized()

2007-07-30 Thread Michael Busch (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-970:
-

Attachment: lucene-970.patch

Trivial patch. I'm planning to commit this shortly.

> FilterIndexReader should overwrite isOptimized()
> 
>
> Key: LUCENE-970
> URL: https://issues.apache.org/jira/browse/LUCENE-970
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Trivial
> Fix For: 2.3
>
> Attachments: lucene-970.patch
>
>
> A call of FilterIndexReader.isOptimized() results in a NPE because 
> FilterIndexReader does not overwrite isOptimized().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-584) Decouple Filter from BitSet

2007-07-30 Thread Paul Elschot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516502
 ] 

Paul Elschot commented on LUCENE-584:
-

Some more remarks on the 20070730 patches.

To recap, this introduces Matcher as a superclass of Scorer to take the role 
that BitSet currently has in Filter.

The total number of java files changed/added by these patches is 47, so some 
extra care will be needed.
The following issues are still pending:

What approach should be taken for the API change to Filter (see above, 2 
comments up)?

I'd like to get all test cases to pass again. TestRemoteCachingWrapperFilter 
still does not pass, and
I don't know why.

For xml-query-parser in contrib I'd like to know in which direction to proceed 
(see 1 comment up).
Does it make sense to try and get the TestQueryTemplateManager test to pass 
again?

The ..default.. patch has taken OpenBitSet and friends from solr to have a 
default implementation.
However, I have not checked whether there is unused code in there, so some 
trimming may still
be appropriate.

Once these issues have been resolved far enough, I would recommend to introduce 
this shortly after a release so there is some time to let things settle.



> Decouple Filter from BitSet
> ---
>
> Key: LUCENE-584
> URL: https://issues.apache.org/jira/browse/LUCENE-584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Peter Schäfer
>Priority: Minor
> Attachments: bench-diff.txt, bench-diff.txt, 
> Matcher1-ground-20070730.patch, Matcher2-default-20070730.patch, 
> Matcher3-core-20070730.patch, Matcher4-contrib-misc-20070730.patch, 
> Matcher5-contrib-queries-20070730.patch, Matcher6-contrib-xml-20070730.patch, 
> Some Matchers.zip
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-584) Decouple Filter from BitSet

2007-07-30 Thread Paul Elschot (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Elschot updated LUCENE-584:


Attachment: (was: Matcher1-ground-20070730.patch)

> Decouple Filter from BitSet
> ---
>
> Key: LUCENE-584
> URL: https://issues.apache.org/jira/browse/LUCENE-584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Peter Schäfer
>Priority: Minor
> Attachments: bench-diff.txt, bench-diff.txt, 
> Matcher1-ground-20070730.patch, Matcher2-default-20070730.patch, 
> Matcher3-core-20070730.patch, Matcher4-contrib-misc-20070730.patch, 
> Matcher5-contrib-queries-20070730.patch, Matcher6-contrib-xml-20070730.patch, 
> Matcher6-contrib-xml-20070730.patch, Some Matchers.zip
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-584) Decouple Filter from BitSet

2007-07-30 Thread Paul Elschot (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Elschot updated LUCENE-584:


Attachment: (was: Matcher3-core-20070730.patch)

> Decouple Filter from BitSet
> ---
>
> Key: LUCENE-584
> URL: https://issues.apache.org/jira/browse/LUCENE-584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Peter Schäfer
>Priority: Minor
> Attachments: bench-diff.txt, bench-diff.txt, 
> Matcher1-ground-20070730.patch, Matcher2-default-20070730.patch, 
> Matcher3-core-20070730.patch, Matcher4-contrib-misc-20070730.patch, 
> Matcher5-contrib-queries-20070730.patch, Matcher6-contrib-xml-20070730.patch, 
> Matcher6-contrib-xml-20070730.patch, Some Matchers.zip
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-584) Decouple Filter from BitSet

2007-07-30 Thread Paul Elschot (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Elschot updated LUCENE-584:


Attachment: Matcher6-contrib-xml-20070730.patch
Matcher5-contrib-queries-20070730.patch
Matcher4-contrib-misc-20070730.patch

> Decouple Filter from BitSet
> ---
>
> Key: LUCENE-584
> URL: https://issues.apache.org/jira/browse/LUCENE-584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Peter Schäfer
>Priority: Minor
> Attachments: bench-diff.txt, bench-diff.txt, 
> Matcher1-ground-20070730.patch, Matcher2-default-20070730.patch, 
> Matcher3-core-20070730.patch, Matcher4-contrib-misc-20070730.patch, 
> Matcher5-contrib-queries-20070730.patch, Matcher6-contrib-xml-20070730.patch, 
> Matcher6-contrib-xml-20070730.patch, Some Matchers.zip
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-584) Decouple Filter from BitSet

2007-07-30 Thread Paul Elschot (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Elschot updated LUCENE-584:


Attachment: (was: Matcher2-default-20070730.patch)

> Decouple Filter from BitSet
> ---
>
> Key: LUCENE-584
> URL: https://issues.apache.org/jira/browse/LUCENE-584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Peter Schäfer
>Priority: Minor
> Attachments: bench-diff.txt, bench-diff.txt, 
> Matcher1-ground-20070730.patch, Matcher2-default-20070730.patch, 
> Matcher3-core-20070730.patch, Matcher4-contrib-misc-20070730.patch, 
> Matcher5-contrib-queries-20070730.patch, Matcher6-contrib-xml-20070730.patch, 
> Matcher6-contrib-xml-20070730.patch, Some Matchers.zip
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Created: (LUCENE-970) FilterIndexReader should overwrite isOptimized()

2007-07-30 Thread Michael Busch (JIRA)

FilterIndexReader should overwrite isOptimized()


 Key: LUCENE-970
 URL: https://issues.apache.org/jira/browse/LUCENE-970
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Trivial
 Fix For: 2.3


A call of FilterIndexReader.isOptimized() results in a NPE because 
FilterIndexReader does not overwrite isOptimized().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-584) Decouple Filter from BitSet

2007-07-30 Thread Paul Elschot (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Elschot updated LUCENE-584:


Attachment: Matcher3-core-20070730.patch
Matcher2-default-20070730.patch
Matcher1-ground-20070730.patch

Uploading the patches again, this time with the ASF license.

> Decouple Filter from BitSet
> ---
>
> Key: LUCENE-584
> URL: https://issues.apache.org/jira/browse/LUCENE-584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Peter Schäfer
>Priority: Minor
> Attachments: bench-diff.txt, bench-diff.txt, 
> Matcher1-ground-20070730.patch, Matcher2-default-20070730.patch, 
> Matcher3-core-20070730.patch, Matcher4-contrib-misc-20070730.patch, 
> Matcher5-contrib-queries-20070730.patch, Matcher6-contrib-xml-20070730.patch, 
> Matcher6-contrib-xml-20070730.patch, Some Matchers.zip
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-584) Decouple Filter from BitSet

2007-07-30 Thread Paul Elschot (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Elschot updated LUCENE-584:


Attachment: (was: Matcher6-contrib-xml-20070730.patch)

> Decouple Filter from BitSet
> ---
>
> Key: LUCENE-584
> URL: https://issues.apache.org/jira/browse/LUCENE-584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Peter Schäfer
>Priority: Minor
> Attachments: bench-diff.txt, bench-diff.txt, 
> Matcher1-ground-20070730.patch, Matcher2-default-20070730.patch, 
> Matcher3-core-20070730.patch, Matcher4-contrib-misc-20070730.patch, 
> Matcher5-contrib-queries-20070730.patch, Matcher6-contrib-xml-20070730.patch, 
> Some Matchers.zip
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-584) Decouple Filter from BitSet

2007-07-30 Thread Paul Elschot (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Elschot updated LUCENE-584:


Attachment: (was: Matcher4-contrib-misc-20070730.patch)

> Decouple Filter from BitSet
> ---
>
> Key: LUCENE-584
> URL: https://issues.apache.org/jira/browse/LUCENE-584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Peter Schäfer
>Priority: Minor
> Attachments: bench-diff.txt, bench-diff.txt, 
> Matcher1-ground-20070730.patch, Matcher2-default-20070730.patch, 
> Matcher3-core-20070730.patch, Matcher4-contrib-misc-20070730.patch, 
> Matcher5-contrib-queries-20070730.patch, Matcher6-contrib-xml-20070730.patch, 
> Matcher6-contrib-xml-20070730.patch, Some Matchers.zip
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-584) Decouple Filter from BitSet

2007-07-30 Thread Paul Elschot (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Elschot updated LUCENE-584:


Attachment: (was: Matcher5-contrib-queries-20070730.patch)

> Decouple Filter from BitSet
> ---
>
> Key: LUCENE-584
> URL: https://issues.apache.org/jira/browse/LUCENE-584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Peter Schäfer
>Priority: Minor
> Attachments: bench-diff.txt, bench-diff.txt, 
> Matcher1-ground-20070730.patch, Matcher2-default-20070730.patch, 
> Matcher3-core-20070730.patch, Matcher4-contrib-misc-20070730.patch, 
> Matcher5-contrib-queries-20070730.patch, Matcher6-contrib-xml-20070730.patch, 
> Matcher6-contrib-xml-20070730.patch, Some Matchers.zip
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-584) Decouple Filter from BitSet

2007-07-30 Thread Paul Elschot (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Elschot updated LUCENE-584:


Attachment: Matcher6-contrib-xml-20070730.patch
Matcher5-contrib-queries-20070730.patch
Matcher4-contrib-misc-20070730.patch

Some 20070730 patches to contrib using BitSetFilter.
The contrib-misc and contrib-queries patches are reasonbly good,
their tests pass and replacing Filter by BitSetFilter is right for them.

However, I'm not happy with the contrib-xml patch to the xml-query parser.
I had to criple some of the code and to disable the TestQueryTemplateManager 
test.
I don't know how to get around this, basically because I don't know whether
it is a good idea at all to move the xml-query-parser to BitSetFilter. It might 
be
better to move it to Filter.getMatcher() instead, but I have no idea how to do 
this.


> Decouple Filter from BitSet
> ---
>
> Key: LUCENE-584
> URL: https://issues.apache.org/jira/browse/LUCENE-584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Peter Schäfer
>Priority: Minor
> Attachments: bench-diff.txt, bench-diff.txt, 
> Matcher1-ground-20070730.patch, Matcher2-default-20070730.patch, 
> Matcher3-core-20070730.patch, Matcher4-contrib-misc-20070730.patch, 
> Matcher5-contrib-queries-20070730.patch, Matcher6-contrib-xml-20070730.patch, 
> Some Matchers.zip
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-584) Decouple Filter from BitSet

2007-07-30 Thread Paul Elschot (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Elschot updated LUCENE-584:


Attachment: (was: Matcher-ground20070725.patch)

> Decouple Filter from BitSet
> ---
>
> Key: LUCENE-584
> URL: https://issues.apache.org/jira/browse/LUCENE-584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Peter Schäfer
>Priority: Minor
> Attachments: bench-diff.txt, bench-diff.txt, 
> Matcher1-ground-20070730.patch, Matcher2-default-20070730.patch, 
> Matcher3-core-20070730.patch, Some Matchers.zip
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-584) Decouple Filter from BitSet

2007-07-30 Thread Paul Elschot (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Elschot updated LUCENE-584:


Attachment: (was: Matcher-default20070725.patch)

> Decouple Filter from BitSet
> ---
>
> Key: LUCENE-584
> URL: https://issues.apache.org/jira/browse/LUCENE-584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Peter Schäfer
>Priority: Minor
> Attachments: bench-diff.txt, bench-diff.txt, 
> Matcher1-ground-20070730.patch, Matcher2-default-20070730.patch, 
> Matcher3-core-20070730.patch, Some Matchers.zip
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-584) Decouple Filter from BitSet

2007-07-30 Thread Paul Elschot (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Elschot updated LUCENE-584:


Attachment: (was: Matcher-core20070725.patch)

> Decouple Filter from BitSet
> ---
>
> Key: LUCENE-584
> URL: https://issues.apache.org/jira/browse/LUCENE-584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Peter Schäfer
>Priority: Minor
> Attachments: bench-diff.txt, bench-diff.txt, 
> Matcher-default20070725.patch, Matcher-ground20070725.patch, 
> Matcher1-ground-20070730.patch, Matcher2-default-20070730.patch, 
> Matcher3-core-20070730.patch, Some Matchers.zip
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-584) Decouple Filter from BitSet

2007-07-30 Thread Paul Elschot (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Elschot updated LUCENE-584:


Attachment: Matcher3-core-20070730.patch
Matcher2-default-20070730.patch
Matcher1-ground-20070730.patch

A different take in the patches of 20070730.

In this version class Filter has only one method:
public abstract Matcher getMatcher(IndexReader).

Class BitSetFilter is added as a subclass of Filter, and it has the familiar
public abstract BitSet bits(IndexReader),
as well as a default implementation of the getMatcher() method.

In the ..core.. patch, and in the ..contrib.. patches (to follow), most uses of 
Filter simply replaced by BitSetFilter. This turned out to be an easy way of 
dealing
with this API change in Filter.

This change to Filter and its replacement by BitSetFilter could well be taking
things too far for now, and I'd like to know whether other approaches
are preferred.

The ..default.. patch contains a default implementation of a Matcher from a 
BitSet, and it has OpenBitSet and friends from solr, as well as SortedVIntList 
as posted earlier.







> Decouple Filter from BitSet
> ---
>
> Key: LUCENE-584
> URL: https://issues.apache.org/jira/browse/LUCENE-584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Peter Schäfer
>Priority: Minor
> Attachments: bench-diff.txt, bench-diff.txt, 
> Matcher-core20070725.patch, Matcher-default20070725.patch, 
> Matcher-ground20070725.patch, Matcher1-ground-20070730.patch, 
> Matcher2-default-20070730.patch, Matcher3-core-20070730.patch, Some 
> Matchers.zip
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-968) SpanFilter should not extend Filter

2007-07-30 Thread Grant Ingersoll

Right, I thought briefly about that one, but in the end wasn't sure  
how to handle it.  Having the SpanFilterResult change is no big deal,  
btw, at least not until it is officially released.  I would be fine  
w/ putting a note saying this is experimental and subject to change.


On Jul 30, 2007, at 12:24 PM, Paul Elschot (JIRA) wrote:



[ https://issues.apache.org/jira/browse/LUCENE-968? 
page=com.atlassian.jira.plugin.system.issuetabpanels:comment- 
tabpanel#action_12516429 ]


Paul Elschot commented on LUCENE-968:
-

Ok, I missed that possible use as a Filter. I'm busy with  
LUCENE-584, and I could not figure out how to deal with this one.
Since it is a Filter, I'll include it in there as one of the  
currently present Filters.




SpanFilter should not extend Filter
---

Key: LUCENE-968
URL: https://issues.apache.org/jira/browse/LUCENE-968
Project: Lucene - Java
 Issue Type: Bug
 Components: Search
   Affects Versions: 2.3
   Reporter: Paul Elschot
   Priority: Trivial
Fix For: 2.3

Attachments: SpanFilter20070729.patch


All tests pass with the patch applied.


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [VOTE] Migrate Lucene to JDK 1.5 for 3.0 release

2007-07-30 Thread Grant Ingersoll



On Jul 30, 2007, at 8:18 AM, DM Smith wrote:

+1 from me, too. Not because I have a vote or that I am for going  
to 1.5, but because it is inevitable and this is a well thought  
out, fine plan. (excepting the aggressive timeline that has been  
hashed out already in this thread)


I'd like to point out that there is a consequence of this plan and  
how Lucene has done things in the past.


At 1.9 it was fully compatible with 1.4.3, with deprecations. 2.0  
mostly had deprecations removed and a few bug fixes. Then the 2.x  
series has been backwardly compatible but not with 1.x (except  
being able to read prior indexes, perhaps a few other things.).


If we continue that same pattern, then there will be no 1.5  
features in 2.9. (Otherwise it won't compile under 1.4). Thus, 3.0  
will have a 1.4.2 compatible interface. And except for new classes,  
new methods and compile equivalent features (such as Enums), 1.5  
features won't appear in the 3.x series API.




Yes, this is a slight variation from the 1.9 -> 2.0 migration.  I  
think the plan is to switch to 1.5 for compilation for 3.0-dev and  
then we will be immediately open for accepting 1.5 patches.  In fact,  
if someone submitted a patch that converted all collections to  
generics, I would be in favor of accepting it with all the usual  
caveats.  I don't see any other way around, as I don't think the  
intent is to say that 3.x contains no 1.5 features other than it  
compiles using JDK 1.5.



I think it is very important to preserve the Lucene API where  
possible and reasonable, not changing it without gain. Given that  
this has been the practice, I don't think it is an issue.




I agree.  I think method names, etc. will stay the same, but we will  
start adding Generics and Enums where appropriate and new code can be  
all 1.5.  For instance, though, the Field declaration parameters are  
a prime place for Enums.  So, the move would be to add in the new  
Enums and deprecate the old Field.Index and Field.Store static ints.   
Thus, they would not go away until 4.x (wow, that is weird to say)


Does that seem reasonable?

-Grant

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Resolved: (LUCENE-968) SpanFilter should not extend Filter

2007-07-30 Thread Paul Elschot (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Elschot resolved LUCENE-968.
-

   Resolution: Invalid
Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

> SpanFilter should not extend Filter
> ---
>
> Key: LUCENE-968
> URL: https://issues.apache.org/jira/browse/LUCENE-968
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.3
>Reporter: Paul Elschot
>Priority: Trivial
> Fix For: 2.3
>
> Attachments: SpanFilter20070729.patch
>
>
> All tests pass with the patch applied.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-968) SpanFilter should not extend Filter

2007-07-30 Thread Paul Elschot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516429
 ] 

Paul Elschot commented on LUCENE-968:
-

Ok, I missed that possible use as a Filter. I'm busy with LUCENE-584, and I 
could not figure out how to deal with this one.
Since it is a Filter, I'll include it in there as one of the currently present 
Filters.


> SpanFilter should not extend Filter
> ---
>
> Key: LUCENE-968
> URL: https://issues.apache.org/jira/browse/LUCENE-968
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.3
>Reporter: Paul Elschot
>Priority: Trivial
> Fix For: 2.3
>
> Attachments: SpanFilter20070729.patch
>
>
> All tests pass with the patch applied.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-871) ISOLatin1AccentFilter a bit slow

2007-07-30 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516403
 ] 

Michael McCandless commented on LUCENE-871:
---

OK, for LUCENE-969 I made yet a 3rd option for optimizing
ISOLatin1AccentFilter.

In that patch I reuse the Token instance, using the char[] API for the
Token's text instead of String, and I also re-use a single TokenStream
instance (I did this for all core tokenizers).

I just tested total time to tokenize all wikipedia content with
current trunk (1116 sec) vs with LUCENE-969 (500 sec), with a
WhitespaceTokenizer -> ISOLatin1AccentFilter chain.

I separately timed just creating the documents at 112 sec, to subtract
it off from the above times (so I can measure only cost of
tokenization).

This gives net speedup of this filter is 2.97X faster (1004 sec -> 388
sec).


> ISOLatin1AccentFilter a bit slow
> 
>
> Key: LUCENE-871
> URL: https://issues.apache.org/jira/browse/LUCENE-871
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.0.1, 2.1, 2.2
>Reporter: Ian Boston
> Attachments: fasterisoremove1.patch, fasterisoremove2.patch, 
> ISOLatin1AccentFilter.java.patch
>
>
> The ISOLatin1AccentFilter is a bit slow giving 300+ ms responses when used in 
> a highligher for output responses.
> Patch to follow

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-969) Optimize the core tokenizers/analyzers & deprecate Token.termText

2007-07-30 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-969:
--

Lucene Fields: [New, Patch Available]  (was: [New])

> Optimize the core tokenizers/analyzers & deprecate Token.termText
> -
>
> Key: LUCENE-969
> URL: https://issues.apache.org/jira/browse/LUCENE-969
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-969.patch
>
>
> There is some "low hanging fruit" for optimizing the core tokenizers
> and analyzers:
>   - Re-use a single Token instance during indexing instead of creating
> a new one for every term.  To do this, I added a new method "Token
> next(Token result)" (Doron's suggestion) which means TokenStream
> may use the "Token result" as the returned Token, but is not
> required to (ie, can still return an entirely different Token if
> that is more convenient).  I added default implementations for
> both next() methods in TokenStream.java so that a TokenStream can
> choose to implement only one of the next() methods.
>   - Use "char[] termBuffer" in Token instead of the "String
> termText".
> Token now maintains a char[] termBuffer for holding the term's
> text.  Tokenizers & filters should retrieve this buffer and
> directly alter it to put the term text in or change the term
> text.
> I only deprecated the termText() method.  I still allow the ctors
> that pass in String termText, as well as setTermText(String), but
> added a NOTE about performance cost of using these methods.  I
> think it's OK to keep these as convenience methods?
> After the next release, when we can remove the deprecated API, we
> should clean up Token.java to no longer maintain "either String or
> char[]" (and the initTermBuffer() private method) and always use
> the char[] termBuffer instead.
>   - Re-use TokenStream instances across Fields & Documents instead of
> creating a new one for each doc.  To do this I added an optional
> "reusableTokenStream(...)" to Analyzer which just defaults to
> calling tokenStream(...), and then I implemented this for the core
> analyzers.
> I'm using the patch from LUCENE-967 for benchmarking just
> tokenization.
> The changes above give 21% speedup (742 seconds -> 585 seconds) for
> LowerCaseTokenizer -> StopFilter -> PorterStemFilter chain, tokenizing
> all of Wikipedia, on JDK 1.6 -server -Xmx1024M, Debian Linux, RAID 5
> IO system (best of 2 runs).
> If I pre-break Wikipedia docs into 100 token docs then it's 37% faster
> (1236 sec -> 774 sec), I think because of re-using TokenStreams across
> docs.
> I'm just running with this alg and recording the elapsed time:
>   analyzer=org.apache.lucene.analysis.LowercaseStopPorterAnalyzer
>   doc.tokenize.log.step=5
>   docs.file=/lucene/wikifull.txt
>   doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
>   doc.tokenized=true
>   doc.maker.forever=false
>   {ReadTokens > : *
> See this thread for discussion leading up to this:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/51283
> I also fixed Token.toString() to work correctly when termBuffer is
> used (and added unit test).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Token termBuffer issues

2007-07-30 Thread Michael McCandless


"Michael McCandless" <[EMAIL PROTECTED]> wrote:
> "Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> > On 7/25/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
> > > "Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> > > > On 7/24/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
> > > > > "Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> > > > > > On 7/24/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
> > > > > > > OK, I ran some benchmarks here.
> > > > > > >
> > > > > > > The performance gains are sizable: 12.8% speedup using Sun's JDK 
> > > > > > > 5 and
> > > > > > > 17.2% speedup using Sun's JDK 6, on Linux.  This is indexing all
> > > > > > > Wikipedia content using LowerCaseTokenizer + StopFilter +
> > > > > > > PorterStemFilter.  I think it's worth pursuing!
> > > > > >
> > > > > > Did you try it w/o token reuse (reuse tokens only when mutating, not
> > > > > > when creating new tokens from the tokenizer)?
> > > > >
> > > > > I haven't tried this variant yet.  I guess for long filter chains the
> > > > > GC cost of the tokenizer making the initial token should go down as
> > > > > overall part of the time.  Though I think we should still re-use the
> > > > > initial token since it should (?) only help.
> > > >
> > > > If it weren't any slower, that would be great... but I worry about
> > > > filters that need buffering (either on the input side or the output
> > > > side) and how that interacts with filters that try and reuse.
> > >
> > > OK I will tease out this effect & measure performance impact.
> > >
> > > This would mean that the tokenizer must not only produce new Token
> > > instance for each term but also cannot re-use the underlying char[]
> > > buffer in that token, right?
> > 
> > If the tokenizer can actually change the contents of the char[], then
> > yes, it seems like when next() is called rather than next(Token), a
> > new char[] would need to be allocated.
> 
> Right.  So I'm now testing "reuse all" vs "tokenizer makes a full copy
> but filters get to re-use it".

OK, I tested this case where CharTokenizer makes a new Token (and new
char[] array) for every token instead of re-using each.  This way is
6% slower than fully re-using the Token (585 sec -> 618 sec) -- using
same test as described in
https://issues.apache.org/jira/browse/LUCENE-969.

> > >  EG with mods for CharTokenizer I re-use
> > > its "char[] buffer" with every Token, but I'll change that to be a new
> > > buffer for each Token for this test.
> > 
> > It's not just for a test, right?  If next() is called, it can't reuse
> > the char[].  And there is no getting around the fact that some
> > tokenizers will need to call next() because of buffering.
> 
> Correct -- the way I'm doing this now is in TokenStream.java I have a
> default "Token next()" which calls "next(Token result)" but makes a
> complete copy before returning it.  This keeps full backwards
> compatiblity even in the case where a consumer wants a private copy
> (calls next()) but the provider only provides the "re-use" API
> (next(Token result)).
> 
> Mike
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-969) Optimize the core tokenizers/analyzers & deprecate Token.termText

2007-07-30 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-969:
--

Attachment: LUCENE-969.patch

First-cut patch.  All tests pass.  I still need do fix some javadocs
but otherwise I think this is close...


> Optimize the core tokenizers/analyzers & deprecate Token.termText
> -
>
> Key: LUCENE-969
> URL: https://issues.apache.org/jira/browse/LUCENE-969
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-969.patch
>
>
> There is some "low hanging fruit" for optimizing the core tokenizers
> and analyzers:
>   - Re-use a single Token instance during indexing instead of creating
> a new one for every term.  To do this, I added a new method "Token
> next(Token result)" (Doron's suggestion) which means TokenStream
> may use the "Token result" as the returned Token, but is not
> required to (ie, can still return an entirely different Token if
> that is more convenient).  I added default implementations for
> both next() methods in TokenStream.java so that a TokenStream can
> choose to implement only one of the next() methods.
>   - Use "char[] termBuffer" in Token instead of the "String
> termText".
> Token now maintains a char[] termBuffer for holding the term's
> text.  Tokenizers & filters should retrieve this buffer and
> directly alter it to put the term text in or change the term
> text.
> I only deprecated the termText() method.  I still allow the ctors
> that pass in String termText, as well as setTermText(String), but
> added a NOTE about performance cost of using these methods.  I
> think it's OK to keep these as convenience methods?
> After the next release, when we can remove the deprecated API, we
> should clean up Token.java to no longer maintain "either String or
> char[]" (and the initTermBuffer() private method) and always use
> the char[] termBuffer instead.
>   - Re-use TokenStream instances across Fields & Documents instead of
> creating a new one for each doc.  To do this I added an optional
> "reusableTokenStream(...)" to Analyzer which just defaults to
> calling tokenStream(...), and then I implemented this for the core
> analyzers.
> I'm using the patch from LUCENE-967 for benchmarking just
> tokenization.
> The changes above give 21% speedup (742 seconds -> 585 seconds) for
> LowerCaseTokenizer -> StopFilter -> PorterStemFilter chain, tokenizing
> all of Wikipedia, on JDK 1.6 -server -Xmx1024M, Debian Linux, RAID 5
> IO system (best of 2 runs).
> If I pre-break Wikipedia docs into 100 token docs then it's 37% faster
> (1236 sec -> 774 sec), I think because of re-using TokenStreams across
> docs.
> I'm just running with this alg and recording the elapsed time:
>   analyzer=org.apache.lucene.analysis.LowercaseStopPorterAnalyzer
>   doc.tokenize.log.step=5
>   docs.file=/lucene/wikifull.txt
>   doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
>   doc.tokenized=true
>   doc.maker.forever=false
>   {ReadTokens > : *
> See this thread for discussion leading up to this:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/51283
> I also fixed Token.toString() to work correctly when termBuffer is
> used (and added unit test).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Created: (LUCENE-969) Optimize the core tokenizers/analyzers & deprecate Token.termText

2007-07-30 Thread Michael McCandless (JIRA)

Optimize the core tokenizers/analyzers & deprecate Token.termText
-

 Key: LUCENE-969
 URL: https://issues.apache.org/jira/browse/LUCENE-969
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 2.3
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.3


There is some "low hanging fruit" for optimizing the core tokenizers
and analyzers:

  - Re-use a single Token instance during indexing instead of creating
a new one for every term.  To do this, I added a new method "Token
next(Token result)" (Doron's suggestion) which means TokenStream
may use the "Token result" as the returned Token, but is not
required to (ie, can still return an entirely different Token if
that is more convenient).  I added default implementations for
both next() methods in TokenStream.java so that a TokenStream can
choose to implement only one of the next() methods.

  - Use "char[] termBuffer" in Token instead of the "String
termText".

Token now maintains a char[] termBuffer for holding the term's
text.  Tokenizers & filters should retrieve this buffer and
directly alter it to put the term text in or change the term
text.

I only deprecated the termText() method.  I still allow the ctors
that pass in String termText, as well as setTermText(String), but
added a NOTE about performance cost of using these methods.  I
think it's OK to keep these as convenience methods?

After the next release, when we can remove the deprecated API, we
should clean up Token.java to no longer maintain "either String or
char[]" (and the initTermBuffer() private method) and always use
the char[] termBuffer instead.

  - Re-use TokenStream instances across Fields & Documents instead of
creating a new one for each doc.  To do this I added an optional
"reusableTokenStream(...)" to Analyzer which just defaults to
calling tokenStream(...), and then I implemented this for the core
analyzers.

I'm using the patch from LUCENE-967 for benchmarking just
tokenization.

The changes above give 21% speedup (742 seconds -> 585 seconds) for
LowerCaseTokenizer -> StopFilter -> PorterStemFilter chain, tokenizing
all of Wikipedia, on JDK 1.6 -server -Xmx1024M, Debian Linux, RAID 5
IO system (best of 2 runs).

If I pre-break Wikipedia docs into 100 token docs then it's 37% faster
(1236 sec -> 774 sec), I think because of re-using TokenStreams across
docs.

I'm just running with this alg and recording the elapsed time:

  analyzer=org.apache.lucene.analysis.LowercaseStopPorterAnalyzer
  doc.tokenize.log.step=5
  docs.file=/lucene/wikifull.txt
  doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
  doc.tokenized=true
  doc.maker.forever=false

  {ReadTokens > : *

See this thread for discussion leading up to this:

  http://www.gossamer-threads.com/lists/lucene/java-dev/51283

I also fixed Token.toString() to work correctly when termBuffer is
used (and added unit test).


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [VOTE] Migrate Lucene to JDK 1.5 for 3.0 release

2007-07-30 Thread DM Smith

+1 from me, too. Not because I have a vote or that I am for going to  
1.5, but because it is inevitable and this is a well thought out,  
fine plan. (excepting the aggressive timeline that has been hashed  
out already in this thread)


I'd like to point out that there is a consequence of this plan and  
how Lucene has done things in the past.


At 1.9 it was fully compatible with 1.4.3, with deprecations. 2.0  
mostly had deprecations removed and a few bug fixes. Then the 2.x  
series has been backwardly compatible but not with 1.x (except being  
able to read prior indexes, perhaps a few other things.).


If we continue that same pattern, then there will be no 1.5 features  
in 2.9. (Otherwise it won't compile under 1.4). Thus, 3.0 will have a  
1.4.2 compatible interface. And except for new classes, new methods  
and compile equivalent features (such as Enums), 1.5 features won't  
appear in the 3.x series API.


I think it is very important to preserve the Lucene API where  
possible and reasonable, not changing it without gain. Given that  
this has been the practice, I don't think it is an issue.


-- DM Smith


On Jul 26, 2007, at 8:36 PM, Grant Ingersoll wrote:

I  propose we take the following path for migrating Lucene Java to  
JDK 1.5:

1.  Put in any new deprecations we want, cleanups, etc.
2. Release 2.4 so all of Mike M's goodness is available to 1.4  
users within the next 2-4 weeks using our new release mechanism  
(i.e code freeze, branch, documentation.  I tentatively volunteer  
to be the RM, but hope someone will be my wingman on it).

3. Announce that 2.9 will be the last version under JDK 1.4
4. Put in any other deprecations that we want and do as we did when  
moving from 1.4.3 to 1.9 by laying out a migration plan, etc.

5. Release 2.9 as the last official release on JDK 1.4
6. Switch 3.0-dev to be on JDK 1.5, removing any deprecated code  
and updating ANT to use 1.5 for source and target.

7. Start accepting JDK 1.5 patches on 3.0-dev

If possible, efforts should be made to identify people who are  
willing to backport 3.x changes to JDK 1.4 on 2.9 and give them  
branch commit rights, but this is not a strict requirement of this  
plan.


Thus:

+1 for JDK 1.5 as outlined in steps 1-7
0 if you don't care
-1 if you are against it

Since the weekend is coming up, how about we leave this vote open  
until Monday?


You can see discussions of this here: http://www.gossamer- 
threads.com/lists/lucene/java-dev/51421


Here is my +1.

Cheers,
Grant

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[EMAIL PROTECTED]: Project lucene-java (in module lucene-java) failed

2007-07-30 Thread Jason van Zyl

To whom it may engage...

This is an automated request, but not an unsolicited one. For 
more information please visit http://gump.apache.org/nagged.html, 
and/or contact the folk at [EMAIL PROTECTED]

Project lucene-java has an issue affecting its community integration.
This issue affects 3 projects,
 and has been outstanding for 24 runs.
The current state of this project is 'Failed', with reason 'Build Failed'.
For reference only, the following projects are affected by this:
- eyebrowse :  Web-based mail archive browsing
- jakarta-lucene :  Java Based Search Engine
- lucene-java :  Java Based Search Engine


Full details are available at:
http://vmgump.apache.org/gump/public/lucene-java/lucene-java/index.html

That said, some information snippets are provided here.

The following annotations (debug/informational/warning/error messages) were 
provided:
 -DEBUG- Sole output [lucene-core-30072007.jar] identifier set to project name
 -DEBUG- Dependency on javacc exists, no need to add for property javacc.home.
 -INFO- Failed with reason build failed
 -INFO- Failed to extract fallback artifacts from Gump Repository



The following work was performed:
http://vmgump.apache.org/gump/public/lucene-java/lucene-java/gump_work/build_lucene-java_lucene-java.html
Work Name: build_lucene-java_lucene-java (Type: Build)
Work ended in a state of : Failed
Elapsed: 34 secs
Command Line: /usr/lib/jvm/java-1.5.0-sun/bin/java -Djava.awt.headless=true 
-Xbootclasspath/p:/srv/gump/public/workspace/xml-commons/java/external/build/xml-apis.jar:/srv/gump/public/workspace/xml-xerces2/build/xercesImpl.jar
 org.apache.tools.ant.Main -Dgump.merge=/srv/gump/public/gump/work/merge.xml 
-Dbuild.sysclasspath=only -Dversion=30072007 
-Djavacc.home=/srv/gump/packages/javacc-3.1 package 
[Working Directory: /srv/gump/public/workspace/lucene-java]
CLASSPATH: 
/usr/lib/jvm/java-1.5.0-sun/lib/tools.jar:/srv/gump/public/workspace/lucene-java/build/classes/java:/srv/gump/public/workspace/lucene-java/build/classes/demo:/srv/gump/public/workspace/lucene-java/build/classes/test:/srv/gump/public/workspace/lucene-java/contrib/db/bdb/lib/db-4.3.29.jar:/srv/gump/public/workspace/lucene-java/contrib/gdata-server/lib/gdata-client-1.0.jar:/srv/gump/public/workspace/lucene-java/build/contrib/analyzers/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/ant/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/benchmark/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/db/bdb/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/db/bdb-je/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/gdata-server/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/highlighter/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/javascript/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/lucli/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/memory/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/queries/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/regex/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/similarity/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/snowball/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/spellchecker/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/surround/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/swing/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/wordnet/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/xml-query-parser/classes/java:/srv/gump/public/workspace/ant/dist/lib/ant-jmf.jar:/srv/gump/public/workspace/ant/dist/lib/ant-swing.jar:/srv/gump/public/workspace/ant/dist/lib/ant-apache-resolver.jar:/srv/gump/public/workspace/ant/dist/lib/ant-trax.jar:/srv/gump/public/workspace/ant/dist/lib/ant-junit.jar:/srv/gump/public/workspace/ant/dist/lib/ant-launcher.jar:/srv/gump/public/workspace/ant/dist/lib/ant-nodeps.jar:/srv/gump/public/workspace/ant/dist/lib/ant.jar:/srv/gump/packages/junit3.8.1/junit.jar:/srv/gump/public/workspace/xml-commons/java/build/resolver.jar:/srv/gump/packages/je-1.7.1/lib/je.jar:/srv/gump/public/workspace/apache-commons/digester/dist/commons-digester.jar:/srv/gump/public/workspace/jakarta-regexp/build/jakarta-regexp-30072007.jar:/srv/gump/packages/javacc-3.1/bin/lib/javacc.jar:/srv/gump/public/workspace/jline/target/jline-0.9.92-SNAPSHOT.jar:/srv/gump/packages/jtidy-04aug2000r7-dev/build/Tidy.jar:/srv/gump/public/workspace/junit/dist/junit-30072007.jar:/srv/gump/public/workspace/xml-commons/java/external/build/xml-apis-ext.jar:/srv/gump/public/workspace/apache-commons/logging/target/commons-logging-30072007.jar:/srv/gump/public/workspace/apache-commons/logging/target/commons-logging-api-30072007.jar:/srv/gump/public/workspace/jakarta-servletapi-5/jsr154/dist/lib/servlet-api.jar:/srv/gump/packages/nekoh

[EMAIL PROTECTED]: Project lucene-java (in module lucene-java) failed

2007-07-30 Thread Jason van Zyl

To whom it may engage...

This is an automated request, but not an unsolicited one. For 
more information please visit http://gump.apache.org/nagged.html, 
and/or contact the folk at [EMAIL PROTECTED]

Project lucene-java has an issue affecting its community integration.
This issue affects 3 projects,
 and has been outstanding for 24 runs.
The current state of this project is 'Failed', with reason 'Build Failed'.
For reference only, the following projects are affected by this:
- eyebrowse :  Web-based mail archive browsing
- jakarta-lucene :  Java Based Search Engine
- lucene-java :  Java Based Search Engine


Full details are available at:
http://vmgump.apache.org/gump/public/lucene-java/lucene-java/index.html

That said, some information snippets are provided here.

The following annotations (debug/informational/warning/error messages) were 
provided:
 -DEBUG- Sole output [lucene-core-30072007.jar] identifier set to project name
 -DEBUG- Dependency on javacc exists, no need to add for property javacc.home.
 -INFO- Failed with reason build failed
 -INFO- Failed to extract fallback artifacts from Gump Repository



The following work was performed:
http://vmgump.apache.org/gump/public/lucene-java/lucene-java/gump_work/build_lucene-java_lucene-java.html
Work Name: build_lucene-java_lucene-java (Type: Build)
Work ended in a state of : Failed
Elapsed: 34 secs
Command Line: /usr/lib/jvm/java-1.5.0-sun/bin/java -Djava.awt.headless=true 
-Xbootclasspath/p:/srv/gump/public/workspace/xml-commons/java/external/build/xml-apis.jar:/srv/gump/public/workspace/xml-xerces2/build/xercesImpl.jar
 org.apache.tools.ant.Main -Dgump.merge=/srv/gump/public/gump/work/merge.xml 
-Dbuild.sysclasspath=only -Dversion=30072007 
-Djavacc.home=/srv/gump/packages/javacc-3.1 package 
[Working Directory: /srv/gump/public/workspace/lucene-java]
CLASSPATH: 
/usr/lib/jvm/java-1.5.0-sun/lib/tools.jar:/srv/gump/public/workspace/lucene-java/build/classes/java:/srv/gump/public/workspace/lucene-java/build/classes/demo:/srv/gump/public/workspace/lucene-java/build/classes/test:/srv/gump/public/workspace/lucene-java/contrib/db/bdb/lib/db-4.3.29.jar:/srv/gump/public/workspace/lucene-java/contrib/gdata-server/lib/gdata-client-1.0.jar:/srv/gump/public/workspace/lucene-java/build/contrib/analyzers/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/ant/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/benchmark/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/db/bdb/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/db/bdb-je/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/gdata-server/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/highlighter/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/javascript/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/lucli/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/memory/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/queries/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/regex/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/similarity/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/snowball/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/spellchecker/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/surround/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/swing/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/wordnet/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/xml-query-parser/classes/java:/srv/gump/public/workspace/ant/dist/lib/ant-jmf.jar:/srv/gump/public/workspace/ant/dist/lib/ant-swing.jar:/srv/gump/public/workspace/ant/dist/lib/ant-apache-resolver.jar:/srv/gump/public/workspace/ant/dist/lib/ant-trax.jar:/srv/gump/public/workspace/ant/dist/lib/ant-junit.jar:/srv/gump/public/workspace/ant/dist/lib/ant-launcher.jar:/srv/gump/public/workspace/ant/dist/lib/ant-nodeps.jar:/srv/gump/public/workspace/ant/dist/lib/ant.jar:/srv/gump/packages/junit3.8.1/junit.jar:/srv/gump/public/workspace/xml-commons/java/build/resolver.jar:/srv/gump/packages/je-1.7.1/lib/je.jar:/srv/gump/public/workspace/apache-commons/digester/dist/commons-digester.jar:/srv/gump/public/workspace/jakarta-regexp/build/jakarta-regexp-30072007.jar:/srv/gump/packages/javacc-3.1/bin/lib/javacc.jar:/srv/gump/public/workspace/jline/target/jline-0.9.92-SNAPSHOT.jar:/srv/gump/packages/jtidy-04aug2000r7-dev/build/Tidy.jar:/srv/gump/public/workspace/junit/dist/junit-30072007.jar:/srv/gump/public/workspace/xml-commons/java/external/build/xml-apis-ext.jar:/srv/gump/public/workspace/apache-commons/logging/target/commons-logging-30072007.jar:/srv/gump/public/workspace/apache-commons/logging/target/commons-logging-api-30072007.jar:/srv/gump/public/workspace/jakarta-servletapi-5/jsr154/dist/lib/servlet-api.jar:/srv/gump/packages/nekoh

[jira] Resolved: (LUCENE-970) FilterIndexReader should overwrite isOptimized()

[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

Re: [jira] Updated: (LUCENE-743) IndexReader.reopen()

[jira] Updated: (LUCENE-969) Optimize the core tokenizers/analyzers & deprecate Token.termText

[jira] Commented: (LUCENE-969) Optimize the core tokenizers/analyzers & deprecate Token.termText

[jira] Commented: (LUCENE-969) Optimize the core tokenizers/analyzers & deprecate Token.termText

[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

[jira] Updated: (LUCENE-970) FilterIndexReader should overwrite isOptimized()

[jira] Commented: (LUCENE-584) Decouple Filter from BitSet

[jira] Updated: (LUCENE-584) Decouple Filter from BitSet

[jira] Updated: (LUCENE-584) Decouple Filter from BitSet

[jira] Updated: (LUCENE-584) Decouple Filter from BitSet

[jira] Updated: (LUCENE-584) Decouple Filter from BitSet

[jira] Created: (LUCENE-970) FilterIndexReader should overwrite isOptimized()

[jira] Updated: (LUCENE-584) Decouple Filter from BitSet

[jira] Updated: (LUCENE-584) Decouple Filter from BitSet

[jira] Updated: (LUCENE-584) Decouple Filter from BitSet

[jira] Updated: (LUCENE-584) Decouple Filter from BitSet

[jira] Updated: (LUCENE-584) Decouple Filter from BitSet

[jira] Updated: (LUCENE-584) Decouple Filter from BitSet

[jira] Updated: (LUCENE-584) Decouple Filter from BitSet

[jira] Updated: (LUCENE-584) Decouple Filter from BitSet

[jira] Updated: (LUCENE-584) Decouple Filter from BitSet

Re: [jira] Commented: (LUCENE-968) SpanFilter should not extend Filter

Re: [VOTE] Migrate Lucene to JDK 1.5 for 3.0 release

[jira] Resolved: (LUCENE-968) SpanFilter should not extend Filter

[jira] Commented: (LUCENE-968) SpanFilter should not extend Filter

[jira] Commented: (LUCENE-871) ISOLatin1AccentFilter a bit slow

[jira] Updated: (LUCENE-969) Optimize the core tokenizers/analyzers & deprecate Token.termText

Re: Token termBuffer issues

[jira] Updated: (LUCENE-969) Optimize the core tokenizers/analyzers & deprecate Token.termText

[jira] Created: (LUCENE-969) Optimize the core tokenizers/analyzers & deprecate Token.termText

Re: [VOTE] Migrate Lucene to JDK 1.5 for 3.0 release

[EMAIL PROTECTED]: Project lucene-java (in module lucene-java) failed

[EMAIL PROTECTED]: Project lucene-java (in module lucene-java) failed

35 matches

Site Navigation

Mail list logo

Footer information