subject:"\[jira\] Commented\: \(LUCENE\-2279\) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet"

[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet

2010-02-24 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837736#action_12837736
 ] 

Michael McCandless commented on LUCENE-2279:


Should we deprecate (eventually, remove) Analyzer.tokenStream?

Maybe we should absorb ReusableAnalyzerBase back into Analyzer?

Or maybe now is an opportune time to create a separate standalone
analyzers package (subproject under the Lucene tlp)?  We've broached
this idea in the past, and I think it's compelling I think
Lucene/Solr/Nutch need to eventually get to this point (where they
share analyzers from a single source), so maybe now is the time.

It'd be a single place where we would pull in all of Lucene's
core/contrib, plus Solr's analyzers, plus new analyzers Robert keeps
making ;) Robert's efforts to upgrade Solr's analyzers to 3.0
(currently a big patch waiting on SOLR-1657), plus his various other
pending analyzer bug fixes, could be done in this new analyzers
package.  And we could immediately fix problems we have with the
current analyzers API (like this reusable/tokenStream amibiguity).


 eliminate pathological performance on StopFilter when using a SetString 
 instead of CharArraySet
 -

 Key: LUCENE-2279
 URL: https://issues.apache.org/jira/browse/LUCENE-2279
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: thushara wijeratna
Priority: Minor

 passing a SetSrtring to a StopFilter instead of a CharArraySet results in a 
 very slow filter.
 this is because for each document, Analyzer.tokenStream() is called, which 
 ends up calling the StopFilter (if used). And if a regular SetString is 
 used in the StopFilter all the elements of the set are copied to a 
 CharArraySet, as we can see in it's ctor:
 public StopFilter(boolean enablePositionIncrements, TokenStream input, Set 
 stopWords, boolean ignoreCase)
   {
 super(input);
 if (stopWords instanceof CharArraySet) {
   this.stopWords = (CharArraySet)stopWords;
 } else {
   this.stopWords = new CharArraySet(stopWords.size(), ignoreCase);
   this.stopWords.addAll(stopWords);
 }
 this.enablePositionIncrements = enablePositionIncrements;
 init();
   }
 i feel we should make the StopFilter signature specific, as in specifying 
 CharArraySet vs Set, and there should be a JavaDoc warning on using the other 
 variants of the StopFilter as they all result in a copy for each invocation 
 of Analyzer.tokenStream().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet

2010-02-24 Thread Simon Willnauer (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837752#action_12837752
 ] 

Simon Willnauer commented on LUCENE-2279:
-

bq. Should we deprecate (eventually, remove) Analyzer.tokenStream? 
I would totally agree with that but  I guess we can not remove this method 
until lucene 4.0 which will be hmm in 2020 :) - just joking

bq.Maybe we should absorb ReusableAnalyzerBase back into Analyzer?
That would be the logical consequence but the problem with ReusableAnalyzerBase 
is that it will break bw comapt if moved to Analyzer. It assumes both 
#reusabelTokenStream and #tokenStream to be final and introduces a new factory 
method. Yet, as an analyzer developer you really want to use the new 
ReusableAnalyzerBase in favor of Analyzer in 99% of the cases and it will 
require you writing half of the code plus gives you reusability of the 
tokenStream

bp. I think Lucene/Solr/Nutch need to eventually get to this point
Huge +1 from my side. This could also unify the factory pattern solr uses to 
build tokenstreams. I would stop right here and ask to discuss it on the dev 
list, thoughts mike?!



 eliminate pathological performance on StopFilter when using a SetString 
 instead of CharArraySet
 -

 Key: LUCENE-2279
 URL: https://issues.apache.org/jira/browse/LUCENE-2279
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: thushara wijeratna
Priority: Minor

 passing a SetSrtring to a StopFilter instead of a CharArraySet results in a 
 very slow filter.
 this is because for each document, Analyzer.tokenStream() is called, which 
 ends up calling the StopFilter (if used). And if a regular SetString is 
 used in the StopFilter all the elements of the set are copied to a 
 CharArraySet, as we can see in it's ctor:
 public StopFilter(boolean enablePositionIncrements, TokenStream input, Set 
 stopWords, boolean ignoreCase)
   {
 super(input);
 if (stopWords instanceof CharArraySet) {
   this.stopWords = (CharArraySet)stopWords;
 } else {
   this.stopWords = new CharArraySet(stopWords.size(), ignoreCase);
   this.stopWords.addAll(stopWords);
 }
 this.enablePositionIncrements = enablePositionIncrements;
 init();
   }
 i feel we should make the StopFilter signature specific, as in specifying 
 CharArraySet vs Set, and there should be a JavaDoc warning on using the other 
 variants of the StopFilter as they all result in a copy for each invocation 
 of Analyzer.tokenStream().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet

2010-02-24 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837759#action_12837759
 ] 

Robert Muir commented on LUCENE-2279:
-

bq. Yet, as an analyzer developer you really want to use the new 
ReusableAnalyzerBase in favor of Analyzer in 99% of the cases and it will 
require you writing half of the code plus gives you reusability of the 
tokenStream

and the 1% extremely advanced cases that can't reuse, can just use TokenStreams 
directly when indexing, e.g. the Analyzer class could be reusable by 
definition. we shouldnt let these obscure cases slow down everyone else.

bq. It assumes both #reusabelTokenStream and #tokenStream to be final

in my opinion all the core analyzers (you already fixed contrib) should be 
final. this is another trap, if you subclass one of these analyzers and 
implement 'tokenStream', its immediately slow due to the backwards code.

bq. I think Lucene/Solr/Nutch need to eventually get to this point

if this is what we should do to remove the code duplication, then i am all for 
it. i still don't quite understand how it gives us more freedom to break/change 
the APIs, i mean however we label this stuff, a break is a break to the user at 
the end of the day.

 eliminate pathological performance on StopFilter when using a SetString 
 instead of CharArraySet
 -

 Key: LUCENE-2279
 URL: https://issues.apache.org/jira/browse/LUCENE-2279
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: thushara wijeratna
Priority: Minor

 passing a SetSrtring to a StopFilter instead of a CharArraySet results in a 
 very slow filter.
 this is because for each document, Analyzer.tokenStream() is called, which 
 ends up calling the StopFilter (if used). And if a regular SetString is 
 used in the StopFilter all the elements of the set are copied to a 
 CharArraySet, as we can see in it's ctor:
 public StopFilter(boolean enablePositionIncrements, TokenStream input, Set 
 stopWords, boolean ignoreCase)
   {
 super(input);
 if (stopWords instanceof CharArraySet) {
   this.stopWords = (CharArraySet)stopWords;
 } else {
   this.stopWords = new CharArraySet(stopWords.size(), ignoreCase);
   this.stopWords.addAll(stopWords);
 }
 this.enablePositionIncrements = enablePositionIncrements;
 init();
   }
 i feel we should make the StopFilter signature specific, as in specifying 
 CharArraySet vs Set, and there should be a JavaDoc warning on using the other 
 variants of the StopFilter as they all result in a copy for each invocation 
 of Analyzer.tokenStream().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet

2010-02-24 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837792#action_12837792
 ] 

Michael McCandless commented on LUCENE-2279:


bq. I would stop right here and ask to discuss it on the dev list, thoughts 
mike?!

Agreed... I'll start a thread.

{quote}
bq. Maybe we should absorb ReusableAnalyzerBase back into Analyzer?

That would be the logical consequence but the problem with ReusableAnalyzerBase 
is that it will break bw comapt if moved to Analyzer.
{quote}

Right, this is why I was thinking if we make a new analyzers package, it's a 
chance to break/improve things.  We'd have a single abstract base class that 
only exposes reuse API.

bq. in my opinion all the core analyzers (you already fixed contrib) should be 
final. 

I agree, and we should consistently take this approach w/ the new analyzers 
package...

bq. i still don't quite understand how it gives us more freedom to break/change 
the APIs, i mean however we label this stuff, a break is a break to the user at 
the end of the day.

Because it'd be an entirely new package, so we can create a new base Analyzer 
class (in that package) that breaks/fixes things when compared to Lucene's 
Analyzer class.

We'd eventually deprecate the analyzers/tokenizers/token filters in 
Lucene/Solr/Nutch in favor of this new package, and users can switch over on 
their own schedule.


 eliminate pathological performance on StopFilter when using a SetString 
 instead of CharArraySet
 -

 Key: LUCENE-2279
 URL: https://issues.apache.org/jira/browse/LUCENE-2279
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: thushara wijeratna
Priority: Minor

 passing a SetSrtring to a StopFilter instead of a CharArraySet results in a 
 very slow filter.
 this is because for each document, Analyzer.tokenStream() is called, which 
 ends up calling the StopFilter (if used). And if a regular SetString is 
 used in the StopFilter all the elements of the set are copied to a 
 CharArraySet, as we can see in it's ctor:
 public StopFilter(boolean enablePositionIncrements, TokenStream input, Set 
 stopWords, boolean ignoreCase)
   {
 super(input);
 if (stopWords instanceof CharArraySet) {
   this.stopWords = (CharArraySet)stopWords;
 } else {
   this.stopWords = new CharArraySet(stopWords.size(), ignoreCase);
   this.stopWords.addAll(stopWords);
 }
 this.enablePositionIncrements = enablePositionIncrements;
 init();
   }
 i feel we should make the StopFilter signature specific, as in specifying 
 CharArraySet vs Set, and there should be a JavaDoc warning on using the other 
 variants of the StopFilter as they all result in a copy for each invocation 
 of Analyzer.tokenStream().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet

2010-02-24 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837859#action_12837859
 ] 

Michael McCandless commented on LUCENE-2279:


{quote}
bq. I would stop right here and ask to discuss it on the dev list, thoughts 
mike?!

Agreed... I'll start a thread.
{quote}

OK I just started a thread on general@

 eliminate pathological performance on StopFilter when using a SetString 
 instead of CharArraySet
 -

 Key: LUCENE-2279
 URL: https://issues.apache.org/jira/browse/LUCENE-2279
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: thushara wijeratna
Priority: Minor

 passing a SetSrtring to a StopFilter instead of a CharArraySet results in a 
 very slow filter.
 this is because for each document, Analyzer.tokenStream() is called, which 
 ends up calling the StopFilter (if used). And if a regular SetString is 
 used in the StopFilter all the elements of the set are copied to a 
 CharArraySet, as we can see in it's ctor:
 public StopFilter(boolean enablePositionIncrements, TokenStream input, Set 
 stopWords, boolean ignoreCase)
   {
 super(input);
 if (stopWords instanceof CharArraySet) {
   this.stopWords = (CharArraySet)stopWords;
 } else {
   this.stopWords = new CharArraySet(stopWords.size(), ignoreCase);
   this.stopWords.addAll(stopWords);
 }
 this.enablePositionIncrements = enablePositionIncrements;
 init();
   }
 i feel we should make the StopFilter signature specific, as in specifying 
 CharArraySet vs Set, and there should be a JavaDoc warning on using the other 
 variants of the StopFilter as they all result in a copy for each invocation 
 of Analyzer.tokenStream().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet

2010-02-23 Thread thushara wijeratna (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837373#action_12837373
 ] 

thushara wijeratna commented on LUCENE-2279:


isn't the resusableTokenStream created again for a new Document, while there is 
no need to copy the list of stopwords for a new document? or did i miss 
something?

 eliminate pathological performance on StopFilter when using a SetString 
 instead of CharArraySet
 -

 Key: LUCENE-2279
 URL: https://issues.apache.org/jira/browse/LUCENE-2279
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: thushara wijeratna

 passing a SetSrtring to a StopFilter instead of a CharArraySet results in a 
 very slow filter.
 this is because for each document, Analyzer.tokenStream() is called, which 
 ends up calling the StopFilter (if used). And if a regular SetString is 
 used in the StopFilter all the elements of the set are copied to a 
 CharArraySet, as we can see in it's ctor:
 public StopFilter(boolean enablePositionIncrements, TokenStream input, Set 
 stopWords, boolean ignoreCase)
   {
 super(input);
 if (stopWords instanceof CharArraySet) {
   this.stopWords = (CharArraySet)stopWords;
 } else {
   this.stopWords = new CharArraySet(stopWords.size(), ignoreCase);
   this.stopWords.addAll(stopWords);
 }
 this.enablePositionIncrements = enablePositionIncrements;
 init();
   }
 i feel we should make the StopFilter signature specific, as in specifying 
 CharArraySet vs Set, and there should be a JavaDoc warning on using the other 
 variants of the StopFilter as they all result in a copy for each invocation 
 of Analyzer.tokenStream().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet

2010-02-23 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837410#action_12837410
 ] 

Robert Muir commented on LUCENE-2279:
-

reusableTokenStream() is called again for each document. if you don't implement 
it, the default is to defer to tokenStream(), which must create new instances 
of StopFilter, LowerCaseFilter, whatever else you have going on in your 
analyzer.

instead, if you implement reusableTokenStream(), you can keep a reference to 
these things, and just reset() your tokenfilters, and pass the reader to your 
tokenizer's reset(Reader) method.

of course, for this to work, you must implement reset() correctly in any custom 
filters you have: if they keep some state such as accumulated offsets or 
something, then these should be reset back to what they are just as if you 
created a new one.

For an example, see StandardAnalyzer

 eliminate pathological performance on StopFilter when using a SetString 
 instead of CharArraySet
 -

 Key: LUCENE-2279
 URL: https://issues.apache.org/jira/browse/LUCENE-2279
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: thushara wijeratna

 passing a SetSrtring to a StopFilter instead of a CharArraySet results in a 
 very slow filter.
 this is because for each document, Analyzer.tokenStream() is called, which 
 ends up calling the StopFilter (if used). And if a regular SetString is 
 used in the StopFilter all the elements of the set are copied to a 
 CharArraySet, as we can see in it's ctor:
 public StopFilter(boolean enablePositionIncrements, TokenStream input, Set 
 stopWords, boolean ignoreCase)
   {
 super(input);
 if (stopWords instanceof CharArraySet) {
   this.stopWords = (CharArraySet)stopWords;
 } else {
   this.stopWords = new CharArraySet(stopWords.size(), ignoreCase);
   this.stopWords.addAll(stopWords);
 }
 this.enablePositionIncrements = enablePositionIncrements;
 init();
   }
 i feel we should make the StopFilter signature specific, as in specifying 
 CharArraySet vs Set, and there should be a JavaDoc warning on using the other 
 variants of the StopFilter as they all result in a copy for each invocation 
 of Analyzer.tokenStream().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet

2010-02-23 Thread Simon Willnauer (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837465#action_12837465
 ] 

Simon Willnauer commented on LUCENE-2279:
-

I don't consider this as an issue at all. Each analyzer creating StopFilter 
instances uses CharArraySet internally and if you write your own you should do 
so too. The JavaDoc of StopFilter clearly describes what is going on if you use 
a set in favor of CharArraySet.
You should really consider reusabelTokenStream AND use a CharArraySet instance. 
You should have a look at the current trunk how all the analyzers handle 
stopwords. Once 3.1 is out you will also be able to subclass 
ReusableAnalyzerBase which enables reusableTokenStream on the the fly in 99% of 
the cases.

I tend to close this issue though, Robert?



 eliminate pathological performance on StopFilter when using a SetString 
 instead of CharArraySet
 -

 Key: LUCENE-2279
 URL: https://issues.apache.org/jira/browse/LUCENE-2279
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: thushara wijeratna

 passing a SetSrtring to a StopFilter instead of a CharArraySet results in a 
 very slow filter.
 this is because for each document, Analyzer.tokenStream() is called, which 
 ends up calling the StopFilter (if used). And if a regular SetString is 
 used in the StopFilter all the elements of the set are copied to a 
 CharArraySet, as we can see in it's ctor:
 public StopFilter(boolean enablePositionIncrements, TokenStream input, Set 
 stopWords, boolean ignoreCase)
   {
 super(input);
 if (stopWords instanceof CharArraySet) {
   this.stopWords = (CharArraySet)stopWords;
 } else {
   this.stopWords = new CharArraySet(stopWords.size(), ignoreCase);
   this.stopWords.addAll(stopWords);
 }
 this.enablePositionIncrements = enablePositionIncrements;
 init();
   }
 i feel we should make the StopFilter signature specific, as in specifying 
 CharArraySet vs Set, and there should be a JavaDoc warning on using the other 
 variants of the StopFilter as they all result in a copy for each invocation 
 of Analyzer.tokenStream().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet

2010-02-23 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837467#action_12837467
 ] 

Robert Muir commented on LUCENE-2279:
-

in my opinion the issue states one of my biggest gripes with analysis, this 
whole tokenstream/reusabletokenstream thing.

we go to all this trouble to have a reusable attributes-based api, only for 
this analyzer problem to trip up users.
maybe its best to give 3.1's ReusableAnalyzerBase a chance, and see if it 
clears up the confusion for users.
but if it doesnt, in my opinion we should do a hard backwards break and make 
tokenstream reusable by default.

 eliminate pathological performance on StopFilter when using a SetString 
 instead of CharArraySet
 -

 Key: LUCENE-2279
 URL: https://issues.apache.org/jira/browse/LUCENE-2279
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: thushara wijeratna
Priority: Minor

 passing a SetSrtring to a StopFilter instead of a CharArraySet results in a 
 very slow filter.
 this is because for each document, Analyzer.tokenStream() is called, which 
 ends up calling the StopFilter (if used). And if a regular SetString is 
 used in the StopFilter all the elements of the set are copied to a 
 CharArraySet, as we can see in it's ctor:
 public StopFilter(boolean enablePositionIncrements, TokenStream input, Set 
 stopWords, boolean ignoreCase)
   {
 super(input);
 if (stopWords instanceof CharArraySet) {
   this.stopWords = (CharArraySet)stopWords;
 } else {
   this.stopWords = new CharArraySet(stopWords.size(), ignoreCase);
   this.stopWords.addAll(stopWords);
 }
 this.enablePositionIncrements = enablePositionIncrements;
 init();
   }
 i feel we should make the StopFilter signature specific, as in specifying 
 CharArraySet vs Set, and there should be a JavaDoc warning on using the other 
 variants of the StopFilter as they all result in a copy for each invocation 
 of Analyzer.tokenStream().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet

2010-02-22 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837016#action_12837016
 ] 

Robert Muir commented on LUCENE-2279:
-

bq. this is because for each document, Analyzer.tokenStream() is called

have you considered implementing reusableTokenStream?

 eliminate pathological performance on StopFilter when using a SetString 
 instead of CharArraySet
 -

 Key: LUCENE-2279
 URL: https://issues.apache.org/jira/browse/LUCENE-2279
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: thushara wijeratna

 passing a SetSrtring to a StopFilter instead of a CharArraySet results in a 
 very slow filter.
 this is because for each document, Analyzer.tokenStream() is called, which 
 ends up calling the StopFilter (if used). And if a regular SetString is 
 used in the StopFilter all the elements of the set are copied to a 
 CharArraySet, as we can see in it's ctor:
 public StopFilter(boolean enablePositionIncrements, TokenStream input, Set 
 stopWords, boolean ignoreCase)
   {
 super(input);
 if (stopWords instanceof CharArraySet) {
   this.stopWords = (CharArraySet)stopWords;
 } else {
   this.stopWords = new CharArraySet(stopWords.size(), ignoreCase);
   this.stopWords.addAll(stopWords);
 }
 this.enablePositionIncrements = enablePositionIncrements;
 init();
   }
 i feel we should make the StopFilter signature specific, as in specifying 
 CharArraySet vs Set, and there should be a JavaDoc warning on using the other 
 variants of the StopFilter as they all result in a copy for each invocation 
 of Analyzer.tokenStream().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet

[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet

[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet

[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet

[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet

[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet

[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet

[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet

[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet

[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet

10 matches

Site Navigation

Mail list logo

Footer information