[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet
[ https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837736#action_12837736 ] Michael McCandless commented on LUCENE-2279: Should we deprecate (eventually, remove) Analyzer.tokenStream? Maybe we should absorb ReusableAnalyzerBase back into Analyzer? Or maybe now is an opportune time to create a separate standalone analyzers package (subproject under the Lucene tlp)? We've broached this idea in the past, and I think it's compelling I think Lucene/Solr/Nutch need to eventually get to this point (where they share analyzers from a single source), so maybe now is the time. It'd be a single place where we would pull in all of Lucene's core/contrib, plus Solr's analyzers, plus new analyzers Robert keeps making ;) Robert's efforts to upgrade Solr's analyzers to 3.0 (currently a big patch waiting on SOLR-1657), plus his various other pending analyzer bug fixes, could be done in this new analyzers package. And we could immediately fix problems we have with the current analyzers API (like this reusable/tokenStream amibiguity). eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet - Key: LUCENE-2279 URL: https://issues.apache.org/jira/browse/LUCENE-2279 Project: Lucene - Java Issue Type: Improvement Reporter: thushara wijeratna Priority: Minor passing a SetSrtring to a StopFilter instead of a CharArraySet results in a very slow filter. this is because for each document, Analyzer.tokenStream() is called, which ends up calling the StopFilter (if used). And if a regular SetString is used in the StopFilter all the elements of the set are copied to a CharArraySet, as we can see in it's ctor: public StopFilter(boolean enablePositionIncrements, TokenStream input, Set stopWords, boolean ignoreCase) { super(input); if (stopWords instanceof CharArraySet) { this.stopWords = (CharArraySet)stopWords; } else { this.stopWords = new CharArraySet(stopWords.size(), ignoreCase); this.stopWords.addAll(stopWords); } this.enablePositionIncrements = enablePositionIncrements; init(); } i feel we should make the StopFilter signature specific, as in specifying CharArraySet vs Set, and there should be a JavaDoc warning on using the other variants of the StopFilter as they all result in a copy for each invocation of Analyzer.tokenStream(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet
[ https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837752#action_12837752 ] Simon Willnauer commented on LUCENE-2279: - bq. Should we deprecate (eventually, remove) Analyzer.tokenStream? I would totally agree with that but I guess we can not remove this method until lucene 4.0 which will be hmm in 2020 :) - just joking bq.Maybe we should absorb ReusableAnalyzerBase back into Analyzer? That would be the logical consequence but the problem with ReusableAnalyzerBase is that it will break bw comapt if moved to Analyzer. It assumes both #reusabelTokenStream and #tokenStream to be final and introduces a new factory method. Yet, as an analyzer developer you really want to use the new ReusableAnalyzerBase in favor of Analyzer in 99% of the cases and it will require you writing half of the code plus gives you reusability of the tokenStream bp. I think Lucene/Solr/Nutch need to eventually get to this point Huge +1 from my side. This could also unify the factory pattern solr uses to build tokenstreams. I would stop right here and ask to discuss it on the dev list, thoughts mike?! eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet - Key: LUCENE-2279 URL: https://issues.apache.org/jira/browse/LUCENE-2279 Project: Lucene - Java Issue Type: Improvement Reporter: thushara wijeratna Priority: Minor passing a SetSrtring to a StopFilter instead of a CharArraySet results in a very slow filter. this is because for each document, Analyzer.tokenStream() is called, which ends up calling the StopFilter (if used). And if a regular SetString is used in the StopFilter all the elements of the set are copied to a CharArraySet, as we can see in it's ctor: public StopFilter(boolean enablePositionIncrements, TokenStream input, Set stopWords, boolean ignoreCase) { super(input); if (stopWords instanceof CharArraySet) { this.stopWords = (CharArraySet)stopWords; } else { this.stopWords = new CharArraySet(stopWords.size(), ignoreCase); this.stopWords.addAll(stopWords); } this.enablePositionIncrements = enablePositionIncrements; init(); } i feel we should make the StopFilter signature specific, as in specifying CharArraySet vs Set, and there should be a JavaDoc warning on using the other variants of the StopFilter as they all result in a copy for each invocation of Analyzer.tokenStream(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet
[ https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837759#action_12837759 ] Robert Muir commented on LUCENE-2279: - bq. Yet, as an analyzer developer you really want to use the new ReusableAnalyzerBase in favor of Analyzer in 99% of the cases and it will require you writing half of the code plus gives you reusability of the tokenStream and the 1% extremely advanced cases that can't reuse, can just use TokenStreams directly when indexing, e.g. the Analyzer class could be reusable by definition. we shouldnt let these obscure cases slow down everyone else. bq. It assumes both #reusabelTokenStream and #tokenStream to be final in my opinion all the core analyzers (you already fixed contrib) should be final. this is another trap, if you subclass one of these analyzers and implement 'tokenStream', its immediately slow due to the backwards code. bq. I think Lucene/Solr/Nutch need to eventually get to this point if this is what we should do to remove the code duplication, then i am all for it. i still don't quite understand how it gives us more freedom to break/change the APIs, i mean however we label this stuff, a break is a break to the user at the end of the day. eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet - Key: LUCENE-2279 URL: https://issues.apache.org/jira/browse/LUCENE-2279 Project: Lucene - Java Issue Type: Improvement Reporter: thushara wijeratna Priority: Minor passing a SetSrtring to a StopFilter instead of a CharArraySet results in a very slow filter. this is because for each document, Analyzer.tokenStream() is called, which ends up calling the StopFilter (if used). And if a regular SetString is used in the StopFilter all the elements of the set are copied to a CharArraySet, as we can see in it's ctor: public StopFilter(boolean enablePositionIncrements, TokenStream input, Set stopWords, boolean ignoreCase) { super(input); if (stopWords instanceof CharArraySet) { this.stopWords = (CharArraySet)stopWords; } else { this.stopWords = new CharArraySet(stopWords.size(), ignoreCase); this.stopWords.addAll(stopWords); } this.enablePositionIncrements = enablePositionIncrements; init(); } i feel we should make the StopFilter signature specific, as in specifying CharArraySet vs Set, and there should be a JavaDoc warning on using the other variants of the StopFilter as they all result in a copy for each invocation of Analyzer.tokenStream(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet
[ https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837792#action_12837792 ] Michael McCandless commented on LUCENE-2279: bq. I would stop right here and ask to discuss it on the dev list, thoughts mike?! Agreed... I'll start a thread. {quote} bq. Maybe we should absorb ReusableAnalyzerBase back into Analyzer? That would be the logical consequence but the problem with ReusableAnalyzerBase is that it will break bw comapt if moved to Analyzer. {quote} Right, this is why I was thinking if we make a new analyzers package, it's a chance to break/improve things. We'd have a single abstract base class that only exposes reuse API. bq. in my opinion all the core analyzers (you already fixed contrib) should be final. I agree, and we should consistently take this approach w/ the new analyzers package... bq. i still don't quite understand how it gives us more freedom to break/change the APIs, i mean however we label this stuff, a break is a break to the user at the end of the day. Because it'd be an entirely new package, so we can create a new base Analyzer class (in that package) that breaks/fixes things when compared to Lucene's Analyzer class. We'd eventually deprecate the analyzers/tokenizers/token filters in Lucene/Solr/Nutch in favor of this new package, and users can switch over on their own schedule. eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet - Key: LUCENE-2279 URL: https://issues.apache.org/jira/browse/LUCENE-2279 Project: Lucene - Java Issue Type: Improvement Reporter: thushara wijeratna Priority: Minor passing a SetSrtring to a StopFilter instead of a CharArraySet results in a very slow filter. this is because for each document, Analyzer.tokenStream() is called, which ends up calling the StopFilter (if used). And if a regular SetString is used in the StopFilter all the elements of the set are copied to a CharArraySet, as we can see in it's ctor: public StopFilter(boolean enablePositionIncrements, TokenStream input, Set stopWords, boolean ignoreCase) { super(input); if (stopWords instanceof CharArraySet) { this.stopWords = (CharArraySet)stopWords; } else { this.stopWords = new CharArraySet(stopWords.size(), ignoreCase); this.stopWords.addAll(stopWords); } this.enablePositionIncrements = enablePositionIncrements; init(); } i feel we should make the StopFilter signature specific, as in specifying CharArraySet vs Set, and there should be a JavaDoc warning on using the other variants of the StopFilter as they all result in a copy for each invocation of Analyzer.tokenStream(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet
[ https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837859#action_12837859 ] Michael McCandless commented on LUCENE-2279: {quote} bq. I would stop right here and ask to discuss it on the dev list, thoughts mike?! Agreed... I'll start a thread. {quote} OK I just started a thread on general@ eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet - Key: LUCENE-2279 URL: https://issues.apache.org/jira/browse/LUCENE-2279 Project: Lucene - Java Issue Type: Improvement Reporter: thushara wijeratna Priority: Minor passing a SetSrtring to a StopFilter instead of a CharArraySet results in a very slow filter. this is because for each document, Analyzer.tokenStream() is called, which ends up calling the StopFilter (if used). And if a regular SetString is used in the StopFilter all the elements of the set are copied to a CharArraySet, as we can see in it's ctor: public StopFilter(boolean enablePositionIncrements, TokenStream input, Set stopWords, boolean ignoreCase) { super(input); if (stopWords instanceof CharArraySet) { this.stopWords = (CharArraySet)stopWords; } else { this.stopWords = new CharArraySet(stopWords.size(), ignoreCase); this.stopWords.addAll(stopWords); } this.enablePositionIncrements = enablePositionIncrements; init(); } i feel we should make the StopFilter signature specific, as in specifying CharArraySet vs Set, and there should be a JavaDoc warning on using the other variants of the StopFilter as they all result in a copy for each invocation of Analyzer.tokenStream(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet
[ https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837373#action_12837373 ] thushara wijeratna commented on LUCENE-2279: isn't the resusableTokenStream created again for a new Document, while there is no need to copy the list of stopwords for a new document? or did i miss something? eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet - Key: LUCENE-2279 URL: https://issues.apache.org/jira/browse/LUCENE-2279 Project: Lucene - Java Issue Type: Improvement Reporter: thushara wijeratna passing a SetSrtring to a StopFilter instead of a CharArraySet results in a very slow filter. this is because for each document, Analyzer.tokenStream() is called, which ends up calling the StopFilter (if used). And if a regular SetString is used in the StopFilter all the elements of the set are copied to a CharArraySet, as we can see in it's ctor: public StopFilter(boolean enablePositionIncrements, TokenStream input, Set stopWords, boolean ignoreCase) { super(input); if (stopWords instanceof CharArraySet) { this.stopWords = (CharArraySet)stopWords; } else { this.stopWords = new CharArraySet(stopWords.size(), ignoreCase); this.stopWords.addAll(stopWords); } this.enablePositionIncrements = enablePositionIncrements; init(); } i feel we should make the StopFilter signature specific, as in specifying CharArraySet vs Set, and there should be a JavaDoc warning on using the other variants of the StopFilter as they all result in a copy for each invocation of Analyzer.tokenStream(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet
[ https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837410#action_12837410 ] Robert Muir commented on LUCENE-2279: - reusableTokenStream() is called again for each document. if you don't implement it, the default is to defer to tokenStream(), which must create new instances of StopFilter, LowerCaseFilter, whatever else you have going on in your analyzer. instead, if you implement reusableTokenStream(), you can keep a reference to these things, and just reset() your tokenfilters, and pass the reader to your tokenizer's reset(Reader) method. of course, for this to work, you must implement reset() correctly in any custom filters you have: if they keep some state such as accumulated offsets or something, then these should be reset back to what they are just as if you created a new one. For an example, see StandardAnalyzer eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet - Key: LUCENE-2279 URL: https://issues.apache.org/jira/browse/LUCENE-2279 Project: Lucene - Java Issue Type: Improvement Reporter: thushara wijeratna passing a SetSrtring to a StopFilter instead of a CharArraySet results in a very slow filter. this is because for each document, Analyzer.tokenStream() is called, which ends up calling the StopFilter (if used). And if a regular SetString is used in the StopFilter all the elements of the set are copied to a CharArraySet, as we can see in it's ctor: public StopFilter(boolean enablePositionIncrements, TokenStream input, Set stopWords, boolean ignoreCase) { super(input); if (stopWords instanceof CharArraySet) { this.stopWords = (CharArraySet)stopWords; } else { this.stopWords = new CharArraySet(stopWords.size(), ignoreCase); this.stopWords.addAll(stopWords); } this.enablePositionIncrements = enablePositionIncrements; init(); } i feel we should make the StopFilter signature specific, as in specifying CharArraySet vs Set, and there should be a JavaDoc warning on using the other variants of the StopFilter as they all result in a copy for each invocation of Analyzer.tokenStream(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet
[ https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837465#action_12837465 ] Simon Willnauer commented on LUCENE-2279: - I don't consider this as an issue at all. Each analyzer creating StopFilter instances uses CharArraySet internally and if you write your own you should do so too. The JavaDoc of StopFilter clearly describes what is going on if you use a set in favor of CharArraySet. You should really consider reusabelTokenStream AND use a CharArraySet instance. You should have a look at the current trunk how all the analyzers handle stopwords. Once 3.1 is out you will also be able to subclass ReusableAnalyzerBase which enables reusableTokenStream on the the fly in 99% of the cases. I tend to close this issue though, Robert? eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet - Key: LUCENE-2279 URL: https://issues.apache.org/jira/browse/LUCENE-2279 Project: Lucene - Java Issue Type: Improvement Reporter: thushara wijeratna passing a SetSrtring to a StopFilter instead of a CharArraySet results in a very slow filter. this is because for each document, Analyzer.tokenStream() is called, which ends up calling the StopFilter (if used). And if a regular SetString is used in the StopFilter all the elements of the set are copied to a CharArraySet, as we can see in it's ctor: public StopFilter(boolean enablePositionIncrements, TokenStream input, Set stopWords, boolean ignoreCase) { super(input); if (stopWords instanceof CharArraySet) { this.stopWords = (CharArraySet)stopWords; } else { this.stopWords = new CharArraySet(stopWords.size(), ignoreCase); this.stopWords.addAll(stopWords); } this.enablePositionIncrements = enablePositionIncrements; init(); } i feel we should make the StopFilter signature specific, as in specifying CharArraySet vs Set, and there should be a JavaDoc warning on using the other variants of the StopFilter as they all result in a copy for each invocation of Analyzer.tokenStream(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet
[ https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837467#action_12837467 ] Robert Muir commented on LUCENE-2279: - in my opinion the issue states one of my biggest gripes with analysis, this whole tokenstream/reusabletokenstream thing. we go to all this trouble to have a reusable attributes-based api, only for this analyzer problem to trip up users. maybe its best to give 3.1's ReusableAnalyzerBase a chance, and see if it clears up the confusion for users. but if it doesnt, in my opinion we should do a hard backwards break and make tokenstream reusable by default. eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet - Key: LUCENE-2279 URL: https://issues.apache.org/jira/browse/LUCENE-2279 Project: Lucene - Java Issue Type: Improvement Reporter: thushara wijeratna Priority: Minor passing a SetSrtring to a StopFilter instead of a CharArraySet results in a very slow filter. this is because for each document, Analyzer.tokenStream() is called, which ends up calling the StopFilter (if used). And if a regular SetString is used in the StopFilter all the elements of the set are copied to a CharArraySet, as we can see in it's ctor: public StopFilter(boolean enablePositionIncrements, TokenStream input, Set stopWords, boolean ignoreCase) { super(input); if (stopWords instanceof CharArraySet) { this.stopWords = (CharArraySet)stopWords; } else { this.stopWords = new CharArraySet(stopWords.size(), ignoreCase); this.stopWords.addAll(stopWords); } this.enablePositionIncrements = enablePositionIncrements; init(); } i feel we should make the StopFilter signature specific, as in specifying CharArraySet vs Set, and there should be a JavaDoc warning on using the other variants of the StopFilter as they all result in a copy for each invocation of Analyzer.tokenStream(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2279) eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet
[ https://issues.apache.org/jira/browse/LUCENE-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837016#action_12837016 ] Robert Muir commented on LUCENE-2279: - bq. this is because for each document, Analyzer.tokenStream() is called have you considered implementing reusableTokenStream? eliminate pathological performance on StopFilter when using a SetString instead of CharArraySet - Key: LUCENE-2279 URL: https://issues.apache.org/jira/browse/LUCENE-2279 Project: Lucene - Java Issue Type: Improvement Reporter: thushara wijeratna passing a SetSrtring to a StopFilter instead of a CharArraySet results in a very slow filter. this is because for each document, Analyzer.tokenStream() is called, which ends up calling the StopFilter (if used). And if a regular SetString is used in the StopFilter all the elements of the set are copied to a CharArraySet, as we can see in it's ctor: public StopFilter(boolean enablePositionIncrements, TokenStream input, Set stopWords, boolean ignoreCase) { super(input); if (stopWords instanceof CharArraySet) { this.stopWords = (CharArraySet)stopWords; } else { this.stopWords = new CharArraySet(stopWords.size(), ignoreCase); this.stopWords.addAll(stopWords); } this.enablePositionIncrements = enablePositionIncrements; init(); } i feel we should make the StopFilter signature specific, as in specifying CharArraySet vs Set, and there should be a JavaDoc warning on using the other variants of the StopFilter as they all result in a copy for each invocation of Analyzer.tokenStream(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org