[jira] Issue Comment Edited: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors

Robert Muir (JIRA) Sun, 08 Nov 2009 12:30:04 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774816#action_12774816
 ]


Robert Muir edited comment on LUCENE-2034 at 11/8/09 8:29 PM:
--------------------------------------------------------------

Simon, i started looking at this, the testStemExclusionTable( for 
BrazilianAnalyzer is actually not related to stopwords and should not be 
changed.

BrazilianAnalyzer has a .setStemExclusionTable() method that allows you to 
supply a set of words that should not be stemmed. 

This test is to ensure  that if you change the stem exclusion table with this 
method, that reusableTokenStream will force the creation of a new 
BrazilianStemFilter with this modified exclusion table so that it will take 
effect immediately, the way it did with .tokenStream() before this analyzer 
supported reusableTokenStream()

<edit, addition>
also, i think this setStemExclusionTable stuff is really unrelated to your 
patch, but a reuse challenge in at least this analyzer. one way to solve it 
would be to:
* add .setStemExclusionTable to BrazilianStemFilter so it can be changed 
without creating a new instance.
* in Brazilian Analyzer's createComponents(), cache the BrazilianStemFilter and 
change .setStemExclusionTable() to pass along the new value to that.


      was (Author: rcmuir):
    Simon, i started looking at this, the testStemExclusionTable( for 
BrazilianAnalyzer is actually not related to stopwords and should not be 
changed.

BrazilianAnalyzer has a .setStemExclusionTable() method that allows you to 
supply a set of words that should not be stemmed. 

This test is to ensure  that if you change the stem exclusion table with this 
method, that reusableTokenStream will force the creation of a new 
BrazilianStemFilter with this modified exclusion table so that it will take 
effect immediately, the way it did with .tokenStream() before this analyzer 
supported reusableTokenStream()

  
> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-2034
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2034
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.9
>            Reporter: Simon Willnauer
>            Priority: Minor
>             Fix For: 3.0
>
>         Attachments: LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, 
> LUCENE-2034.patch
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses 
> need to implement at least one of the methodes returning a tokenStream. When 
> you look at the code it appears to be almost identical if both are 
> implemented in the same analyzer.  Each analyzer defnes the same inner class 
> (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his 
> own way of loading them or defines a large number of ctors to load stopwords 
> from a file, set, arrays etc.. those ctors should be removed / deprecated and 
> eventually removed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Issue Comment Edited: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors

Reply via email to