[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors

Simon Willnauer (JIRA) Mon, 09 Nov 2009 03:19:00 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774940#action_12774940
 ]


Simon Willnauer commented on LUCENE-2034:
-----------------------------------------

bq. the testStemExclusionTable( for BrazilianAnalyzer is actually not related 
to stopwords and should not be changed. 
I agree, I missed to extend the testcase instead I changed it to test the 
constructor only. I will extend it instead.
This testcase is actually a duplicate of testExclusionTableReuse(), it should 
test tokenStream instead of reusableTokenStream() - will fix this too.

bq. This test is to ensure that if you change the stem exclusion table with 
this method, that reusableTokenStream will force the creation of a new 
BrazilianStemFilter with this modified exclusion table so that it will take 
effect immediately, the way it did with .tokenStream() before this analyzer 
supported reusableTokenStream()

that is actually what  testExclusionTableReuse() does.

bq. also, i think this setStemExclusionTable stuff is really unrelated to your 
patch, but a reuse challenge in at least this analyzer. one way to solve it 
would be to...

I agree with your first point that this is kind of unrelated. I guess we should 
to that in a different issue while I think it is not that much of a deal as it 
does not change any functionality though.
I disagree with the reuse challenge, in my opinion analyzers should be 
immutable thats why I deprecated those methods and added the set to the 
constructor. The problem with those setters is that you have to be in the same 
thread to change your set as this will only invalidate the cached version of a 
token stream hold in a ThreadLocal. The implementation is ambiguous and should 
go away. The analyzer itself can be shared but the behaviour is kind of 
unpredictable if you reset the set. If there is an instance of this analyzer 
around and you call the setter you would expect the analyzer to use the set 
from the very moment on you call the setter which is not always true. 






> Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-2034
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2034
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.9
>            Reporter: Simon Willnauer
>            Priority: Minor
>             Fix For: 3.0
>
>         Attachments: LUCENE-2034.patch, LUCENE-2034.patch, LUCENE-2034.patch, 
> LUCENE-2034.patch
>
>
> Due to the variouse tokenStream APIs we had in lucene analyzer subclasses 
> need to implement at least one of the methodes returning a tokenStream. When 
> you look at the code it appears to be almost identical if both are 
> implemented in the same analyzer.  Each analyzer defnes the same inner class 
> (SavedStreams) which is unnecessary.
> In contrib almost every analyzer uses stopwords and each of them creates his 
> own way of loading them or defines a large number of ctors to load stopwords 
> from a file, set, arrays etc.. those ctors should be removed / deprecated and 
> eventually removed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2034) Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors

Reply via email to