[jira] [Commented] (LUCENE-7444) Remove StopFilter from StandardAnalyzer in Lucene-Core

Michael McCandless (JIRA) Sun, 11 Sep 2016 10:09:29 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15482066#comment-15482066
 ]


Michael McCandless commented on LUCENE-7444:
--------------------------------------------

+1 to not applying stop words by default with our default analyzer 
({{StandardAnalyzer}}).  I agree a secret English bias is no good.

But, replying to Uwe from LUCENE-7318:

bq. People that want to have stopwords can always define their own Analyzer 
using CustomAnalyzer.

Sorry, this is completely non-obvious to new users.

Sure, [~thetaphi] and perhaps 2 other people in the world would consider this 
the obvious way to add stop word filtering to Lucene.

But for everyone else, we must keep a simple core API ({{StandardAnalyzer}} 
ctor) taking an optional stop words set in core.

I wouldn't mind paring that API down, e.g. to {{Set<String>}} or 
{{CharArraySet}} passed to {{StandardAnalyzer}}.  This would let us move the 
word list loaders back out to the analyzers module?

Regardless of how you feel personally about whether stop words should be used 
in a search engine, many users legitimately fall on both sides of the camp.

> Remove StopFilter from StandardAnalyzer in Lucene-Core
> ------------------------------------------------------
>
>                 Key: LUCENE-7444
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7444
>             Project: Lucene - Core
>          Issue Type: Task
>          Components: core/other, modules/analysis
>    Affects Versions: 6.2
>            Reporter: Uwe Schindler
>
> Yonik said on LUCENE-7318:
> {quote}
> bq. I think it would make a good default for most Lucene users, and we should 
> graduate it from the analyzers module into core, and make it the default for 
> IndexWriter.
> This "StandardAnalyzer" is specific to English, as it removes English 
> stopwords.
> That seems to be an odd choice now for a few reasons:
> - It was argued in the past (rather vehemently) that Solr should not prefer 
> english in it's default "text" field
> - AFAIK, removing stopwords is no longer considered best practice.
> Given that removal of english stopwords is the only thing that really makes 
> this analyzer english-centric (and given the negative impact that can have on 
> other languages), it seems like the stopword filter should be removed from 
> StandardAnalyzer.
> {quote}
> When trying to fix the backwards incompatibility issues in LUCENE-7318, it 
> looks like most unrelated code moved from analysis module to core (and 
> changing package names!!!! :( ) was related to word list loading, 
> CharArraySets, and superclasses of StopFilter. If we follow Yonik's 
> suggestion, we can revert all those changes. I agree with hin, an "universal" 
> analyzer should not have any language specific stop-words.
> The other thing is LowercaseFilter, but I'd suggest to simply add a clone of 
> it to Lucene core and leave the analysis-module self-contained.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-7444) Remove StopFilter from StandardAnalyzer in Lucene-Core

Reply via email to