[jira] [Commented] (SOLR-6468) Regression: StopFilterFactory doesn't work properly without deprecated enablePositionIncrements="false"

Steve Rowe (JIRA) Mon, 05 Aug 2019 05:50:09 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-6468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16900057#comment-16900057
 ]


Steve Rowe commented on SOLR-6468:
----------------------------------

Hi Alexander,

I'm not sure about the performance impact, you'd have to test to see how it 
performs on your own data.

The only downside I know of: Since you're removing content prior to 
tokenization, if the boundaries you use for MappingCharFilter are not the same 
as those used in tokenization, or if your replacement string impacts 
tokenization, you may see some differences from the behavior of your analysis 
chain when using StopFilter.  My recommendation: test using some real world 
data.

> Regression: StopFilterFactory doesn't work properly without deprecated 
> enablePositionIncrements="false"
> -------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-6468
>                 URL: https://issues.apache.org/jira/browse/SOLR-6468
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 4.8.1, 4.9, 5.3.1, 6.6.2, 7.1
>            Reporter: Alexander S.
>            Priority: Major
>         Attachments: FieldValue.png
>
>
> Setup:
> * Schema version is 1.5
> * Field config:
> {code}
> <fieldType name="words_ngram" class="solr.TextField" omitNorms="false" 
> autoGeneratePhraseQueries="true">
>   <analyzer>
>     <tokenizer class="solr.PatternTokenizerFactory" pattern="[^\w]+" />
>     <filter class="solr.StopFilterFactory" words="url_stopwords.txt" 
> ignoreCase="true" />
>     <filter class="solr.LowerCaseFilterFactory" />
>   </analyzer>
> </fieldType>
> {code}
> * Stop words:
> {code}
> http 
> https 
> ftp 
> www
> {code}
> So very simple. In the index I have:
> * twitter.com/testuser
> All these queries do match:
> * twitter.com/testuser
> * com/testuser
> * testuser
> But none of these does:
> * https://twitter.com/testuser
> * https://www.twitter.com/testuser
> * www.twitter.com/testuser
> Debug output shows:
> "parsedquery_toString": "+(url_words_ngram:\"? twitter com testuser\")"
> But we need:
> "parsedquery_toString": "+(url_words_ngram:\"twitter com testuser\")"
> Complete debug outputs:
> * a valid search: 
> http://pastie.org/pastes/9500661/text?key=rgqj5ivlgsbk1jxsudx9za
> * an invalid search: 
> http://pastie.org/pastes/9500662/text?key=b4zlh2oaxtikd8jvo5xaww
> The complete discussion and explanation of the problem is here: 
> http://lucene.472066.n3.nabble.com/Help-with-StopFilterFactory-td4153839.html
> I didn't find a clear explanation how can we upgrade Solr, there's no any 
> replacement or a workarround to this, so this is not just a major change but 
> a major disrespect to all existing Solr users who are using this feature.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-6468) Regression: StopFilterFactory doesn't work properly without deprecated enablePositionIncrements="false"

Reply via email to