[ 
https://issues.apache.org/jira/browse/SOLR-6468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15514785#comment-15514785
 ] 

Roman Chyla commented on SOLR-6468:
-----------------------------------

Ha! :-)
I've found my own comment above, 2 years later I'm facing this situation again, 
I completely forgot (and truth be told: preferred running old solr 4x).

This is how the new solr sees things:

A 350-MHz GBT Survey of 50 Faint Fermi γ ray Sources for Radio Millisecond 
Pulsars

is indexed as
```
null_1
1       :350|350mhz
2       :mhz|syn::mhz
3       :acr::gbt|gbt|syn::gbt|syn::green bank telescope
4       :survey|syn::survey
null_1
6       :50
```

the 1st and 5th position is a gap - so the search for "350-MHz GBT Survey of 50 
Faint" will fail - because 'of' is a stopword and the stop-filter will always 
increment the position (what's the purpose of a stopfilter; if it is leaving 
gaps?)

anyways, the solution with CharFilterFactory cannot work for me, I have to do 
this:
 
 1. search for synonyms (they can contain stopwords)
 2. remove stopwords
 3. search for other synonyms (that don't have stopwords)

I'm afraid the real life is little bit more complex than what it seems; but 
there is a logic to your choices, SOLR devs, I'm afraid I can agree with you. 
People who understand the *why* will make it work again as it *should*. Others 
will happily keep using the 'simplified' version.

> Regression: StopFilterFactory doesn't work properly without 
> enablePositionIncrements="false"
> --------------------------------------------------------------------------------------------
>
>                 Key: SOLR-6468
>                 URL: https://issues.apache.org/jira/browse/SOLR-6468
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 4.8.1, 4.9
>            Reporter: Alexander S.
>
> Setup:
> * Schema version is 1.5
> * Field config:
> {code}
> <fieldType name="words_ngram" class="solr.TextField" omitNorms="false" 
> autoGeneratePhraseQueries="true">
>   <analyzer>
>     <tokenizer class="solr.PatternTokenizerFactory" pattern="[^\w]+" />
>     <filter class="solr.StopFilterFactory" words="url_stopwords.txt" 
> ignoreCase="true" />
>     <filter class="solr.LowerCaseFilterFactory" />
>   </analyzer>
> </fieldType>
> {code}
> * Stop words:
> {code}
> http 
> https 
> ftp 
> www
> {code}
> So very simple. In the index I have:
> * twitter.com/testuser
> All these queries do match:
> * twitter.com/testuser
> * com/testuser
> * testuser
> But none of these does:
> * https://twitter.com/testuser
> * https://www.twitter.com/testuser
> * www.twitter.com/testuser
> Debug output shows:
> "parsedquery_toString": "+(url_words_ngram:\"? twitter com testuser\")"
> But we need:
> "parsedquery_toString": "+(url_words_ngram:\"twitter com testuser\")"
> Complete debug outputs:
> * a valid search: 
> http://pastie.org/pastes/9500661/text?key=rgqj5ivlgsbk1jxsudx9za
> * an invalid search: 
> http://pastie.org/pastes/9500662/text?key=b4zlh2oaxtikd8jvo5xaww
> The complete discussion and explanation of the problem is here: 
> http://lucene.472066.n3.nabble.com/Help-with-StopFilterFactory-td4153839.html
> I didn't find a clear explanation how can we upgrade Solr, there's no any 
> replacement or a workarround to this, so this is not just a major change but 
> a major disrespect to all existing Solr users who are using this feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to