Hi Andrea,

This is a long-standing issue: see 
https://issues.apache.org/jira/browse/LUCENE-4065 
<https://issues.apache.org/jira/browse/LUCENE-4065> and 
https://issues.apache.org/jira/browse/LUCENE-8250 
<https://issues.apache.org/jira/browse/LUCENE-8250> for discussion.  I don’t 
think we’ve reached a consensus on how to fix it yet, but more examples are 
good.

Unfortunately I don’t think changing the StopFilter to ignore SYNONYM tokens 
will work, because then you’ll generate queries that always fail - they’ll 
search for ‘of’ in the middle of the phrase, but ‘of’ never gets indexed 
because it’s removed by the StopFilter at index time.

- Alan

> On 26 Jul 2018, at 08:04, Andrea Gazzarini <a.gazzar...@sease.io 
> <mailto:a.gazzar...@sease.io>> wrote:
> 
> Hi, 
> I have the following field type definition: 
> <fieldtype name="text" class="solr.TextField" 
> autoGeneratePhraseQueries="true">
>     <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>     </analyzer>
>     <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.SynonymGraphFilterFactory" 
> synonyms="synonyms.txt" ignoreCase="false" expand="true"/>
>         <filter class="solr.StopFilterFactory" words="stopwords.txt" 
> ignoreCase="false"/>
>     </analyzer>
> </fieldtype>
> Where synonyms and stopwords are defined as follows: 
> 
> synonyms = out of warranty,oow
> stopwords = of
> 
> Running the following query:
> 
> q=my tv went out of warranty something of
> 
> I get wrong results, with the following explain: 
> 
> title:my title:tv title:went (title:oow PhraseQuery(title:"out ? warranty 
> something"))
> 
> That is, the synonyms is correctly detected, I see the graph information are 
> correctly reported in the positionLength, it seems they are wrongly 
> interpreted by the QueryParser. 
> I guess the reason is the "of" removal operated by the StopFilter, which 
> removes the "of" term within the phrase (I wouldn't want that)
> creates a "hole" in the span defined by the "oow" term, which has been marked 
> as a synonym with a positionLength = 3, therefore including the next 
> available term (something). 
> I tried to change the StopFilter in order to ignore stopwords that are marked 
> as SYNONYM or that are part of a previous synonym span, and it works: it 
> correctly produces the following query: 
> 
> title:my title:tv title:went (title:oow PhraseQuery(title:"out of warranty")) 
> title:something
> 
> So I'd like to ask your opinion about this. Am I missing something? Do you 
> think it's better to open a JIRA issue? If the solution is a graph aware stop 
> filter, do you think it's better to change the existing filter or to subclass 
> it?
> 
> Best, 
> Andrea
> 
> 

Reply via email to