Hi Alan, thanks for the response and thank you very much for the pointers
On 26/07/18 12:16, Alan Woodward wrote:
Hi Andrea,
This is a long-standing issue: see
https://issues.apache.org/jira/browse/LUCENE-4065 and
https://issues.apache.org/jira/browse/LUCENE-8250 for discussion. I
don’t think we’ve reached a consensus on how to fix it yet, but more
examples are good.
Unfortunately I don’t think changing the StopFilter to ignore SYNONYM
tokens will work, because then you’ll generate queries that always
fail - they’ll search for ‘of’ in the middle of the phrase, but ‘of’
never gets indexed because it’s removed by the StopFilter at index time.
- Alan
On 26 Jul 2018, at 08:04, Andrea Gazzarini <a.gazzar...@sease.io
<mailto:a.gazzar...@sease.io>> wrote:
Hi,
I have the following field type definition:
<fieldtype name="text" class="solr.TextField"
autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SynonymGraphFilterFactory"
synonyms="synonyms.txt" ignoreCase="false" expand="true"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt"
ignoreCase="false"/>
</analyzer>
</fieldtype>
Where synonyms and stopwords are defined as follows:
synonyms = out of warranty,oow
stopwords = of
Running the following query:
q=my tv went out *of* warranty something *of*
I get wrong results, with the following explain:
title:my title:tv title:went (title:oow *PhraseQuery(title:"out ?
warranty something"))*
That is, the synonyms is correctly detected, I see the graph
information are correctly reported in the positionLength, it seems
they are wrongly interpreted by the QueryParser.
I guess the reason is the "of" removal operated by the StopFilter, which
* removes the "of" term within the phrase (I wouldn't want that)
* creates a "hole" in the span defined by the "oow" term, which has
been marked as a synonym with a positionLength = 3, therefore
including the next available term (something).
I tried to change the StopFilter in order to ignore stopwords that
are marked as SYNONYM or that are part of a previous synonym span,
and it works: it correctly produces the following query:
title:my title:tv title:went *(title:oow PhraseQuery(title:"out of
warranty"))* title:something
So I'd like to ask your opinion about this. Am I missing something?
Do you think it's better to open a JIRA issue? If the solution is a
graph aware stop filter, do you think it's better to change the
existing filter or to subclass it?
Best,
Andrea