Re: SynonymGraphFilter followed by StopFilter

Alan Woodward Thu, 26 Jul 2018 08:40:39 -0700

> Also, phrase synonyms just don’t work at query time because the terms are 
> parsed into individual tokens by the query parser, not the tokenizer.


This is no longer the case.  In general I’d avoid index-time synonyms in lucene 
because synonyms can create graphs (eg if a single term gets expanded to 
several terms), and we can’t index graphs correctly.

I’d agree that removing stop words is generally unnecessary, but there are 
other reasons that you’d want to filter out terms from the Tokenstream, and we 
should be able to handle those situations correctly.

> On 26 Jul 2018, at 15:59, Walter Underwood <wun...@wunderwood.org> wrote:
> 
> Move the synonym filter to the index analyzer chain. That provides better 
> performance and avoids some surprising relevance behavior. With synonyms at 
> query time, you’ll see different idf for terms in the synonym set, with the 
> rare variant scoring higher. That is probably the opposite of what is 
> expected.
> 
> Also, phrase synonyms just don’t work at query time because the terms are 
> parsed into individual tokens by the query parser, not the tokenizer.
> 
> Don’t use stop words. Just remove that line. Removing stop words is a 
> performance and space hack that was useful in the 1960’s, but causes problems 
> now. I’ve never used stop word removal and I started in search with Infoseek 
> in 1996. Stop word removal is like a binary idf, ignoring common words. Since 
> we have idf, we can give a lower score to common words and keep them in the 
> index. 
> 
> Do those two things and it should work as you expect. 
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> http://observer.wunderwood.org/  (my blog)
> 
>> On Jul 26, 2018, at 3:23 AM, Andrea Gazzarini <a.gazzar...@sease.io 
>> <mailto:a.gazzar...@sease.io>> wrote:
>> 
>> Hi Alan, thanks for the response and thank you very much for the pointers
>> 
>> On 26/07/18 12:16, Alan Woodward wrote:
>>> Hi Andrea,
>>> 
>>> This is a long-standing issue: see 
>>> https://issues.apache.org/jira/browse/LUCENE-4065 
>>> <https://issues.apache.org/jira/browse/LUCENE-4065> and 
>>> https://issues.apache.org/jira/browse/LUCENE-8250 
>>> <https://issues.apache.org/jira/browse/LUCENE-8250> for discussion.  I 
>>> don’t think we’ve reached a consensus on how to fix it yet, but more 
>>> examples are good.
>>> 
>>> Unfortunately I don’t think changing the StopFilter to ignore SYNONYM 
>>> tokens will work, because then you’ll generate queries that always fail - 
>>> they’ll search for ‘of’ in the middle of the phrase, but ‘of’ never gets 
>>> indexed because it’s removed by the StopFilter at index time.
>>> 
>>> - Alan
>>> 
>>>> On 26 Jul 2018, at 08:04, Andrea Gazzarini <a.gazzar...@sease.io 
>>>> <mailto:a.gazzar...@sease.io>> wrote:
>>>> 
>>>> Hi, 
>>>> I have the following field type definition: 
>>>> <fieldtype name="text" class="solr.TextField" 
>>>> autoGeneratePhraseQueries="true">
>>>>     <analyzer type="index">
>>>>         <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>>     </analyzer>
>>>>     <analyzer type="query">
>>>>         <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>>         <filter class="solr.SynonymGraphFilterFactory" 
>>>> synonyms="synonyms.txt" ignoreCase="false" expand="true"/>
>>>>         <filter class="solr.StopFilterFactory" words="stopwords.txt" 
>>>> ignoreCase="false"/>
>>>>     </analyzer>
>>>> </fieldtype>
>>>> Where synonyms and stopwords are defined as follows: 
>>>> 
>>>> synonyms = out of warranty,oow
>>>> stopwords = of
>>>> 
>>>> Running the following query:
>>>> 
>>>> q=my tv went out of warranty something of
>>>> 
>>>> I get wrong results, with the following explain: 
>>>> 
>>>> title:my title:tv title:went (title:oow PhraseQuery(title:"out ? warranty 
>>>> something"))
>>>> 
>>>> That is, the synonyms is correctly detected, I see the graph information 
>>>> are correctly reported in the positionLength, it seems they are wrongly 
>>>> interpreted by the QueryParser. 
>>>> I guess the reason is the "of" removal operated by the StopFilter, which 
>>>> removes the "of" term within the phrase (I wouldn't want that)
>>>> creates a "hole" in the span defined by the "oow" term, which has been 
>>>> marked as a synonym with a positionLength = 3, therefore including the 
>>>> next available term (something). 
>>>> I tried to change the StopFilter in order to ignore stopwords that are 
>>>> marked as SYNONYM or that are part of a previous synonym span, and it 
>>>> works: it correctly produces the following query: 
>>>> 
>>>> title:my title:tv title:went (title:oow PhraseQuery(title:"out of 
>>>> warranty")) title:something
>>>> 
>>>> So I'd like to ask your opinion about this. Am I missing something? Do you 
>>>> think it's better to open a JIRA issue? If the solution is a graph aware 
>>>> stop filter, do you think it's better to change the existing filter or to 
>>>> subclass it?
>>>> 
>>>> Best, 
>>>> Andrea
>>>> 
>>>> 
>>> 
>> 
>

Re: SynonymGraphFilter followed by StopFilter

Reply via email to