Re: SynonymGraphFilter followed by StopFilter

Robert Muir Fri, 27 Jul 2018 04:28:53 -0700

No Solr patches necessary: synonymquery fixed that IDF issue 3 years ago.
There is just extremely outdated advice on this thread.


On Fri, Jul 27, 2018 at 7:08 AM, Alessandro Benedetti <[email protected]>
wrote:

> Hi all,
> I just want to add that
> "With synonyms at query time, you’ll see different idf for terms in the
> synonym set, with the rare variant scoring higher. That is probably the
> opposite of what is expected."
> should be solved by : https://issues.apache.org/jira/browse/SOLR-11662
>
> At least that feature brings flexibility in.
>
> Cheers
>
> --------------------------
> Alessandro Benedetti
> Search Consultant, R&D Software Engineer, Director
> www.sease.io
>
> On Fri, Jul 27, 2018 at 3:25 AM, Michael Sokolov <[email protected]>
> wrote:
>
>>  > In general I’d avoid index-time synonyms in lucene because synonyms
>> can create graphs (eg if a single term gets expanded to several terms), and
>> we can’t index graphs correctly.
>>
>> I wonder what it would take to address this. I guess the blast radius of
>> adding a token "width" could be pretty large. Is there an issue or any past
>> discussion about that?
>>
>> On Thu, Jul 26, 2018 at 11:42 AM Andrea Gazzarini <[email protected]>
>> wrote:
>>
>>> Hi Walter,
>>> many thanks for the response and without any constraint at all, I would
>>> agree with you. From your message I clearly understand your experience is
>>> greater than mine. My 2 cents inline below:
>>>
>>> > Move the synonym filter to the index analyzer chain. That provides
>>> better performance and avoids some surprising relevance behavior. With
>>> synonyms at query time, you’ll see different idf for terms in the synonym
>>> set, with the rare variant scoring higher. That is probably the opposite of
>>> what is expected.
>>>
>>> Unfortunately moving the synonym filter to the index analyzer is not an
>>> option: the project where I'm working on has a huge index and the synonyms
>>> list is something that (at least in this stage) frequently changes;
>>> re-index everything from scratch each time a change occurs is a big
>>> problem. On the other side, the IDF issue you mention doesn't produce so
>>> many unwanted effect, at least until now. But I got the point, thanks for
>>> the hint.
>>>
>>> > Also, phrase synonyms just don’t work at query time because the terms
>>> are parsed into individual tokens by the query parser, not the tokenizer.
>>> Here I dont' get you: using the SynonymGraph Filter + SplitOnWhiteSpace
>>> = false + AutoGeneratePhraseQueries I get the synonym phrasing correctly
>>> working (see the first example in my email).
>>>
>>> > Don’t use stop words. Just remove that line. Removing stop words is a
>>> performance and space hack that was useful in the 1960’s, but causes
>>> problems now. I’ve never used stop word removal and I started in search
>>> with Infoseek in 1996. Stop word removal is like a binary idf, ignoring
>>> common words. Since we have idf, we can give a lower score to common words
>>> and keep them in the index.
>>>
>>> And this is, as I see, something which animated long discussions around
>>> using / avoiding stopwords. I will check your suggestion, what it means to
>>> apply that approach to my project, but in meantime I think, also looking at
>>> the JIRA Alan pointed in his answer, the issue is there, and it's real; I
>>> mean, it is something that it doesn't work as expected (my use case, as far
>>> as I understand, is just an example because the thing is broader and it is
>>> related to the FilteredTokenFilter)
>>>
>>> Thanks again,
>>> Andrea
>>>
>>> On 26/07/18 16:59, Walter Underwood wrote:
>>>
>>> Move the synonym filter to the index analyzer chain. That provides
>>> better performance and avoids some surprising relevance behavior. With
>>> synonyms at query time, you’ll see different idf for terms in the synonym
>>> set, with the rare variant scoring higher. That is probably the opposite of
>>> what is expected.
>>>
>>> Also, phrase synonyms just don’t work at query time because the terms
>>> are parsed into individual tokens by the query parser, not the tokenizer.
>>>
>>> Don’t use stop words. Just remove that line. Removing stop words is a
>>> performance and space hack that was useful in the 1960’s, but causes
>>> problems now. I’ve never used stop word removal and I started in search
>>> with Infoseek in 1996. Stop word removal is like a binary idf, ignoring
>>> common words. Since we have idf, we can give a lower score to common words
>>> and keep them in the index.
>>>
>>> Do those two things and it should work as you expect.
>>>
>>> wunder
>>> Walter Underwood
>>> [email protected]
>>> http://observer.wunderwood.org/  (my blog)
>>>
>>> On Jul 26, 2018, at 3:23 AM, Andrea Gazzarini <[email protected]>
>>> wrote:
>>>
>>> Hi Alan, thanks for the response and thank you very much for the pointers
>>>
>>> On 26/07/18 12:16, Alan Woodward wrote:
>>>
>>> Hi Andrea,
>>>
>>> This is a long-standing issue: see https://issues.apache.org/
>>> jira/browse/LUCENE-4065 and https://issues.apache.org/jira/b
>>> rowse/LUCENE-8250 for discussion.  I don’t think we’ve reached a
>>> consensus on how to fix it yet, but more examples are good.
>>>
>>> Unfortunately I don’t think changing the StopFilter to ignore SYNONYM
>>> tokens will work, because then you’ll generate queries that always fail -
>>> they’ll search for ‘of’ in the middle of the phrase, but ‘of’ never gets
>>> indexed because it’s removed by the StopFilter at index time.
>>>
>>> - Alan
>>>
>>> On 26 Jul 2018, at 08:04, Andrea Gazzarini <[email protected]> wrote:
>>>
>>> Hi,
>>> I have the following field type definition:
>>>
>>> <fieldtype name="text" class="solr.TextField" 
>>> autoGeneratePhraseQueries="true">
>>>     <analyzer type="index">
>>>         <tokenizer class="solr.StandardTokenizerFactory"/>
>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>     </analyzer>
>>>     <analyzer type="query">
>>>         <tokenizer class="solr.StandardTokenizerFactory"/>
>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>         <filter class="solr.SynonymGraphFilterFactory" 
>>> synonyms="synonyms.txt" ignoreCase="false" expand="true"/>
>>>         <filter class="solr.StopFilterFactory" words="stopwords.txt" 
>>> ignoreCase="false"/>
>>>     </analyzer></fieldtype>
>>>
>>> Where synonyms and stopwords are defined as follows:
>>>
>>> synonyms = out of warranty,oow
>>> stopwords = of
>>>
>>> Running the following query:
>>>
>>> q=my tv went out *of* warranty something *of*
>>>
>>> I get wrong results, with the following explain:
>>>
>>> title:my title:tv title:went (title:oow *PhraseQuery(title:"out ?
>>> warranty something"))*
>>>
>>> That is, the synonyms is correctly detected, I see the graph information
>>> are correctly reported in the positionLength, it seems they are wrongly
>>> interpreted by the QueryParser.
>>> I guess the reason is the "of" removal operated by the StopFilter, which
>>>
>>>    - removes the "of" term within the phrase (I wouldn't want that)
>>>    - creates a "hole" in the span defined by the "oow" term, which has
>>>    been marked as a synonym with a positionLength = 3, therefore including 
>>> the
>>>    next available term (something).
>>>
>>> I tried to change the StopFilter in order to ignore stopwords that are
>>> marked as SYNONYM or that are part of a previous synonym span, and it
>>> works: it correctly produces the following query:
>>>
>>> title:my title:tv title:went *(title:oow PhraseQuery(title:"out of
>>> warranty"))* title:something
>>>
>>> So I'd like to ask your opinion about this. Am I missing something? Do
>>> you think it's better to open a JIRA issue? If the solution is a graph
>>> aware stop filter, do you think it's better to change the existing filter
>>> or to subclass it?
>>>
>>> Best,
>>> Andrea
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>

Re: SynonymGraphFilter followed by StopFilter

Reply via email to