No Solr patches necessary: synonymquery fixed that IDF issue 3 years ago. There is just extremely outdated advice on this thread.
On Fri, Jul 27, 2018 at 7:08 AM, Alessandro Benedetti <[email protected]> wrote: > Hi all, > I just want to add that > "With synonyms at query time, you’ll see different idf for terms in the > synonym set, with the rare variant scoring higher. That is probably the > opposite of what is expected." > should be solved by : https://issues.apache.org/jira/browse/SOLR-11662 > > At least that feature brings flexibility in. > > Cheers > > -------------------------- > Alessandro Benedetti > Search Consultant, R&D Software Engineer, Director > www.sease.io > > On Fri, Jul 27, 2018 at 3:25 AM, Michael Sokolov <[email protected]> > wrote: > >> > In general I’d avoid index-time synonyms in lucene because synonyms >> can create graphs (eg if a single term gets expanded to several terms), and >> we can’t index graphs correctly. >> >> I wonder what it would take to address this. I guess the blast radius of >> adding a token "width" could be pretty large. Is there an issue or any past >> discussion about that? >> >> On Thu, Jul 26, 2018 at 11:42 AM Andrea Gazzarini <[email protected]> >> wrote: >> >>> Hi Walter, >>> many thanks for the response and without any constraint at all, I would >>> agree with you. From your message I clearly understand your experience is >>> greater than mine. My 2 cents inline below: >>> >>> > Move the synonym filter to the index analyzer chain. That provides >>> better performance and avoids some surprising relevance behavior. With >>> synonyms at query time, you’ll see different idf for terms in the synonym >>> set, with the rare variant scoring higher. That is probably the opposite of >>> what is expected. >>> >>> Unfortunately moving the synonym filter to the index analyzer is not an >>> option: the project where I'm working on has a huge index and the synonyms >>> list is something that (at least in this stage) frequently changes; >>> re-index everything from scratch each time a change occurs is a big >>> problem. On the other side, the IDF issue you mention doesn't produce so >>> many unwanted effect, at least until now. But I got the point, thanks for >>> the hint. >>> >>> > Also, phrase synonyms just don’t work at query time because the terms >>> are parsed into individual tokens by the query parser, not the tokenizer. >>> Here I dont' get you: using the SynonymGraph Filter + SplitOnWhiteSpace >>> = false + AutoGeneratePhraseQueries I get the synonym phrasing correctly >>> working (see the first example in my email). >>> >>> > Don’t use stop words. Just remove that line. Removing stop words is a >>> performance and space hack that was useful in the 1960’s, but causes >>> problems now. I’ve never used stop word removal and I started in search >>> with Infoseek in 1996. Stop word removal is like a binary idf, ignoring >>> common words. Since we have idf, we can give a lower score to common words >>> and keep them in the index. >>> >>> And this is, as I see, something which animated long discussions around >>> using / avoiding stopwords. I will check your suggestion, what it means to >>> apply that approach to my project, but in meantime I think, also looking at >>> the JIRA Alan pointed in his answer, the issue is there, and it's real; I >>> mean, it is something that it doesn't work as expected (my use case, as far >>> as I understand, is just an example because the thing is broader and it is >>> related to the FilteredTokenFilter) >>> >>> Thanks again, >>> Andrea >>> >>> On 26/07/18 16:59, Walter Underwood wrote: >>> >>> Move the synonym filter to the index analyzer chain. That provides >>> better performance and avoids some surprising relevance behavior. With >>> synonyms at query time, you’ll see different idf for terms in the synonym >>> set, with the rare variant scoring higher. That is probably the opposite of >>> what is expected. >>> >>> Also, phrase synonyms just don’t work at query time because the terms >>> are parsed into individual tokens by the query parser, not the tokenizer. >>> >>> Don’t use stop words. Just remove that line. Removing stop words is a >>> performance and space hack that was useful in the 1960’s, but causes >>> problems now. I’ve never used stop word removal and I started in search >>> with Infoseek in 1996. Stop word removal is like a binary idf, ignoring >>> common words. Since we have idf, we can give a lower score to common words >>> and keep them in the index. >>> >>> Do those two things and it should work as you expect. >>> >>> wunder >>> Walter Underwood >>> [email protected] >>> http://observer.wunderwood.org/ (my blog) >>> >>> On Jul 26, 2018, at 3:23 AM, Andrea Gazzarini <[email protected]> >>> wrote: >>> >>> Hi Alan, thanks for the response and thank you very much for the pointers >>> >>> On 26/07/18 12:16, Alan Woodward wrote: >>> >>> Hi Andrea, >>> >>> This is a long-standing issue: see https://issues.apache.org/ >>> jira/browse/LUCENE-4065 and https://issues.apache.org/jira/b >>> rowse/LUCENE-8250 for discussion. I don’t think we’ve reached a >>> consensus on how to fix it yet, but more examples are good. >>> >>> Unfortunately I don’t think changing the StopFilter to ignore SYNONYM >>> tokens will work, because then you’ll generate queries that always fail - >>> they’ll search for ‘of’ in the middle of the phrase, but ‘of’ never gets >>> indexed because it’s removed by the StopFilter at index time. >>> >>> - Alan >>> >>> On 26 Jul 2018, at 08:04, Andrea Gazzarini <[email protected]> wrote: >>> >>> Hi, >>> I have the following field type definition: >>> >>> <fieldtype name="text" class="solr.TextField" >>> autoGeneratePhraseQueries="true"> >>> <analyzer type="index"> >>> <tokenizer class="solr.StandardTokenizerFactory"/> >>> <filter class="solr.LowerCaseFilterFactory"/> >>> </analyzer> >>> <analyzer type="query"> >>> <tokenizer class="solr.StandardTokenizerFactory"/> >>> <filter class="solr.LowerCaseFilterFactory"/> >>> <filter class="solr.SynonymGraphFilterFactory" >>> synonyms="synonyms.txt" ignoreCase="false" expand="true"/> >>> <filter class="solr.StopFilterFactory" words="stopwords.txt" >>> ignoreCase="false"/> >>> </analyzer></fieldtype> >>> >>> Where synonyms and stopwords are defined as follows: >>> >>> synonyms = out of warranty,oow >>> stopwords = of >>> >>> Running the following query: >>> >>> q=my tv went out *of* warranty something *of* >>> >>> I get wrong results, with the following explain: >>> >>> title:my title:tv title:went (title:oow *PhraseQuery(title:"out ? >>> warranty something"))* >>> >>> That is, the synonyms is correctly detected, I see the graph information >>> are correctly reported in the positionLength, it seems they are wrongly >>> interpreted by the QueryParser. >>> I guess the reason is the "of" removal operated by the StopFilter, which >>> >>> - removes the "of" term within the phrase (I wouldn't want that) >>> - creates a "hole" in the span defined by the "oow" term, which has >>> been marked as a synonym with a positionLength = 3, therefore including >>> the >>> next available term (something). >>> >>> I tried to change the StopFilter in order to ignore stopwords that are >>> marked as SYNONYM or that are part of a previous synonym span, and it >>> works: it correctly produces the following query: >>> >>> title:my title:tv title:went *(title:oow PhraseQuery(title:"out of >>> warranty"))* title:something >>> >>> So I'd like to ask your opinion about this. Am I missing something? Do >>> you think it's better to open a JIRA issue? If the solution is a graph >>> aware stop filter, do you think it's better to change the existing filter >>> or to subclass it? >>> >>> Best, >>> Andrea >>> >>> >>> >>> >>> >>> >>> >
