> In general I’d avoid index-time synonyms in lucene because synonyms can create graphs (eg if a single term gets expanded to several terms), and we can’t index graphs correctly.
I wonder what it would take to address this. I guess the blast radius of adding a token "width" could be pretty large. Is there an issue or any past discussion about that? On Thu, Jul 26, 2018 at 11:42 AM Andrea Gazzarini <[email protected]> wrote: > Hi Walter, > many thanks for the response and without any constraint at all, I would > agree with you. From your message I clearly understand your experience is > greater than mine. My 2 cents inline below: > > > Move the synonym filter to the index analyzer chain. That provides > better performance and avoids some surprising relevance behavior. With > synonyms at query time, you’ll see different idf for terms in the synonym > set, with the rare variant scoring higher. That is probably the opposite of > what is expected. > > Unfortunately moving the synonym filter to the index analyzer is not an > option: the project where I'm working on has a huge index and the synonyms > list is something that (at least in this stage) frequently changes; > re-index everything from scratch each time a change occurs is a big > problem. On the other side, the IDF issue you mention doesn't produce so > many unwanted effect, at least until now. But I got the point, thanks for > the hint. > > > Also, phrase synonyms just don’t work at query time because the terms > are parsed into individual tokens by the query parser, not the tokenizer. > Here I dont' get you: using the SynonymGraph Filter + SplitOnWhiteSpace = > false + AutoGeneratePhraseQueries I get the synonym phrasing correctly > working (see the first example in my email). > > > Don’t use stop words. Just remove that line. Removing stop words is a > performance and space hack that was useful in the 1960’s, but causes > problems now. I’ve never used stop word removal and I started in search > with Infoseek in 1996. Stop word removal is like a binary idf, ignoring > common words. Since we have idf, we can give a lower score to common words > and keep them in the index. > > And this is, as I see, something which animated long discussions around > using / avoiding stopwords. I will check your suggestion, what it means to > apply that approach to my project, but in meantime I think, also looking at > the JIRA Alan pointed in his answer, the issue is there, and it's real; I > mean, it is something that it doesn't work as expected (my use case, as far > as I understand, is just an example because the thing is broader and it is > related to the FilteredTokenFilter) > > Thanks again, > Andrea > > On 26/07/18 16:59, Walter Underwood wrote: > > Move the synonym filter to the index analyzer chain. That provides better > performance and avoids some surprising relevance behavior. With synonyms at > query time, you’ll see different idf for terms in the synonym set, with the > rare variant scoring higher. That is probably the opposite of what is > expected. > > Also, phrase synonyms just don’t work at query time because the terms are > parsed into individual tokens by the query parser, not the tokenizer. > > Don’t use stop words. Just remove that line. Removing stop words is a > performance and space hack that was useful in the 1960’s, but causes > problems now. I’ve never used stop word removal and I started in search > with Infoseek in 1996. Stop word removal is like a binary idf, ignoring > common words. Since we have idf, we can give a lower score to common words > and keep them in the index. > > Do those two things and it should work as you expect. > > wunder > Walter Underwood > [email protected] > http://observer.wunderwood.org/ (my blog) > > On Jul 26, 2018, at 3:23 AM, Andrea Gazzarini <[email protected]> > wrote: > > Hi Alan, thanks for the response and thank you very much for the pointers > > On 26/07/18 12:16, Alan Woodward wrote: > > Hi Andrea, > > This is a long-standing issue: see > https://issues.apache.org/jira/browse/LUCENE-4065 and > https://issues.apache.org/jira/browse/LUCENE-8250 for discussion. I > don’t think we’ve reached a consensus on how to fix it yet, but more > examples are good. > > Unfortunately I don’t think changing the StopFilter to ignore SYNONYM > tokens will work, because then you’ll generate queries that always fail - > they’ll search for ‘of’ in the middle of the phrase, but ‘of’ never gets > indexed because it’s removed by the StopFilter at index time. > > - Alan > > On 26 Jul 2018, at 08:04, Andrea Gazzarini <[email protected]> wrote: > > Hi, > I have the following field type definition: > > <fieldtype name="text" class="solr.TextField" > autoGeneratePhraseQueries="true"> > <analyzer type="index"> > <tokenizer class="solr.StandardTokenizerFactory"/> > <filter class="solr.LowerCaseFilterFactory"/> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.StandardTokenizerFactory"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.SynonymGraphFilterFactory" > synonyms="synonyms.txt" ignoreCase="false" expand="true"/> > <filter class="solr.StopFilterFactory" words="stopwords.txt" > ignoreCase="false"/> > </analyzer></fieldtype> > > Where synonyms and stopwords are defined as follows: > > synonyms = out of warranty,oow > stopwords = of > > Running the following query: > > q=my tv went out *of* warranty something *of* > > I get wrong results, with the following explain: > > title:my title:tv title:went (title:oow *PhraseQuery(title:"out ? > warranty something"))* > > That is, the synonyms is correctly detected, I see the graph information > are correctly reported in the positionLength, it seems they are wrongly > interpreted by the QueryParser. > I guess the reason is the "of" removal operated by the StopFilter, which > > - removes the "of" term within the phrase (I wouldn't want that) > - creates a "hole" in the span defined by the "oow" term, which has > been marked as a synonym with a positionLength = 3, therefore including the > next available term (something). > > I tried to change the StopFilter in order to ignore stopwords that are > marked as SYNONYM or that are part of a previous synonym span, and it > works: it correctly produces the following query: > > title:my title:tv title:went *(title:oow PhraseQuery(title:"out of > warranty"))* title:something > > So I'd like to ask your opinion about this. Am I missing something? Do you > think it's better to open a JIRA issue? If the solution is a graph aware > stop filter, do you think it's better to change the existing filter or to > subclass it? > > Best, > Andrea > > > > > > >
