> In general I’d avoid index-time synonyms in lucene because synonyms can
create graphs (eg if a single term gets expanded to several terms), and we
can’t index graphs correctly.

I wonder what it would take to address this. I guess the blast radius of
adding a token "width" could be pretty large. Is there an issue or any past
discussion about that?

On Thu, Jul 26, 2018 at 11:42 AM Andrea Gazzarini <[email protected]>
wrote:

> Hi Walter,
> many thanks for the response and without any constraint at all, I would
> agree with you. From your message I clearly understand your experience is
> greater than mine. My 2 cents inline below:
>
> > Move the synonym filter to the index analyzer chain. That provides
> better performance and avoids some surprising relevance behavior. With
> synonyms at query time, you’ll see different idf for terms in the synonym
> set, with the rare variant scoring higher. That is probably the opposite of
> what is expected.
>
> Unfortunately moving the synonym filter to the index analyzer is not an
> option: the project where I'm working on has a huge index and the synonyms
> list is something that (at least in this stage) frequently changes;
> re-index everything from scratch each time a change occurs is a big
> problem. On the other side, the IDF issue you mention doesn't produce so
> many unwanted effect, at least until now. But I got the point, thanks for
> the hint.
>
> > Also, phrase synonyms just don’t work at query time because the terms
> are parsed into individual tokens by the query parser, not the tokenizer.
> Here I dont' get you: using the SynonymGraph Filter + SplitOnWhiteSpace =
> false + AutoGeneratePhraseQueries I get the synonym phrasing correctly
> working (see the first example in my email).
>
> > Don’t use stop words. Just remove that line. Removing stop words is a
> performance and space hack that was useful in the 1960’s, but causes
> problems now. I’ve never used stop word removal and I started in search
> with Infoseek in 1996. Stop word removal is like a binary idf, ignoring
> common words. Since we have idf, we can give a lower score to common words
> and keep them in the index.
>
> And this is, as I see, something which animated long discussions around
> using / avoiding stopwords. I will check your suggestion, what it means to
> apply that approach to my project, but in meantime I think, also looking at
> the JIRA Alan pointed in his answer, the issue is there, and it's real; I
> mean, it is something that it doesn't work as expected (my use case, as far
> as I understand, is just an example because the thing is broader and it is
> related to the FilteredTokenFilter)
>
> Thanks again,
> Andrea
>
> On 26/07/18 16:59, Walter Underwood wrote:
>
> Move the synonym filter to the index analyzer chain. That provides better
> performance and avoids some surprising relevance behavior. With synonyms at
> query time, you’ll see different idf for terms in the synonym set, with the
> rare variant scoring higher. That is probably the opposite of what is
> expected.
>
> Also, phrase synonyms just don’t work at query time because the terms are
> parsed into individual tokens by the query parser, not the tokenizer.
>
> Don’t use stop words. Just remove that line. Removing stop words is a
> performance and space hack that was useful in the 1960’s, but causes
> problems now. I’ve never used stop word removal and I started in search
> with Infoseek in 1996. Stop word removal is like a binary idf, ignoring
> common words. Since we have idf, we can give a lower score to common words
> and keep them in the index.
>
> Do those two things and it should work as you expect.
>
> wunder
> Walter Underwood
> [email protected]
> http://observer.wunderwood.org/  (my blog)
>
> On Jul 26, 2018, at 3:23 AM, Andrea Gazzarini <[email protected]>
> wrote:
>
> Hi Alan, thanks for the response and thank you very much for the pointers
>
> On 26/07/18 12:16, Alan Woodward wrote:
>
> Hi Andrea,
>
> This is a long-standing issue: see
> https://issues.apache.org/jira/browse/LUCENE-4065 and
> https://issues.apache.org/jira/browse/LUCENE-8250 for discussion.  I
> don’t think we’ve reached a consensus on how to fix it yet, but more
> examples are good.
>
> Unfortunately I don’t think changing the StopFilter to ignore SYNONYM
> tokens will work, because then you’ll generate queries that always fail -
> they’ll search for ‘of’ in the middle of the phrase, but ‘of’ never gets
> indexed because it’s removed by the StopFilter at index time.
>
> - Alan
>
> On 26 Jul 2018, at 08:04, Andrea Gazzarini <[email protected]> wrote:
>
> Hi,
> I have the following field type definition:
>
> <fieldtype name="text" class="solr.TextField" 
> autoGeneratePhraseQueries="true">
>     <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>     </analyzer>
>     <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.SynonymGraphFilterFactory" 
> synonyms="synonyms.txt" ignoreCase="false" expand="true"/>
>         <filter class="solr.StopFilterFactory" words="stopwords.txt" 
> ignoreCase="false"/>
>     </analyzer></fieldtype>
>
> Where synonyms and stopwords are defined as follows:
>
> synonyms = out of warranty,oow
> stopwords = of
>
> Running the following query:
>
> q=my tv went out *of* warranty something *of*
>
> I get wrong results, with the following explain:
>
> title:my title:tv title:went (title:oow *PhraseQuery(title:"out ?
> warranty something"))*
>
> That is, the synonyms is correctly detected, I see the graph information
> are correctly reported in the positionLength, it seems they are wrongly
> interpreted by the QueryParser.
> I guess the reason is the "of" removal operated by the StopFilter, which
>
>    - removes the "of" term within the phrase (I wouldn't want that)
>    - creates a "hole" in the span defined by the "oow" term, which has
>    been marked as a synonym with a positionLength = 3, therefore including the
>    next available term (something).
>
> I tried to change the StopFilter in order to ignore stopwords that are
> marked as SYNONYM or that are part of a previous synonym span, and it
> works: it correctly produces the following query:
>
> title:my title:tv title:went *(title:oow PhraseQuery(title:"out of
> warranty"))* title:something
>
> So I'd like to ask your opinion about this. Am I missing something? Do you
> think it's better to open a JIRA issue? If the solution is a graph aware
> stop filter, do you think it's better to change the existing filter or to
> subclass it?
>
> Best,
> Andrea
>
>
>
>
>
>
>

Reply via email to