[ https://issues.apache.org/jira/browse/SOLR-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16370883#comment-16370883 ]
Steve Rowe commented on SOLR-11968: ----------------------------------- bq. I think the root cause is LUCENE-4065. I'll try to make a simple test demonstrating this. Not so - LUCENE-4065 should probably be closed as won't-fix (I'll comment there in a sec). Instead, this looks like the problem described in LUCENE-7848. I tracked the problem down to a bug in Lucene's QueryBuilder, which is dropping tokens in side paths with position gaps that are caused by StopFilter. Below is a test that shows the problem - MockSynonymFilter has synonym "cavy" for "guinea pig", and the anonymous analyzer below has "pig" on its stopfilter's stoplist. QueryBuilder produces a query for only "cavy", even though the token stream also contains "guinea". {code:java|title=TestQueryBuilder.java} public void testGraphStop() { Query syn1 = new TermQuery(new Term("field", "guinea")); Query syn2 = new TermQuery(new Term("field", "cavy")); BooleanQuery synQuery = new BooleanQuery.Builder() .add(syn1, BooleanClause.Occur.SHOULD) .add(syn2, BooleanClause.Occur.SHOULD) .build(); BooleanQuery expectedGraphQuery = new BooleanQuery.Builder() .add(synQuery, BooleanClause.Occur.SHOULD) .build(); QueryBuilder queryBuilder = new QueryBuilder(new Analyzer() { @Override protected TokenStreamComponents createComponents(String fieldName) { MockTokenizer tokenizer = new MockTokenizer(); TokenStream stream = new MockSynonymFilter(tokenizer); stream = new StopFilter(stream, CharArraySet.copy(Collections.singleton("pig"))); return new TokenStreamComponents(tokenizer, stream); } }); queryBuilder.setAutoGenerateMultiTermSynonymsPhraseQuery(true); assertEquals(expectedGraphQuery, queryBuilder.createBooleanQuery("field", "guinea pig", BooleanClause.Occur.SHOULD)); } } {code} > Multi-words query time synonyms > ------------------------------- > > Key: SOLR-11968 > URL: https://issues.apache.org/jira/browse/SOLR-11968 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: query parsers, Schema and Analysis > Affects Versions: master (8.0), 6.6.2 > Environment: Centos 7.x > Reporter: Dominique Béjean > Priority: Major > > I am trying multi words query time synonyms with Solr 6.6.2 and > SynonymGraphFilterFactory filter as explain in this article > > [https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/] > > My field type is : > {code:java} > <fieldType name="textSyn" class="solr.TextField" positionIncrementGap="100"> > <analyzer type="index"> > <tokenizer class="solr.StandardTokenizerFactory"/> > <filter class="solr.ElisionFilterFactory" ignoreCase="true" > articles="lang/contractions_fr.txt"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.ASCIIFoldingFilterFactory"/> > <filter class="solr.StopFilterFactory" words="stopwords.txt" > ignoreCase="true"/> > <filter class="solr.FrenchMinimalStemFilterFactory"/> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.StandardTokenizerFactory"/> > <filter class="solr.ElisionFilterFactory" ignoreCase="true" > articles="lang/contractions_fr.txt"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" > ignoreCase="true" expand="true"/> > <filter class="solr.ASCIIFoldingFilterFactory"/> > <filter class="solr.StopFilterFactory" words="stopwords.txt" > ignoreCase="true"/> > <filter class="solr.FrenchMinimalStemFilterFactory"/> > </analyzer> > </fieldType>{code} > > synonyms.txt contains the line : > {code:java} > om, olympique de marseille{code} > > stopwords.txt contains the word > {code:java} > de{code} > > The order of words in my query has an impact on the generated query in > edismax > {code:java} > q={!edismax qf='name_text_gp' v=$qq} > &sow=false > &qq=...{code} > with "qq=om maillot" or "qq=olympique de marseille maillot", I can see the > synonyms expansion. It is working as expected. > {code:java} > "parsedquery_toString":"+(((+name_text_gp:olympiqu +name_text_gp:marseil > +name_text_gp:maillot) name_text_gp:om))", > "parsedquery_toString":"+((name_text_gp:om (+name_text_gp:olympiqu > +name_text_gp:marseil +name_text_gp:maillot)))",{code} > with "qq=maillot om" or "qq=maillot olympique de marseille", I can see the > same generated query > {code:java} > "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))", > "parsedquery_toString":"+((name_text_gp:maillot) (name_text_gp:om))",{code} > I don't understand these generated queries. The first one looks like the > synonym expansion is ignored, but the second one shows it is not ignored and > only the synonym term is used. > > When I test the analisys for the field type the synonyms are correctly > expanded for both expressions > {code:java} > om maillot > maillot om > olympique de marseille maillot > maillot olympique de marseille{code} > resulting outputs always include the following terms (obvioulsly not always > in the same order) > {code:java} > olympiqu om marseil maillot {code} > > So, i suspect an issue with edismax query parser. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org