Hello, I have opened a SOLR-13009 describing the problem. The attached patch contains a unit test proving the problem, i.e. the test fails. Any help would be greatly appreciated.
Many thanks, Markus https://issues.apache.org/jira/browse/SOLR-13009 -----Original message----- > From:Markus Jelsma <markus.jel...@openindex.io> > Sent: Sunday 18th November 2018 23:21 > To: solr-user@lucene.apache.org; solr-user <solr-user@lucene.apache.org> > Subject: RE: KeywordRepeat, stemming, (single term) synonyms and minimum > should match (edismax) > > Hello, > > Apologies for bothering you all again, but i really need some help in this > matter. How can we resolve this issue? Are we dealing with a bug here (then > i'll open a ticket), am i doing something wrong? > > Is here anyone who had the same issue or understand the problem? > > Many thanks, > Markus > > > > -----Original message----- > > From:Markus Jelsma <markus.jel...@openindex.io> > > Sent: Tuesday 13th November 2018 9:52 > > To: solr-user <solr-user@lucene.apache.org> > > Subject: KeywordRepeat, stemming, (single term) synonyms and minimum should > > match (edismax) > > > > Hello, apologies for this long winded e-mail. > > > > Our fields have KeywordRepeat and language specific filters such as a > > stemmer, the final filter at query-time is SynonymGraph. We do not use > > RemoveDuplicatesFilter for those of you wondering why when you see the > > parsed queries below, this is due to [1]. > > > > We use a custom QParser extending edismax and also extend > > ExtendedSolrQueryParser, so we are able to override newFieldQuery in case > > we have to. The problem also directly applies to Solr's vanilla edismax. > > The file synonyms.txt contains the stemmed versions of the original terms. > > > > Consider this example synonym set [bier,brouw] where bier means beer and > > brouw is the stemmed version of brouwsel (brewage, concoction), and > > consider these parameters on /select: qf=content_nl&defType=edismax&mm=2<-1 > > 5<-2 6<90%25. > > > > The queries q=bier and q=brouw both parse to the following query and give > > the desired results (notice the missing RemoveDuplicates here): > > +(((Synonym(content_nl:bier content_nl:brouw) Synonym(content_nl:bier > > content_nl:brouw))~2)) > > > > However, for q=brouwsel something (partially) unexpected happens: > > +(((content_nl:brouwsel Synonym(content_nl:bier content_nl:brouw))~2)) > > > > This results in a BooleanQuery where, due to mm=2, both clauses need to > > match, giving very few matches. Removing KeywordRepeat or setting mm=1 of > > course fixes the problem, but that is not what we want. > > > > What is also unexpected, and may be related to the problem, is that when > > checking the analzer output via the GUI, we see the position incrementing > > when KeywordRepeat and SynonymGraph are combined. When these filters are > > not combined, the positions are always 1, as expected. When combined we get > > this for 'brouw': > > term: bier brouw bier brouw > > pos: 1 1 2 2 > > > > or for 'brouwsel': > > term: brouwsel bier brouw > > pos: 1 2 2 > > > > ExtendedSolrQueryParser, and everything underneath, is a complicated piece > > of code. In the end it extends Lucene's QueryBuilder, but not always > > relying on its results, it seems. Edismax for example 'resets' > > minShouldMatch in SolrPluginUtils.setMinShouldMatch(), so this is a > > complicated web of code and i am a bit too deep in this unfamiliar area, > > and i am in need of help here. > > > > So, my question is, how to solve this problem? Or how to approach it? What > > is the actual problem? How can i get the same stable results for both > > queries? Does the odd positon increment have anything to do with it (it > > seems Lucene's QueryBuilder does something with it). What do i need to do? > > > > Many thanks, > > Markus > > > > ps. this is on Solr 7.2.1 and 7.5.0. > > > > [1] > > http://lucene.472066.n3.nabble.com/Multiple-languages-boosting-and-stemming-and-KeywordRepeat-td4389086.html > > >