RE: KeywordRepeat, stemming, (single term) synonyms and minimum should match (edismax)

Markus Jelsma Thu, 22 Nov 2018 06:39:36 -0800

Hello,

I have opened a SOLR-13009 describing the problem. The attached patch contains 
a unit test proving the problem, i.e. the test fails. Any help would be greatly 
appreciated.


Many thanks,
Markus

https://issues.apache.org/jira/browse/SOLR-13009

 
 
-----Original message-----
> From:Markus Jelsma <markus.jel...@openindex.io>
> Sent: Sunday 18th November 2018 23:21
> To: solr-user@lucene.apache.org; solr-user <solr-user@lucene.apache.org>
> Subject: RE: KeywordRepeat, stemming, (single term) synonyms and minimum 
> should match (edismax)
> 
> Hello,
> 
> Apologies for bothering you all again, but i really need some help in this 
> matter. How can we resolve this issue? Are we dealing with a bug here (then 
> i'll open a ticket), am i doing something wrong?
> 
> Is here anyone who had the same issue or understand the problem?
> 
> Many thanks,
> Markus 
> 
>  
>  
> -----Original message-----
> > From:Markus Jelsma <markus.jel...@openindex.io>
> > Sent: Tuesday 13th November 2018 9:52
> > To: solr-user <solr-user@lucene.apache.org>
> > Subject: KeywordRepeat, stemming, (single term) synonyms and minimum should 
> > match (edismax)
> > 
> > Hello, apologies for this long winded e-mail.
> > 
> > Our fields have KeywordRepeat and language specific filters such as a 
> > stemmer, the final filter at query-time is SynonymGraph. We do not use 
> > RemoveDuplicatesFilter for those of you wondering why when you see the 
> > parsed queries below, this is due to [1]. 
> > 
> > We use a custom QParser extending edismax and also extend 
> > ExtendedSolrQueryParser, so we are able to override newFieldQuery in case 
> > we have to. The problem also directly applies to Solr's vanilla edismax. 
> > The file synonyms.txt contains the stemmed versions of the original terms.
> > 
> > Consider this example synonym set [bier,brouw] where bier means beer and 
> > brouw is the stemmed version of brouwsel (brewage, concoction), and 
> > consider these parameters on /select: qf=content_nl&defType=edismax&mm=2<-1 
> > 5<-2 6<90%25.
> > 
> > The queries q=bier and q=brouw both parse to the following query and give 
> > the desired results (notice the missing RemoveDuplicates here):
> > +(((Synonym(content_nl:bier content_nl:brouw) Synonym(content_nl:bier 
> > content_nl:brouw))~2))
> > 
> > However, for q=brouwsel something (partially) unexpected happens:
> > +(((content_nl:brouwsel Synonym(content_nl:bier content_nl:brouw))~2))
> > 
> > This results in a BooleanQuery where, due to mm=2, both clauses need to 
> > match, giving very few matches. Removing KeywordRepeat or setting mm=1 of 
> > course fixes the problem, but that is not what we want.
> > 
> > What is also unexpected, and may be related to the problem, is that when 
> > checking the analzer output via the GUI, we see the position incrementing 
> > when KeywordRepeat and SynonymGraph are combined. When these filters are 
> > not combined, the positions are always 1, as expected. When combined we get 
> > this for 'brouw':
> > term: bier brouw bier brouw
> > pos:  1     1         2      2
> > 
> > or for 'brouwsel':
> > term: brouwsel bier brouw
> > pos:  1               2      2
> > 
> > ExtendedSolrQueryParser, and everything underneath, is a complicated piece 
> > of code. In the end it extends Lucene's QueryBuilder, but not always 
> > relying on its results, it seems. Edismax for example 'resets' 
> > minShouldMatch in SolrPluginUtils.setMinShouldMatch(), so this is a 
> > complicated web of code and i am a bit too deep in this unfamiliar area, 
> > and i am in need of help here.
> > 
> > So, my question is, how to solve this problem? Or how to approach it?  What 
> > is the actual problem? How can i get the same stable results for both 
> > queries? Does the odd positon increment have anything to do with it (it 
> > seems Lucene's QueryBuilder does something with it). What do i need to do?
> > 
> > Many thanks,
> > Markus
> > 
> > ps. this is on Solr 7.2.1 and 7.5.0.
> > 
> > [1] 
> > http://lucene.472066.n3.nabble.com/Multiple-languages-boosting-and-stemming-and-KeywordRepeat-td4389086.html
> > 
>

RE: KeywordRepeat, stemming, (single term) synonyms and minimum should match (edismax)

Reply via email to