RE: Trouble with mm and SynonymQuery and KeywordRepeatFilter

2017-12-22 Thread Markus Jelsma
Hello Walter, Steve,

That is not going to be that easy, we have many Germanic languages in the 
index, all with support for splitting compound words.

I also do not like the idea of adding all inflections to the synonyms file, and 
blows up our queries N fold, they are very big already due to search over many 
fields of many languages. And i believe it is counter intuitive, i have a 
stemmer for that.

Ideally i would want to fix this in mm, something like mm.autoRelax does.

Many thanks,
Markus

-Original message-
> From:Walter Underwood 
> Sent: Thursday 21st December 2017 17:13
> To: solr-user@lucene.apache.org
> Subject: Re: Trouble with mm and SynonymQuery and KeywordRepeatFilter
> 
> You can find all the inflected forms that are in your index. Search for the 
> root form, use highlighting to pull out matches, and collect them. It is a 
> bother, but not that hard for a program to do.
> 
> In the synonym file, you don’t need to list an inflected form of the synonym, 
> because it will be stemmed. So:
> 
> traject => verbind
> trajecten => verbind
> 
> If you want an algorithmic solution, look for a “morphological generator”. 
> That is the inverse of a morphological analyzer. In the olden days, query 
> time generation was an alternative to stemming (analysis) at index time. But 
> that makes the query much larger and much slower.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> 
> > On Dec 21, 2017, at 6:28 AM, Markus Jelsma  
> > wrote:
> > 
> > Hello Steve,
> > 
> > Well, that is an interesting approach to the topic indeed. But i do not 
> > think it is possible to obtain a list of all inflected forms for all words 
> > that also have roots in some synonym file, the stemmers are not reversible. 
> > 
> > Any other ideas?
> > 
> > Thanks,
> > Markus
> > 
> > -Original message-
> >> From:Steve Rowe 
> >> Sent: Thursday 21st December 2017 0:10
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: Trouble with mm and SynonymQuery and KeywordRepeatFilter
> >> 
> >> Hi Markus,
> >> 
> >> My suggestion: rewrite your synonyms to include the triggering word in the 
> >> expanded synonyms list.  That way you won’t need 
> >> KeywordRepeat/RemoveDuplicates filters, and mm=100% will work as you 
> >> expect.
> >> 
> >> I don’t think this situation is a bug, since mm applies to the built 
> >> query, not to the original query terms.
> >> 
> >> --
> >> Steve
> >> www.lucidworks.com
> >> 
> >>> On Dec 20, 2017, at 5:02 PM, Markus Jelsma  
> >>> wrote:
> >>> 
> >>> Hello,
> >>> 
> >>> Yes of course, index time synonyms lessens the query time complexity and 
> >>> will solve the mm problem. It also screws IDF and the flexibility of 
> >>> adding synonyms on demand. The first we do not want, the second is 
> >>> impossible for us (very large main search index).
> >>> 
> >>> We are looking for a solution with mm that takes KeywordRepeat, stemming 
> >>> and synonym expansion into consideration. To me the current working of mm 
> >>> in this case is a bug, i input one term so treat it as one term in mm, 
> >>> regardless of expanded query terms.
> >>> 
> >>> Any query time ideas to share? I am not well versed with the actual code 
> >>> dealing with this specific subject, the code doesn't like me. I am fine 
> >>> if someone points me to the code that tells mm about the number of 
> >>> original input terms, and what to do. If someone does, please also 
> >>> explain why the change i want to make is a bad one, what to be aware of 
> >>> or what to beware of, or what to take into account.
> >>> 
> >>> Also, am i the only one who regards this behaviour as a bug, or more 
> >>> subtle, a weird unexpected behaviour?
> >>> 
> >>> Many many thanks!
> >>> Markus
> >>> 
> >>> -Original message-
>  From:Shawn Heisey 
>  Sent: Wednesday 20th December 2017 22:39
>  To: solr-user@lucene.apache.org
>  Subject: Re: Trouble with mm and SynonymQuery and KeywordRepeatFilter
>  
>  On 12/19/2017 4:38 AM, Markus Jelsma wrote:
> > I have an interesting issue with mm and SynonymQuery and 
> > KeywordRepeatFilter. We do query time synonym expansion and use 
> > KeywordRepeat for not only finding stemmed tokens. Our synonyms are 
> > already preprocessed and contain only stemmed tokens. Synonym file 
> > contains: traject,verbind
> > 
> > So, any non-root stem that ends up in a synonym is actually a search 
> > for three terms: +DisjunctionMaxQuery(((title_nl:trajecten 
> > Synonym(title_nl:traject title_nl:verbind
> > 
> > But, our default mm requires that two terms must match if the input 
> > query consists of two terms: 2<-1 5<-2 6<90%
> > 
> > So, a simple query looking for a plural (trajecten) will not match a 
> > document where the title contains 

Re: Trouble with mm and SynonymQuery and KeywordRepeatFilter

2017-12-21 Thread Walter Underwood
You can find all the inflected forms that are in your index. Search for the 
root form, use highlighting to pull out matches, and collect them. It is a 
bother, but not that hard for a program to do.

In the synonym file, you don’t need to list an inflected form of the synonym, 
because it will be stemmed. So:

traject => verbind
trajecten => verbind

If you want an algorithmic solution, look for a “morphological generator”. That 
is the inverse of a morphological analyzer. In the olden days, query time 
generation was an alternative to stemming (analysis) at index time. But that 
makes the query much larger and much slower.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Dec 21, 2017, at 6:28 AM, Markus Jelsma  wrote:
> 
> Hello Steve,
> 
> Well, that is an interesting approach to the topic indeed. But i do not think 
> it is possible to obtain a list of all inflected forms for all words that 
> also have roots in some synonym file, the stemmers are not reversible. 
> 
> Any other ideas?
> 
> Thanks,
> Markus
> 
> -Original message-
>> From:Steve Rowe 
>> Sent: Thursday 21st December 2017 0:10
>> To: solr-user@lucene.apache.org
>> Subject: Re: Trouble with mm and SynonymQuery and KeywordRepeatFilter
>> 
>> Hi Markus,
>> 
>> My suggestion: rewrite your synonyms to include the triggering word in the 
>> expanded synonyms list.  That way you won’t need 
>> KeywordRepeat/RemoveDuplicates filters, and mm=100% will work as you expect.
>> 
>> I don’t think this situation is a bug, since mm applies to the built query, 
>> not to the original query terms.
>> 
>> --
>> Steve
>> www.lucidworks.com
>> 
>>> On Dec 20, 2017, at 5:02 PM, Markus Jelsma  
>>> wrote:
>>> 
>>> Hello,
>>> 
>>> Yes of course, index time synonyms lessens the query time complexity and 
>>> will solve the mm problem. It also screws IDF and the flexibility of adding 
>>> synonyms on demand. The first we do not want, the second is impossible for 
>>> us (very large main search index).
>>> 
>>> We are looking for a solution with mm that takes KeywordRepeat, stemming 
>>> and synonym expansion into consideration. To me the current working of mm 
>>> in this case is a bug, i input one term so treat it as one term in mm, 
>>> regardless of expanded query terms.
>>> 
>>> Any query time ideas to share? I am not well versed with the actual code 
>>> dealing with this specific subject, the code doesn't like me. I am fine if 
>>> someone points me to the code that tells mm about the number of original 
>>> input terms, and what to do. If someone does, please also explain why the 
>>> change i want to make is a bad one, what to be aware of or what to beware 
>>> of, or what to take into account.
>>> 
>>> Also, am i the only one who regards this behaviour as a bug, or more 
>>> subtle, a weird unexpected behaviour?
>>> 
>>> Many many thanks!
>>> Markus
>>> 
>>> -Original message-
 From:Shawn Heisey 
 Sent: Wednesday 20th December 2017 22:39
 To: solr-user@lucene.apache.org
 Subject: Re: Trouble with mm and SynonymQuery and KeywordRepeatFilter
 
 On 12/19/2017 4:38 AM, Markus Jelsma wrote:
> I have an interesting issue with mm and SynonymQuery and 
> KeywordRepeatFilter. We do query time synonym expansion and use 
> KeywordRepeat for not only finding stemmed tokens. Our synonyms are 
> already preprocessed and contain only stemmed tokens. Synonym file 
> contains: traject,verbind
> 
> So, any non-root stem that ends up in a synonym is actually a search for 
> three terms: +DisjunctionMaxQuery(((title_nl:trajecten 
> Synonym(title_nl:traject title_nl:verbind
> 
> But, our default mm requires that two terms must match if the input query 
> consists of two terms: 2<-1 5<-2 6<90%
> 
> So, a simple query looking for a plural (trajecten) will not match a 
> document where the title contains only its singular form: q=trajecten 
> will not match document with title_nl:"een traject"
 
 I would think that doing synonym expansion at index time would remove
 any possible confusion about the number of terms at query time.  Queries
 that involve synonyms will be slightly less complex, but the index would
 be larger, so it's difficult to say whether those kinds of queries would
 be any faster or not.
 
 There is one clear disadvantage to index-time synonym expansion: If you
 change your synonyms, you have to reindex.
 
 Thanks,
 Shawn
 
 
>> 
>> 



RE: Trouble with mm and SynonymQuery and KeywordRepeatFilter

2017-12-21 Thread Markus Jelsma
Hello Steve,

This is an example of a query-time analyzer that has the problem:

  
  
  
  
  
  
  
  
  
  

Synonym file contains stemmed terms:  traject,verbind

A search for plural term 'trajecten' becomes 
+DisjunctionMaxQuery(((title_nl:trajecten Synonym(title_nl:traject 
title_nl:verbind

With mm=2 this means that a search for 'trajecten' will only match documents 
that contain that plural form, singlurars are not matched, due to mm.

I know this is a tricky problem, hope to have conveyed it well enough.

Thanks!
Markus
 
-Original message-
> From:Steve Rowe 
> Sent: Thursday 21st December 2017 16:40
> To: solr-user@lucene.apache.org
> Subject: Re: Trouble with mm and SynonymQuery and KeywordRepeatFilter
> 
> Markus,
> 
> I’m confused about exactly what operations you’re performing - could you 
> provide your field type?
> 
> In particular, I don’t understand why you can’t just rewrite the synonyms 
> file entry
> 
>   word1 => word2
> 
> to:
> 
>   word1 => word1, word2
> 
> (Clearly I’m missing something about how stemming is involved.)
> 
> --
> Steve
> www.lucidworks.com
> 
> > On Dec 21, 2017, at 9:28 AM, Markus Jelsma  
> > wrote:
> > 
> > Hello Steve,
> > 
> > Well, that is an interesting approach to the topic indeed. But i do not 
> > think it is possible to obtain a list of all inflected forms for all words 
> > that also have roots in some synonym file, the stemmers are not reversible. 
> > 
> > Any other ideas?
> > 
> > Thanks,
> > Markus
> > 
> > -Original message-
> >> From:Steve Rowe 
> >> Sent: Thursday 21st December 2017 0:10
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: Trouble with mm and SynonymQuery and KeywordRepeatFilter
> >> 
> >> Hi Markus,
> >> 
> >> My suggestion: rewrite your synonyms to include the triggering word in the 
> >> expanded synonyms list.  That way you won’t need 
> >> KeywordRepeat/RemoveDuplicates filters, and mm=100% will work as you 
> >> expect.
> >> 
> >> I don’t think this situation is a bug, since mm applies to the built 
> >> query, not to the original query terms.
> >> 
> >> --
> >> Steve
> >> www.lucidworks.com
> >> 
> >>> On Dec 20, 2017, at 5:02 PM, Markus Jelsma  
> >>> wrote:
> >>> 
> >>> Hello,
> >>> 
> >>> Yes of course, index time synonyms lessens the query time complexity and 
> >>> will solve the mm problem. It also screws IDF and the flexibility of 
> >>> adding synonyms on demand. The first we do not want, the second is 
> >>> impossible for us (very large main search index).
> >>> 
> >>> We are looking for a solution with mm that takes KeywordRepeat, stemming 
> >>> and synonym expansion into consideration. To me the current working of mm 
> >>> in this case is a bug, i input one term so treat it as one term in mm, 
> >>> regardless of expanded query terms.
> >>> 
> >>> Any query time ideas to share? I am not well versed with the actual code 
> >>> dealing with this specific subject, the code doesn't like me. I am fine 
> >>> if someone points me to the code that tells mm about the number of 
> >>> original input terms, and what to do. If someone does, please also 
> >>> explain why the change i want to make is a bad one, what to be aware of 
> >>> or what to beware of, or what to take into account.
> >>> 
> >>> Also, am i the only one who regards this behaviour as a bug, or more 
> >>> subtle, a weird unexpected behaviour?
> >>> 
> >>> Many many thanks!
> >>> Markus
> >>> 
> >>> -Original message-
>  From:Shawn Heisey 
>  Sent: Wednesday 20th December 2017 22:39
>  To: solr-user@lucene.apache.org
>  Subject: Re: Trouble with mm and SynonymQuery and KeywordRepeatFilter
>  
>  On 12/19/2017 4:38 AM, Markus Jelsma wrote:
> > I have an interesting issue with mm and SynonymQuery and 
> > KeywordRepeatFilter. We do query time synonym expansion and use 
> > KeywordRepeat for not only finding stemmed tokens. Our synonyms are 
> > already preprocessed and contain only stemmed tokens. Synonym file 
> > contains: traject,verbind
> > 
> > So, any non-root stem that ends up in a synonym is actually a search 
> > for three terms: +DisjunctionMaxQuery(((title_nl:trajecten 
> > Synonym(title_nl:traject title_nl:verbind
> > 
> > But, our default mm requires that two terms must match if the input 
> > query consists of two terms: 2<-1 5<-2 6<90%
> > 
> > So, a simple query looking for a plural (trajecten) will not match a 
> > document where the title contains only its singular form: q=trajecten 
> > will not match document with title_nl:"een traject"
>  
>  I would think that doing synonym expansion at index time would remove
>  any possible confusion about the number of terms at query time.  Queries
>  that involve synonyms will be slightly less complex, 

Re: Trouble with mm and SynonymQuery and KeywordRepeatFilter

2017-12-21 Thread Steve Rowe
Markus,

I’m confused about exactly what operations you’re performing - could you 
provide your field type?

In particular, I don’t understand why you can’t just rewrite the synonyms file 
entry

  word1 => word2

to:

  word1 => word1, word2

(Clearly I’m missing something about how stemming is involved.)

--
Steve
www.lucidworks.com

> On Dec 21, 2017, at 9:28 AM, Markus Jelsma  wrote:
> 
> Hello Steve,
> 
> Well, that is an interesting approach to the topic indeed. But i do not think 
> it is possible to obtain a list of all inflected forms for all words that 
> also have roots in some synonym file, the stemmers are not reversible. 
> 
> Any other ideas?
> 
> Thanks,
> Markus
> 
> -Original message-
>> From:Steve Rowe 
>> Sent: Thursday 21st December 2017 0:10
>> To: solr-user@lucene.apache.org
>> Subject: Re: Trouble with mm and SynonymQuery and KeywordRepeatFilter
>> 
>> Hi Markus,
>> 
>> My suggestion: rewrite your synonyms to include the triggering word in the 
>> expanded synonyms list.  That way you won’t need 
>> KeywordRepeat/RemoveDuplicates filters, and mm=100% will work as you expect.
>> 
>> I don’t think this situation is a bug, since mm applies to the built query, 
>> not to the original query terms.
>> 
>> --
>> Steve
>> www.lucidworks.com
>> 
>>> On Dec 20, 2017, at 5:02 PM, Markus Jelsma  
>>> wrote:
>>> 
>>> Hello,
>>> 
>>> Yes of course, index time synonyms lessens the query time complexity and 
>>> will solve the mm problem. It also screws IDF and the flexibility of adding 
>>> synonyms on demand. The first we do not want, the second is impossible for 
>>> us (very large main search index).
>>> 
>>> We are looking for a solution with mm that takes KeywordRepeat, stemming 
>>> and synonym expansion into consideration. To me the current working of mm 
>>> in this case is a bug, i input one term so treat it as one term in mm, 
>>> regardless of expanded query terms.
>>> 
>>> Any query time ideas to share? I am not well versed with the actual code 
>>> dealing with this specific subject, the code doesn't like me. I am fine if 
>>> someone points me to the code that tells mm about the number of original 
>>> input terms, and what to do. If someone does, please also explain why the 
>>> change i want to make is a bad one, what to be aware of or what to beware 
>>> of, or what to take into account.
>>> 
>>> Also, am i the only one who regards this behaviour as a bug, or more 
>>> subtle, a weird unexpected behaviour?
>>> 
>>> Many many thanks!
>>> Markus
>>> 
>>> -Original message-
 From:Shawn Heisey 
 Sent: Wednesday 20th December 2017 22:39
 To: solr-user@lucene.apache.org
 Subject: Re: Trouble with mm and SynonymQuery and KeywordRepeatFilter
 
 On 12/19/2017 4:38 AM, Markus Jelsma wrote:
> I have an interesting issue with mm and SynonymQuery and 
> KeywordRepeatFilter. We do query time synonym expansion and use 
> KeywordRepeat for not only finding stemmed tokens. Our synonyms are 
> already preprocessed and contain only stemmed tokens. Synonym file 
> contains: traject,verbind
> 
> So, any non-root stem that ends up in a synonym is actually a search for 
> three terms: +DisjunctionMaxQuery(((title_nl:trajecten 
> Synonym(title_nl:traject title_nl:verbind
> 
> But, our default mm requires that two terms must match if the input query 
> consists of two terms: 2<-1 5<-2 6<90%
> 
> So, a simple query looking for a plural (trajecten) will not match a 
> document where the title contains only its singular form: q=trajecten 
> will not match document with title_nl:"een traject"
 
 I would think that doing synonym expansion at index time would remove
 any possible confusion about the number of terms at query time.  Queries
 that involve synonyms will be slightly less complex, but the index would
 be larger, so it's difficult to say whether those kinds of queries would
 be any faster or not.
 
 There is one clear disadvantage to index-time synonym expansion: If you
 change your synonyms, you have to reindex.
 
 Thanks,
 Shawn
 
 
>> 
>> 



RE: Trouble with mm and SynonymQuery and KeywordRepeatFilter

2017-12-21 Thread Markus Jelsma
Hello Steve,

Well, that is an interesting approach to the topic indeed. But i do not think 
it is possible to obtain a list of all inflected forms for all words that also 
have roots in some synonym file, the stemmers are not reversible. 

Any other ideas?

Thanks,
Markus
 
-Original message-
> From:Steve Rowe 
> Sent: Thursday 21st December 2017 0:10
> To: solr-user@lucene.apache.org
> Subject: Re: Trouble with mm and SynonymQuery and KeywordRepeatFilter
> 
> Hi Markus,
> 
> My suggestion: rewrite your synonyms to include the triggering word in the 
> expanded synonyms list.  That way you won’t need 
> KeywordRepeat/RemoveDuplicates filters, and mm=100% will work as you expect.
> 
> I don’t think this situation is a bug, since mm applies to the built query, 
> not to the original query terms.
> 
> --
> Steve
> www.lucidworks.com
> 
> > On Dec 20, 2017, at 5:02 PM, Markus Jelsma  
> > wrote:
> > 
> > Hello,
> > 
> > Yes of course, index time synonyms lessens the query time complexity and 
> > will solve the mm problem. It also screws IDF and the flexibility of adding 
> > synonyms on demand. The first we do not want, the second is impossible for 
> > us (very large main search index).
> > 
> > We are looking for a solution with mm that takes KeywordRepeat, stemming 
> > and synonym expansion into consideration. To me the current working of mm 
> > in this case is a bug, i input one term so treat it as one term in mm, 
> > regardless of expanded query terms.
> > 
> > Any query time ideas to share? I am not well versed with the actual code 
> > dealing with this specific subject, the code doesn't like me. I am fine if 
> > someone points me to the code that tells mm about the number of original 
> > input terms, and what to do. If someone does, please also explain why the 
> > change i want to make is a bad one, what to be aware of or what to beware 
> > of, or what to take into account.
> > 
> > Also, am i the only one who regards this behaviour as a bug, or more 
> > subtle, a weird unexpected behaviour?
> > 
> > Many many thanks!
> > Markus
> > 
> > -Original message-
> >> From:Shawn Heisey 
> >> Sent: Wednesday 20th December 2017 22:39
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: Trouble with mm and SynonymQuery and KeywordRepeatFilter
> >> 
> >> On 12/19/2017 4:38 AM, Markus Jelsma wrote:
> >>> I have an interesting issue with mm and SynonymQuery and 
> >>> KeywordRepeatFilter. We do query time synonym expansion and use 
> >>> KeywordRepeat for not only finding stemmed tokens. Our synonyms are 
> >>> already preprocessed and contain only stemmed tokens. Synonym file 
> >>> contains: traject,verbind
> >>> 
> >>> So, any non-root stem that ends up in a synonym is actually a search for 
> >>> three terms: +DisjunctionMaxQuery(((title_nl:trajecten 
> >>> Synonym(title_nl:traject title_nl:verbind
> >>> 
> >>> But, our default mm requires that two terms must match if the input query 
> >>> consists of two terms: 2<-1 5<-2 6<90%
> >>> 
> >>> So, a simple query looking for a plural (trajecten) will not match a 
> >>> document where the title contains only its singular form: q=trajecten 
> >>> will not match document with title_nl:"een traject"
> >> 
> >> I would think that doing synonym expansion at index time would remove
> >> any possible confusion about the number of terms at query time.  Queries
> >> that involve synonyms will be slightly less complex, but the index would
> >> be larger, so it's difficult to say whether those kinds of queries would
> >> be any faster or not.
> >> 
> >> There is one clear disadvantage to index-time synonym expansion: If you
> >> change your synonyms, you have to reindex.
> >> 
> >> Thanks,
> >> Shawn
> >> 
> >> 
> 
> 


Re: Trouble with mm and SynonymQuery and KeywordRepeatFilter

2017-12-20 Thread Steve Rowe
Hi Markus,

My suggestion: rewrite your synonyms to include the triggering word in the 
expanded synonyms list.  That way you won’t need KeywordRepeat/RemoveDuplicates 
filters, and mm=100% will work as you expect.

I don’t think this situation is a bug, since mm applies to the built query, not 
to the original query terms.

--
Steve
www.lucidworks.com

> On Dec 20, 2017, at 5:02 PM, Markus Jelsma  wrote:
> 
> Hello,
> 
> Yes of course, index time synonyms lessens the query time complexity and will 
> solve the mm problem. It also screws IDF and the flexibility of adding 
> synonyms on demand. The first we do not want, the second is impossible for us 
> (very large main search index).
> 
> We are looking for a solution with mm that takes KeywordRepeat, stemming and 
> synonym expansion into consideration. To me the current working of mm in this 
> case is a bug, i input one term so treat it as one term in mm, regardless of 
> expanded query terms.
> 
> Any query time ideas to share? I am not well versed with the actual code 
> dealing with this specific subject, the code doesn't like me. I am fine if 
> someone points me to the code that tells mm about the number of original 
> input terms, and what to do. If someone does, please also explain why the 
> change i want to make is a bad one, what to be aware of or what to beware of, 
> or what to take into account.
> 
> Also, am i the only one who regards this behaviour as a bug, or more subtle, 
> a weird unexpected behaviour?
> 
> Many many thanks!
> Markus
> 
> -Original message-
>> From:Shawn Heisey 
>> Sent: Wednesday 20th December 2017 22:39
>> To: solr-user@lucene.apache.org
>> Subject: Re: Trouble with mm and SynonymQuery and KeywordRepeatFilter
>> 
>> On 12/19/2017 4:38 AM, Markus Jelsma wrote:
>>> I have an interesting issue with mm and SynonymQuery and 
>>> KeywordRepeatFilter. We do query time synonym expansion and use 
>>> KeywordRepeat for not only finding stemmed tokens. Our synonyms are already 
>>> preprocessed and contain only stemmed tokens. Synonym file contains: 
>>> traject,verbind
>>> 
>>> So, any non-root stem that ends up in a synonym is actually a search for 
>>> three terms: +DisjunctionMaxQuery(((title_nl:trajecten 
>>> Synonym(title_nl:traject title_nl:verbind
>>> 
>>> But, our default mm requires that two terms must match if the input query 
>>> consists of two terms: 2<-1 5<-2 6<90%
>>> 
>>> So, a simple query looking for a plural (trajecten) will not match a 
>>> document where the title contains only its singular form: q=trajecten will 
>>> not match document with title_nl:"een traject"
>> 
>> I would think that doing synonym expansion at index time would remove
>> any possible confusion about the number of terms at query time.  Queries
>> that involve synonyms will be slightly less complex, but the index would
>> be larger, so it's difficult to say whether those kinds of queries would
>> be any faster or not.
>> 
>> There is one clear disadvantage to index-time synonym expansion: If you
>> change your synonyms, you have to reindex.
>> 
>> Thanks,
>> Shawn
>> 
>> 



RE: Trouble with mm and SynonymQuery and KeywordRepeatFilter

2017-12-20 Thread Markus Jelsma
Hello,

Yes of course, index time synonyms lessens the query time complexity and will 
solve the mm problem. It also screws IDF and the flexibility of adding synonyms 
on demand. The first we do not want, the second is impossible for us (very 
large main search index).

We are looking for a solution with mm that takes KeywordRepeat, stemming and 
synonym expansion into consideration. To me the current working of mm in this 
case is a bug, i input one term so treat it as one term in mm, regardless of 
expanded query terms.

Any query time ideas to share? I am not well versed with the actual code 
dealing with this specific subject, the code doesn't like me. I am fine if 
someone points me to the code that tells mm about the number of original input 
terms, and what to do. If someone does, please also explain why the change i 
want to make is a bad one, what to be aware of or what to beware of, or what to 
take into account.

Also, am i the only one who regards this behaviour as a bug, or more subtle, a 
weird unexpected behaviour?

Many many thanks!
Markus

-Original message-
> From:Shawn Heisey 
> Sent: Wednesday 20th December 2017 22:39
> To: solr-user@lucene.apache.org
> Subject: Re: Trouble with mm and SynonymQuery and KeywordRepeatFilter
> 
> On 12/19/2017 4:38 AM, Markus Jelsma wrote:
> > I have an interesting issue with mm and SynonymQuery and 
> > KeywordRepeatFilter. We do query time synonym expansion and use 
> > KeywordRepeat for not only finding stemmed tokens. Our synonyms are already 
> > preprocessed and contain only stemmed tokens. Synonym file contains: 
> > traject,verbind
> >
> > So, any non-root stem that ends up in a synonym is actually a search for 
> > three terms: +DisjunctionMaxQuery(((title_nl:trajecten 
> > Synonym(title_nl:traject title_nl:verbind
> >
> > But, our default mm requires that two terms must match if the input query 
> > consists of two terms: 2<-1 5<-2 6<90%
> >
> > So, a simple query looking for a plural (trajecten) will not match a 
> > document where the title contains only its singular form: q=trajecten will 
> > not match document with title_nl:"een traject"
> 
> I would think that doing synonym expansion at index time would remove
> any possible confusion about the number of terms at query time.  Queries
> that involve synonyms will be slightly less complex, but the index would
> be larger, so it's difficult to say whether those kinds of queries would
> be any faster or not.
> 
> There is one clear disadvantage to index-time synonym expansion: If you
> change your synonyms, you have to reindex.
> 
> Thanks,
> Shawn
> 
> 


Re: Trouble with mm and SynonymQuery and KeywordRepeatFilter

2017-12-20 Thread Shawn Heisey
On 12/19/2017 4:38 AM, Markus Jelsma wrote:
> I have an interesting issue with mm and SynonymQuery and KeywordRepeatFilter. 
> We do query time synonym expansion and use KeywordRepeat for not only finding 
> stemmed tokens. Our synonyms are already preprocessed and contain only 
> stemmed tokens. Synonym file contains: traject,verbind
>
> So, any non-root stem that ends up in a synonym is actually a search for 
> three terms: +DisjunctionMaxQuery(((title_nl:trajecten 
> Synonym(title_nl:traject title_nl:verbind
>
> But, our default mm requires that two terms must match if the input query 
> consists of two terms: 2<-1 5<-2 6<90%
>
> So, a simple query looking for a plural (trajecten) will not match a document 
> where the title contains only its singular form: q=trajecten will not match 
> document with title_nl:"een traject"

I would think that doing synonym expansion at index time would remove
any possible confusion about the number of terms at query time.  Queries
that involve synonyms will be slightly less complex, but the index would
be larger, so it's difficult to say whether those kinds of queries would
be any faster or not.

There is one clear disadvantage to index-time synonym expansion: If you
change your synonyms, you have to reindex.

Thanks,
Shawn



RE: Trouble with mm and SynonymQuery and KeywordRepeatFilter

2017-12-20 Thread Markus Jelsma
Hello - any ideas to share on this topic?

Many thanks,
Markus

 
 
-Original message-
> From:Markus Jelsma 
> Sent: Tuesday 19th December 2017 12:38
> To: Solr-user 
> Subject: Trouble with mm and SynonymQuery and KeywordRepeatFilter
> 
> Hello,
> 
> I have an interesting issue with mm and SynonymQuery and KeywordRepeatFilter. 
> We do query time synonym expansion and use KeywordRepeat for not only finding 
> stemmed tokens. Our synonyms are already preprocessed and contain only 
> stemmed tokens. Synonym file contains: traject,verbind
> 
> So, any non-root stem that ends up in a synonym is actually a search for 
> three terms: +DisjunctionMaxQuery(((title_nl:trajecten 
> Synonym(title_nl:traject title_nl:verbind
> 
> But, our default mm requires that two terms must match if the input query 
> consists of two terms: 2<-1 5<-2 6<90%
> 
> So, a simple query looking for a plural (trajecten) will not match a document 
> where the title contains only its singular form: q=trajecten will not match 
> document with title_nl:"een traject"
> 
> Now, my question is, how to deal with this problem? I clearly do not want mm 
> to think i input two terms!
> 
> Many many thanks,
> Markus
>