Re: ngrams with position
Jack, Emir, Thanks for your answers. Moving ngram logic to client side would be a fast and easy way to test the solution and compare it with the phonetic one. Best regards, Elisabeth 2016-03-11 10:52 GMT+01:00 Emir Arnautovic : > Hi Elizabeth, > In order to see if you will get better results, you can move ngram logic > outside of analysis chain - simplest solution is to move it to client. In > such setup, you should be able to use pf2 and pf3 and see if that produces > desired result. > > Regards, > Emir > > > On 10.03.2016 13:47, elisabeth benoit wrote: > >> oh yeah, now that you're saying it, yeah you're right, pf2 pf3 will boost >> proximity between words, not between ngrams. >> >> Thanks again, >> Elisabeth >> >> 2016-03-10 12:31 GMT+01:00 Alessandro Benedetti : >> >> The reason pf2 and pf3 seems not a good solution to me is the fact that >>> the >>> edismax query parser calculate those grams on top of words shingles. >>> So it takes the query in input, and produces the shingle based on the >>> white >>> space separator. >>> >>> i.e. if you search : >>> "white tiger jumping" >>> and pf2 configured on field1. >>> You are going to end up searching in field1 : >>> "white tiger", "tiger jumping" . >>> This is really useful in full text search oriented to phrases and partial >>> phrases match. >>> But it has nothing to do with the analysis type associated at query time >>> at >>> this moment. >>> First it is used the query parser tokenisation to build the grams and >>> then >>> the query time analysis is applied. >>> This according to my remembering, >>> I will double check in the code and let you know. >>> >>> Cheers >>> >>> >>> On 10 March 2016 at 11:02, elisabeth benoit >>> wrote: >>> >>> That's the use cas, yes. Find Amsterdam with Asmtreadm. And yes, we're only doing approximative search if we get 0 result. I don't quite get why pf2 pf3 not a good solution. We're actually testing a solution close to phonetic. Some kind of word reduction. Thanks for the suggestion (and the link), this makes me think maybe phonetic is the good solution. Thanks for your help, Elisabeth 2016-03-10 11:32 GMT+01:00 Alessandro Benedetti >>> >: If I followed your use case is: > > I type Asmtreadm and I want document matching Amsterdam ( even if the > edit > distance is greater than 2) . > First of all is something I hope you do only if you get 0 results, if > not >>> the overhead can be great and you are going to lose a lot of precision > causing confusion in the customer. > > Pf2 and Pf3 is ngram of white space separated tokens, to make partial > phrase query to affect the scoring. > Not a good fit for your problem. > > More than grams, have you considered using some sort of phonetic > matching ? > Could this help : > https://cwiki.apache.org/confluence/display/solr/Phonetic+Matching > > Cheers > > On 10 March 2016 at 08:47, elisabeth benoit wrote: > > I am trying to do approximative search with solr. We've tried fuzzy >> > search, > >> and spellcheck search, it's working ok but edit distance is limited >> > (to 2 > for DirectSolrSpellChecker in solr 4.10.1). With fuzzy operator, >> > we've >>> had > >> performance issues, and I don't think you can have an edit distance >> > more > than 2. >> >> What we used to do with a database was more efficient: storing >> > trigrams >>> with position, and then searching arround that position (not >> > precisely >>> at > that position, since it's approximative search) >> >> Position is to avoid for a trigram like ams (amsterdam) to get >> > answers >>> where the same trigram is for instance at the end of the word. I >> > would >>> like > >> answers with the same relative position between trigrams to score >> > higher. > Maybe using edismax'ss pf2 and pf3 is a way to do this. I don't see >> > any >>> other way. Please tell me if you do. >> >> From you're answer, I get that position is stored, but I dont >> > understand > how I can preserve relative order between trigrams, apart from using >> > pf2 > pf3. >> >> Best regards, >> Elisabeth >> >> 2016-03-10 0:02 GMT+01:00 Alessandro Benedetti < >> > abenede...@apache.org >>> : > >> if you store the positions for your tokens ( and it is by default >>> >> if >>> you > >> don't omit them), you have the relative position in the index. [1] >>> I attach a blog post of mine, describing a little bit more in >>> >> details >>> the > >> lucene internals. >>> >>> Apart from that, can you explain the problem you are tryi
Re: ngrams with position
Hi Elizabeth, In order to see if you will get better results, you can move ngram logic outside of analysis chain - simplest solution is to move it to client. In such setup, you should be able to use pf2 and pf3 and see if that produces desired result. Regards, Emir On 10.03.2016 13:47, elisabeth benoit wrote: oh yeah, now that you're saying it, yeah you're right, pf2 pf3 will boost proximity between words, not between ngrams. Thanks again, Elisabeth 2016-03-10 12:31 GMT+01:00 Alessandro Benedetti : The reason pf2 and pf3 seems not a good solution to me is the fact that the edismax query parser calculate those grams on top of words shingles. So it takes the query in input, and produces the shingle based on the white space separator. i.e. if you search : "white tiger jumping" and pf2 configured on field1. You are going to end up searching in field1 : "white tiger", "tiger jumping" . This is really useful in full text search oriented to phrases and partial phrases match. But it has nothing to do with the analysis type associated at query time at this moment. First it is used the query parser tokenisation to build the grams and then the query time analysis is applied. This according to my remembering, I will double check in the code and let you know. Cheers On 10 March 2016 at 11:02, elisabeth benoit wrote: That's the use cas, yes. Find Amsterdam with Asmtreadm. And yes, we're only doing approximative search if we get 0 result. I don't quite get why pf2 pf3 not a good solution. We're actually testing a solution close to phonetic. Some kind of word reduction. Thanks for the suggestion (and the link), this makes me think maybe phonetic is the good solution. Thanks for your help, Elisabeth 2016-03-10 11:32 GMT+01:00 Alessandro Benedetti : If I followed your use case is: I type Asmtreadm and I want document matching Amsterdam ( even if the edit distance is greater than 2) . First of all is something I hope you do only if you get 0 results, if not the overhead can be great and you are going to lose a lot of precision causing confusion in the customer. Pf2 and Pf3 is ngram of white space separated tokens, to make partial phrase query to affect the scoring. Not a good fit for your problem. More than grams, have you considered using some sort of phonetic matching ? Could this help : https://cwiki.apache.org/confluence/display/solr/Phonetic+Matching Cheers On 10 March 2016 at 08:47, elisabeth benoit I am trying to do approximative search with solr. We've tried fuzzy search, and spellcheck search, it's working ok but edit distance is limited (to 2 for DirectSolrSpellChecker in solr 4.10.1). With fuzzy operator, we've had performance issues, and I don't think you can have an edit distance more than 2. What we used to do with a database was more efficient: storing trigrams with position, and then searching arround that position (not precisely at that position, since it's approximative search) Position is to avoid for a trigram like ams (amsterdam) to get answers where the same trigram is for instance at the end of the word. I would like answers with the same relative position between trigrams to score higher. Maybe using edismax'ss pf2 and pf3 is a way to do this. I don't see any other way. Please tell me if you do. From you're answer, I get that position is stored, but I dont understand how I can preserve relative order between trigrams, apart from using pf2 pf3. Best regards, Elisabeth 2016-03-10 0:02 GMT+01:00 Alessandro Benedetti < abenede...@apache.org : if you store the positions for your tokens ( and it is by default if you don't omit them), you have the relative position in the index. [1] I attach a blog post of mine, describing a little bit more in details the lucene internals. Apart from that, can you explain the problem you are trying to solve ? The high level user experience ? What kind of search/autocompletion/relevancy tuning are you trying to achieve ? Maybe we can help better if we start from the problem :) Cheers [1] http://alexbenedetti.blogspot.co.uk/2015/07/exploring-solr-internals-lucene.html On 9 March 2016 at 15:02, elisabeth benoit < elisaelisael...@gmail.com> wrote: Hello Alessandro, You may be right. What would you use to keep relative order between, for instance, grams __a _am ams mst ste ter erd rda dam am_ of amsterdam? pf2 and pf3? That's all I can think about. Please let me know if you have more insights. Best regards, Elisabeth 2016-03-08 17:46 GMT+01:00 Alessandro Benedetti < abenede...@apache.org : Elizabeth, out of curiousity, could we know what you are trying to solve with that complex way of tokenisation ? Solr is really good in storing positions along with token, so I am curious to know why your are mixing the things up. Cheers On 8 March 2016 at 10:08, elisabeth benoit < elisaelisael...@gmail.com> wrote: Thanks for your answer Emir, I'll check that out.
Re: ngrams with position
I suspect that what you really want is analogous to PF2/PF3, but based on the ngram terms that come out of query token analysis rather than using pairs/triples of source terms before analysis that are then analyzed as phrases so that all of the ngrams for a PF2/PF3 phrase must be in order rather potentially shuffled. Also, phrase query is an implicit AND while you may really want more of a SpanOr query where the terms are ORed but must be within a close proximity. -- Jack Krupansky On Thu, Mar 10, 2016 at 6:31 AM, Alessandro Benedetti wrote: > The reason pf2 and pf3 seems not a good solution to me is the fact that the > edismax query parser calculate those grams on top of words shingles. > So it takes the query in input, and produces the shingle based on the white > space separator. > > i.e. if you search : > "white tiger jumping" > and pf2 configured on field1. > You are going to end up searching in field1 : > "white tiger", "tiger jumping" . > This is really useful in full text search oriented to phrases and partial > phrases match. > But it has nothing to do with the analysis type associated at query time at > this moment. > First it is used the query parser tokenisation to build the grams and then > the query time analysis is applied. > This according to my remembering, > I will double check in the code and let you know. > > Cheers > > > On 10 March 2016 at 11:02, elisabeth benoit > wrote: > > > That's the use cas, yes. Find Amsterdam with Asmtreadm. > > > > And yes, we're only doing approximative search if we get 0 result. > > > > I don't quite get why pf2 pf3 not a good solution. > > > > We're actually testing a solution close to phonetic. Some kind of word > > reduction. > > > > Thanks for the suggestion (and the link), this makes me think maybe > > phonetic is the good solution. > > > > Thanks for your help, > > Elisabeth > > > > 2016-03-10 11:32 GMT+01:00 Alessandro Benedetti : > > > > > If I followed your use case is: > > > > > > I type Asmtreadm and I want document matching Amsterdam ( even if the > > edit > > > distance is greater than 2) . > > > First of all is something I hope you do only if you get 0 results, if > not > > > the overhead can be great and you are going to lose a lot of precision > > > causing confusion in the customer. > > > > > > Pf2 and Pf3 is ngram of white space separated tokens, to make partial > > > phrase query to affect the scoring. > > > Not a good fit for your problem. > > > > > > More than grams, have you considered using some sort of phonetic > > matching ? > > > Could this help : > > > https://cwiki.apache.org/confluence/display/solr/Phonetic+Matching > > > > > > Cheers > > > > > > On 10 March 2016 at 08:47, elisabeth benoit > > > > wrote: > > > > > > > I am trying to do approximative search with solr. We've tried fuzzy > > > search, > > > > and spellcheck search, it's working ok but edit distance is limited > > (to 2 > > > > for DirectSolrSpellChecker in solr 4.10.1). With fuzzy operator, > we've > > > had > > > > performance issues, and I don't think you can have an edit distance > > more > > > > than 2. > > > > > > > > What we used to do with a database was more efficient: storing > trigrams > > > > with position, and then searching arround that position (not > precisely > > at > > > > that position, since it's approximative search) > > > > > > > > Position is to avoid for a trigram like ams (amsterdam) to get > answers > > > > where the same trigram is for instance at the end of the word. I > would > > > like > > > > answers with the same relative position between trigrams to score > > higher. > > > > Maybe using edismax'ss pf2 and pf3 is a way to do this. I don't see > any > > > > other way. Please tell me if you do. > > > > > > > > From you're answer, I get that position is stored, but I dont > > understand > > > > how I can preserve relative order between trigrams, apart from using > > pf2 > > > > pf3. > > > > > > > > Best regards, > > > > Elisabeth > > > > > > > > 2016-03-10 0:02 GMT+01:00 Alessandro Benedetti < > abenede...@apache.org > > >: > > > > > > > > > if you store the positions for your tokens ( and it is by default > if > > > you > > > > > don't omit them), you have the relative position in the index. [1] > > > > > I attach a blog post of mine, describing a little bit more in > details > > > the > > > > > lucene internals. > > > > > > > > > > Apart from that, can you explain the problem you are trying to > solve > > ? > > > > > The high level user experience ? > > > > > What kind of search/autocompletion/relevancy tuning are you trying > to > > > > > achieve ? > > > > > Maybe we can help better if we start from the problem :) > > > > > > > > > > Cheers > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > http://alexbenedetti.blogspot.co.uk/2015/07/exploring-solr-internals-lucene.html > > > > > > > > > > On 9 March 2016 at 15:02, elisabeth benoit < > > elisaelisael...@gmail.com> > > > > > wrote: > > > > > > > > > > > Hello
Re: ngrams with position
oh yeah, now that you're saying it, yeah you're right, pf2 pf3 will boost proximity between words, not between ngrams. Thanks again, Elisabeth 2016-03-10 12:31 GMT+01:00 Alessandro Benedetti : > The reason pf2 and pf3 seems not a good solution to me is the fact that the > edismax query parser calculate those grams on top of words shingles. > So it takes the query in input, and produces the shingle based on the white > space separator. > > i.e. if you search : > "white tiger jumping" > and pf2 configured on field1. > You are going to end up searching in field1 : > "white tiger", "tiger jumping" . > This is really useful in full text search oriented to phrases and partial > phrases match. > But it has nothing to do with the analysis type associated at query time at > this moment. > First it is used the query parser tokenisation to build the grams and then > the query time analysis is applied. > This according to my remembering, > I will double check in the code and let you know. > > Cheers > > > On 10 March 2016 at 11:02, elisabeth benoit > wrote: > > > That's the use cas, yes. Find Amsterdam with Asmtreadm. > > > > And yes, we're only doing approximative search if we get 0 result. > > > > I don't quite get why pf2 pf3 not a good solution. > > > > We're actually testing a solution close to phonetic. Some kind of word > > reduction. > > > > Thanks for the suggestion (and the link), this makes me think maybe > > phonetic is the good solution. > > > > Thanks for your help, > > Elisabeth > > > > 2016-03-10 11:32 GMT+01:00 Alessandro Benedetti : > > > > > If I followed your use case is: > > > > > > I type Asmtreadm and I want document matching Amsterdam ( even if the > > edit > > > distance is greater than 2) . > > > First of all is something I hope you do only if you get 0 results, if > not > > > the overhead can be great and you are going to lose a lot of precision > > > causing confusion in the customer. > > > > > > Pf2 and Pf3 is ngram of white space separated tokens, to make partial > > > phrase query to affect the scoring. > > > Not a good fit for your problem. > > > > > > More than grams, have you considered using some sort of phonetic > > matching ? > > > Could this help : > > > https://cwiki.apache.org/confluence/display/solr/Phonetic+Matching > > > > > > Cheers > > > > > > On 10 March 2016 at 08:47, elisabeth benoit > > > > wrote: > > > > > > > I am trying to do approximative search with solr. We've tried fuzzy > > > search, > > > > and spellcheck search, it's working ok but edit distance is limited > > (to 2 > > > > for DirectSolrSpellChecker in solr 4.10.1). With fuzzy operator, > we've > > > had > > > > performance issues, and I don't think you can have an edit distance > > more > > > > than 2. > > > > > > > > What we used to do with a database was more efficient: storing > trigrams > > > > with position, and then searching arround that position (not > precisely > > at > > > > that position, since it's approximative search) > > > > > > > > Position is to avoid for a trigram like ams (amsterdam) to get > answers > > > > where the same trigram is for instance at the end of the word. I > would > > > like > > > > answers with the same relative position between trigrams to score > > higher. > > > > Maybe using edismax'ss pf2 and pf3 is a way to do this. I don't see > any > > > > other way. Please tell me if you do. > > > > > > > > From you're answer, I get that position is stored, but I dont > > understand > > > > how I can preserve relative order between trigrams, apart from using > > pf2 > > > > pf3. > > > > > > > > Best regards, > > > > Elisabeth > > > > > > > > 2016-03-10 0:02 GMT+01:00 Alessandro Benedetti < > abenede...@apache.org > > >: > > > > > > > > > if you store the positions for your tokens ( and it is by default > if > > > you > > > > > don't omit them), you have the relative position in the index. [1] > > > > > I attach a blog post of mine, describing a little bit more in > details > > > the > > > > > lucene internals. > > > > > > > > > > Apart from that, can you explain the problem you are trying to > solve > > ? > > > > > The high level user experience ? > > > > > What kind of search/autocompletion/relevancy tuning are you trying > to > > > > > achieve ? > > > > > Maybe we can help better if we start from the problem :) > > > > > > > > > > Cheers > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > http://alexbenedetti.blogspot.co.uk/2015/07/exploring-solr-internals-lucene.html > > > > > > > > > > On 9 March 2016 at 15:02, elisabeth benoit < > > elisaelisael...@gmail.com> > > > > > wrote: > > > > > > > > > > > Hello Alessandro, > > > > > > > > > > > > You may be right. What would you use to keep relative order > > between, > > > > for > > > > > > instance, grams > > > > > > > > > > > > __a > > > > > > _am > > > > > > ams > > > > > > mst > > > > > > ste > > > > > > ter > > > > > > erd > > > > > > rda > > > > > > dam > > > > > > am_ > > > > > > > > > > > > of amsterda
Re: ngrams with position
The reason pf2 and pf3 seems not a good solution to me is the fact that the edismax query parser calculate those grams on top of words shingles. So it takes the query in input, and produces the shingle based on the white space separator. i.e. if you search : "white tiger jumping" and pf2 configured on field1. You are going to end up searching in field1 : "white tiger", "tiger jumping" . This is really useful in full text search oriented to phrases and partial phrases match. But it has nothing to do with the analysis type associated at query time at this moment. First it is used the query parser tokenisation to build the grams and then the query time analysis is applied. This according to my remembering, I will double check in the code and let you know. Cheers On 10 March 2016 at 11:02, elisabeth benoit wrote: > That's the use cas, yes. Find Amsterdam with Asmtreadm. > > And yes, we're only doing approximative search if we get 0 result. > > I don't quite get why pf2 pf3 not a good solution. > > We're actually testing a solution close to phonetic. Some kind of word > reduction. > > Thanks for the suggestion (and the link), this makes me think maybe > phonetic is the good solution. > > Thanks for your help, > Elisabeth > > 2016-03-10 11:32 GMT+01:00 Alessandro Benedetti : > > > If I followed your use case is: > > > > I type Asmtreadm and I want document matching Amsterdam ( even if the > edit > > distance is greater than 2) . > > First of all is something I hope you do only if you get 0 results, if not > > the overhead can be great and you are going to lose a lot of precision > > causing confusion in the customer. > > > > Pf2 and Pf3 is ngram of white space separated tokens, to make partial > > phrase query to affect the scoring. > > Not a good fit for your problem. > > > > More than grams, have you considered using some sort of phonetic > matching ? > > Could this help : > > https://cwiki.apache.org/confluence/display/solr/Phonetic+Matching > > > > Cheers > > > > On 10 March 2016 at 08:47, elisabeth benoit > > wrote: > > > > > I am trying to do approximative search with solr. We've tried fuzzy > > search, > > > and spellcheck search, it's working ok but edit distance is limited > (to 2 > > > for DirectSolrSpellChecker in solr 4.10.1). With fuzzy operator, we've > > had > > > performance issues, and I don't think you can have an edit distance > more > > > than 2. > > > > > > What we used to do with a database was more efficient: storing trigrams > > > with position, and then searching arround that position (not precisely > at > > > that position, since it's approximative search) > > > > > > Position is to avoid for a trigram like ams (amsterdam) to get answers > > > where the same trigram is for instance at the end of the word. I would > > like > > > answers with the same relative position between trigrams to score > higher. > > > Maybe using edismax'ss pf2 and pf3 is a way to do this. I don't see any > > > other way. Please tell me if you do. > > > > > > From you're answer, I get that position is stored, but I dont > understand > > > how I can preserve relative order between trigrams, apart from using > pf2 > > > pf3. > > > > > > Best regards, > > > Elisabeth > > > > > > 2016-03-10 0:02 GMT+01:00 Alessandro Benedetti >: > > > > > > > if you store the positions for your tokens ( and it is by default if > > you > > > > don't omit them), you have the relative position in the index. [1] > > > > I attach a blog post of mine, describing a little bit more in details > > the > > > > lucene internals. > > > > > > > > Apart from that, can you explain the problem you are trying to solve > ? > > > > The high level user experience ? > > > > What kind of search/autocompletion/relevancy tuning are you trying to > > > > achieve ? > > > > Maybe we can help better if we start from the problem :) > > > > > > > > Cheers > > > > > > > > [1] > > > > > > > > > > > > > > http://alexbenedetti.blogspot.co.uk/2015/07/exploring-solr-internals-lucene.html > > > > > > > > On 9 March 2016 at 15:02, elisabeth benoit < > elisaelisael...@gmail.com> > > > > wrote: > > > > > > > > > Hello Alessandro, > > > > > > > > > > You may be right. What would you use to keep relative order > between, > > > for > > > > > instance, grams > > > > > > > > > > __a > > > > > _am > > > > > ams > > > > > mst > > > > > ste > > > > > ter > > > > > erd > > > > > rda > > > > > dam > > > > > am_ > > > > > > > > > > of amsterdam? pf2 and pf3? That's all I can think about. Please let > > me > > > > know > > > > > if you have more insights. > > > > > > > > > > Best regards, > > > > > Elisabeth > > > > > > > > > > 2016-03-08 17:46 GMT+01:00 Alessandro Benedetti < > > abenede...@apache.org > > > >: > > > > > > > > > > > Elizabeth, > > > > > > out of curiousity, could we know what you are trying to solve > with > > > that > > > > > > complex way of tokenisation ? > > > > > > Solr is really good in storing positions along with token, so I > am > > > > > curio
Re: ngrams with position
That's the use cas, yes. Find Amsterdam with Asmtreadm. And yes, we're only doing approximative search if we get 0 result. I don't quite get why pf2 pf3 not a good solution. We're actually testing a solution close to phonetic. Some kind of word reduction. Thanks for the suggestion (and the link), this makes me think maybe phonetic is the good solution. Thanks for your help, Elisabeth 2016-03-10 11:32 GMT+01:00 Alessandro Benedetti : > If I followed your use case is: > > I type Asmtreadm and I want document matching Amsterdam ( even if the edit > distance is greater than 2) . > First of all is something I hope you do only if you get 0 results, if not > the overhead can be great and you are going to lose a lot of precision > causing confusion in the customer. > > Pf2 and Pf3 is ngram of white space separated tokens, to make partial > phrase query to affect the scoring. > Not a good fit for your problem. > > More than grams, have you considered using some sort of phonetic matching ? > Could this help : > https://cwiki.apache.org/confluence/display/solr/Phonetic+Matching > > Cheers > > On 10 March 2016 at 08:47, elisabeth benoit > wrote: > > > I am trying to do approximative search with solr. We've tried fuzzy > search, > > and spellcheck search, it's working ok but edit distance is limited (to 2 > > for DirectSolrSpellChecker in solr 4.10.1). With fuzzy operator, we've > had > > performance issues, and I don't think you can have an edit distance more > > than 2. > > > > What we used to do with a database was more efficient: storing trigrams > > with position, and then searching arround that position (not precisely at > > that position, since it's approximative search) > > > > Position is to avoid for a trigram like ams (amsterdam) to get answers > > where the same trigram is for instance at the end of the word. I would > like > > answers with the same relative position between trigrams to score higher. > > Maybe using edismax'ss pf2 and pf3 is a way to do this. I don't see any > > other way. Please tell me if you do. > > > > From you're answer, I get that position is stored, but I dont understand > > how I can preserve relative order between trigrams, apart from using pf2 > > pf3. > > > > Best regards, > > Elisabeth > > > > 2016-03-10 0:02 GMT+01:00 Alessandro Benedetti : > > > > > if you store the positions for your tokens ( and it is by default if > you > > > don't omit them), you have the relative position in the index. [1] > > > I attach a blog post of mine, describing a little bit more in details > the > > > lucene internals. > > > > > > Apart from that, can you explain the problem you are trying to solve ? > > > The high level user experience ? > > > What kind of search/autocompletion/relevancy tuning are you trying to > > > achieve ? > > > Maybe we can help better if we start from the problem :) > > > > > > Cheers > > > > > > [1] > > > > > > > > > http://alexbenedetti.blogspot.co.uk/2015/07/exploring-solr-internals-lucene.html > > > > > > On 9 March 2016 at 15:02, elisabeth benoit > > > wrote: > > > > > > > Hello Alessandro, > > > > > > > > You may be right. What would you use to keep relative order between, > > for > > > > instance, grams > > > > > > > > __a > > > > _am > > > > ams > > > > mst > > > > ste > > > > ter > > > > erd > > > > rda > > > > dam > > > > am_ > > > > > > > > of amsterdam? pf2 and pf3? That's all I can think about. Please let > me > > > know > > > > if you have more insights. > > > > > > > > Best regards, > > > > Elisabeth > > > > > > > > 2016-03-08 17:46 GMT+01:00 Alessandro Benedetti < > abenede...@apache.org > > >: > > > > > > > > > Elizabeth, > > > > > out of curiousity, could we know what you are trying to solve with > > that > > > > > complex way of tokenisation ? > > > > > Solr is really good in storing positions along with token, so I am > > > > curious > > > > > to know why your are mixing the things up. > > > > > > > > > > Cheers > > > > > > > > > > On 8 March 2016 at 10:08, elisabeth benoit < > > elisaelisael...@gmail.com> > > > > > wrote: > > > > > > > > > > > Thanks for your answer Emir, > > > > > > > > > > > > I'll check that out. > > > > > > > > > > > > Best regards, > > > > > > Elisabeth > > > > > > > > > > > > 2016-03-08 10:24 GMT+01:00 Emir Arnautovic < > > > > emir.arnauto...@sematext.com > > > > > >: > > > > > > > > > > > > > Hi Elisabeth, > > > > > > > I don't think there is such token filter, so you would have to > > > create > > > > > > your > > > > > > > own token filter that takes token and emits ngram token of > > specific > > > > > > length. > > > > > > > It should not be too hard to create such filter - you can take > a > > > look > > > > > how > > > > > > > nagram filter is coded - yours should be simpler than that. > > > > > > > > > > > > > > Regards, > > > > > > > Emir > > > > > > > > > > > > > > > > > > > > > On 08.03.2016 08:52, elisabeth benoit wrote: > > > > > > > > > > > > > >> Hello, > > > > > > >> > > > > > > >> I'm using solr 4
Re: ngrams with position
If I followed your use case is: I type Asmtreadm and I want document matching Amsterdam ( even if the edit distance is greater than 2) . First of all is something I hope you do only if you get 0 results, if not the overhead can be great and you are going to lose a lot of precision causing confusion in the customer. Pf2 and Pf3 is ngram of white space separated tokens, to make partial phrase query to affect the scoring. Not a good fit for your problem. More than grams, have you considered using some sort of phonetic matching ? Could this help : https://cwiki.apache.org/confluence/display/solr/Phonetic+Matching Cheers On 10 March 2016 at 08:47, elisabeth benoit wrote: > I am trying to do approximative search with solr. We've tried fuzzy search, > and spellcheck search, it's working ok but edit distance is limited (to 2 > for DirectSolrSpellChecker in solr 4.10.1). With fuzzy operator, we've had > performance issues, and I don't think you can have an edit distance more > than 2. > > What we used to do with a database was more efficient: storing trigrams > with position, and then searching arround that position (not precisely at > that position, since it's approximative search) > > Position is to avoid for a trigram like ams (amsterdam) to get answers > where the same trigram is for instance at the end of the word. I would like > answers with the same relative position between trigrams to score higher. > Maybe using edismax'ss pf2 and pf3 is a way to do this. I don't see any > other way. Please tell me if you do. > > From you're answer, I get that position is stored, but I dont understand > how I can preserve relative order between trigrams, apart from using pf2 > pf3. > > Best regards, > Elisabeth > > 2016-03-10 0:02 GMT+01:00 Alessandro Benedetti : > > > if you store the positions for your tokens ( and it is by default if you > > don't omit them), you have the relative position in the index. [1] > > I attach a blog post of mine, describing a little bit more in details the > > lucene internals. > > > > Apart from that, can you explain the problem you are trying to solve ? > > The high level user experience ? > > What kind of search/autocompletion/relevancy tuning are you trying to > > achieve ? > > Maybe we can help better if we start from the problem :) > > > > Cheers > > > > [1] > > > > > http://alexbenedetti.blogspot.co.uk/2015/07/exploring-solr-internals-lucene.html > > > > On 9 March 2016 at 15:02, elisabeth benoit > > wrote: > > > > > Hello Alessandro, > > > > > > You may be right. What would you use to keep relative order between, > for > > > instance, grams > > > > > > __a > > > _am > > > ams > > > mst > > > ste > > > ter > > > erd > > > rda > > > dam > > > am_ > > > > > > of amsterdam? pf2 and pf3? That's all I can think about. Please let me > > know > > > if you have more insights. > > > > > > Best regards, > > > Elisabeth > > > > > > 2016-03-08 17:46 GMT+01:00 Alessandro Benedetti >: > > > > > > > Elizabeth, > > > > out of curiousity, could we know what you are trying to solve with > that > > > > complex way of tokenisation ? > > > > Solr is really good in storing positions along with token, so I am > > > curious > > > > to know why your are mixing the things up. > > > > > > > > Cheers > > > > > > > > On 8 March 2016 at 10:08, elisabeth benoit < > elisaelisael...@gmail.com> > > > > wrote: > > > > > > > > > Thanks for your answer Emir, > > > > > > > > > > I'll check that out. > > > > > > > > > > Best regards, > > > > > Elisabeth > > > > > > > > > > 2016-03-08 10:24 GMT+01:00 Emir Arnautovic < > > > emir.arnauto...@sematext.com > > > > >: > > > > > > > > > > > Hi Elisabeth, > > > > > > I don't think there is such token filter, so you would have to > > create > > > > > your > > > > > > own token filter that takes token and emits ngram token of > specific > > > > > length. > > > > > > It should not be too hard to create such filter - you can take a > > look > > > > how > > > > > > nagram filter is coded - yours should be simpler than that. > > > > > > > > > > > > Regards, > > > > > > Emir > > > > > > > > > > > > > > > > > > On 08.03.2016 08:52, elisabeth benoit wrote: > > > > > > > > > > > >> Hello, > > > > > >> > > > > > >> I'm using solr 4.10.1. I'd like to index words with ngrams of > fix > > > > lenght > > > > > >> with a position in the end. > > > > > >> > > > > > >> For instance, with fix lenght 3, Amsterdam would be something > > like: > > > > > >> > > > > > >> > > > > > >> a0 (two spaces added at beginning) > > > > > >> am1 > > > > > >> ams2 > > > > > >> mst3 > > > > > >> ste4 > > > > > >> ter5 > > > > > >> erd6 > > > > > >> rda7 > > > > > >> dam8 > > > > > >> am9 (one more space in the end) > > > > > >> > > > > > >> The number at the end being the position. > > > > > >> > > > > > >> Does anyone have a clue how to achieve this? > > > > > >> > > > > > >> Best regards, > > > > > >> Elisabeth > > > > > >> > > > > > >> > > > > > > -- > > > > > > Monitoring * Alerting * Anomaly Detect
Re: ngrams with position
I am trying to do approximative search with solr. We've tried fuzzy search, and spellcheck search, it's working ok but edit distance is limited (to 2 for DirectSolrSpellChecker in solr 4.10.1). With fuzzy operator, we've had performance issues, and I don't think you can have an edit distance more than 2. What we used to do with a database was more efficient: storing trigrams with position, and then searching arround that position (not precisely at that position, since it's approximative search) Position is to avoid for a trigram like ams (amsterdam) to get answers where the same trigram is for instance at the end of the word. I would like answers with the same relative position between trigrams to score higher. Maybe using edismax'ss pf2 and pf3 is a way to do this. I don't see any other way. Please tell me if you do. >From you're answer, I get that position is stored, but I dont understand how I can preserve relative order between trigrams, apart from using pf2 pf3. Best regards, Elisabeth 2016-03-10 0:02 GMT+01:00 Alessandro Benedetti : > if you store the positions for your tokens ( and it is by default if you > don't omit them), you have the relative position in the index. [1] > I attach a blog post of mine, describing a little bit more in details the > lucene internals. > > Apart from that, can you explain the problem you are trying to solve ? > The high level user experience ? > What kind of search/autocompletion/relevancy tuning are you trying to > achieve ? > Maybe we can help better if we start from the problem :) > > Cheers > > [1] > > http://alexbenedetti.blogspot.co.uk/2015/07/exploring-solr-internals-lucene.html > > On 9 March 2016 at 15:02, elisabeth benoit > wrote: > > > Hello Alessandro, > > > > You may be right. What would you use to keep relative order between, for > > instance, grams > > > > __a > > _am > > ams > > mst > > ste > > ter > > erd > > rda > > dam > > am_ > > > > of amsterdam? pf2 and pf3? That's all I can think about. Please let me > know > > if you have more insights. > > > > Best regards, > > Elisabeth > > > > 2016-03-08 17:46 GMT+01:00 Alessandro Benedetti : > > > > > Elizabeth, > > > out of curiousity, could we know what you are trying to solve with that > > > complex way of tokenisation ? > > > Solr is really good in storing positions along with token, so I am > > curious > > > to know why your are mixing the things up. > > > > > > Cheers > > > > > > On 8 March 2016 at 10:08, elisabeth benoit > > > wrote: > > > > > > > Thanks for your answer Emir, > > > > > > > > I'll check that out. > > > > > > > > Best regards, > > > > Elisabeth > > > > > > > > 2016-03-08 10:24 GMT+01:00 Emir Arnautovic < > > emir.arnauto...@sematext.com > > > >: > > > > > > > > > Hi Elisabeth, > > > > > I don't think there is such token filter, so you would have to > create > > > > your > > > > > own token filter that takes token and emits ngram token of specific > > > > length. > > > > > It should not be too hard to create such filter - you can take a > look > > > how > > > > > nagram filter is coded - yours should be simpler than that. > > > > > > > > > > Regards, > > > > > Emir > > > > > > > > > > > > > > > On 08.03.2016 08:52, elisabeth benoit wrote: > > > > > > > > > >> Hello, > > > > >> > > > > >> I'm using solr 4.10.1. I'd like to index words with ngrams of fix > > > lenght > > > > >> with a position in the end. > > > > >> > > > > >> For instance, with fix lenght 3, Amsterdam would be something > like: > > > > >> > > > > >> > > > > >> a0 (two spaces added at beginning) > > > > >> am1 > > > > >> ams2 > > > > >> mst3 > > > > >> ste4 > > > > >> ter5 > > > > >> erd6 > > > > >> rda7 > > > > >> dam8 > > > > >> am9 (one more space in the end) > > > > >> > > > > >> The number at the end being the position. > > > > >> > > > > >> Does anyone have a clue how to achieve this? > > > > >> > > > > >> Best regards, > > > > >> Elisabeth > > > > >> > > > > >> > > > > > -- > > > > > Monitoring * Alerting * Anomaly Detection * Centralized Log > > Management > > > > > Solr & Elasticsearch Support * http://sematext.com/ > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > -- > > > > > > Benedetti Alessandro > > > Visiting card : http://about.me/alessandro_benedetti > > > > > > "Tyger, tyger burning bright > > > In the forests of the night, > > > What immortal hand or eye > > > Could frame thy fearful symmetry?" > > > > > > William Blake - Songs of Experience -1794 England > > > > > > > > > -- > -- > > Benedetti Alessandro > Visiting card : http://about.me/alessandro_benedetti > > "Tyger, tyger burning bright > In the forests of the night, > What immortal hand or eye > Could frame thy fearful symmetry?" > > William Blake - Songs of Experience -1794 England >
Re: ngrams with position
if you store the positions for your tokens ( and it is by default if you don't omit them), you have the relative position in the index. [1] I attach a blog post of mine, describing a little bit more in details the lucene internals. Apart from that, can you explain the problem you are trying to solve ? The high level user experience ? What kind of search/autocompletion/relevancy tuning are you trying to achieve ? Maybe we can help better if we start from the problem :) Cheers [1] http://alexbenedetti.blogspot.co.uk/2015/07/exploring-solr-internals-lucene.html On 9 March 2016 at 15:02, elisabeth benoit wrote: > Hello Alessandro, > > You may be right. What would you use to keep relative order between, for > instance, grams > > __a > _am > ams > mst > ste > ter > erd > rda > dam > am_ > > of amsterdam? pf2 and pf3? That's all I can think about. Please let me know > if you have more insights. > > Best regards, > Elisabeth > > 2016-03-08 17:46 GMT+01:00 Alessandro Benedetti : > > > Elizabeth, > > out of curiousity, could we know what you are trying to solve with that > > complex way of tokenisation ? > > Solr is really good in storing positions along with token, so I am > curious > > to know why your are mixing the things up. > > > > Cheers > > > > On 8 March 2016 at 10:08, elisabeth benoit > > wrote: > > > > > Thanks for your answer Emir, > > > > > > I'll check that out. > > > > > > Best regards, > > > Elisabeth > > > > > > 2016-03-08 10:24 GMT+01:00 Emir Arnautovic < > emir.arnauto...@sematext.com > > >: > > > > > > > Hi Elisabeth, > > > > I don't think there is such token filter, so you would have to create > > > your > > > > own token filter that takes token and emits ngram token of specific > > > length. > > > > It should not be too hard to create such filter - you can take a look > > how > > > > nagram filter is coded - yours should be simpler than that. > > > > > > > > Regards, > > > > Emir > > > > > > > > > > > > On 08.03.2016 08:52, elisabeth benoit wrote: > > > > > > > >> Hello, > > > >> > > > >> I'm using solr 4.10.1. I'd like to index words with ngrams of fix > > lenght > > > >> with a position in the end. > > > >> > > > >> For instance, with fix lenght 3, Amsterdam would be something like: > > > >> > > > >> > > > >> a0 (two spaces added at beginning) > > > >> am1 > > > >> ams2 > > > >> mst3 > > > >> ste4 > > > >> ter5 > > > >> erd6 > > > >> rda7 > > > >> dam8 > > > >> am9 (one more space in the end) > > > >> > > > >> The number at the end being the position. > > > >> > > > >> Does anyone have a clue how to achieve this? > > > >> > > > >> Best regards, > > > >> Elisabeth > > > >> > > > >> > > > > -- > > > > Monitoring * Alerting * Anomaly Detection * Centralized Log > Management > > > > Solr & Elasticsearch Support * http://sematext.com/ > > > > > > > > > > > > > > > > > > > -- > > -- > > > > Benedetti Alessandro > > Visiting card : http://about.me/alessandro_benedetti > > > > "Tyger, tyger burning bright > > In the forests of the night, > > What immortal hand or eye > > Could frame thy fearful symmetry?" > > > > William Blake - Songs of Experience -1794 England > > > -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti "Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry?" William Blake - Songs of Experience -1794 England
Re: ngrams with position
Hello Alessandro, You may be right. What would you use to keep relative order between, for instance, grams __a _am ams mst ste ter erd rda dam am_ of amsterdam? pf2 and pf3? That's all I can think about. Please let me know if you have more insights. Best regards, Elisabeth 2016-03-08 17:46 GMT+01:00 Alessandro Benedetti : > Elizabeth, > out of curiousity, could we know what you are trying to solve with that > complex way of tokenisation ? > Solr is really good in storing positions along with token, so I am curious > to know why your are mixing the things up. > > Cheers > > On 8 March 2016 at 10:08, elisabeth benoit > wrote: > > > Thanks for your answer Emir, > > > > I'll check that out. > > > > Best regards, > > Elisabeth > > > > 2016-03-08 10:24 GMT+01:00 Emir Arnautovic >: > > > > > Hi Elisabeth, > > > I don't think there is such token filter, so you would have to create > > your > > > own token filter that takes token and emits ngram token of specific > > length. > > > It should not be too hard to create such filter - you can take a look > how > > > nagram filter is coded - yours should be simpler than that. > > > > > > Regards, > > > Emir > > > > > > > > > On 08.03.2016 08:52, elisabeth benoit wrote: > > > > > >> Hello, > > >> > > >> I'm using solr 4.10.1. I'd like to index words with ngrams of fix > lenght > > >> with a position in the end. > > >> > > >> For instance, with fix lenght 3, Amsterdam would be something like: > > >> > > >> > > >> a0 (two spaces added at beginning) > > >> am1 > > >> ams2 > > >> mst3 > > >> ste4 > > >> ter5 > > >> erd6 > > >> rda7 > > >> dam8 > > >> am9 (one more space in the end) > > >> > > >> The number at the end being the position. > > >> > > >> Does anyone have a clue how to achieve this? > > >> > > >> Best regards, > > >> Elisabeth > > >> > > >> > > > -- > > > Monitoring * Alerting * Anomaly Detection * Centralized Log Management > > > Solr & Elasticsearch Support * http://sematext.com/ > > > > > > > > > > > > -- > -- > > Benedetti Alessandro > Visiting card : http://about.me/alessandro_benedetti > > "Tyger, tyger burning bright > In the forests of the night, > What immortal hand or eye > Could frame thy fearful symmetry?" > > William Blake - Songs of Experience -1794 England >
Re: ngrams with position
Elizabeth, out of curiousity, could we know what you are trying to solve with that complex way of tokenisation ? Solr is really good in storing positions along with token, so I am curious to know why your are mixing the things up. Cheers On 8 March 2016 at 10:08, elisabeth benoit wrote: > Thanks for your answer Emir, > > I'll check that out. > > Best regards, > Elisabeth > > 2016-03-08 10:24 GMT+01:00 Emir Arnautovic : > > > Hi Elisabeth, > > I don't think there is such token filter, so you would have to create > your > > own token filter that takes token and emits ngram token of specific > length. > > It should not be too hard to create such filter - you can take a look how > > nagram filter is coded - yours should be simpler than that. > > > > Regards, > > Emir > > > > > > On 08.03.2016 08:52, elisabeth benoit wrote: > > > >> Hello, > >> > >> I'm using solr 4.10.1. I'd like to index words with ngrams of fix lenght > >> with a position in the end. > >> > >> For instance, with fix lenght 3, Amsterdam would be something like: > >> > >> > >> a0 (two spaces added at beginning) > >> am1 > >> ams2 > >> mst3 > >> ste4 > >> ter5 > >> erd6 > >> rda7 > >> dam8 > >> am9 (one more space in the end) > >> > >> The number at the end being the position. > >> > >> Does anyone have a clue how to achieve this? > >> > >> Best regards, > >> Elisabeth > >> > >> > > -- > > Monitoring * Alerting * Anomaly Detection * Centralized Log Management > > Solr & Elasticsearch Support * http://sematext.com/ > > > > > -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti "Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry?" William Blake - Songs of Experience -1794 England
Re: ngrams with position
Thanks for your answer Emir, I'll check that out. Best regards, Elisabeth 2016-03-08 10:24 GMT+01:00 Emir Arnautovic : > Hi Elisabeth, > I don't think there is such token filter, so you would have to create your > own token filter that takes token and emits ngram token of specific length. > It should not be too hard to create such filter - you can take a look how > nagram filter is coded - yours should be simpler than that. > > Regards, > Emir > > > On 08.03.2016 08:52, elisabeth benoit wrote: > >> Hello, >> >> I'm using solr 4.10.1. I'd like to index words with ngrams of fix lenght >> with a position in the end. >> >> For instance, with fix lenght 3, Amsterdam would be something like: >> >> >> a0 (two spaces added at beginning) >> am1 >> ams2 >> mst3 >> ste4 >> ter5 >> erd6 >> rda7 >> dam8 >> am9 (one more space in the end) >> >> The number at the end being the position. >> >> Does anyone have a clue how to achieve this? >> >> Best regards, >> Elisabeth >> >> > -- > Monitoring * Alerting * Anomaly Detection * Centralized Log Management > Solr & Elasticsearch Support * http://sematext.com/ > >
Re: ngrams with position
Hi Elisabeth, I don't think there is such token filter, so you would have to create your own token filter that takes token and emits ngram token of specific length. It should not be too hard to create such filter - you can take a look how nagram filter is coded - yours should be simpler than that. Regards, Emir On 08.03.2016 08:52, elisabeth benoit wrote: Hello, I'm using solr 4.10.1. I'd like to index words with ngrams of fix lenght with a position in the end. For instance, with fix lenght 3, Amsterdam would be something like: a0 (two spaces added at beginning) am1 ams2 mst3 ste4 ter5 erd6 rda7 dam8 am9 (one more space in the end) The number at the end being the position. Does anyone have a clue how to achieve this? Best regards, Elisabeth -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & Elasticsearch Support * http://sematext.com/
ngrams with position
Hello, I'm using solr 4.10.1. I'd like to index words with ngrams of fix lenght with a position in the end. For instance, with fix lenght 3, Amsterdam would be something like: a0 (two spaces added at beginning) am1 ams2 mst3 ste4 ter5 erd6 rda7 dam8 am9 (one more space in the end) The number at the end being the position. Does anyone have a clue how to achieve this? Best regards, Elisabeth