Re: ngrams with position

elisabeth benoit Fri, 11 Mar 2016 03:53:29 -0800

Jack, Emir,

Thanks for your answers. Moving ngram logic to client side would be a fast
and easy way to test the solution and compare it with the phonetic one.


Best regards,
Elisabeth

2016-03-11 10:52 GMT+01:00 Emir Arnautovic <emir.arnauto...@sematext.com>:

> Hi Elizabeth,
> In order to see if you will get better results, you can move ngram logic
> outside of analysis chain - simplest solution is to move it to client. In
> such setup, you should be able to use pf2 and pf3 and see if that produces
> desired result.
>
> Regards,
> Emir
>
>
> On 10.03.2016 13:47, elisabeth benoit wrote:
>
>> oh yeah, now that you're saying it, yeah you're right, pf2 pf3 will boost
>> proximity between words, not between ngrams.
>>
>> Thanks again,
>> Elisabeth
>>
>> 2016-03-10 12:31 GMT+01:00 Alessandro Benedetti <abenede...@apache.org>:
>>
>> The reason pf2 and pf3 seems not a good solution to me is the fact that
>>> the
>>> edismax query parser calculate those grams on top of words shingles.
>>> So it takes the query in input, and produces the shingle based on the
>>> white
>>> space separator.
>>>
>>> i.e. if you search :
>>> "white tiger jumping"
>>>   and pf2 configured on field1.
>>> You are going to end up searching in field1 :
>>> "white tiger", "tiger jumping" .
>>> This is really useful in full text search oriented to phrases and partial
>>> phrases match.
>>> But it has nothing to do with the analysis type associated at query time
>>> at
>>> this moment.
>>> First it is used the query parser tokenisation to build the grams and
>>> then
>>> the query time analysis is applied.
>>> This according to my remembering,
>>> I will double check in the code and let you know.
>>>
>>> Cheers
>>>
>>>
>>> On 10 March 2016 at 11:02, elisabeth benoit <elisaelisael...@gmail.com>
>>> wrote:
>>>
>>> That's the use cas, yes. Find Amsterdam with Asmtreadm.
>>>>
>>>> And yes, we're only doing approximative search if we get 0 result.
>>>>
>>>> I don't quite get why pf2 pf3 not a good solution.
>>>>
>>>> We're actually testing a solution close to phonetic. Some kind of word
>>>> reduction.
>>>>
>>>> Thanks for the suggestion (and the link), this makes me think maybe
>>>> phonetic is the good solution.
>>>>
>>>> Thanks for your help,
>>>> Elisabeth
>>>>
>>>> 2016-03-10 11:32 GMT+01:00 Alessandro Benedetti <abenede...@apache.org
>>>> >:
>>>>
>>>> mmmm If I followed your use case is:
>>>>>
>>>>> I type Asmtreadm and I want document matching Amsterdam ( even if the
>>>>>
>>>> edit
>>>>
>>>>> distance is greater than 2) .
>>>>> First of all is something I hope you do only if you get 0 results, if
>>>>>
>>>> not
>>>
>>>> the overhead can be great and you are going to lose a lot of precision
>>>>> causing confusion in the customer.
>>>>>
>>>>> Pf2 and Pf3 is ngram of white space separated tokens, to make partial
>>>>> phrase query to affect the scoring.
>>>>> Not a good fit for your problem.
>>>>>
>>>>> More than grams, have you considered using some sort of phonetic
>>>>>
>>>> matching ?
>>>>
>>>>> Could this help :
>>>>> https://cwiki.apache.org/confluence/display/solr/Phonetic+Matching
>>>>>
>>>>> Cheers
>>>>>
>>>>> On 10 March 2016 at 08:47, elisabeth benoit <elisaelisael...@gmail.com
>>>>> wrote:
>>>>>
>>>>> I am trying to do approximative search with solr. We've tried fuzzy
>>>>>>
>>>>> search,
>>>>>
>>>>>> and spellcheck search, it's working ok but edit distance is limited
>>>>>>
>>>>> (to 2
>>>>
>>>>> for DirectSolrSpellChecker in solr 4.10.1). With fuzzy operator,
>>>>>>
>>>>> we've
>>>
>>>> had
>>>>>
>>>>>> performance issues, and I don't think you can have an edit distance
>>>>>>
>>>>> more
>>>>
>>>>> than 2.
>>>>>>
>>>>>> What we used to do with a database was more efficient: storing
>>>>>>
>>>>> trigrams
>>>
>>>> with position, and then searching arround that position (not
>>>>>>
>>>>> precisely
>>>
>>>> at
>>>>
>>>>> that position, since it's approximative search)
>>>>>>
>>>>>> Position is to avoid  for a trigram like ams (amsterdam) to get
>>>>>>
>>>>> answers
>>>
>>>> where the same trigram is for instance at the end of the word. I
>>>>>>
>>>>> would
>>>
>>>> like
>>>>>
>>>>>> answers with the same relative position between trigrams to score
>>>>>>
>>>>> higher.
>>>>
>>>>> Maybe using edismax'ss pf2 and pf3 is a way to do this. I don't see
>>>>>>
>>>>> any
>>>
>>>> other way. Please tell me if you do.
>>>>>>
>>>>>>  From you're answer, I get that position is stored, but I dont
>>>>>>
>>>>> understand
>>>>
>>>>> how I can preserve relative order between trigrams, apart from using
>>>>>>
>>>>> pf2
>>>>
>>>>> pf3.
>>>>>>
>>>>>> Best regards,
>>>>>> Elisabeth
>>>>>>
>>>>>> 2016-03-10 0:02 GMT+01:00 Alessandro Benedetti <
>>>>>>
>>>>> abenede...@apache.org
>>>
>>>> :
>>>>>
>>>>>> if you store the positions for your tokens ( and it is by default
>>>>>>>
>>>>>> if
>>>
>>>> you
>>>>>
>>>>>> don't omit them), you have the relative position in the index. [1]
>>>>>>> I attach a blog post of mine, describing a little bit more in
>>>>>>>
>>>>>> details
>>>
>>>> the
>>>>>
>>>>>> lucene internals.
>>>>>>>
>>>>>>> Apart from that, can you explain the problem you are trying to
>>>>>>>
>>>>>> solve
>>>
>>>> ?
>>>>
>>>>> The high level user experience ?
>>>>>>> What kind of search/autocompletion/relevancy tuning are you trying
>>>>>>>
>>>>>> to
>>>
>>>> achieve ?
>>>>>>> Maybe we can help better if we start from the problem :)
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>> [1]
>>>>>>>
>>>>>>>
>>>>>>>
>>> http://alexbenedetti.blogspot.co.uk/2015/07/exploring-solr-internals-lucene.html
>>>
>>>> On 9 March 2016 at 15:02, elisabeth benoit <
>>>>>>>
>>>>>> elisaelisael...@gmail.com>
>>>>
>>>>> wrote:
>>>>>>>
>>>>>>> Hello Alessandro,
>>>>>>>>
>>>>>>>> You may be right. What would you use to keep relative order
>>>>>>>>
>>>>>>> between,
>>>>
>>>>> for
>>>>>>
>>>>>>> instance, grams
>>>>>>>>
>>>>>>>> __a
>>>>>>>> _am
>>>>>>>> ams
>>>>>>>> mst
>>>>>>>> ste
>>>>>>>> ter
>>>>>>>> erd
>>>>>>>> rda
>>>>>>>> dam
>>>>>>>> am_
>>>>>>>>
>>>>>>>> of amsterdam? pf2 and pf3? That's all I can think about. Please
>>>>>>>>
>>>>>>> let
>>>
>>>> me
>>>>>
>>>>>> know
>>>>>>>
>>>>>>>> if you have more insights.
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Elisabeth
>>>>>>>>
>>>>>>>> 2016-03-08 17:46 GMT+01:00 Alessandro Benedetti <
>>>>>>>>
>>>>>>> abenede...@apache.org
>>>>>
>>>>>> :
>>>>>>>
>>>>>>>> Elizabeth,
>>>>>>>>> out of curiousity, could we know what you are trying to solve
>>>>>>>>>
>>>>>>>> with
>>>>
>>>>> that
>>>>>>
>>>>>>> complex way of tokenisation ?
>>>>>>>>> Solr is really good in storing positions along with token, so I
>>>>>>>>>
>>>>>>>> am
>>>>
>>>>> curious
>>>>>>>>
>>>>>>>>> to know why your are mixing the things up.
>>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>>
>>>>>>>>> On 8 March 2016 at 10:08, elisabeth benoit <
>>>>>>>>>
>>>>>>>> elisaelisael...@gmail.com>
>>>>>>
>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Thanks for your answer Emir,
>>>>>>>>>>
>>>>>>>>>> I'll check that out.
>>>>>>>>>>
>>>>>>>>>> Best regards,
>>>>>>>>>> Elisabeth
>>>>>>>>>>
>>>>>>>>>> 2016-03-08 10:24 GMT+01:00 Emir Arnautovic <
>>>>>>>>>>
>>>>>>>>> emir.arnauto...@sematext.com
>>>>>>>>
>>>>>>>>> :
>>>>>>>>>>
>>>>>>>>>> Hi Elisabeth,
>>>>>>>>>>> I don't think there is such token filter, so you would have
>>>>>>>>>>>
>>>>>>>>>> to
>>>>
>>>>> create
>>>>>>>
>>>>>>>> your
>>>>>>>>>>
>>>>>>>>>>> own token filter that takes token and emits ngram token of
>>>>>>>>>>>
>>>>>>>>>> specific
>>>>>>
>>>>>>> length.
>>>>>>>>>>
>>>>>>>>>>> It should not be too hard to create such filter - you can
>>>>>>>>>>>
>>>>>>>>>> take
>>>>
>>>>> a
>>>>>
>>>>>> look
>>>>>>>
>>>>>>>> how
>>>>>>>>>
>>>>>>>>>> nagram filter is coded - yours should be simpler than that.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Emir
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 08.03.2016 08:52, elisabeth benoit wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hello,
>>>>>>>>>>>>
>>>>>>>>>>>> I'm using solr 4.10.1. I'd like to index words with ngrams
>>>>>>>>>>>>
>>>>>>>>>>> of
>>>>
>>>>> fix
>>>>>>
>>>>>>> lenght
>>>>>>>>>
>>>>>>>>>> with a position in the end.
>>>>>>>>>>>>
>>>>>>>>>>>> For instance, with fix lenght 3, Amsterdam would be
>>>>>>>>>>>>
>>>>>>>>>>> something
>>>>
>>>>> like:
>>>>>>>
>>>>>>>>
>>>>>>>>>>>> a0 (two spaces added at beginning)
>>>>>>>>>>>> am1
>>>>>>>>>>>> ams2
>>>>>>>>>>>> mst3
>>>>>>>>>>>> ste4
>>>>>>>>>>>> ter5
>>>>>>>>>>>> erd6
>>>>>>>>>>>> rda7
>>>>>>>>>>>> dam8
>>>>>>>>>>>> am9 (one more space in the end)
>>>>>>>>>>>>
>>>>>>>>>>>> The number at the end being the position.
>>>>>>>>>>>>
>>>>>>>>>>>> Does anyone have a clue how to achieve this?
>>>>>>>>>>>>
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>> Elisabeth
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>> Monitoring * Alerting * Anomaly Detection * Centralized Log
>>>>>>>>>>>
>>>>>>>>>> Management
>>>>>>>>
>>>>>>>>> Solr & Elasticsearch Support * http://sematext.com/
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> --------------------------
>>>>>>>>>
>>>>>>>>> Benedetti Alessandro
>>>>>>>>> Visiting card : http://about.me/alessandro_benedetti
>>>>>>>>>
>>>>>>>>> "Tyger, tyger burning bright
>>>>>>>>> In the forests of the night,
>>>>>>>>> What immortal hand or eye
>>>>>>>>> Could frame thy fearful symmetry?"
>>>>>>>>>
>>>>>>>>> William Blake - Songs of Experience -1794 England
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> --------------------------
>>>>>>>
>>>>>>> Benedetti Alessandro
>>>>>>> Visiting card : http://about.me/alessandro_benedetti
>>>>>>>
>>>>>>> "Tyger, tyger burning bright
>>>>>>> In the forests of the night,
>>>>>>> What immortal hand or eye
>>>>>>> Could frame thy fearful symmetry?"
>>>>>>>
>>>>>>> William Blake - Songs of Experience -1794 England
>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> --------------------------
>>>>>
>>>>> Benedetti Alessandro
>>>>> Visiting card : http://about.me/alessandro_benedetti
>>>>>
>>>>> "Tyger, tyger burning bright
>>>>> In the forests of the night,
>>>>> What immortal hand or eye
>>>>> Could frame thy fearful symmetry?"
>>>>>
>>>>> William Blake - Songs of Experience -1794 England
>>>>>
>>>>>
>>>
>>> --
>>> --------------------------
>>>
>>> Benedetti Alessandro
>>> Visiting card : http://about.me/alessandro_benedetti
>>>
>>> "Tyger, tyger burning bright
>>> In the forests of the night,
>>> What immortal hand or eye
>>> Could frame thy fearful symmetry?"
>>>
>>> William Blake - Songs of Experience -1794 England
>>>
>>>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>

Re: ngrams with position

Reply via email to