Re: ngrams with position

Emir Arnautovic Fri, 11 Mar 2016 01:53:43 -0800

Hi Elizabeth,

In order to see if you will get better results, you can move ngram logicoutside of analysis chain - simplest solution is to move it to client.In such setup, you should be able to use pf2 and pf3 and see if thatproduces desired result.


Regards,
Emir

On 10.03.2016 13:47, elisabeth benoit wrote:

oh yeah, now that you're saying it, yeah you're right, pf2 pf3 will boost
proximity between words, not between ngrams.

Thanks again,
Elisabeth

2016-03-10 12:31 GMT+01:00 Alessandro Benedetti <abenede...@apache.org>:

The reason pf2 and pf3 seems not a good solution to me is the fact that the
edismax query parser calculate those grams on top of words shingles.
So it takes the query in input, and produces the shingle based on the white
space separator.

i.e. if you search :
"white tiger jumping"
  and pf2 configured on field1.
You are going to end up searching in field1 :
"white tiger", "tiger jumping" .
This is really useful in full text search oriented to phrases and partial
phrases match.
But it has nothing to do with the analysis type associated at query time at
this moment.
First it is used the query parser tokenisation to build the grams and then
the query time analysis is applied.
This according to my remembering,
I will double check in the code and let you know.

Cheers


On 10 March 2016 at 11:02, elisabeth benoit <elisaelisael...@gmail.com>
wrote:

That's the use cas, yes. Find Amsterdam with Asmtreadm.

And yes, we're only doing approximative search if we get 0 result.

I don't quite get why pf2 pf3 not a good solution.

We're actually testing a solution close to phonetic. Some kind of word
reduction.

Thanks for the suggestion (and the link), this makes me think maybe
phonetic is the good solution.

Thanks for your help,
Elisabeth

2016-03-10 11:32 GMT+01:00 Alessandro Benedetti <abenede...@apache.org>:

mmmm If I followed your use case is:

I type Asmtreadm and I want document matching Amsterdam ( even if the

edit

distance is greater than 2) .
First of all is something I hope you do only if you get 0 results, if

not

the overhead can be great and you are going to lose a lot of precision
causing confusion in the customer.

Pf2 and Pf3 is ngram of white space separated tokens, to make partial
phrase query to affect the scoring.
Not a good fit for your problem.

More than grams, have you considered using some sort of phonetic

matching ?

Could this help :
https://cwiki.apache.org/confluence/display/solr/Phonetic+Matching

Cheers

On 10 March 2016 at 08:47, elisabeth benoit <elisaelisael...@gmail.com
wrote:

I am trying to do approximative search with solr. We've tried fuzzy

search,

and spellcheck search, it's working ok but edit distance is limited

(to 2

for DirectSolrSpellChecker in solr 4.10.1). With fuzzy operator,

we've

had

performance issues, and I don't think you can have an edit distance

more

than 2.

What we used to do with a database was more efficient: storing

trigrams

with position, and then searching arround that position (not

precisely

at

that position, since it's approximative search)

Position is to avoid  for a trigram like ams (amsterdam) to get

answers

where the same trigram is for instance at the end of the word. I

would

like

answers with the same relative position between trigrams to score

higher.

Maybe using edismax'ss pf2 and pf3 is a way to do this. I don't see

any

other way. Please tell me if you do.

 From you're answer, I get that position is stored, but I dont

understand

how I can preserve relative order between trigrams, apart from using

pf2

pf3.

Best regards,
Elisabeth

2016-03-10 0:02 GMT+01:00 Alessandro Benedetti <

abenede...@apache.org

if you store the positions for your tokens ( and it is by default

if

you

don't omit them), you have the relative position in the index. [1]
I attach a blog post of mine, describing a little bit more in

details

the

lucene internals.

Apart from that, can you explain the problem you are trying to

solve

The high level user experience ?
What kind of search/autocompletion/relevancy tuning are you trying

to

achieve ?
Maybe we can help better if we start from the problem :)

Cheers

[1]

http://alexbenedetti.blogspot.co.uk/2015/07/exploring-solr-internals-lucene.html

On 9 March 2016 at 15:02, elisabeth benoit <

elisaelisael...@gmail.com>

wrote:

Hello Alessandro,

You may be right. What would you use to keep relative order

between,

for

instance, grams

__a
_am
ams
mst
ste
ter
erd
rda
dam
am_

of amsterdam? pf2 and pf3? That's all I can think about. Please

let

me

know

if you have more insights.

Best regards,
Elisabeth

2016-03-08 17:46 GMT+01:00 Alessandro Benedetti <

abenede...@apache.org

Elizabeth,
out of curiousity, could we know what you are trying to solve

with

that

complex way of tokenisation ?
Solr is really good in storing positions along with token, so I

am

curious

to know why your are mixing the things up.

Cheers

On 8 March 2016 at 10:08, elisabeth benoit <

elisaelisael...@gmail.com>

wrote:

Thanks for your answer Emir,

I'll check that out.

Best regards,
Elisabeth

2016-03-08 10:24 GMT+01:00 Emir Arnautovic <

emir.arnauto...@sematext.com

Hi Elisabeth,
I don't think there is such token filter, so you would have

to

create

your

own token filter that takes token and emits ngram token of

specific

length.

It should not be too hard to create such filter - you can

take

look

how

nagram filter is coded - yours should be simpler than that.

Regards,
Emir


On 08.03.2016 08:52, elisabeth benoit wrote:

Hello,

I'm using solr 4.10.1. I'd like to index words with ngrams

of

fix

lenght

with a position in the end.

For instance, with fix lenght 3, Amsterdam would be

something

like:


a0 (two spaces added at beginning)
am1
ams2
mst3
ste4
ter5
erd6
rda7
dam8
am9 (one more space in the end)

The number at the end being the position.

Does anyone have a clue how to achieve this?

Best regards,
Elisabeth

--
Monitoring * Alerting * Anomaly Detection * Centralized Log

Management

Solr & Elasticsearch Support * http://sematext.com/



--
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England



--
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England



--
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England



--
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/

Re: ngrams with position

Reply via email to