Levenstein Distance

2012-06-06 Thread Gau
I have a list of synoynms which is being expanded at query time. This yields
a lot of results (in millions). My use-case is name search.

I want to sort the results by Levenstein Distance. I know this can be done
with strdist function. But sorting being inefficient and Solr function
adding to its woes kills the performance. I want the results to be returned
as quickly as possible. 

One of the ways which I think Levenstein can work is, applying the strdist
on the synonym file and getting the scores of each of the synonym. And then
use these scores to boost the results appropriately, it should be equivalent
to levenstein distance. But I am not sure how to do this in Solr or infact
if Solr supports this.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Levenstein-Distance-tp3988026.html
Sent from the Solr - User mailing list archive at Nabble.com.


Sorting performance

2012-06-04 Thread Gau
Here is the usecase:
I am using synonym expansion at query time to get results. this is
essentially a name search, so a search for Jim may be expanded at query time
for James, Jung, Jimmy, etc.

So ranking fields like TF, IDF, Norms do not mean anything to me. I just
reset them to zero. so all the results which I get have the same rank. I
have used a copy field to boost the weights of exact match, so Jim would be
boosted to the top. 

However I want the other results like Jimmy, Jung, James to be sorted by
Levenstein Distance with respect to word Jim (the original query). The
number of results returned are quite large. So a genereal strdist sort takes
6-7 seconds. Is there any other option than applying a sort= in the query to
achieve the same functionality? Any particular way to index the data to
achieve the same result? any idea to boost the performance and get the
intended functionality?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Sorting-performance-tp3987633.html
Sent from the Solr - User mailing list archive at Nabble.com.


Sorting performance

2012-06-04 Thread Gau
Here is the usecase:
I am using synonym expansion at query time to get results. this is
essentially a name search, so a search for Jim may be expanded at query time
for James, Jung, Jimmy, etc.

So ranking fields like TF, IDF, Norms do not mean anything to me. I just
reset them to zero. so all the results which I get have the same rank. I
have used a copy field to boost the weights of exact match, so Jim would be
boosted to the top. 

However I want the other results like Jimmy, Jung, James to be sorted by
Levenstein Distance with respect to word Jim (the original query). The
number of results returned are quite large. So a genereal strdist sort takes
6-7 seconds. Is there any other option than applying a sort= in the query to
achieve the same functionality? Any particular way to index the data to
achieve the same result? any idea to boost the performance and get the
intended functionality?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Sorting-performance-tp3987632.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Difference between textfield and strfield

2012-06-01 Thread Gau
is there any other option to sorting. I mean, sorting can affect query
performance. Is there a way to embed this into Solr and not have a toll on
the system,

I tried boosting the scores based on strdist, but that seems to bring in
more results than expected.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Difference-between-textfield-and-strfield-tp3986916p3987338.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Difference between textfield and strfield

2012-05-30 Thread Gau
I cannot move from textfield to strfield, since I am using synonym expansion.
Is there anything we can do on textfield itself

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Difference-between-textfield-and-strfield-tp3986916p3986938.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Difference between textfield and strfield

2012-05-30 Thread Gau
Well the I do not have phrases for synonym expansion. So it does work well.
The synonym expansion is done at query time. And since i am just searching
against the first name field, tf, idf and other ranking parameters do not
make sense, hence their weight has been initialized to 1. So after applying
synonym expansion I am getting results in random word format. 

the Results are perfect just that they are not ordered by Levenstein
distance of the original query.

So the use case is 
if use enters query ab
it gets expanded at query time to abc,abxy,aberfg
And I get results for ab, abc, abxy, aberfg.
But I want them to be sorted by Levenstein distance from the original query
(ab)
So order shoud be 
ab
abc
abxy
aberfg

.. ! 
TextField makes this even more difficult? Any other suggestions?
Spellcheckers? Ngrams?


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Difference-between-textfield-and-strfield-tp3986916p3986928.html
Sent from the Solr - User mailing list archive at Nabble.com.


Difference between textfield and strfield

2012-05-30 Thread Gau
Hi,

 Can anyone explain me the basic pros and cons between textfield and
strfield. I am trying to use Levenstein distance on textfield, but it seems
that it can only be applied on the strfield. So my question is whats the
difference between the 2 and what are the radical advantages of one over the
other

Currently I have the text field defined for first_name and i apply synonym
expansion at query time to this field.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Difference-between-textfield-and-strfield-tp3986916.html
Sent from the Solr - User mailing list archive at Nabble.com.


Relevancy ranking for synonym matches

2012-05-29 Thread Gau
I was wondering if there is any solution for this.
Currently I expand my results to match the synonyms at query time.

So if I entered James, I would get results for Jim, Gomes, Game etc as they
would be expanded by matching the synonyms for James. But then since this is
just a one word match, tf, idf and other parameters dont make sense. I have
reset those factors to 1. Hence the results I get have an equal score.

What I really want to do is, sort these results by Levenstein Distance
without using ~ sign. The issue in using ~ sign is, if I have a synonym
which is radically different (say Greg for James), if I use James~0, Greg
would not even match closely with James and the number of results returned
would be less than the actual number of synonym matches.

So my usecase is, without reducing the number of results, I want to sort
them by Levenstein Distance, or closest string match to the original query

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Relevancy-ranking-for-synonym-matches-tp3986634.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr boost relevancy

2012-05-27 Thread Gau
Wait, I thought the fuzzy match is invoked with a ~. I am not invoking any ~
but expanding my query terms with the synonyms at query time. So from what I
understarnd, when I query for James, internally, Solr would expand using
synonym search to James, Jim, Games, Jameson. So I guess, the original
information about the query is lost and it returns you the results matched
for Games, Jameson, Jim and James in any order (since I normalized the
scores). Using a copy field for James would return results for James as top
results but I dont see the other 3 keywords being arranged by Levenstein
Distance. Or am I thinking in the wrong direction?


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-boost-relevancy-tp3986200p3986283.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr boost relevancy

2012-05-26 Thread Gau
Hi Lori,

  Yeah. I thought exactly of the same solution. Use a copy field  and boost
the relevancy of the the exact match. But my question is more broad here.

For eg, if i have a synonym for James as Jim, Games, Jimmy, Jameson

And if I normalize the tf, norm, etc factors to 1, on searching for James I
could get Jameson and Jim as my top matches since now the score of all the
documents is 1. Definitely, having a copy field for James and then boosting
relevancy of James would put James as the top result.

But what after James, the order of results for the other synonyms is still
skewed. By Levenstein distance, I would want Games to be the next set of
results and probably Jameson as next. How do I achieve that? Thats my bigger
question?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-boost-relevancy-tp3986200p3986280.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr boost relevancy

2012-05-25 Thread Gau
Consider a db of just names. Now if I use synonym expansion at query time, I
get a set of results. 
(Background: I created a class, which resets idf, tf, .. .all to 1) since
they dont matter to me anymore. What really matters is, how closely does the
query match to the given name. 

Currently I am getting all results with the same score (makes sense since I
reset all the factors to 1), but how do I rank now depending on the
closeness of match.

P.S: the query is being exapanded at query time to match all the documents
from the synonyms. I want to make sure that if I enter  "Raj" , i get Raj as
the topmost results and the synonyms like "Raju" to be after that.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-boost-relevancy-tp3986200.html
Sent from the Solr - User mailing list archive at Nabble.com.