Re: How best to compare tow sentences

Paul Taylor Wed, 03 Dec 2014 07:46:44 -0800

On 03/12/2014 15:14, Barry Coughlan wrote:

Hi Paul,
I don't have much expertise in this area so hopefully others willanswer, but maybe this is better than nothing.
I don't know many out-of-the-box solutions for this problem, but I'msure they exist. Mahout and Carrot2 might be worth investigating.
Similarity Metrics:
- Jaccard Index. Measures similarity between two sets, so word orderis not important.- Levenshtein distance. If you are dealing with user-inputted typos,the Damerau–Levenshtein distance can perform a bit better because ittakes into account swapping adjacent letters (e.g. teh -> the).
I worked with some code that did this for author names e.g. merge"Barack Obama", "Obama B." and "B. H. Obama". It used a combination ofDamerau–Levenshtein distance and Jaccard index. It worked very wellfor this problem, but unfortunately the code was sparse ondocumentation and full of magic numbers so I don't know the details.The approach was similar to the approach described in this answer:http://stackoverflow.com/a/11920867/281469
This is an O(n^2) pairwise comparison problem. As your data getsbigger you have to work around this limitation. This problem is knownin research literature as the "all-pairs" similarity problem. Thepaper linked from this repository is a good read on the subject:https://code.google.com/p/google-all-pairs-similarity-search/
One of the ways you can work around this is by using Lucene to limitthe amount of comparisons you need to do:
- Index your documents.
- For each song title do a fuzzy search of the words.
- Take the top N results, calculate their similarity with the songtitle using the metrics (or just use the Lucene score).
- Cluster similar titles by song title.
This is basically creating a sparse inverted index of your documentmatrix, so that you can find results that will produce non-zerosimilarities. This is the most effective way of optimizing performancethat I have encountered.
Again, I'm sure there are out-of-the-box solutions for this problem,but I don't know what they are.
Hope that helps,
Barry

Thankyou barry I wil spend some time going through your suggestions, inthe library Im looking at I dont seem to have Damerau–Levenshtein but Ido have jaccardSimilarity so if that understands words Ill will give ita try.


|BlockDistance
ChapmanLengthDeviation
ChapmanMatchingSoundex
ChapmanMeanLength
ChapmanOrderedNameCompoundSimilarity
CosineSimilarity
DiceSimilarity
EuclideanDistance
InterfaceStringMetric
JaccardSimilarity
Jaro
JaroWinkler
Levenshtein
MatchingCoefficient
MongeElkan
NeedlemanWunch
OverlapCoefficient
QGramsDistance
SmithWaterman
SmithWatermanGotoh
SmithWatermanGotohWindowedAffine
Soundex
|
One things, regaridng your lucene based solution I think you have missed an 
important point. I am only comparing TWO strings at any time, I dont have a 
dataset of thousands of sentences that I want to compare over time I literally 
have string a and string b and I just want to compare those, at a later date 
Ill have string c and d, but at no point do I have strings a,b,c,d. I'm not 
trying to find the best  matching string for a single title just is this String 
a good match for this song title.

Paul

Re: How best to compare tow sentences

Reply via email to