Re: Using Lucene to find duplicate/similar names

Grant Ingersoll Wed, 16 Apr 2008 09:11:42 -0700

I believe there were some posts on this about a year ago. Trysearching in the archives for duplicate names, as well as "recordlinkage" or any other various synonyms that you can think of. Theshort answer is Lucene is reasonable to attempt this with, but you mayneed some help. The long answer is to dig into those archives and seethe other recommendations.


-Grant


On Apr 16, 2008, at 12:37 PM, Andy DePue wrote:

I'm new to Lucene, and would like to use it to find duplicate (orsimilar) names in a contact list. Is Lucene a good fit?We have a form where a user enters a company or person's name, andwe want the system to warn them if there is already a company orperson entered with the same or similar name.Based on the little I know of Lucene, I'm thinking an NGramalgorithm (based on characters, not words) would work best... but,I'm not sure if Lucene takes proximity or edit distances intoaccount? For example, say you have these two names:
Andrew John
John Andrew
If a user enters Andy John, without proximity or edit distance,these two names will match about the same, while, obviously, thefirst name should be ranked higher.
Thanks in advance for any help or advice.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ







---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Using Lucene to find duplicate/similar names

Reply via email to