Thanks for the pointer. I found the thread, and there is certainly some interesting information there. I'd like to stick to what Lucene has available today, mainly because I lack the time to implement anything more than that. I originally thought Levenshtein, but then realized that Lucene would probably have to do a whole index scan for that? I don't need anything too fancy, so I'm still wondering if NGram with some sort of proximity ranking would do the trick. By proximity, I mean, how closely the NGrams in the document field match in proximity and order to each other as the same NGrams in the search string. I'm hoping NGrams would avoid the need for a whole index scan. Does Lucene already factor this into its hit score, or would I need to do some custom work?

 - Andy

Grant Ingersoll wrote:
I believe there were some posts on this about a year ago. Try searching in the archives for duplicate names, as well as "record linkage" or any other various synonyms that you can think of. The short answer is Lucene is reasonable to attempt this with, but you may need some help. The long answer is to dig into those archives and see the other recommendations.

-Grant

On Apr 16, 2008, at 12:37 PM, Andy DePue wrote:

I'm new to Lucene, and would like to use it to find duplicate (or similar) names in a contact list. Is Lucene a good fit? We have a form where a user enters a company or person's name, and we want the system to warn them if there is already a company or person entered with the same or similar name. Based on the little I know of Lucene, I'm thinking an NGram algorithm (based on characters, not words) would work best... but, I'm not sure if Lucene takes proximity or edit distances into account? For example, say you have these two names:
Andrew John
John Andrew

If a user enters Andy John, without proximity or edit distance, these two names will match about the same, while, obviously, the first name should be ranked higher.
Thanks in advance for any help or advice.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ







---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to