On Dec 27, 2009, at 1:23 PM, Lie Ryan wrote:
> 
> IMHO, that's a poor example. Rather than writing a fuzzy search algorithm, 
> it's easier to write a normalizer before entering data to the index (or 
> before comparing the search string with the corpus' string).
> -- 
> 

It does seem like that at first, but it turns out that you can't normalize this 
data, for many reasons.

With address data:
        one address may have suite data and the other might not
        the same city may have multiple zip codes
        incoming addresses may be missing information
        typos are common
        sometimes "Route 35" is the same road as "Convery Boulevard"
        etc. etc. etc.

With names:
        you have to compare with and without the middle name
        compare with and without the title (Mrs., Dr., Mr., Ms.)
        compare with and without the suffix (PhD., Sr., Junior, III, etc.)
        typos are VERY common
        what if John Henry Smith goes by "Henry Smith"?
        what if Xu Wang goes by "John Wang" (happens all the time)
        maiden name versus married name
        etc. etc. etc.

This is a major, real-world issue that remains unsolved, and companies that do 
a decent job at it make millions of dollars a year from their clients. One of 
my old jobs made tens of millions a year (and growing FAST) in the  medical 
industry alone. 

Shawn
        

-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to