On 12/28/2009 11:59 PM, Shawn Milochik wrote:
With address data:
        one address may have suite data and the other might not
        the same city may have multiple zip codes

why is that even a problem? You do put suite data and zipcode into different database fields right?

        incoming addresses may be missing information
        typos are common

see below

        sometimes "Route 35" is the same road as "Convery Boulevard"

If you have access to roadnames index, you can theoretically normalize "Route 35" into "Convery Boulevard" (or vice versa). But for most practical purpose, that's a human issue; tell the user to use a certain form of address for the form entry. Tell the user that once they registers using "Route 35" they have to refer to it as "Route 35" in the future.

With names:
        you have to compare with and without the middle name
        compare with and without the title (Mrs., Dr., Mr., Ms.)
        compare with and without the suffix (PhD., Sr., Junior, III, etc.)

they're never a problem, names should be separated into at least two fields: firstname and lastname; and title and suffixes should have their own fields:

Mrs. John Doe -> titles: (mrs,); first: john; last: doe
Doe, John -> titles: (); first: john; last: doe
John Doe -> titles: (); first: john; last: doe
John Foo Doe -> titles: (); first: john middle: foo; last: doe
         or: -> titles: (); first: john; middle: foo; last: doe
Doe, John Foo -> titles: (); first: john; middle: foo; last: doe
Prof. John Doe -> titles: (professor,); first: john; last: doe
dr. John Doe, PhD -> titles: (doctor, PhD); first: john; last: doe
Lady John Doe III -> titles: (lady, III); first: john; last: doe
Lady John Doe The Third -> titles: (lady, III); first: john; last: doe
John Doe Jr. -> titles: (junior,); first: john; last: doe

If both the "query" and the "index" is normalized with the same (or similar) algorithms; that would significantly reduce the need for fuzzy search.

        typos are VERY common

that's where fuzzy search comes in, but the database entries themselves should be normalized long before fuzzy search kicks in.

        what if John Henry Smith goes by "Henry Smith"?

what's wrong with that? Your "name search" algorithm can combine the firstname, middlename, and lastname fields into one "superview" for searching purpose.

        what if Xu Wang goes by "John Wang" (happens all the time)
>    maiden name versus married name

your search query should be normalized as well. search "Xu" first, then search "Wang", then find intersection. Show the "intersection" to the user, if they can't find the correct name in the intersection, then offer the queree to search for "Xu"-only or "Wang"-only. However, if John Wang goes by Jack Black, then indeed it is an unsolvable problem.

        etc. etc. etc.

This is a major, real-world issue that remains unsolved, and companies that do 
a decent job at it make millions of dollars a year from their clients. One of 
my old jobs made tens of millions a year (and growing FAST) in the  medical 
industry alone.

I agree fuzzy searches is indispensable in certain cases, but from the way you're describing the issue, it appears that half of your "unsolved" problems comes due to the poor design of the database. I agree, that the other halves (e.g. typos, multiple names/addresses) are indeed unsolvable.
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to