Re: Simple distributed example for learning purposes?

Lie Ryan Mon, 28 Dec 2009 08:34:53 -0800

On 12/28/2009 11:59 PM, Shawn Milochik wrote:

With address data:
        one address may have suite data and the other might not
        the same city may have multiple zip codes

why is that even a problem? You do put suite data and zipcode intodifferent database fields right?

        incoming addresses may be missing information
        typos are common


see below

        sometimes "Route 35" is the same road as "Convery Boulevard"

If you have access to roadnames index, you can theoretically normalize"Route 35" into "Convery Boulevard" (or vice versa). But for mostpractical purpose, that's a human issue; tell the user to use a certainform of address for the form entry. Tell the user that once theyregisters using "Route 35" they have to refer to it as "Route 35" in thefuture.

With names:
        you have to compare with and without the middle name
        compare with and without the title (Mrs., Dr., Mr., Ms.)
        compare with and without the suffix (PhD., Sr., Junior, III, etc.)

they're never a problem, names should be separated into at least twofields: firstname and lastname; and title and suffixes should have theirown fields:


Mrs. John Doe -> titles: (mrs,); first: john; last: doe
Doe, John -> titles: (); first: john; last: doe
John Doe -> titles: (); first: john; last: doe
John Foo Doe -> titles: (); first: john middle: foo; last: doe
         or: -> titles: (); first: john; middle: foo; last: doe
Doe, John Foo -> titles: (); first: john; middle: foo; last: doe
Prof. John Doe -> titles: (professor,); first: john; last: doe
dr. John Doe, PhD -> titles: (doctor, PhD); first: john; last: doe
Lady John Doe III -> titles: (lady, III); first: john; last: doe
Lady John Doe The Third -> titles: (lady, III); first: john; last: doe
John Doe Jr. -> titles: (junior,); first: john; last: doe

If both the "query" and the "index" is normalized with the same (orsimilar) algorithms; that would significantly reduce the need for fuzzysearch.

        typos are VERY common

that's where fuzzy search comes in, but the database entries themselvesshould be normalized long before fuzzy search kicks in.

        what if John Henry Smith goes by "Henry Smith"?

what's wrong with that? Your "name search" algorithm can combine thefirstname, middlename, and lastname fields into one "superview" forsearching purpose.

        what if Xu Wang goes by "John Wang" (happens all the time)

>    maiden name versus married name

your search query should be normalized as well. search "Xu" first, thensearch "Wang", then find intersection. Show the "intersection" to theuser, if they can't find the correct name in the intersection, thenoffer the queree to search for "Xu"-only or "Wang"-only. However, ifJohn Wang goes by Jack Black, then indeed it is an unsolvable problem.

        etc. etc. etc.

This is a major, real-world issue that remains unsolved, and companies that do 
a decent job at it make millions of dollars a year from their clients. One of 
my old jobs made tens of millions a year (and growing FAST) in the  medical 
industry alone.

I agree fuzzy searches is indispensable in certain cases, but from theway you're describing the issue, it appears that half of your "unsolved"problems comes due to the poor design of the database. I agree, that theother halves (e.g. typos, multiple names/addresses) are indeed unsolvable.

--
http://mail.python.org/mailman/listinfo/python-list

Re: Simple distributed example for learning purposes?

Reply via email to