On Dec 27, 2009, at 1:23 PM, Lie Ryan wrote: > > IMHO, that's a poor example. Rather than writing a fuzzy search algorithm, > it's easier to write a normalizer before entering data to the index (or > before comparing the search string with the corpus' string). > -- >
It does seem like that at first, but it turns out that you can't normalize this data, for many reasons. With address data: one address may have suite data and the other might not the same city may have multiple zip codes incoming addresses may be missing information typos are common sometimes "Route 35" is the same road as "Convery Boulevard" etc. etc. etc. With names: you have to compare with and without the middle name compare with and without the title (Mrs., Dr., Mr., Ms.) compare with and without the suffix (PhD., Sr., Junior, III, etc.) typos are VERY common what if John Henry Smith goes by "Henry Smith"? what if Xu Wang goes by "John Wang" (happens all the time) maiden name versus married name etc. etc. etc. This is a major, real-world issue that remains unsolved, and companies that do a decent job at it make millions of dollars a year from their clients. One of my old jobs made tens of millions a year (and growing FAST) in the medical industry alone. Shawn -- http://mail.python.org/mailman/listinfo/python-list