Nik Butler wrote:
<snip>
has anyone actually had experience ( postive or negative ) with such
tools in doing mail address deduplication of text files with SQL ....
<snip>

I doubt that there are many drop-in tools out there since the question of what constitutes a duplicate address/customer will depend on the nature of the business.

For instance, if you're BG then the chances that you have two customers, one named Jon Reades and the other named John Reades at the same address and postcode are very low. If you're a bookstore then the chances are rather higher.

I've heard of some interesting ways of tackling the problem that generally involve attempting to correct for common input errors by hashing the name in one or more of the following ways for the purposes of comparing two records that you have reason to believe might be the same person:

1. Removing all duplicate vowels (so 'aa' becomes 'a', and so on)
2. Remove all duplicate consonants (so 'll' becomes 'l', and so on)
3. Removing vowels, whether single or multiple, altogether and replacing with a single placeholder (s|[aeiow]+|-|g)
4. Lower-casing everything
5. You might apply some similar rules (very carefully) to some types of address/postcode information
6. You might also consider trying variations on commonly mis-heard letters -- D, E, G, P, etc.


But basically, the rules are up to you in terms of what you'll accept as the tradeoff between verification and deduplication.

HTH,

jon
--
jon reades
fulcrum analytics
t: 0870.366.9338
m: 0797.698.7392
f: 0870.888.8880

lower ground floor
2 sheraton street
london w1f 8bh




Reply via email to