"Andrew McLean" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED]
I have a problem that is suspect isn't unusual and I'm looking to see if there is any code available to help. I've Googled without success.
There isn't any publically availible code that I'm aware of. Companies that do a good job of address matching regard that code as a competitive advantage on a par with the crown jewels.
Basically, I have two databases containing lists of postal addresses and need to look for matching addresses in the two databases. More precisely, for each address in database A I want to find a single matching address in database B.
I'm 90% of the way there, in the sense that I have a simplistic approach that matches 90% of the addresses in database A. But the extra cases could be a pain to deal with!
From a purely pragmatic viewpoint, is this a one-off, and how many
non-matches do you have to deal with? If the answers are yes, and not all that many, I'd do the rest by hand.
It's probably not relevant, but I'm using ZODB to store the databases.
I doubt if it's relevant.
The current approach is to loop over addresses in database A. I then identify all addresses in database B that share the same postal code (typically less than 50). The database has a mapping that lets me do this efficiently. Then I look for 'good' matches. If there is exactly one I declare a success. This isn't as efficient as it could be, it's O(n^2) for each postcode, because I end up comparing all possible pairs. But it's fast enough for my application.
The problem is looking for good matches. I currently normalise the addresses to ignore some irrelevant issues like case and punctuation, but there are other issues.
I used to work on a system that had a reasonably decent address matching routine. The critical issue is, as you suspected, normalization. You're not going far enough. You've also got an issue here that doesn't exist in the States - named buildings.
Here are just some examples where the software didn't declare a match:
1 Brantwood, BEAMINSTER, DORSET, DT8 3SS THE BEECHES 1, BRANTWOOD, BEAMINSTER, DORSET DT8 3SS
The first line is a street address, the second is a named building and a street
without a house number. There's no way of matching this unless you know
that The Beaches doesn't have flat (or room, etc.) numbers and can move the
1 to being the street address. On the other hand, this seems to be a
consistent problem in your data base - in the US, the street address must
be associated with the street name. No comma is allowed between the two.
Flat 2, Bethany House, Broadwindsor Road, BEAMINSTER, DORSET, DT8 3PP 2, BETHANY HOUSE, BEAMINSTER, DORSET DT8 3PP
The first is a flat, house name and street name, the second is a number and a house name. Assuming that UK postal standards don't allow more than one named building in a postal code, this is easily matchable if you do a good job of normalization.
Penthouse,Old Vicarage, 1 Clay Lane, BEAMINSTER, DORSET, DT8 3BU PENTHOUSE FLAT THE OLD VICARAGE 1, CLAY LANE, BEAMINSTER, DORSET DT8 3BU
The issue here is to use the words "flat" and "the" to split the flat name and the house name. Then the house number is in the wrong part - it shoud go with the street name. See the comment above.
St John's Presbytery, Shortmoor, BEAMINSTER, DORSET, DT8 3EL THE PRESBYTERY, SHORTMOOR, BEAMINSTER, DORSET DT8 3EL
This one may not be resolvable, unless there is only one house name with "presbytery" in it within the postal code. Notice that "the" should probably be dropped when normalizing.
The Pinnacles, White Sheet Hill, BEAMINSTER, DORSET, DT8 3SF PINNACLES, WHITESHEET HILL, BEAMINSTER, DORSET DT8 3SF
Spelling correction needed.
The challenge is to fix some of the false negatives above without introducing false positives!
Any pointers gratefully received.
If, on the other hand, this is a repeating problem that's simply going to be an ongoing headache, I'd look into commercial address correction software. Here in the US, there are a number of vendors that have such software to correct addresses to the standards of the USPS. They also have data bases of all the legitimate addresses in each postal code. They're adjuncts of mass mailers, and they exist because the USPS gives a mass mailing discount based on the number of "good" addresses you give them.
I don't know what the situation is in the UK, but I'd be surprised if there wasn't some availible address data base, either commercial or free, possibly as an adjunct of the postal service.
The later, by the way, is probably the first place I'd look. The postal service has a major interest in having addresses that they can deliver without a lot of hassle.
Another place is google. The first two pages using "Address Matching software" gave two UK references, and several Australian references.
John Roth
--
Andrew McLean
-- http://mail.python.org/mailman/listinfo/python-list