Andrew McLean wrote:
The problem is looking for good matches. I currently normalise the addresses to ignore some irrelevant issues like case and punctuation, but there are other issues.
I'd do a bit more extensive normalization. First, strip off the city through postal code (e.g. 'Beaminster, Dorset, DT8 3SS' in your examples). In the remaining string, remove any punctuation and words like "the", "flat", etc.
Here are just some examples where the software didn't declare a match:
And how they'd look after the transformation I suggest above:
1 Brantwood, BEAMINSTER, DORSET, DT8 3SS THE BEECHES 1, BRANTWOOD, BEAMINSTER, DORSET DT8 3SS
1 Brantwood BEECHES 1 BRANTWOOD
Flat 2, Bethany House, Broadwindsor Road, BEAMINSTER, DORSET, DT8 3PP 2, BETHANY HOUSE, BEAMINSTER, DORSET DT8 3PP
2 Bethany House Broadwindsor Road 2 BETHANY HOUSE
Penthouse,Old Vicarage, 1 Clay Lane, BEAMINSTER, DORSET, DT8 3BU PENTHOUSE FLAT THE OLD VICARAGE 1, CLAY LANE, BEAMINSTER, DORSET DT8 3BU
Penthouse Old Vicarage 1 Clay Lane PENTHOUSE OLD VICARAGE 1 CLAY LANE
St John's Presbytery, Shortmoor, BEAMINSTER, DORSET, DT8 3EL THE PRESBYTERY, SHORTMOOR, BEAMINSTER, DORSET DT8 3EL
St Johns Presbytery Shortmoor PRESBYTERY SHORTMOOR
The Pinnacles, White Sheet Hill, BEAMINSTER, DORSET, DT8 3SF PINNACLES, WHITESHEET HILL, BEAMINSTER, DORSET DT8 3SF
Pinnacles White Sheet Hill PINNACLES WHITESHEET HILL
Obviously, this is not perfect, but it's closer. At this point, you could perhaps say that if either string is a substring of the other, you have a match. That should work with all of these examples except the last one. You could either do this munging for all address lookups, or you could do it only for those that don't find a match in the simplistic way. Either way, you can store the Database B's pre-munged address so that you don't need to constantly recompute those. I can't say for certain how this will perform in the false positives department, but I'd expect that it wouldn't be too bad.
For a more-detailed matching, you might look into finding an algorithm to determine the "distance" between two strings and using that to score possible matches.
Jeff Shannon Technician/Programmer Credit International
-- http://mail.python.org/mailman/listinfo/python-list