Andrew McLean wrote:

The problem is looking for good matches. I currently normalise the addresses to ignore some irrelevant issues like case and punctuation, but there are other issues.


I'd do a bit more extensive normalization. First, strip off the city through postal code (e.g. 'Beaminster, Dorset, DT8 3SS' in your examples). In the remaining string, remove any punctuation and words like "the", "flat", etc.

Here are just some examples where the software didn't declare a match:

And how they'd look after the transformation I suggest above:

1 Brantwood, BEAMINSTER, DORSET, DT8 3SS
THE BEECHES 1, BRANTWOOD, BEAMINSTER, DORSET DT8 3SS

1 Brantwood BEECHES 1 BRANTWOOD

Flat 2, Bethany House, Broadwindsor Road, BEAMINSTER, DORSET, DT8 3PP
2, BETHANY HOUSE, BEAMINSTER, DORSET DT8 3PP

2 Bethany House Broadwindsor Road 2 BETHANY HOUSE

Penthouse,Old Vicarage, 1 Clay Lane, BEAMINSTER, DORSET, DT8 3BU
PENTHOUSE FLAT THE OLD VICARAGE 1, CLAY LANE, BEAMINSTER, DORSET DT8 3BU

Penthouse Old Vicarage 1 Clay Lane PENTHOUSE OLD VICARAGE 1 CLAY LANE

St John's Presbytery, Shortmoor, BEAMINSTER, DORSET, DT8 3EL
THE PRESBYTERY, SHORTMOOR, BEAMINSTER, DORSET DT8 3EL

St Johns Presbytery Shortmoor PRESBYTERY SHORTMOOR

The Pinnacles, White Sheet Hill, BEAMINSTER, DORSET, DT8 3SF
PINNACLES, WHITESHEET HILL, BEAMINSTER, DORSET DT8 3SF

Pinnacles White Sheet Hill PINNACLES WHITESHEET HILL


Obviously, this is not perfect, but it's closer. At this point, you could perhaps say that if either string is a substring of the other, you have a match. That should work with all of these examples except the last one. You could either do this munging for all address lookups, or you could do it only for those that don't find a match in the simplistic way. Either way, you can store the Database B's pre-munged address so that you don't need to constantly recompute those. I can't say for certain how this will perform in the false positives department, but I'd expect that it wouldn't be too bad.


For a more-detailed matching, you might look into finding an algorithm to determine the "distance" between two strings and using that to score possible matches.

Jeff Shannon
Technician/Programmer
Credit International

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to