Andrew> I'm 90% of the way there, in the sense that I have a simplistic Andrew> approach that matches 90% of the addresses in database A. But Andrew> the extra cases could be a pain to deal with!
Based upon the examples you gave, here are a couple things you might try to reduce the size of the difficult comparisons: * Remove "the" and commas as part of your normalization process * Split each address on white space and convert the resulting list to a set, then consider the size of the intersection with other addresses with the same postal code: >>> a1 = "St John's Presbytery, Shortmoor, BEAMINSTER, DORSET, DT8 3EL".upper().replace(",", "") >>> a1 "ST JOHN'S PRESBYTERY SHORTMOOR BEAMINSTER DORSET DT8 3EL" >>> a2 = "THE PRESBYTERY, SHORTMOOR, BEAMINSTER, DORSET DT8 3EL".upper().replace(",", "").replace("THE ", "") >>> a2 'PRESBYTERY SHORTMOOR BEAMINSTER DORSET DT8 3EL' >>> a1 == a2 False >>> sa1 = set(a1.split()) >>> sa2 = set(a2.split()) >>> len(sa1) 8 >>> len(sa2) 6 >>> len(sa1 & sa2) 6 Skip -- http://mail.python.org/mailman/listinfo/python-list