Andrew> I'm 90% of the way there, in the sense that I have a simplistic
    Andrew> approach that matches 90% of the addresses in database A. But
    Andrew> the extra cases could be a pain to deal with!

Based upon the examples you gave, here are a couple things you might try to
reduce the size of the difficult comparisons:

    * Remove "the" and commas as part of your normalization process

    * Split each address on white space and convert the resulting list to a
      set, then consider the size of the intersection with other addresses
      with the same postal code:

    >>> a1 = "St John's Presbytery, Shortmoor, BEAMINSTER, DORSET, DT8 
3EL".upper().replace(",", "")
    >>> a1
    "ST JOHN'S PRESBYTERY SHORTMOOR BEAMINSTER DORSET DT8 3EL"
    >>> a2 = "THE PRESBYTERY, SHORTMOOR, BEAMINSTER, DORSET DT8 
3EL".upper().replace(",", "").replace("THE ", "")
    >>> a2
    'PRESBYTERY SHORTMOOR BEAMINSTER DORSET DT8 3EL'
    >>> a1 == a2
    False
    >>> sa1 = set(a1.split())
    >>> sa2 = set(a2.split())
    >>> len(sa1)
    8
    >>> len(sa2)
    6
    >>> len(sa1 & sa2)
    6

Skip
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to