"Andrew McLean" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED]
Thanks for all the suggestions. There were some really useful pointers.

A few random points:

1. Spending money is not an option, this is a 'volunteer' project. I'll try out some of the ideas over the weekend.

2. Someone commented that the data was suspiciously good quality. The data sources are both ones that you might expect to be authoritative. If you use as a metric, having a correctly formatted and valid postcode, in one database 100% the records do in the other 99.96% do.

3. I've already noticed duplicate addresses in one of the databases.

4. You need to be careful doing an endswith search. It was actually my first approach to the house name issue. The problem is you end up matching "12 Acacia Avenue, ..." with "2 Acacia Avenue, ...".

I am tempted to try an approach based on splitting the address into a sequence of normalised tokens. Then work with a metric based on the differences between the sequences. The simple case would look at deleting tokens and perhaps concatenating tokens to make a match.

It's been a while since I did this stuff. The trick in dealing with address normalization is to parse the string backwards, and try to slot the pieces into the general pattern.

In your case, the postal code,  district (is that what it's called
in the UK?) and city seem to be fine, it's when you get to the
street (with or without house number), building name and flat
or room number that there's a difficulty.

We always had a list of keywords that could be trusted
to be delimiters. In your examples, "the" should be pretty
reliable in indicating a building name. Of course, that might
have some trouble with Tottering on the Brink.

John Roth

--
Andrew McLean

-- http://mail.python.org/mailman/listinfo/python-list

Reply via email to