"Andrew McLean" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED]
Thanks for all the suggestions. There were some really useful pointers.
A few random points:
1. Spending money is not an option, this is a 'volunteer' project. I'll try out some of the ideas over the weekend.
2. Someone commented that the data was suspiciously good quality. The data sources are both ones that you might expect to be authoritative. If you use as a metric, having a correctly formatted and valid postcode, in one database 100% the records do in the other 99.96% do.
3. I've already noticed duplicate addresses in one of the databases.
4. You need to be careful doing an endswith search. It was actually my first approach to the house name issue. The problem is you end up matching "12 Acacia Avenue, ..." with "2 Acacia Avenue, ...".
I am tempted to try an approach based on splitting the address into a sequence of normalised tokens. Then work with a metric based on the differences between the sequences. The simple case would look at deleting tokens and perhaps concatenating tokens to make a match.
It's been a while since I did this stuff. The trick in dealing with address normalization is to parse the string backwards, and try to slot the pieces into the general pattern.
In your case, the postal code, district (is that what it's called in the UK?) and city seem to be fine, it's when you get to the street (with or without house number), building name and flat or room number that there's a difficulty.
We always had a list of keywords that could be trusted to be delimiters. In your examples, "the" should be pretty reliable in indicating a building name. Of course, that might have some trouble with Tottering on the Brink.
John Roth
--
Andrew McLean
-- http://mail.python.org/mailman/listinfo/python-list