Re: Fuzzy matching of postal addresses

2005-02-19 Thread Tim Churches
McBooCzech wrote: Sorry for my "Ferbl typo". No offense was caused, I was just joking. For the local anti-smoking campaign As a public health epidemiologist, that's the sort of application of our project I like to see! And judging by the reports by Martin Mcgee and colleagues at the European

Re: Fuzzy matching of postal addresses

2005-02-19 Thread McBooCzech
Sorry for my "Ferbl typo". For the local anti-smoking campaign I am trying to link some addresses which contain following "linkable" informations (data fields) only: RECORD_ID, Street + No., City, Post code, All data are now w/o Unicode characters. Do you think it possible to try to link it with

Re: Fuzzy matching of postal addresses

2005-02-17 Thread Tim Churches
McBooCzech wrote: Tim, do you think Ferbel can parse properly with non English data-sets? The official name for the project is "Febrl" (freely-extensible biomedical record linkage) but perhaps "Furball" would be better name, given its focus on fuzziness (if that is not a contradiction in terms).

Re: Fuzzy matching of postal addresses

2005-02-17 Thread McBooCzech
Tim, do you think Ferbel can parse properly with non English data-sets? I mean do you think it will work properly with data they include non English characters as well? As we live in Europe, we have to solve such a problems here : If the software needs some changes, I am ready, according to yo

Re: Fuzzy matching of postal addresses

2005-01-23 Thread Joseph Turian
Andrew, > Basically, I have two databases containing lists of postal addresses and > need to look for matching addresses in the two databases. More > precisely, for each address in database A I want to find a single > matching address in database B. What percent of addresses in A have a unique co

Re: Fuzzy matching of postal addresses [1/1]

2005-01-23 Thread Andrew McLean
In article <[EMAIL PROTECTED]>, John Machin <[EMAIL PROTECTED]> writes Andrew McLean wrote: In case anyone is interested, here is the latest. def insCost(tokenList, indx, pos): """The cost of inserting a specific token at a specific normalised position along the sequence.""" if containsNu

Re: Fuzzy matching of postal addresses [1/1]

2005-01-23 Thread John Machin
Andrew McLean wrote: > In case anyone is interested, here is the latest. > def insCost(tokenList, indx, pos): > """The cost of inserting a specific token at a specific normalised position along the sequence.""" > if containsNumber(tokenList[indx]): > return INSERT_TOKEN_WITH_NUMBER

Re: Fuzzy matching of postal addresses

2005-01-19 Thread Aaron Bingham
Andrew McLean wrote: Thanks for all the suggestions. There were some really useful pointers. A few random points: [snip] 4. You need to be careful doing an endswith search. It was actually my first approach to the house name issue. The problem is you end up matching "12 Acacia Avenue, ..." with "

Re: Re: Fuzzy matching of postal addresses

2005-01-18 Thread Tim Churches
Andrew McLean <[EMAIL PROTECTED]> wrote: > > Thanks for all the suggestions. There were some really useful pointers. > > A few random points: > > 1. Spending money is not an option, this is a 'volunteer' project. I'll > try out some of the ideas over the weekend. > ... > I am tempted to try an

Re: Fuzzy matching of postal addresses

2005-01-18 Thread John Roth
"Andrew McLean" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] Thanks for all the suggestions. There were some really useful pointers. A few random points: 1. Spending money is not an option, this is a 'volunteer' project. I'll try out some of the ideas over the weekend. 2. Someone

Re: Fuzzy matching of postal addresses

2005-01-18 Thread Andrew McLean
Thanks for all the suggestions. There were some really useful pointers. A few random points: 1. Spending money is not an option, this is a 'volunteer' project. I'll try out some of the ideas over the weekend. 2. Someone commented that the data was suspiciously good quality. The data sources are

Re: Fuzzy matching of postal addresses

2005-01-18 Thread John Machin
John Machin wrote: > Ermmm ... only remove "the" when you are sure it is a whole word. Even > then it's a dodgy idea. In the first 1000 lines of the nearest address > file I had to hand, I found these: Catherine, Matthew, Rotherwood, > Weatherall, and "The Avenue". > Partial apologies: I wasn't r

Re: Fuzzy matching of postal addresses

2005-01-18 Thread [EMAIL PROTECTED]
I think you guys are missing the point. All you would need to add to get a 'probable match' is add another search that goes through the 10% that didnt get matched and do a "endswith" search on the data. From the example data you showed me, that would match a good 90% of the 10%, leaving you with a

Re: Fuzzy matching of postal addresses

2005-01-18 Thread Simon Brunning
You might find these at least periperally useful: They refer to address formatting rather than de-duping - but normalising soulds like a useful first step to me. -- Che

Re: Fuzzy matching of postal addresses

2005-01-18 Thread Aaron Bingham
Andrew McLean wrote: I have a problem that is suspect isn't unusual and I'm looking to see if there is any code available to help. I've Googled without success. Basically, I have two databases containing lists of postal addresses and need to look for matching addresses in the two databases. More

Re: Fuzzy matching of postal addresses

2005-01-17 Thread John Machin
You can't even get anywhere near 100% accuracy when comparing "authoritative sources" e.g. postal authority and the body charged with maintaining a database of which streets are in which electoral district -- no, not AUS, but close :-) -- http://mail.python.org/mailman/listinfo/python-list

Re: Fuzzy matching of postal addresses

2005-01-17 Thread John Machin
Ermmm ... only remove "the" when you are sure it is a whole word. Even then it's a dodgy idea. In the first 1000 lines of the nearest address file I had to hand, I found these: Catherine, Matthew, Rotherwood, Weatherall, and "The Avenue". Ermmm... don't rip out commas (or other punctuation); repla

Re: Fuzzy matching of postal addresses

2005-01-17 Thread Skip Montanaro
Andrew> I'm 90% of the way there, in the sense that I have a simplistic Andrew> approach that matches 90% of the addresses in database A. But Andrew> the extra cases could be a pain to deal with! Based upon the examples you gave, here are a couple things you might try to reduce the si

Re: Fuzzy matching of postal addresses

2005-01-17 Thread John Roth
"Andrew McLean" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] I have a problem that is suspect isn't unusual and I'm looking to see if there is any code available to help. I've Googled without success. There isn't any publically availible code that I'm aware of. Companies that do a

Re: Fuzzy matching of postal addresses

2005-01-17 Thread Tim Churches
Andrew McLean <[EMAIL PROTECTED]> wrote: > I have a problem that is suspect isn't unusual and I'm looking to > see if there is any code available to help. I've Googled without success. > > Basically, I have two databases containing lists of postal addresses and > need to look for matching addres

Re: Fuzzy matching of postal addresses

2005-01-17 Thread Jeff Shannon
Andrew McLean wrote: The problem is looking for good matches. I currently normalise the addresses to ignore some irrelevant issues like case and punctuation, but there are other issues. I'd do a bit more extensive normalization. First, strip off the city through postal code (e.g. 'Beaminster,

Re: Fuzzy matching of postal addresses

2005-01-17 Thread James Keasley
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 2005-01-18, Andrew McLean <[EMAIL PROTECTED]> wrote: > I have a problem that is suspect isn't unusual and I'm looking to see if > there is any code available to help. I've Googled without success. I have done something very similar (well, near as