Re: [Boston.pm] Postal address De-duping
really, any perl programmer worth hiring should be able to do this while sleeping. Really? Postal de-duping may be harder than you think. What's the Reg-exp to convert or match FIELDS CORNER, BOSTON to DORCHESTER It's Not just canonicalization of abbreviations and moving Apartments etc to their own fields or not. The locality can be the city, neighborhood, or post-office ... or any combination. The Zip may be omitted, wrong, Zip+4, zip-with-wrong+4. Are Mrs. Williams-Smith and Mr Smith the same household? If it's single-family dwelling, or same apartment number, yes; if no apartment numbers given but known to be a multi-family building, can't assume same family ... and not just because Smith is common, as mother-in-law Cziwicz may be downstairs, as a separate mail patron, from daughter-in-law Smith-Cziwicz at same address but no apartment numbers in use. The post office will know, as will better quality commercial de-duping services. But you won't. Interesting if true, there may be TWO *correct* Zip+4's for my address, one by block and one by carrier route conversion. Bill ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] Postal address De-duping
hi ( 03.08.04 17:12 -0400 ) Joel Gwynn: we're looking for a fast, customizable de-duping solution. I was thinking there might be some perl stuff out there, really, any perl programmer worth hiring should be able to do this while sleeping. -- \js ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] Postal address De-duping
On Tuesday, August 5, 2003, at 09:07 AM, John Saylor wrote: hi ( 03.08.04 17:12 -0400 ) Joel Gwynn: we're looking for a fast, customizable de-duping solution. I was thinking there might be some perl stuff out there, really, any perl programmer worth hiring should be able to do this while sleeping. OT, but now you've got my interest piqued. What is de-duping? Thanks, Ricky ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] Postal address De-duping
Tolkin, Steve mentioned On 8/5/03 11:21 AM,: The article in question can be found at http://www.foo.be/docs/tpj/issues/vol4_1/tpj0401-0002.html (I had a hard time finding it via tpj.com, but Google worked.) Unfortunately I think that the USPS site http://www.usps.com/cgi-bin/zip4/zip4inq needed to run this script is no more. A search there for zip4inq produced nothing. Does anyone know of a similar page, wither by the USPS or another provider of (web) services? FYI, the Terms of Service for the USPS's website prohibit use of screen scraping. I don't think they'd care (much) if you made a small number of requests, but if you're talking thousands of addresses you should be careful. They also have some tools for doing this sort of thing, but I don't believe they are free. Their response to my specific request (batch processing of addresses) is below, but I believe they have some other tools which may be of use to you. At 10:58 AM 9/23/02 -0400, the USPS Web Tools team wrote: Address Information API's are accessible with special permission. We must first understand how you'll be using the API. We need a commitment that the API will be used on a transactional basis (not batch processing or cleansing of a database, but as a customer enters the information into a form on a website). Also, you must state that you will use the output from this API solely in association with USPS services (mail or shipping). Please describe how you plan to use the APIs, include the URL of the site (development or production) and state your commitment to use the APIs in association with USPS services. Once this email is received, we will give you access to the documentation. I hope this helps, Jay Torg USPS Web Tools Drew -- - Drew Taylor * Web app development consulting [EMAIL PROTECTED] * Site implementation hosting www.drewtaylor.com * perl/mod_perl/DBI/mysql/postgres - ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] Postal address De-duping
You may find more useful information as a registered USPS developer: http://www.USPSPriorityMail.com/et_regcert.html If you don't want to register before you get more answers, read carefully through their web tools documents available here: http://www.uspswebtools.com/ I believe the most relavant information to your problem can be found here: http://www.usps.com/ncsc/addressinfo/addressinfomenu.htm Their is a lot of data and can be obtained on CD-ROM for a fee. I don't remember how large the fee is, but the similar data is available through other sources. This information applies to US zip codes only. Similar information is available for other countries, but compiling postal code information and addressing rules for the whole world is a monumental task. There are various modules on CPAN that take good stabs at standardizing postal addresses. However, you will find the results will be woefully inadequate for the purpose of deduping. I think you'll find that getting your project to even 95% certainty is practically impossible. Even companies like TransUnion and Experian, who's primary purpose is to give a potential creditor an accurate view of your credit based on easily identifiable information like your address and full name, have difficulty doing this. Recently, my local electric company charged me a security deposit because they could not verify my relationship to my social security number through a credit agency. The problem, it turned out, was they could not match Lipton, Daniel to Daniel M. Lipton. My advice is to use DoubleTake if the licensing fee is not prohibitive. If you don't like it, you can always attempt something different in Perl. :-) -- Daniel M. Lipton Andrew Pimlott wrote: On Tue, Aug 05, 2003 at 11:21:25AM -0400, Tolkin, Steve wrote: Unfortunately I think that the USPS site http://www.usps.com/cgi-bin/zip4/zip4inq needed to run this script is no more. A search there for zip4inq produced nothing. Does anyone know of a similar page, wither by the USPS or another provider of (web) services? Just follow the Find a Zip Code link from http://www.usps.com/ ? Andrew ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] Postal address De-duping
Actually, if I understand what Joel was asking about, removing duplicates by address is a non-trivial task -- address data is notoriously dirty. What makes the job interesting is that there are a wide variety of abbreviations used in addresses -- for example: 22 Saint John Street 22 St John St 22 Saint John St 22 St John Street So, if you have addresses from multiple sources, and you want to find duplicates between (or within) those sources, you have to find a way to standardize the addresses (i.e. I believe the USPS standardizes on the abbreviated form rather than the expanded form). You also have to parse the first and second addresses apart -- sometimes these are included on the same line one one of the duplicates, while they might be separate entries (address1 and address2) on another potential duplicate. I have often seen datasets where city, state and zip are mistakenly included in the address1 or address2 line. All of these issues make it difficult to take addresses from separate datasets and compare then to find duplicates. We have used the following method for standardizing addresses -- it catches the majority of the standardization issues (though it does not check for address2 in address1, or the presence of city, state and zip. sub standardize_address { my $address = shift; if ( $address =~ /(.*)\sMt\.?\s(.*)/i ) { $address = $1 . ' Mount ' . $2 } if ( $address =~ /(.*)\sNt?h?\.?\s(.*)/i ) { $address = $1 . ' North ' . $2 } if ( $address =~ /(.*)\sSt?h?\.?\s(.*)/i ) { $address = $1 . ' South ' . $2 } if ( $address =~ /(.*)\sE\.?\s(.*)/i ) { $address = $1 . ' East ' . $2 } if ( $address =~ /(.*)\sW\.?\s(.*)/i ) { $address = $1 . ' West ' . $2 } if ( $address =~ /(.*)\sU\.?\s(.*)/i ) { $address = $1 . ' Upper ' . $2 } if ( $address =~ /(.*)\sL\.?\s(.*)/i ) { $address = $1 . ' Lower ' . $2 } if ( $address =~ /(.*)p\.?\s?o\.? box\s(.*)/i ) { $address = $1 . 'P.O. Box ' . $2 } if ( $address =~ /(.*)\sSt\b\.?(\s*.*)/i ) { $address = $1 . ' Street' . $2 } if ( $address =~ /(.*)\sRd\b\.?(\s*.*)/i ) { $address = $1 . ' Road' . $2 } if ( $address =~ /(.*)\sLa\b\.?(\s*.*)/i ) { $address = $1 . ' Lane' . $2 } if ( $address =~ /(.*)\sAve\b\.?(\s*.*)/i ) { $address = $1 . ' Avenue' . $2 } if ( $address =~ /(.*)\sHwy\b\.?(\s*.*)/i ) { $address = $1 . ' Highway' . $2 } $address =~ s/\bDr\.?\b/Drive/ig; $address =~ s/\bDrive\./Drive/g; $address =~ s/#//g; return $address; } HTH. -Chris John Saylor wrote: hi ( 03.08.04 17:12 -0400 ) Joel Gwynn: we're looking for a fast, customizable de-duping solution. I was thinking there might be some perl stuff out there, really, any perl programmer worth hiring should be able to do this while sleeping. -- Chris Brooks VP, Technology carescout.com STATEMENT OF CONFIDENTIALITY: The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain confidential or privileged information. If you are not the intended recipient, please notify CareScout immediately at either (800) 571-1918 or at [EMAIL PROTECTED], and destroy all copies of this message and any attachments. ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] Postal address De-duping
On Tue, Aug 05, 2003 at 11:21:25AM -0400, Tolkin, Steve wrote: Unfortunately I think that the USPS site http://www.usps.com/cgi-bin/zip4/zip4inq needed to run this script is no more. A search there for zip4inq produced nothing. Does anyone know of a similar page, wither by the USPS or another provider of (web) services? Just follow the Find a Zip Code link from http://www.usps.com/ ? Andrew ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] Postal address De-duping
On Tuesday, August 5, 2003, at 09:07 AM, John Saylor wrote: really, any perl programmer worth hiring should be able to do this while sleeping. No, it's quite a hard problem. All of the following UK addresses are the same and are deliverable. 2/11 CR7 8JH 11b CR7 8JH Flat 2, 11 Beulah Road, CR7 8JH * 11 Beulah Road, Flat 2, CR7 8JH 11b Beulah Road, CR7 8JH * Cantrell, CR7 8JH Now, they're not all recommended - I've asterisked the two that the post office like - but they do all work. Now factor in all the people who can't spell my name right, or the name of the road, or get a character wrong in the post code, and yet my mail still arrives. The following are all the same too: 1 London Apsley House, Duke of Wellington Place, W1 W1J 7NT -- David Cantrell | Degenerate | http://www.cantrell.org.uk/david While researching this email, I was forced to carry out some investigative work which unfortunately involved a bucket of puppies and a belt sander -- after JoeB, in the Monastery ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] Postal address De-duping
Unfortunately I think that the USPS site http://www.usps.com/cgi-bin/zip4/zip4inq needed to run this script is no more. A search there for zip4inq produced nothing. Does anyone know of a similar page, wither by the USPS or another provider of (web) services? Just follow the Find a Zip Code link from http://www.usps.com/ ? According to http://www.usps.com/zip4/zipfaq.htm, their zip code lookup is ZIP+4 Lookup is intended for interactive use, not automated script processing. However, it looks like the postal service does have electronic zip directories http://www.usps.com/ncsc/ziplookup/amsdev.htm As far as the general of cleaning postal addresses, it looks like there are commerical packages out there, and a whole certification process for them. http://www.usps.com/ncsc/ziplookup/cam.htm -- Steve Revilak ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] Postal address De-duping
hi On Tuesday, August 5, 2003, at 09:07 AM, John Saylor wrote: really, any perl programmer worth hiring should be able to do this while sleeping. ( 03.08.05 19:21 +0100 ) David Cantrell: No, it's quite a hard problem. i guess it depends on the way the problem is defined by the client. as you might have guessed, the problem i was thinking of was considerably simpler than the example you outlined in your response. -- \js ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
RE: [Boston.pm] Postal address De-duping
The article in question can be found at http://www.foo.be/docs/tpj/issues/vol4_1/tpj0401-0002.html (I had a hard time finding it via tpj.com, but Google worked.) Unfortunately I think that the USPS site http://www.usps.com/cgi-bin/zip4/zip4inq needed to run this script is no more. A search there for zip4inq produced nothing. Does anyone know of a similar page, wither by the USPS or another provider of (web) services? Hopefully helpfully yours, Steve -- Steven Tolkinsteve . tolkin at fmr dot com 617-563-0516 Fidelity Investments 82 Devonshire St. V4D Boston MA 02109 There is nothing so practical as a good theory. Comments are by me, not Fidelity Investments, its subsidiaries or affiliates. -Original Message- From: Jon Orwant [mailto:[EMAIL PROTECTED] Sent: Monday, August 04, 2003 6:15 PM To: Joel Gwynn Cc: [EMAIL PROTECTED] Subject: Re: [Boston.pm] Postal address De-duping On Monday, August 4, 2003, at 05:12 PM, Joel Gwynn wrote: Hey, all. We do lots of (snail) mailings, and we're looking for a fast, customizable de-duping solution. We're currently taking a look at doubletake from http://peoplesmith.com/, which is not too expensive, but I was thinking there might be some perl stuff out there, given perl's text-processing powers. There's a wee script I wrote for TPJ a while back that scrapes the U.S. Postal Service's address canonicalizer. The script is on tpj.com; look under Archives for the article called Five Quick Hacks. The canonicalizer (well, they call it a zip code locator or something like that) will transform variants on the same address into the One True Address that the USPS recognizes, so de-duping then becomes a matter of simple string matching. Won't help you for foreign addresses, obviously. -Jon ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] Postal address De-duping
On Monday, August 4, 2003, at 05:12 PM, Joel Gwynn wrote: Hey, all. We do lots of (snail) mailings, and we're looking for a fast, customizable de-duping solution. We're currently taking a look at doubletake from http://peoplesmith.com/, which is not too expensive, but I was thinking there might be some perl stuff out there, given perl's text-processing powers. There's a wee script I wrote for TPJ a while back that scrapes the U.S. Postal Service's address canonicalizer. The script is on tpj.com; look under Archives for the article called Five Quick Hacks. The canonicalizer (well, they call it a zip code locator or something like that) will transform variants on the same address into the One True Address that the USPS recognizes, so de-duping then becomes a matter of simple string matching. Won't help you for foreign addresses, obviously. -Jon ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm