Re: [Boston.pm] Postal address De-duping

2003-08-14 Thread Bill N1VUX
 really, any perl programmer worth hiring should be able to do this while
 sleeping.

Really?  Postal de-duping may be harder than you think.  


What's the Reg-exp to convert or match
  FIELDS CORNER, BOSTON 
to 
  DORCHESTER 

It's Not just canonicalization of abbreviations and moving Apartments etc to
their own fields or not.  The locality can be the city, neighborhood, or
post-office ... or any combination.  The Zip may be omitted, wrong, Zip+4,
zip-with-wrong+4.   Are Mrs. Williams-Smith and Mr Smith the same household?
If it's single-family dwelling, or same apartment number, yes; if no apartment
numbers given but known to be a multi-family building, can't assume same
family ... and not just because Smith is common, as mother-in-law Cziwicz may
be downstairs, as a separate mail patron, from daughter-in-law Smith-Cziwicz
at same address but no apartment numbers in use.  The post office will know,
as will better quality commercial de-duping services. But you won't. 

Interesting if true, there may be TWO *correct* Zip+4's for my address, one by
block and one by carrier route conversion.

Bill
___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] Postal address De-duping

2003-08-14 Thread John Saylor
hi

( 03.08.04 17:12 -0400 ) Joel Gwynn:
 we're looking for a fast, customizable de-duping solution.
 I was thinking there might be some perl stuff out there,

really, any perl programmer worth hiring should be able to do this while
sleeping.

-- 
\js

___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] Postal address De-duping

2003-08-14 Thread Richard Morse
On Tuesday, August 5, 2003, at 09:07 AM, John Saylor wrote:

hi

( 03.08.04 17:12 -0400 ) Joel Gwynn:
we're looking for a fast, customizable de-duping solution.
I was thinking there might be some perl stuff out there,
really, any perl programmer worth hiring should be able to do this 
while
sleeping.
OT, but now you've got my interest piqued.  What is de-duping?

Thanks,
Ricky
___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] Postal address De-duping

2003-08-14 Thread Drew Taylor
Tolkin, Steve mentioned On 8/5/03 11:21 AM,:

The article in question can be found at
http://www.foo.be/docs/tpj/issues/vol4_1/tpj0401-0002.html
(I had a hard time finding it via tpj.com, but Google worked.)
Unfortunately I think that the USPS site 
http://www.usps.com/cgi-bin/zip4/zip4inq
needed to run this script is no more.  
A search there for zip4inq produced nothing.

Does anyone know of a similar page, wither by the USPS or
another provider of (web) services?
FYI, the Terms of Service for the USPS's website prohibit use of screen 
scraping. I don't think they'd care (much) if you made a small number of 
requests, but if you're talking thousands of addresses you should be 
careful. They also have some tools for doing this sort of thing, but I 
don't believe they are free. Their response to my specific request 
(batch processing of addresses) is below, but I believe they have some 
other tools which may be of use to you.

At 10:58 AM 9/23/02 -0400, the USPS Web Tools team wrote:

Address Information API's are accessible with special permission. We
must first understand how you'll be using the API.  We need a commitment
that the API will be used  on a transactional basis (not batch
processing or cleansing of a database, but as a customer enters the
information into a form on a website).  Also, you must state that you
will use the output from this API solely in association with USPS
services (mail or shipping).
Please describe how you plan to use the APIs, include the URL of the
site (development or production) and state your commitment to use the
APIs in association with USPS services. Once this email is received, we
will give you access to the documentation.
 I hope this helps,

 Jay Torg
 USPS Web Tools
Drew
--
-
Drew Taylor  *  Web app development  consulting
[EMAIL PROTECTED]  *  Site implementation  hosting
www.drewtaylor.com   *  perl/mod_perl/DBI/mysql/postgres
-
___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] Postal address De-duping

2003-08-14 Thread Daniel M. Lipton
You may find more useful information as a registered USPS developer:

http://www.USPSPriorityMail.com/et_regcert.html

If you don't want to register before you get more answers, read 
carefully through their web tools documents available here:

http://www.uspswebtools.com/

I believe the most relavant information to your problem can be found here:

http://www.usps.com/ncsc/addressinfo/addressinfomenu.htm

Their is a lot of data and can be obtained on CD-ROM for a fee.  I don't 
remember how large the fee is, but the similar data is available through 
other sources.

This information applies to US zip codes only.

Similar information is available for other countries, but compiling 
postal code information and addressing rules for the whole world is a 
monumental task.

There are various modules on CPAN that take good stabs at standardizing 
postal addresses.  However, you will find the results will be woefully 
inadequate for the purpose of deduping.

I think you'll find that getting your project to even 95% certainty is 
practically impossible.  Even companies like TransUnion and Experian, 
who's primary purpose is to give a potential creditor an accurate view 
of your credit based on easily identifiable information like your 
address and full name, have difficulty doing this.  Recently, my local 
electric company charged me a security deposit because they could not 
verify my relationship to my social security number through a credit 
agency.  The problem, it turned out, was they could not match Lipton, 
Daniel to Daniel M. Lipton.

My advice is to use DoubleTake if the licensing fee is not prohibitive.  
If you don't like it, you can always attempt something different in 
Perl.  :-)

--
Daniel M. Lipton


Andrew Pimlott wrote:

On Tue, Aug 05, 2003 at 11:21:25AM -0400, Tolkin, Steve wrote:
 

Unfortunately I think that the USPS site 
http://www.usps.com/cgi-bin/zip4/zip4inq
needed to run this script is no more.  

A search there for zip4inq produced nothing.

Does anyone know of a similar page, wither by the USPS or
another provider of (web) services?
   

Just follow the Find a Zip Code link from http://www.usps.com/ ?

Andrew
___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm
 

___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] Postal address De-duping

2003-08-14 Thread Chris Brooks
Actually, if I understand what Joel was asking about, removing 
duplicates by address is a non-trivial task -- address data is 
notoriously dirty.  What makes the job interesting is that there are a 
wide variety of abbreviations used in addresses -- for example:

22 Saint John Street
22 St John St
22 Saint John St
22 St John Street
So, if you have addresses from multiple sources, and you want to find 
duplicates between (or within) those sources, you have to find a way to 
standardize the addresses (i.e. I believe the USPS standardizes on the 
abbreviated form rather than the expanded form).

You also have to parse the first and second addresses apart -- sometimes 
these are included on the same line one one of the duplicates, while 
they might be separate entries (address1 and address2) on another 
potential duplicate.  I have often seen datasets where city, state and 
zip are mistakenly included in the address1 or address2 line.

All of these issues make it difficult to take addresses from separate 
datasets and compare then to find duplicates.

We have used the following method for standardizing addresses -- it 
catches the majority of the standardization issues (though it does not 
check for address2 in address1, or the presence of city, state and zip.

sub standardize_address {
   my $address = shift;
   if ( $address =~ /(.*)\sMt\.?\s(.*)/i ) { $address = $1 . ' Mount ' 
. $2 }
   if ( $address =~ /(.*)\sNt?h?\.?\s(.*)/i ) { $address = $1 . ' North 
' . $2 }
   if ( $address =~ /(.*)\sSt?h?\.?\s(.*)/i ) { $address = $1 . ' South 
' . $2 }
   if ( $address =~ /(.*)\sE\.?\s(.*)/i ) { $address = $1 . ' East ' . $2 }
   if ( $address =~ /(.*)\sW\.?\s(.*)/i ) { $address = $1 . ' West ' . $2 }
   if ( $address =~ /(.*)\sU\.?\s(.*)/i ) { $address = $1 . ' Upper ' . 
$2 }
   if ( $address =~ /(.*)\sL\.?\s(.*)/i ) { $address = $1 . ' Lower ' . 
$2 }
   if ( $address =~ /(.*)p\.?\s?o\.? box\s(.*)/i ) { $address = $1 . 
'P.O. Box ' . $2 }

   if ( $address =~ /(.*)\sSt\b\.?(\s*.*)/i ) { $address = $1 . ' 
Street' . $2 }
   if ( $address =~ /(.*)\sRd\b\.?(\s*.*)/i ) { $address = $1 . ' Road' 
. $2 }
   if ( $address =~ /(.*)\sLa\b\.?(\s*.*)/i ) { $address = $1 . ' Lane' 
. $2 }
   if ( $address =~ /(.*)\sAve\b\.?(\s*.*)/i ) { $address = $1 . ' 
Avenue' . $2 }
   if ( $address =~ /(.*)\sHwy\b\.?(\s*.*)/i ) { $address = $1 . ' 
Highway' . $2 }

   $address =~ s/\bDr\.?\b/Drive/ig;
   $address =~ s/\bDrive\./Drive/g;
   $address =~ s/#//g;

   return $address;
}
HTH.

-Chris

John Saylor wrote:

hi

( 03.08.04 17:12 -0400 ) Joel Gwynn:
 

we're looking for a fast, customizable de-duping solution.
I was thinking there might be some perl stuff out there,
   

really, any perl programmer worth hiring should be able to do this while
sleeping.
 

--
Chris Brooks
VP, Technology
carescout.com
STATEMENT OF CONFIDENTIALITY:
The information contained in this electronic message and any attachments
to this message are intended for the exclusive use of the addressee(s)
and may contain confidential or privileged information. If you are not
the intended recipient, please notify CareScout immediately at either
(800) 571-1918 or at [EMAIL PROTECTED], and destroy all copies of
this message and any attachments.
___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] Postal address De-duping

2003-08-09 Thread Andrew Pimlott
On Tue, Aug 05, 2003 at 11:21:25AM -0400, Tolkin, Steve wrote:
 Unfortunately I think that the USPS site 
 http://www.usps.com/cgi-bin/zip4/zip4inq
 needed to run this script is no more.  

 A search there for zip4inq produced nothing.
 
 Does anyone know of a similar page, wither by the USPS or
 another provider of (web) services?

Just follow the Find a Zip Code link from http://www.usps.com/ ?

Andrew
___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] Postal address De-duping

2003-08-09 Thread David Cantrell
On Tuesday, August 5, 2003, at 09:07 AM, John Saylor wrote:
really, any perl programmer worth hiring should be able to do this while
sleeping.
No, it's quite a hard problem.  All of the following UK addresses are
the same and are deliverable.
2/11 CR7 8JH
11b CR7 8JH
Flat 2, 11 Beulah Road, CR7 8JH *
11 Beulah Road, Flat 2, CR7 8JH
11b Beulah Road, CR7 8JH *
Cantrell, CR7 8JH
Now, they're not all recommended - I've asterisked the two that the post
office like - but they do all work.  Now factor in all the people who
can't spell my name right, or the name of the road, or get a character
wrong in the post code, and yet my mail still arrives.
The following are all the same too:

1 London
Apsley House, Duke of Wellington Place, W1
W1J 7NT
--
David Cantrell |  Degenerate  | http://www.cantrell.org.uk/david
  While researching this email, I was forced to carry out some
  investigative work which unfortunately involved a bucket of
  puppies and a belt sander
-- after JoeB, in the Monastery


___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] Postal address De-duping

2003-08-08 Thread Steve Revilak
 Unfortunately I think that the USPS site
 http://www.usps.com/cgi-bin/zip4/zip4inq needed to run this script
 is no more.

 A search there for zip4inq produced nothing.
 
 Does anyone know of a similar page, wither by the USPS or another
 provider of (web) services?
 
 Just follow the Find a Zip Code link from http://www.usps.com/ ?

According to http://www.usps.com/zip4/zipfaq.htm, their zip code
lookup is ZIP+4 Lookup is intended for interactive use, not automated
script processing.

However, it looks like the postal service does have electronic zip
directories

  http://www.usps.com/ncsc/ziplookup/amsdev.htm

As far as the general of cleaning postal addresses, it looks like
there are commerical packages out there, and a whole certification
process for them.

  http://www.usps.com/ncsc/ziplookup/cam.htm

-- 
Steve Revilak

___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] Postal address De-duping

2003-08-06 Thread John Saylor
hi

 On Tuesday, August 5, 2003, at 09:07 AM, John Saylor wrote:
 really, any perl programmer worth hiring should be able to do this while
 sleeping.

( 03.08.05 19:21 +0100 ) David Cantrell:
 No, it's quite a hard problem.

i guess it depends on the way the problem is defined by the client. as
you might have guessed, the problem i was thinking of was considerably
simpler than the example you outlined in your response.

-- 
\js

___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


RE: [Boston.pm] Postal address De-duping

2003-08-06 Thread Tolkin, Steve
The article in question can be found at
http://www.foo.be/docs/tpj/issues/vol4_1/tpj0401-0002.html
(I had a hard time finding it via tpj.com, but Google worked.)

Unfortunately I think that the USPS site 
http://www.usps.com/cgi-bin/zip4/zip4inq
needed to run this script is no more.  
A search there for zip4inq produced nothing.

Does anyone know of a similar page, wither by the USPS or
another provider of (web) services?

Hopefully helpfully yours,
Steve
-- 
Steven Tolkinsteve . tolkin at fmr dot com   617-563-0516 
Fidelity Investments   82 Devonshire St. V4D Boston MA 02109
There is nothing so practical as a good theory.  Comments are by me, 
not Fidelity Investments, its subsidiaries or affiliates.



 -Original Message-
 From: Jon Orwant [mailto:[EMAIL PROTECTED] 
 Sent: Monday, August 04, 2003 6:15 PM
 To: Joel Gwynn
 Cc: [EMAIL PROTECTED]
 Subject: Re: [Boston.pm] Postal address De-duping
 
 
 
 On Monday, August 4, 2003, at 05:12  PM, Joel Gwynn wrote:
 
  Hey, all.  We do lots of (snail) mailings, and we're looking for a 
  fast,
  customizable de-duping solution.  We're currently taking a look at
  doubletake from http://peoplesmith.com/, which is not too 
 expensive, 
  but
  I was thinking there might be some perl stuff out there, 
 given perl's
  text-processing powers.
 
 There's a wee script I wrote for TPJ a while back that 
 scrapes the U.S. 
 Postal Service's address canonicalizer.  The script is on 
 tpj.com; look 
 under Archives for the article called Five Quick Hacks.  The 
 canonicalizer (well, they call it a zip code locator or something 
 like that) will transform variants on the same address into the One 
 True Address that the USPS recognizes, so de-duping then becomes a 
 matter of simple string matching.
 
 Won't help you for foreign addresses, obviously.
 
 -Jon
 
 ___
 Boston-pm mailing list
 [EMAIL PROTECTED]
 http://mail.pm.org/mailman/listinfo/boston-pm
 
___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] Postal address De-duping

2003-08-04 Thread Jon Orwant
On Monday, August 4, 2003, at 05:12  PM, Joel Gwynn wrote:

Hey, all.  We do lots of (snail) mailings, and we're looking for a 
fast,
customizable de-duping solution.  We're currently taking a look at
doubletake from http://peoplesmith.com/, which is not too expensive, 
but
I was thinking there might be some perl stuff out there, given perl's
text-processing powers.
There's a wee script I wrote for TPJ a while back that scrapes the U.S. 
Postal Service's address canonicalizer.  The script is on tpj.com; look 
under Archives for the article called Five Quick Hacks.  The 
canonicalizer (well, they call it a zip code locator or something 
like that) will transform variants on the same address into the One 
True Address that the USPS recognizes, so de-duping then becomes a 
matter of simple string matching.

Won't help you for foreign addresses, obviously.

-Jon

___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm