On Jun 17, 2005, at 9:08 AM, Ricker, William wrote:
>
>> So I have about 75k addresses that need to be redone.
>
> That's enough to consider a commercial service with value-added data
> corrections and warranty.  Both Eagle and Geocoder.US offer that.

I redid my math last night, and Eagle offers 100k addresses for $1550, 
which works out to 1.5 cents each. The price per address is similar in 
smaller batches too. I was thinking Eagle was much more expensive, but 
it's still no where near as cheap as geocoder.

>  If there's a business value to accurate data, either should
> be well worth it.

Exactly! In our case there is definite business value.

>> One option we've used in the past is just doing zipcode centroid
>> matching. You can get this information for ~$100.
>
> Can't you do that for free with the USPO web service?  But $100 is less
> than the coding cost to scrape USPO, so ... yeah.

Sure you can scrap, but have you read the USPO's TOS? It, like Google, 
forbids commercial use for the free service. Like always, if you want 
to make money off it, you gotta pay!

> Yes, the question is how short the radius is, what the density of 
> datums
> are, what industry you're catloging is.  If the normal radius is say 25
> miles, grouping all items in zipcode 12345 to the same lat/lon at 
> either
> the zipcode centroid or the post-office lat-lon should be just fine. If
> you have sufficient density that returning the closest 10 hits would 
> all
> be within a mile, this would increase the error dramatically.

The choices for radius are 5, 10, 15, 25 miles. The problem is that we 
have a fair number of non-urban addresses. And non-urban zipcodes tend 
to be rather large... So I don't think the density overall would give 
us very accurate results.

> You've got 75k data items.  What is the clustering?  Of the 99999
> possible zipcodes, how many are represented in your 75k items, and how
> many have high counts?  If spread evenly over the (back of the 
> envelope,
> 3k*1k=) 3M sq miles of the lower 48, that's (75k/3kk = 25/k = 1/40) one
> datum per 40 square miles, would be ok.

I just checked our largest database, and there are 1600 zipcodes which 
have 3+ addresses in them, and just 3 zipcodes w/ 10+ addresses. This 
quick calculation tells me that the addresses are spread pretty evenly 
across the country.

>> In the case of using the TIGER/Line dataset, how
>> accurate is it?
>
> Not sure how different GeoCoder's public webservice dataset is from raw
> Tiger/line set. I'm assuming they're showing their value add, but don't
> know that.

That is the main question holding me back from trying geocoder. How 
much enhancement have they done from the public dataset? The big 
mapping/geocoding companies have spent tons of $$$ to keep their 
databases up-to-date. And they are million/billion dollar businesses 
too so they have an incentive to make their databases as accurate as 
possible.

> I've found their web service is touchy about giving the right
> abbreviations and is helped by giving it zipcodes.  It took me several
> tries to get "1 Stanwood Street, Gloucester, MA", not sure why, works
> now.  If you use Geo::Coder::Us and TIGER/Line, you'll want to put a
> USPO or your own cleanup routine in front of it to force Street to St,
> and strip the .'s off abbrevs and add ,'s where it wants 'em (or parse
> 'em yourself into fields).

This is one benefit of Eagle's service - they take the address given 
and first standardize it according to USPS standards. This is sent back 
as part of the result set. So this tells me that they are liberal in 
their input, and strict in their output - a good thing!

There's a second part to the address standardization equation too. We 
get our data primarily from the government. It is often inaccurate or 
just plain incorrect, usually because a phone number has changed (not 
relevent here) or the business moved. So we have the additional problem 
of how to layer our updated (presumably more accurate) data - gotten 
from Eagle and from phone calls to the business - on top of the 
original data. People are working on this issue, but it's rather 
thorny. You basically have to get down to a set of rules which takes 
into account the source and date to decide which is most likely 
correct. Interesting problem isn't it?

This is a great discussion! Anyone else have opinions or suggestions on 
the subject?

Drew

 
_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

Reply via email to