On Jun 17, 2005, at 9:08 AM, Ricker, William wrote: > >> So I have about 75k addresses that need to be redone. > > That's enough to consider a commercial service with value-added data > corrections and warranty. Both Eagle and Geocoder.US offer that.
I redid my math last night, and Eagle offers 100k addresses for $1550, which works out to 1.5 cents each. The price per address is similar in smaller batches too. I was thinking Eagle was much more expensive, but it's still no where near as cheap as geocoder. > If there's a business value to accurate data, either should > be well worth it. Exactly! In our case there is definite business value. >> One option we've used in the past is just doing zipcode centroid >> matching. You can get this information for ~$100. > > Can't you do that for free with the USPO web service? But $100 is less > than the coding cost to scrape USPO, so ... yeah. Sure you can scrap, but have you read the USPO's TOS? It, like Google, forbids commercial use for the free service. Like always, if you want to make money off it, you gotta pay! > Yes, the question is how short the radius is, what the density of > datums > are, what industry you're catloging is. If the normal radius is say 25 > miles, grouping all items in zipcode 12345 to the same lat/lon at > either > the zipcode centroid or the post-office lat-lon should be just fine. If > you have sufficient density that returning the closest 10 hits would > all > be within a mile, this would increase the error dramatically. The choices for radius are 5, 10, 15, 25 miles. The problem is that we have a fair number of non-urban addresses. And non-urban zipcodes tend to be rather large... So I don't think the density overall would give us very accurate results. > You've got 75k data items. What is the clustering? Of the 99999 > possible zipcodes, how many are represented in your 75k items, and how > many have high counts? If spread evenly over the (back of the > envelope, > 3k*1k=) 3M sq miles of the lower 48, that's (75k/3kk = 25/k = 1/40) one > datum per 40 square miles, would be ok. I just checked our largest database, and there are 1600 zipcodes which have 3+ addresses in them, and just 3 zipcodes w/ 10+ addresses. This quick calculation tells me that the addresses are spread pretty evenly across the country. >> In the case of using the TIGER/Line dataset, how >> accurate is it? > > Not sure how different GeoCoder's public webservice dataset is from raw > Tiger/line set. I'm assuming they're showing their value add, but don't > know that. That is the main question holding me back from trying geocoder. How much enhancement have they done from the public dataset? The big mapping/geocoding companies have spent tons of $$$ to keep their databases up-to-date. And they are million/billion dollar businesses too so they have an incentive to make their databases as accurate as possible. > I've found their web service is touchy about giving the right > abbreviations and is helped by giving it zipcodes. It took me several > tries to get "1 Stanwood Street, Gloucester, MA", not sure why, works > now. If you use Geo::Coder::Us and TIGER/Line, you'll want to put a > USPO or your own cleanup routine in front of it to force Street to St, > and strip the .'s off abbrevs and add ,'s where it wants 'em (or parse > 'em yourself into fields). This is one benefit of Eagle's service - they take the address given and first standardize it according to USPS standards. This is sent back as part of the result set. So this tells me that they are liberal in their input, and strict in their output - a good thing! There's a second part to the address standardization equation too. We get our data primarily from the government. It is often inaccurate or just plain incorrect, usually because a phone number has changed (not relevent here) or the business moved. So we have the additional problem of how to layer our updated (presumably more accurate) data - gotten from Eagle and from phone calls to the business - on top of the original data. People are working on this issue, but it's rather thorny. You basically have to get down to a set of rules which takes into account the source and date to decide which is most likely correct. Interesting problem isn't it? This is a great discussion! Anyone else have opinions or suggestions on the subject? Drew _______________________________________________ Boston-pm mailing list [email protected] http://mail.pm.org/mailman/listinfo/boston-pm

