On Jan 27, 6:35 pm, Paul Rubin <no.em...@nospam.invalid> wrote: > Brian D <brianden...@gmail.com> writes: > > I've tackled this kind of problem before by looping through a patterns > > dictionary, but there must be a smarter approach.> > > Two addresses. Note that the first has incorrectly transposed the > > direction and street name. .... > > If you're really serious about it (e.g. you are the post office trying > to program automatic mail sorting machines) there is no simple regex > trick anything like what you want. A lot of addresses will be > ambiguous. You have use all the info you have about your entire address > corpus (e.g. you need a complete street directory of the whole US) and > do a bunch of Bayesian inference. As a very simple example, for an > address like "1000 RAMPART S ST" you'd use the zip code to identify the > address's geographic neighborhood, and then use your street directory to > find candidate correct addresses within that zip code. The USPS does > an amazing job of delivering mail to completely mangled addresses > based on methods like that.
Paul, That's a sound methodology. I actually have a routine that will compare an address to a list of all streets in the city using a Short Distance function. I have used that in circumstances when there are a lot of problems with addresses. In this case, however, the streets are actually structured very well -- except for the transposed street directions. I was really hoping to see if there's a solution that handles one, two, and three word strings, followed by an occasional single character, and then a two character suffix. I'm still hoping for that kind of a solution if it exists. The reason? It's actually a very small number of addresses that aren't being captured with the current regex. I don't see the need for overkill, and I'm always stretching to learn something I haven't already succeeded at accomplishing. I may just make a second pass at the data with a different regex. -- http://mail.python.org/mailman/listinfo/python-list