On Wed, 2007-11-28 at 12:58 -0500, Olivier Boudry wrote: > Hi all, > > This e-mail may be a bit off topic. My question is more about methods > and algorithms than Haskell. I'm looking for links to methods or tools > for parsing unstructured data. > > I'm currently working on data cleaning of a Customer Addresses > database. Addresses are stored as 3 lines of text without any > structure and people made used lots of imagination to create the data > (20 years of data using no rules at all). Postal code, street, city, > state, region, country and other details as suite, building, dock, > doors, PO box, etc... are all stored in free form in those 3 lines. > > I already wrote a haskell program to do the job. It correctly parses > about 2/3 addresses and parses much of the rest but with unrecognized > parts left. The current program works by trying to recognize words > used to tag parts like STE, SUITE, BLDG, street words (STR, AVE, > CIRCLE, etc...) and countries from a list (including typos). It uses > regular expressions to recognize variation of those words, lookup > tables for countries, street words, regular expression rules for > postal codes, etc... The most difficult task is splitting the address > parts. There is no clearly defined separator for the fields. It can be > dot, space, comma, dash, slash, or anything you can imagine using as a > separator and this separator can of course also be found inside an > address part. > > In the current application when part of an address is recognized it > will not be parsed again by the other rules. A system trying all rules > and tagging them with probabilities would probably give better > results. Have you looked at the Java Rule Engine (I believe JSR 94) and in particular Jess? http://herzberg.ca.sandia.gov/
I have no experience with it myself, though, just heard of it. Regards, Hans van Thiel > Any link to tools or methods that could help me in that task would be > greatly appreciated. I already searched for fuzzy, probabilistic or > statistical parsing but without much success. > > Thanks, > > Olivier. > > PS: just in case someone's interested I attached the code and partial > data to this e-mail. _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe