Olivier Boudry wrote: > > Hi all, > > This e-mail may be a bit off topic. My question is more about methods and > algorithms than Haskell. I'm looking for links to methods or tools for > parsing unstructured data. > > I'm currently working on data cleaning of a Customer Addresses database. > Addresses are stored as 3 lines of text without any structure and people > made used lots of imagination to create the data (20 years of data using > no > rules at all). Postal code, street, city, state, region, country and other > details as suite, building, dock, doors, PO box, etc... are all stored in > free form in those 3 lines. > > I already wrote a haskell program to do the job. It correctly parses about > 2/3 addresses and parses much of the rest but with unrecognized parts > left. > The current program works by trying to recognize words used to tag parts > like STE, SUITE, BLDG, street words (STR, AVE, CIRCLE, etc...) and > countries > from a list (including typos). It uses regular expressions to recognize > variation of those words, lookup tables for countries, street words, > regular > expression rules for postal codes, etc... The most difficult task is > splitting the address parts. There is no clearly defined separator for the > fields. It can be dot, space, comma, dash, slash, or anything you can > imagine using as a separator and this separator can of course also be > found > inside an address part. > > In the current application when part of an address is recognized it will > not > be parsed again by the other rules. A system trying all rules and tagging > them with probabilities would probably give better results. > > Any link to tools or methods that could help me in that task would be > greatly appreciated. I already searched for fuzzy, probabilistic or > statistical parsing but without much success. > >
You may have better luck checking out methods used in parsing natural language. In order to use statistical parsing techniques such as Probabilistic Context Free Grammars ([1],[2] ) the standard approach is to extract rule probabilities from an annotated corpus, that is collection of strings with associated parse trees. Maybe you could use your 2/3 of addresses that you know are correctly parsed as your training material. A PCFG parser can output all (or n-best) parses ordered according to probabilities so that would seem to be fit your requirements. [1] http://en.wikipedia.org/wiki/Stochastic_context-free_grammar [2] http://www.cs.colorado.edu/~martin/slp2.html#Chapter14 -- Best, Grzegorz -- View this message in context: http://www.nabble.com/Parsing-unstructured-data-tf4892274.html#a14011334 Sent from the Haskell - Haskell-Cafe mailing list archive at Nabble.com. _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe