Hi all, I have a list of strings with the company names inbeded inside, I would like to 1) extract these names without the junk around it, and 2) normalize the company names.
For 1), the name is often followed by its address, e.g. BOSTON PROBES, INC 75 E WIGGINS AVENUE BEDFORD, ... PAUL F PRESTIA RATNER & PRESTIA SUITE 301 ... NIKAIDO MARMELSTEIN MURRAY AND ORAM METROPOLITAN SQUARE 655 15TH STREET NW ... Ratner & Prestia P.O. Box 980 Valley Forge ... These are just a few examples. I noticed I can probably use number as an indication of the address starting, along with some other keywords: PO, ONE, SUITE, etc. However, sometimes the company is followed by city name, which makes it harder (e.g. RICE UNIVERSITY HOUSTON TX). 2) normalize the compnay names often, one company name has many variations, MILLENNIUM PHARMACEUTICAL MILLENNIUM PHARMACEUTICALS MILLENNIUM PHARMACEUTICAL CAMBRIDGE MILLENNIUM PHARMA Also there are abbrevations like INC., CORP., S.A.; GMBH; UNIV., RES. vs Research, Inst. etc. QUESTION: are there modules/algorithms written to handle these kind of name scrubbing problems? Jun ---- Jun Wan Bioinformatics Scientist Gene-IT, Inc. _______________________________________________ Boston-pm mailing list [email protected] http://mail.pm.org/mailman/listinfo/boston-pm

