Hi all,
I have a list of strings with the company names inbeded inside, I would like to
1) extract these names without the junk around it, and
2) normalize the company names.

For 1), the name is often followed by its address, e.g.
BOSTON PROBES, INC  75 E WIGGINS AVENUE  BEDFORD, ...
PAUL F PRESTIA  RATNER & PRESTIA  SUITE 301 ...
NIKAIDO MARMELSTEIN MURRAY AND ORAM  METROPOLITAN SQUARE  655 15TH STREET NW ...
Ratner & Prestia  P.O. Box 980  Valley Forge ...

These are just a few examples. I noticed I can probably use number as
an indication of the address starting, along with some other keywords:
PO, ONE, SUITE, etc. However, sometimes the company is followed by
city name, which makes it harder (e.g. RICE UNIVERSITY HOUSTON TX).

2) normalize the compnay names
often, one company name has many variations,
MILLENNIUM PHARMACEUTICAL
MILLENNIUM PHARMACEUTICALS
MILLENNIUM PHARMACEUTICAL CAMBRIDGE
MILLENNIUM PHARMA

Also there are abbrevations like INC., CORP., S.A.; GMBH; UNIV., RES.
vs Research, Inst. etc.

QUESTION: are there modules/algorithms written to handle these kind of
name scrubbing problems?

Jun

----
Jun Wan
Bioinformatics Scientist
Gene-IT, Inc.
 
_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

Reply via email to