Ken, 

A group in Chicago has been working for a few years now on a deduplication 
toolkit that might do what you are looking for, they also have a couple 
versions that works with an excel file or .csv file. 

https://github.com/datamade/dedupe
https://github.com/datamade/dedupe-web
https://github.com/datamade/csvdedupe

I have not worked with them extensively, but I have heard others find these 
very useful for entity recognition and resolution.






-----Original Message-----
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Ken 
Irwin
Sent: Friday, March 21, 2014 2:25 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] tool for finding close matches in vocabular list

Hi folks,

I'm looking for a tool that can look at a list of all of subject terms in a 
poorly-controlled index as possible candidates for term consolidation. Our 
student newspaper index has about 16,000 subject terms and they include a lot 
of meaningless typographical and nomenclatural difference, e.g.:

Irwin, Ken
Irwin, Kenneth
Irwin, Mr. Kenneth
Irwin, Kenneth R.

Basketball - Women
Basketball - Women's
Basketball-Women
Basketball-Women's

I would love to have some sort of pattern-matching tool that's smart about this 
sort of thing that could go through the list of terms (as a text list, 
database, xml file, or whatever structure it wants to ingest) and spit out some 
clusters of possible matches.

Does anyone know of a tool that's good for that sort of thing?

The index is just a bunch of MySQL tables - there is no real controlled-vocab 
system, though I've recently built some systems to suggest known SH's to reduce 
this sort of redundancy.

Any ideas?

Thanks!
Ken

Reply via email to