As Roy suggests, Open Refine is designed for this type of work and could easily deal with the volume you are talking about here. It can cluster terms using a variety of algorithms and easily apply a set of standard transformations.
The screencasts and info at http://freeyourmetadata.org/cleanup/ might be a good starting point if you want to see what Refine can do Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 21 Mar 2014, at 18:24, Ken Irwin <kir...@wittenberg.edu> wrote: > Hi folks, > > I'm looking for a tool that can look at a list of all of subject terms in a > poorly-controlled index as possible candidates for term consolidation. Our > student newspaper index has about 16,000 subject terms and they include a lot > of meaningless typographical and nomenclatural difference, e.g.: > > Irwin, Ken > Irwin, Kenneth > Irwin, Mr. Kenneth > Irwin, Kenneth R. > > Basketball - Women > Basketball - Women's > Basketball-Women > Basketball-Women's > > I would love to have some sort of pattern-matching tool that's smart about > this sort of thing that could go through the list of terms (as a text list, > database, xml file, or whatever structure it wants to ingest) and spit out > some clusters of possible matches. > > Does anyone know of a tool that's good for that sort of thing? > > The index is just a bunch of MySQL tables - there is no real controlled-vocab > system, though I've recently built some systems to suggest known SH's to > reduce this sort of redundancy. > > Any ideas? > > Thanks! > Ken