Hi Eric, I am planning to work on detecting such anomalities. What I have thought about so far the following approaches: - n-gram analysis - basket analysis - similarity detection of Solr - final state automat
The tools I will use: Apache Solr and Apache Spark. I haven't started yet the implementation. Best, Péter 2017-10-25 17:57 GMT+02:00 Eric Lease Morgan <emor...@nd.edu>: > Has anybody here played with any clustering techniques for normalizing > bibliographic data? > > My bibliographic data is fraught with inconsistencies. For example, a > publisher’s name may be recorded one way, another way, or a third way. The > same goes for things like publisher place: South Bend; South Bend, IN; South > Bend, Ind. And then there is the ISBD punctuation that is sometimes applied > and sometimes not. All of these inconsistencies make indexing & faceted > browsing more difficult than it needs to be. > > OpenRefine is a really good program for finding these inconsistencies and > then normalizing them. OpenRefine calls this process “clustering”, and it > points to a nice page describing the various clustering processes. [1] Some > of the techniques included “fingerprinting” and calculating “nearest > neighbors”. Unfortunately, OpenRefine is not really programable, and I’d like > to automate much of this process. > > Does anybody here have any experience automating the process of normalize > bibliographic (MARC) data? > > [1] about clustering - http://bit.ly/2izQarE > > — > Eric Morgan -- Péter Király software developer GWDG, Göttingen - Europeana - eXtensible Catalog - The Code4Lib Journal http://linkedin.com/in/peterkiraly