Re: [CODE4LIB] clustering techniques for normalizing bibliographic data

Péter Király Wed, 25 Oct 2017 09:20:06 -0700

Hi Eric,

I am planning to work on detecting such anomalities. What I have
thought about so far the following approaches:
- n-gram analysis
- basket analysis
- similarity detection of Solr
- final state automat


The tools I will use: Apache Solr and Apache Spark. I haven't started
yet the implementation.

Best,
Péter


2017-10-25 17:57 GMT+02:00 Eric Lease Morgan <emor...@nd.edu>:
> Has anybody here played with any clustering techniques for normalizing 
> bibliographic data?
>
> My bibliographic data is fraught with inconsistencies. For example, a 
> publisher’s name may be recorded one way, another way, or a third way. The 
> same goes for things like publisher place: South Bend; South Bend, IN; South 
> Bend, Ind. And then there is the ISBD punctuation that is sometimes applied 
> and sometimes not. All of these inconsistencies make indexing & faceted 
> browsing more difficult than it needs to be.
>
> OpenRefine is a really good program for finding these inconsistencies and 
> then normalizing them. OpenRefine calls this process “clustering”, and it 
> points to a nice page describing the various clustering processes. [1] Some 
> of the techniques included “fingerprinting” and calculating “nearest 
> neighbors”. Unfortunately, OpenRefine is not really programable, and I’d like 
> to automate much of this process.
>
> Does anybody here have any experience automating the process of normalize 
> bibliographic (MARC) data?
>
> [1] about clustering - http://bit.ly/2izQarE
>
> —
> Eric Morgan



-- 
Péter Király
software developer
GWDG, Göttingen - Europeana - eXtensible Catalog - The Code4Lib Journal
http://linkedin.com/in/peterkiraly

Re: [CODE4LIB] clustering techniques for normalizing bibliographic data

Reply via email to