Re: [CODE4LIB] Identifying description sources across a large corpus of MARC records

Eric Lease Morgan Fri, 20 Sep 2019 09:08:34 -0700

Eric Morgan wrote:

>> [I also put this on AUTOCAT. Apologies if you also follow that. This
>> falls at the intersection of hand-cataloging, data processing and
>> simple AI.]...
> 
> Tim, your's is a perfect example of a supervised machine learning 
> classification process. The process works very much like your computer's spam 
> filter. Here's how:
> 
>   1. collect a set of data that you know is
>      library-written
> 
>   2. collect a set of data that you know is
>      publisher-sourced
> 
>   3. count, tabulate, and vectorize the
>      features of your data -- measure the data's
>      characteristics and associate them with
>      a collection
> 
>   4. model the data -- use any one of a number
>      of clustering algorithms to associate
>      the data with one collection or another,
>      such as Naive Bayes
> 
>   5. optionally, test the accuracy of the model
> 
>   6. save the model



The crucial part of a supervised machine learning process is the training step, 
and each sub-step can (and probably should) be tweaked given one's particular 
situation. There are a number of things to consider:

  * Identifying correct & accurate sets of training data is difficult. First, 
many times data does not fall neatly into one or more distinct categories. 
While a book may be written by a single individual, the book may fall into a 
number of different subjects or genres. Second, the distinction between one 
category and another may be so subtle, that even a computer, given a very large 
set of sample data, may not be able to consistently choose between one category 
and another. Third, binary classification is easy (spam versus ham). 
Classification into a flat list of categories is not too difficult. But 
hierarchal classification is very difficult.

 * Measuring the data -- counting, tabulating, and vectorizing -- is fraught 
with nuance. For example, what are you going to count? Individual words? 
Phrases? Numbers? Will you exclude stop words? Are you going to stemmatize the 
features? Maybe you will lemmatize the words? Maybe you will do neither. Will 
you merely count and tabulate the words, or maybe you will use something like 
an algorithm called TFIDF to create a more "relevant" list of words and scores? 
To what degree will you test the accuracy of the data, and if to a high degree, 
then what technique will you use?

  * Modeling the data - This is the "magic happen here" step. What algorithm 
are you going to use, and how are you going to parameterize it? Your choices 
will depend on many things, such as: the size & scope of the data, whether the 
data is numeric or not, the desire for a true/false classification or a degree 
of certainty, the size & scope of your computer(s), the degree of real 
distinctiveness of the different data sets, etc. Entire dissertations are 
written on this topic.

Not ironically, there are computer processes that help with the writing of 
these sorts of computer programs; there are techniques used to determine which 
of the various combinations -- "turning the knobs" -- are the most efficient. 
Computer programs used to create... machine learning programs. Yikes!!

When it comes to the use case alluded to in the original posting, this is what 
I would do:

  1) Identify a "large" set of library-written MARC
     records, at least 50.

  2) Identify a similarly large set of publisher
     -sourced MARC records.

  3) Loop through each MARC record, read the 520
     field, and save the result as a file in a
     directory named "library" or a directory named
     "publisher", accordingly.

  4) Run train.py against the directories.

  5) Identify a set of MARC records which contain
     values in the 520 field.

  6) Loop through each of these additional records,
     read the 520 field, and save the result as a
     file in a directory, called, say "unclassified"

  7) Run classify.py against the unclassified
     directory.

  8) The result will be a list of labels/filenames
     -- classifications.

You will then want to repeat the whole process for the purposes of "turning the 
knobs". For example:

  * increase the size of your datasets but keep
    them similarly sized; not as easy as you might
    think

  * use different techniques to measure your data

  * use different modeling algorithms

What is really cool about this whole process is that it is immensely scalable. 
For example, one could classify a whole set of documents, and one could feel 
okay about the result. Then, a year later, given more expertise and additional 
sets of data, the process could be tweaked, and the whole lot could be 
re-classified. The computer doesn't care about touching each item more than 
once. It will touch it as many times as you tell it. Yes, there is a lot of 
work up front, the work requires additional skills, but the result can 
definitely supplement & enhance the work that is already being done. 

We, as a profession, need to go beyond the use of computers to merely automate 
things. We need -- ought -- to learn how to exploit computers to really & truly 
take advantage of their ability to store vast amounts of data, organize it into 
information, widely share the information, consume ("read") the information, 
analyze the information, and output knowledge which is then verified by a 
person as true, useful, relevant, understandable, etc.

(Again, the whole lot of this posting has been saved in a tarball temporarily 
accessible at http://dh.crc.nd.edu/tmp/classification.zip)

--
Eric Lease Morgan, Librarian

Re: [CODE4LIB] Identifying description sources across a large corpus of MARC records

Reply via email to