Re: [CODE4LIB] Text analysis for MARC data?

Eric Lease Morgan Thu, 04 Jan 2024 09:10:28 -0800

On Jan 4, 2024, at 11:26 AM, Alison Clemens <alison.clem...@gmail.com> wrote:


> Has anyone here done text analysis-type work on MARC data, particularly on
> topical subject headings? I work closely with my library's digital
> collections, and I am interested in seeing what kinds of topics (as
> indicated in our descriptive data) are represented in our
> digital collections. So, I have the corresponding MARCXML for the
> materials and have extracted the 650s as a string (e.g., *650 $a World War,
> 1914-1918 $x Territorial questions $v Maps*), but I'm a little stuck on how
> to meaningfully analyze the data. I tried feeding the data into Voyant, but
> I think it's too large of a corpus to run properly there, and regardless,
> the MARC data is (of course) delimited in a specific way.
> 
> Any / all perspectives or experience would be welcome -- please do get in
> touch directly (at alison.clem...@gmail.com), if you'd like.
> 
> --
> Alison Clemens
> Beinecke Rare Book and Manuscript Library, Yale University


The amount of available content, relative to the size of the values in 6xx, is 
kinda small. The number of things might be large, but the number of result 
words is small. That said, I can think of a number of ways such analysis can be 
done. The process can be boiled down to four very broad steps:

  1) articulating more thoroughly what questions you want to ask of the MARC
  2) distilling the MARC into one or more formats amenable to a given 
modeling/analysis process
  3) modeling/analyzing the data
  4) evaluating the results

For example, suppose you simply wanted to know the frequency of each FAST 
subject heading. I would loop through each 6xx field in each MARC, extract the 
given subjects, parse the values into FAST headings, and output the result to a 
file. You will then have file looking something like this:

  United States
  World War, 1914-1918
  Directories
  Science, Ancient
  Maps
  Librarians
  Origami
  Science, Ancient
  Origami
  Maps
  Philosophy
  Dickens, Charles
  World War, 1914-1918
  Territorial questions
  Maps

Suppose the file is named headings.txt. You can then sort the list, use the 
Linux uniq command to count and tabulate each heading. Pipe the result to the 
sort command, and you will end up a with groovy frequency list. The command 
will look something like this:

  cat headings.txt | sort | uniq -c | sort -rn

Here is the result:

   3 Maps
   2 Science, Ancient
   2 Origami
   1 World War, 1914-1918
   1 Territorial questions
   1 Philosophy
   1 Librarians
   1 Directories
   1 Dickens, Charles

Such a process will give you one view of your data. Relatively quick and easy.

Suppose you wanted to extract latent themes from the content of MARC 6xx. This 
is sometimes called "topic modeling", and MALLET is the grandaddy of topic 
modeling tools. Loop through each 6xx field of your MARC records, extract the 
headings, and for each MARC, create a plain text file containing the data. In 
the end you will thousands of tiny plain text files. You can then turn MALLET 
against the files, and the result will a set of weighted themes -- "topics". 
For extra credit, consider adding the values of 245, 1xx, 5xx to your output. 
If each plain text file is associated with a metadata value (such as date, 
collection, format, etc.), then the resulting topic model can be pivoted, and 
you will be able to observe how the topics compare to the metadata values. For 
example, you could answer the question, "For items in these formats, what are 
the more frequent topics?" or "How have our subjects ebbed & flowed over time?"

I do this sort of work all the time; what you are describing is a very large 
part of my job. Here in our scholarship center people bring me lots o' content, 
and I use processes very much like the ones outlined above to help the people 
use & understand it.

Fun!

--
Eric Morgan <emor...@nd.edu>
Navari Family Center for Digital Scholarship
University of Notre Dame

574/631-8604

Re: [CODE4LIB] Text analysis for MARC data?

Reply via email to