On Nov 21, 2022, at 5:12 PM, RD B <[email protected]> wrote:
> We (Kelvin Smith Library, Case Western Reserve University) are considering
> the ProQuest TDM Studio:
>
> https://about.proquest.com/en/products-services/TDM-Studio/
>
> I was curious if anyone here had any direct experience with the system they
> could share, or if there were alternatives that the community recommends
> and why.
>
> --
> R. David Beales - [email protected] - 732-299-0390
> Library, Earth, Sol System, Orion-Cygnus Arm of the Milky Way Galaxy,
> Laniakea Supercluster
A couple of years ago I experimented with TDM Studio, and I can report that it
worked as advertised.
More specifically, Studio worked like the handful of similar services. One from
Lexis/Nexus, one from JSTOR, and the one from the HathiTrust. What does that
mean? It means a person:
1. searches the given collection
2. results are subsetted to a secure location
3. using tools and APIs provided by the vendor,
analysis is done against the results
4. results are exported
5. repeat until done
Many times the tools and API require a working knowledge of the Python
programming language, and then there is the curve of learning the specific
tools. The tools usually include a number of modeling techniques: bibliography
creation, ngram analysis, topic modeling, and full text searching. After
working in this area for a more than a few years now, these techniques ought to
be considered rudimentary, and additional techniques such as the application of
grammars, semantic indexing, and collocations ought to be included.
All of the vendors have their hands tied by contract and copyright. Each vendor
has made agreements with publishers not to freely share content, but it is not
possible to do text mining, natural language processing, nor data science with
words sans the content. Consequently each vendor implements a variation on Step
#2, above. The process would be a h3ll of a lot easier if the student,
researcher, or scholar could:
1. search content
2. select items of interest
3. download selected items sans click,
save, click, save, click, save, etc.
4. use a wide variety of GUI tools,
command-line tools, or programming
languages to do the analysis
Here licensing is probably the limiting factor, not copyright.
Do I know of open source alternatives? No, not really, but I hope my Reader
addresses some of these problems. Given a set of files of just about any number
and just about any ilk and saved in a local folder/directory, the Reader:
* converts the files to plain text
* does all sorts of feature extraction against the result
* distills the features into a data set (a "study carrel")
* provides the means to compute against the data set, and
the computing could done with GUI tools (like OpenRefine
or AntConc), command-line tools (like grep or jq), or
programming libraries (like Python's NLTK or spaCy)
In the end the Reader supports all of the modeling techniques alluded to above
as well as a few others. Consequently, a person can search any vendor for
content of interest, download the results (through click, click, click), and do
analysis against the result.
Like all software, the Reader is never done and ought to be considered
beta-ware, See:
https://distantreader.org
HTH
P.S. David, nice signature.
--
Eric Lease Morgan
Navari Family Center for Digital Scholarship
Hesburgh Libraries
University of Notre Dame
574/631-8604
https://cds.library.nd.edu