[CODE4LIB] pdf2txt

Eric Lease Morgan Fri, 11 Oct 2013 08:17:20 -0700

For a limited period of time I am making publicly available a Web-based program 
called PDF2TXT -- http://bit.ly/1bJRyh8


PDF2TXT extracts the text from an OCRed PDF document and then does some 
rudimentary "distant reading" against the text in the form of word clouds, 
readability scores, concordance features, and "maps" (histograms) illustrating 
where terms appear in a text.

Here is the idea behind the application:

  1. In the Libraries I see people scanning, scanning, and
     scanning. I suppose these people then go home and read the
     document. They might even print it. These documents are long.
     Moreover, I'll bet they have multiple documents.

  2. Text mining requires digitized text, but PDF documents are
     frequently full of formatting. At the same time, they often
     have the text underneath. Our scanning software does OCR.

  3. By extracting the text from PDF documents, I can facilitate
     a different -- additional -- type of analysis against sets of
     one or more documents. PDF2TXT is the first step in this
     process.

What is really cool is that PDF2TXT works for many of the articles downloadable 
from the Libraries's article indexes. Search an article index. Download a full 
text, PDF version of the article. Feed it to PDF2TXT. Get more out of your 
article.

PDF2TXT currently has "creeping featuritis" -- meaning that it is growing in 
weird directions. Your feedback is more than welcome. (I know. The output is 
ugly.) Also, please be gentle with it because it does not process things the 
size of the Bible.

--
[cid:116F6092-2AB6-4E95-8199-25639542726A]

Eric Lease Morgan
Digital Initiatives Librarian

University of Notre Dame
Room 131, Hesburgh Libraries
Notre Dame, IN 46556
o: 574-631-8604
e: [email protected]<mailto:[email protected]>

[cid:8DBE3E66-AAD0-40A0-A626-745EEEA175E5]

<<inline: 116F6092-2AB6-4E95-8199-25639542726A.png>>

<<inline: 8DBE3E66-AAD0-40A0-A626-745EEEA175E5.png>>

[CODE4LIB] pdf2txt

Reply via email to