Eric,
Very interesting. I've have been working with some existing pdf
utilities with a goal of automatically extracting the abstract from
technical reports, articles and dissertations that are to be bulk
uploaded to our institutional repository. I tried two of our documents
through your system and the first one worked great.
The second tech report I tried however generated this error message:
Software error:
No words from which to create a cloud - see add(...). at
/usr/local/share/perl5/HTML/TagCloud/Centred.pm line 229.
For help, please send mail to the webmaster (root@localhost), giving
this error message and the time and date of the error.
Although based on some subsequent messages where you mention tesseract
maybe I misunderstood and your tool only handles pdfs that have already
been OCR'ed which would explain why the second document (which only
contains page images) fails.
-Bob Haschart
On 10/11/2013 11:16 AM, Eric Lease Morgan wrote:
For a limited period of time I am making publicly available a Web-based program
called PDF2TXT -- http://bit.ly/1bJRyh8
PDF2TXT extracts the text from an OCRed PDF document and then does some rudimentary "distant
reading" against the text in the form of word clouds, readability scores, concordance
features, and "maps" (histograms) illustrating where terms appear in a text.
Here is the idea behind the application:
1. In the Libraries I see people scanning, scanning, and
scanning. I suppose these people then go home and read the
document. They might even print it. These documents are long.
Moreover, I'll bet they have multiple documents.
2. Text mining requires digitized text, but PDF documents are
frequently full of formatting. At the same time, they often
have the text underneath. Our scanning software does OCR.
3. By extracting the text from PDF documents, I can facilitate
a different -- additional -- type of analysis against sets of
one or more documents. PDF2TXT is the first step in this
process.
What is really cool is that PDF2TXT works for many of the articles downloadable
from the Libraries's article indexes. Search an article index. Download a full
text, PDF version of the article. Feed it to PDF2TXT. Get more out of your
article.
PDF2TXT currently has "creeping featuritis" -- meaning that it is growing in
weird directions. Your feedback is more than welcome. (I know. The output is ugly.) Also,
please be gentle with it because it does not process things the size of the Bible.
--
[cid:116F6092-2AB6-4E95-8199-25639542726A]
Eric Lease Morgan
Digital Initiatives Librarian
University of Notre Dame
Room 131, Hesburgh Libraries
Notre Dame, IN 46556
o: 574-631-8604
e: [email protected]<mailto:[email protected]>
[cid:8DBE3E66-AAD0-40A0-A626-745EEEA175E5]