Very neat. I couldn't get the 'network diagram' link to work (from http://dh.crc.nd.edu/sandbox/pdf2txt/pdf2txt.cgi?cmd=search&id=1381506693&query=public%20library). How hard to you think it would be to do stemming before some of the subsequent processing. The bi-grams "public libraries" and "public library" are usually the same thing.
Peter On Oct 11, 2013, at 11:16 AM, Eric Lease Morgan <emor...@nd.edu> wrote: > > For a limited period of time I am making publicly available a Web-based > program called PDF2TXT -- http://bit.ly/1bJRyh8 > > PDF2TXT extracts the text from an OCRed PDF document and then does some > rudimentary "distant reading" against the text in the form of word clouds, > readability scores, concordance features, and "maps" (histograms) > illustrating where terms appear in a text. > > Here is the idea behind the application: > > 1. In the Libraries I see people scanning, scanning, and > scanning. I suppose these people then go home and read the > document. They might even print it. These documents are long. > Moreover, I'll bet they have multiple documents. > > 2. Text mining requires digitized text, but PDF documents are > frequently full of formatting. At the same time, they often > have the text underneath. Our scanning software does OCR. > > 3. By extracting the text from PDF documents, I can facilitate > a different -- additional -- type of analysis against sets of > one or more documents. PDF2TXT is the first step in this > process. > > What is really cool is that PDF2TXT works for many of the articles > downloadable from the Libraries's article indexes. Search an article index. > Download a full text, PDF version of the article. Feed it to PDF2TXT. Get > more out of your article. > > PDF2TXT currently has "creeping featuritis" -- meaning that it is growing in > weird directions. Your feedback is more than welcome. (I know. The output is > ugly.) Also, please be gentle with it because it does not process things the > size of the Bible. > > -- > [cid:116F6092-2AB6-4E95-8199-25639542726A] > > Eric Lease Morgan > Digital Initiatives Librarian > > University of Notre Dame > Room 131, Hesburgh Libraries > Notre Dame, IN 46556 > o: 574-631-8604 > e: emor...@nd.edu<mailto:emor...@nd.edu> > > [cid:8DBE3E66-AAD0-40A0-A626-745EEEA175E5] > > <116F6092-2AB6-4E95-8199-25639542726A.png><8DBE3E66-AAD0-40A0-A626-745EEEA175E5.png> -- Peter Murray Assistant Director, Technology Services Development LYRASIS peter.mur...@lyrasis.org +1 678-235-2955 800.999.8558 x2955