Re: [CODE4LIB] pdf2txt

Peter Murray Fri, 11 Oct 2013 08:57:28 -0700

Very neat.  I couldn't get the 'network diagram' link to work (from 
http://dh.crc.nd.edu/sandbox/pdf2txt/pdf2txt.cgi?cmd=search&id=1381506693&query=public%20library).
  How hard to you think it would be to do stemming before some of the 
subsequent processing.  The bi-grams "public libraries" and "public library" 
are usually the same thing.



Peter

On Oct 11, 2013, at 11:16 AM, Eric Lease Morgan <[email protected]> wrote:

> 
> For a limited period of time I am making publicly available a Web-based 
> program called PDF2TXT -- http://bit.ly/1bJRyh8
> 
> PDF2TXT extracts the text from an OCRed PDF document and then does some 
> rudimentary "distant reading" against the text in the form of word clouds, 
> readability scores, concordance features, and "maps" (histograms) 
> illustrating where terms appear in a text.
> 
> Here is the idea behind the application:
> 
>  1. In the Libraries I see people scanning, scanning, and
>     scanning. I suppose these people then go home and read the
>     document. They might even print it. These documents are long.
>     Moreover, I'll bet they have multiple documents.
> 
>  2. Text mining requires digitized text, but PDF documents are
>     frequently full of formatting. At the same time, they often
>     have the text underneath. Our scanning software does OCR.
> 
>  3. By extracting the text from PDF documents, I can facilitate
>     a different -- additional -- type of analysis against sets of
>     one or more documents. PDF2TXT is the first step in this
>     process.
> 
> What is really cool is that PDF2TXT works for many of the articles 
> downloadable from the Libraries's article indexes. Search an article index. 
> Download a full text, PDF version of the article. Feed it to PDF2TXT. Get 
> more out of your article.
> 
> PDF2TXT currently has "creeping featuritis" -- meaning that it is growing in 
> weird directions. Your feedback is more than welcome. (I know. The output is 
> ugly.) Also, please be gentle with it because it does not process things the 
> size of the Bible.
> 
> --
> [cid:116F6092-2AB6-4E95-8199-25639542726A]
> 
> Eric Lease Morgan
> Digital Initiatives Librarian
> 
> University of Notre Dame
> Room 131, Hesburgh Libraries
> Notre Dame, IN 46556
> o: 574-631-8604
> e: [email protected]<mailto:[email protected]>
> 
> [cid:8DBE3E66-AAD0-40A0-A626-745EEEA175E5]
> 
> <116F6092-2AB6-4E95-8199-25639542726A.png><8DBE3E66-AAD0-40A0-A626-745EEEA175E5.png>

--
Peter Murray
Assistant Director, Technology Services Development
LYRASIS
[email protected]
+1 678-235-2955
800.999.8558 x2955

Re: [CODE4LIB] pdf2txt

Reply via email to