Re: Script to extract text from PDF files

Scott Werner Fri, 06 Nov 2015 14:32:18 -0800

On Tuesday, September 25, 2007 at 1:41:56 PM UTC-4, brad wrote:
> I have a very crude Python script that extracts text from some (and I 
> emphasize some) PDF documents. On many PDF docs, I cannot extract text, 
> but this is because I'm doing something wrong. The PDF spec is large and 
> complex and there are various ways in which to store and encode text. I 
> wanted to post here and ask if anyone is interested in helping make the 
> script better which means it should accurately extract text from most 
> any pdf file... not just some.
> 
> I know the topic of reading/extracting the text from a PDF document 
> natively in Python comes up every now and then on comp.lang.python... 
> I've posted about it in the past myself. After searching for other 
> solutions, I've resorted to attempting this on my own in my spare time. 
> Using apps external to Python (pdftotext, etc.) is not really an option 
> for me. If someone knows of a free native Python app that does this now, 
> let me know and I'll use that instead!
> 
> So, if other more experienced programmer are interested in helping make 
> the script better, please let me know. I can host a website and the 
> latest revision and do all of the grunt work.
> 
> Thanks,
> 
> Brad


As mentioned before, extracting plain text from a PDF document can be hit or 
miss. I have tried all the following applications (free/open source) on Arch 
Linux. Note, I would execute the commands with subprocess and capture stdout or 
read plain text file created by the application.

* textract (uses pdftotext)
- https://github.com/deanmalmgren/textract

* pdftotext 
- http://poppler.freedesktop.org/
- cmd: pdftotext -layout "/path/to/document.pdf" -
- cmd: pdftotext "/path/to/document.pdf" -

* Calibre
- http://calibre-ebook.com/
- cmd: ebook-convert "/path/to/document.pdf" "/path/to/plain.txt" 
--no-chapters-in-toc

* AbiWord
- http://www.abiword.org/
- cmd: abiword --to-name=fd://1 --to-TXT "/path/to/document.pdf"

* Apache Tika
- https://tika.apache.org/
- cmd: "/usr/bin/java" -jar "/path/to/standalone/tika-app-1.10.jar" --text-main 
"/path/to/document.pdf"

For my application, I saw the best results using Apache Tika. However, I do 
still encounter strange encoding or extraction issues, e.g. S P A C E D  O U T  
H E A D E R S" and "\nBroken \nHeader\n". I ended up writing a lot of 
repairing/cleaning methods.

I welcome an improved solution that has some intelligence like comparing the 
extract plain text order to a snapshot of the pdf page using OCR.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Script to extract text from PDF files

Reply via email to