Re: [CODE4LIB] pdf2txt

Kevin Hawkins Wed, 16 Oct 2013 09:49:06 -0700

On 10/15/13 11:45 AM, Eric Lease Morgan wrote:

On Oct 14, 2013, at 7:56 AM, Nicolas Franck <[email protected]> wrote:

Could this also be done by Apache Tika? Or do I miss a crucial point?

http://tika.apache.org/1.4/gettingstarted.html



Nicolas, this looks VERY promising! It seemingly can extract the OCR from a PDF 
document as well as extract the text from a Word document. 'More experimenting, 
but thank you. code4lib++  --Eric Morgan

In case they are of use to anyone, here are links I've collected overthe years (some may be dead) to other tools that include the capabilityto extract text from a vector PDF (not a raster one that still needs tobe OCRd):


* pdfx: http://pdfx.cs.man.ac.uk/

* LA-PDFText: https://code.google.com/p/lapdftext/

* pdf2htmlEX: https://github.com/coolwanglu/pdf2htmlEX

* Apache PDFBox: http://pdfbox.apache.org/

* pdf2txt.py, part of PDFMiner:http://www.unixuser.org/~euske/python/pdfminer/


* pdftotext (part of xpdf)

See also the list at http://scholrev.org/hackathon/ and this discussionof using Jade, Gemini, and Adobe Acrobat to extract text from a PDF:http://www.ncbi.nlm.nih.gov/books/NBK61837/ .


--Kevin

Re: [CODE4LIB] pdf2txt

Reply via email to