On 10/15/13 11:45 AM, Eric Lease Morgan wrote:
On Oct 14, 2013, at 7:56 AM, Nicolas Franck <[email protected]> wrote:

Could this also be done by Apache Tika? Or do I miss a crucial point?

http://tika.apache.org/1.4/gettingstarted.html


Nicolas, this looks VERY promising! It seemingly can extract the OCR from a PDF 
document as well as extract the text from a Word document. 'More experimenting, 
but thank you. code4lib++  --Eric Morgan

In case they are of use to anyone, here are links I've collected over the years (some may be dead) to other tools that include the capability to extract text from a vector PDF (not a raster one that still needs to be OCRd):

* pdfx: http://pdfx.cs.man.ac.uk/

* LA-PDFText: https://code.google.com/p/lapdftext/

* pdf2htmlEX: https://github.com/coolwanglu/pdf2htmlEX

* Apache PDFBox: http://pdfbox.apache.org/

* pdf2txt.py, part of PDFMiner: http://www.unixuser.org/~euske/python/pdfminer/

* pdftotext (part of xpdf)

See also the list at http://scholrev.org/hackathon/ and this discussion of using Jade, Gemini, and Adobe Acrobat to extract text from a PDF: http://www.ncbi.nlm.nih.gov/books/NBK61837/ .

--Kevin

Reply via email to