Re: [CODE4LIB] PDF->text extraction

Andreas Walker Tue, 21 Jun 2011 07:35:38 -0700

I'm using Docsplit (http://documentcloud.github.com/docsplit/), due toits Ruby bindings. It includes OCR if it fails at extracting the text,but it also requires you to install a bunch of other (open source)software. Results seem fine to me so far.


Best,
Andreas


Am 21.06.2011 16:23, schrieb Owen Stephens:

The CORE project at The Open University in the UK is doing some work on finding 
similarity between papers in institutional repositories (see 
http://core-project.kmi.open.ac.uk/ for more info).  The first step in the 
process is extracting text from the (mainly) pdf documents harvested from 
repositories

We've tried iText but had issues with quality
We moved to PDFBox but are having performance issues

Any other suggestions/experience?

Thanks,

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: [email protected]
Telephone: 0121 288 6936

Re: [CODE4LIB] PDF->text extraction

Reply via email to