I'm using Docsplit (http://documentcloud.github.com/docsplit/), due to its Ruby bindings. It includes OCR if it fails at extracting the text, but it also requires you to install a bunch of other (open source) software. Results seem fine to me so far.

Best,
Andreas

Am 21.06.2011 16:23, schrieb Owen Stephens:
The CORE project at The Open University in the UK is doing some work on finding 
similarity between papers in institutional repositories (see 
http://core-project.kmi.open.ac.uk/ for more info).  The first step in the 
process is extracting text from the (mainly) pdf documents harvested from 
repositories

We've tried iText but had issues with quality
We moved to PDFBox but are having performance issues

Any other suggestions/experience?

Thanks,

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

Reply via email to