I'm using Docsplit (http://documentcloud.github.com/docsplit/), due to
its Ruby bindings. It includes OCR if it fails at extracting the text,
but it also requires you to install a bunch of other (open source)
software. Results seem fine to me so far.
Best,
Andreas
Am 21.06.2011 16:23, schrieb Owen Stephens:
The CORE project at The Open University in the UK is doing some work on finding
similarity between papers in institutional repositories (see
http://core-project.kmi.open.ac.uk/ for more info). The first step in the
process is extracting text from the (mainly) pdf documents harvested from
repositories
We've tried iText but had issues with quality
We moved to PDFBox but are having performance issues
Any other suggestions/experience?
Thanks,
Owen
Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936