Re: [CODE4LIB] PDF->text extraction

Boheemen, Peter van Tue, 21 Jun 2011 10:38:42 -0700

The most used open source software for this (and many other mime types) is 
tika: http://tika.apache.org/
________________________________________
Van: Code for Libraries [[email protected]] namens Bill Janssen 
[[email protected]]
Verzonden: dinsdag 21 juni 2011 19:19
Aan: [email protected]
Onderwerp: Re: [CODE4LIB] PDF->text extraction


Owen Stephens <[email protected]> wrote:

> The CORE project at The Open University in the UK is doing some work on 
> finding similarity between papers in institutional repositories (see 
> http://core-project.kmi.open.ac.uk/ for more info).  The first step in the 
> process is extracting text from the (mainly) pdf documents harvested from 
> repositories
>
> We've tried iText but had issues with quality
> We moved to PDFBox but are having performance issues
>
> Any other suggestions/experience?

UpLib uses xpdf's pdftotext, which works well.  There's also code in
UpLib to find similarities between papers :-).

Bill

Re: [CODE4LIB] PDF->text extraction

Reply via email to