Re: Nutch Parsing PDFs, and general PDF extraction

John X Tue, 28 Feb 2006 23:19:29 -0800

On Tue, Feb 28, 2006 at 09:55:18AM -0500, Richard Braman wrote:
> thanks for the help.  I dont know what happenned , but it is working no.
> Did any other contributros read what I sent about parsing PDFs?
> I dont think nutch is capable with this based on the text stripper code
> in parse pdf
>  
> http://64.233.179.104/search?q=cache:QOwcLFXNw5oJ:www.irs.gov/pub/irs-pd
> f/f1040.pdf+irs+1040+pdf
> <http://64.233.179.104/search?q=cache:QOwcLFXNw5oJ:www.irs.gov/pub/irs-p
> df/f1040.pdf+irs+1040+pdf&hl=en&gl=us&ct=clnk&cd=1>
> &hl=en&gl=us&ct=clnk&cd=1
>  
>  
> Its time to implement some real pdf parsing technology.
> any other takers?


Nutch is about search and it relies on 3rd party libraries
to extract text from various mimetypes, including application/pdf.
Whether nutch can correctly extract text from a pdf file largely
depends on the pdf parsing library it uses, currently PDFBox.
It won't be very difficult to switch to other libraries.
However it seems hard to find a free/open implementation
that can parse every pdf file in the wild. There is an alternative:
use nutch's parse-ext with a command line pdf parser/converter,
which can just be an executable.

John

Re: Nutch Parsing PDFs, and general PDF extraction

Reply via email to