On Tue, Feb 28, 2006 at 09:55:18AM -0500, Richard Braman wrote: > thanks for the help. I dont know what happenned , but it is working no. > Did any other contributros read what I sent about parsing PDFs? > I dont think nutch is capable with this based on the text stripper code > in parse pdf > > http://64.233.179.104/search?q=cache:QOwcLFXNw5oJ:www.irs.gov/pub/irs-pd > f/f1040.pdf+irs+1040+pdf > <http://64.233.179.104/search?q=cache:QOwcLFXNw5oJ:www.irs.gov/pub/irs-p > df/f1040.pdf+irs+1040+pdf&hl=en&gl=us&ct=clnk&cd=1> > &hl=en&gl=us&ct=clnk&cd=1 > > > Its time to implement some real pdf parsing technology. > any other takers?
Nutch is about search and it relies on 3rd party libraries to extract text from various mimetypes, including application/pdf. Whether nutch can correctly extract text from a pdf file largely depends on the pdf parsing library it uses, currently PDFBox. It won't be very difficult to switch to other libraries. However it seems hard to find a free/open implementation that can parse every pdf file in the wild. There is an alternative: use nutch's parse-ext with a command line pdf parser/converter, which can just be an executable. John