Rakesh,
What developments have been done so far to enable nutch to parse PDFs?
Have you read through Tamir's Whitepaper?
Rich
PS. Here are some comments from Ben Litchfiled, developer of open source
PDF Box (java), followed by some comments from Tamir, who wrote the PDF
extraction algorithm :
I noticed that nutch seems to have some problems parsing pdfs.
060226 131210 fetch okay, but can't parse
http://www.irs.gov/pub/irs-pdf/p1828.pdf, reason: failed(2,203):
Content-Type not text/html: application/pdf
I am actually working on PDF parsing technology, and have posted the
following me