RE: FW: Good reading/research on PDF text extraction

2006-02-26 Thread Richard Braman
Rakesh, What developments have been done so far to enable nutch to parse PDFs? Have you read through Tamir's Whitepaper? Rich PS. Here are some comments from Ben Litchfiled, developer of open source PDF Box (java), followed by some comments from Tamir, who wrote the PDF extraction algorithm :

FW: Good reading/research on PDF text extraction

2006-02-26 Thread Richard Braman
I noticed that nutch seems to have some problems parsing pdfs. 060226 131210 fetch okay, but can't parse http://www.irs.gov/pub/irs-pdf/p1828.pdf, reason: failed(2,203): Content-Type not text/html: application/pdf I am actually working on PDF parsing technology, and have posted the following me