Crawling PDFs

paddz Thu, 10 Jan 2013 04:19:15 -0800

Hi there,

i am having some problems parsing PDFs, i got a website to crawl which
includes some links to pdf files. My problem is that nutch is not
recognizing these links as PDF files.


The links are just simple output links (http://XYZ/output/4366), with no
file extension and this seems to be the problem, if I rebuild the links with
an .pdf extension nutch crawls them, but that is not really an option for
me.
Is there an other solution, or do i just have an error in my config
elsewhere? I could bet nutch can detect pdfs whether they have an file
extension or not.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Crawling-PDFs-tp4032174.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Crawling PDFs

Reply via email to