Hi there, i am having some problems parsing PDFs, i got a website to crawl which includes some links to pdf files. My problem is that nutch is not recognizing these links as PDF files.
The links are just simple output links (http://XYZ/output/4366), with no file extension and this seems to be the problem, if I rebuild the links with an .pdf extension nutch crawls them, but that is not really an option for me. Is there an other solution, or do i just have an error in my config elsewhere? I could bet nutch can detect pdfs whether they have an file extension or not. -- View this message in context: http://lucene.472066.n3.nabble.com/Crawling-PDFs-tp4032174.html Sent from the Nutch - User mailing list archive at Nabble.com.

