Hi I tried to crawl including the pdf plugin. doesn't seem to work. Does anyone know what could be the problem?
nutch-site.xml is .. <property> <name>plugin.includes</name> <value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html|js|pdf)|index-basic|query-(basic|site|url)|language-identifier</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. </description> </property> .. seems to be included: 060313 134732 parsing: /home/../plugins/parse-pdf/plugin.xml 060313 134732 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.pdf.PdfParser but: 060313 134822 fetch okay, but can't parse http://www.uni-koeln.de/uni/map.html, reason: failed(2,203): Content-Type not application/pdf: -- Echte DSL-Flatrate dauerhaft für 0,- Euro*! "Feel free" mit GMX DSL! http://www.gmx.net/de/go/dsl
