Hi
I tried to crawl including the pdf plugin.
doesn't seem to work.
Does anyone know what could be the problem?

nutch-site.xml is
..
<property>
  <name>plugin.includes</name>
 
<value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html|js|pdf)|index-basic|query-(basic|site|url)|language-identifier</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>
..

seems to be included:
060313 134732 parsing: /home/../plugins/parse-pdf/plugin.xml
060313 134732 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.pdf.PdfParser

but:
060313 134822 fetch okay, but can't parse
http://www.uni-koeln.de/uni/map.html, reason: failed(2,203): Content-Type
not application/pdf: 


-- 
Echte DSL-Flatrate dauerhaft für 0,- Euro*!
"Feel free" mit GMX DSL! http://www.gmx.net/de/go/dsl

Reply via email to