Hi, I trying to create a index with a few ODT files but nutch identify the ODT files as ZIP content type. Can someone help me looking whats wrong with my configuration xml.
Thanks Alexandre Haguiar Error parsing: http://localhost/arquivos/testOO.sxw: org.apache.nutch.parse.ParseException: parser not found for contentType=application/zip url=http://localhost/arquivos/testOO.sxw at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74) at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178) Error parsing: http://localhost/arquivos/softwarelivre.odt: org.apache.nutch.parse.ParseException: parser not found for contentType=application/zip url=http://localhost/arquivos/softwarelivre.odt at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74) at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178) Error parsing: http://localhost/arquivos/ODF.odt: org.apache.nutch.parse.ParseException: parser not found for contentType=application/zip url=http://localhost/arquivos/ODF.odt at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74) at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178) crawl-urlfilter.txt # skip file:, ftp:, & mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*localhost/arquivos nutch-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>http.agent.name</name> <value>SERPRO Busca</value> <description>Sistema de Busca do SERPRO </description> </property> <property> <name>http.agent.description</name> <value>SERPRO Spiderman</value> <description>SERPRO spiderman </description> </property> <property> <name>http.agent.url</name> <value>http://localhost/nutch </value> <description>http://localhost/nutch </description> </property> <property> <name>http.agent.email</name> <value>Email</value> <description>[EMAIL PROTECTED] </description> </property> <property> <name>plugin.includes</name> <value>protocol-(httpclient|file)|urlfilter-(regex)|parse-(text|html|pdf|xml|msword|odt)|index-(basic)|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> </property> <property> <name>http.content.limit</name> <value>-1</value> <description>The length limit for downloaded content, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. </description> </property> </configuration> -- Alexandre Haguiar
