OpenOffice documents are in fact zip files. If you unzip them, you
get a bunch of xml files that have the content, as well as some files
defining the styles.
You need to configure the "parse-oo" plugin, in this section in your
conf/nutch-site.xml
<property>
<name>plugin.includes</name>
<value>protocol-file|protocol-http|urlfilter-regex|parse-(text|
html|js|oo)|index-basic|query-(basic|site|url)|summary-basic|scoring-
opic|urlnormalizer-(pass|regex|basic)|recommended</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints
plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS
please enable
protocol-httpclient, but be aware of possible intermittent
problems with the
underlying commons-httpclient library.
</description>
</property>
On Aug 20, 2008, at 12:50 AM, Alexandre Haguiar wrote:
Hi,
I trying to create a index with a few ODT files but nutch identify
the ODT
files as ZIP content type. Can someone help me looking whats wrong
with my
configuration xml.
Thanks
Alexandre Haguiar
Error parsing: http://localhost/arquivos/testOO.sxw:
org.apache.nutch.parse.ParseException: parser not found for
contentType=application/zip url=http://localhost/arquivos/testOO.sxw
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:
336)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
Error parsing: http://localhost/arquivos/softwarelivre.odt:
org.apache.nutch.parse.ParseException: parser not found for
contentType=application/zip url=http://localhost/arquivos/
softwarelivre.odt
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:
336)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
Error parsing: http://localhost/arquivos/ODF.odt:
org.apache.nutch.parse.ParseException: parser not found for
contentType=application/zip url=http://localhost/arquivos/ODF.odt
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:
336)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
crawl-urlfilter.txt
# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|
gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]
# skip URLs with slash-delimited segment that repeats 3+ times, to
break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*localhost/arquivos
nutch-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>SERPRO Busca</value>
<description>Sistema de Busca do SERPRO
</description>
</property>
<property>
<name>http.agent.description</name>
<value>SERPRO Spiderman</value>
<description>SERPRO spiderman
</description>
</property>
<property>
<name>http.agent.url</name>
<value>http://localhost/nutch </value>
<description>http://localhost/nutch
</description>
</property>
<property>
<name>http.agent.email</name>
<value>Email</value>
<description>[EMAIL PROTECTED]
</description>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-(httpclient|file)|urlfilter-(regex)|parse-(text|
html|pdf|xml|msword|odt)|index-(basic)|query-(basic|site|url)|
summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
<property>
<name>http.content.limit</name>
<value>-1</value>
<description>The length limit for downloaded content, in bytes.
If this value is nonnegative (>=0), content longer than it will be
truncated;
otherwise, no truncation at all.
</description>
</property>
</configuration>
--
Alexandre Haguiar