OpenOffice documents are in fact zip files. If you unzip them, you get a bunch of xml files that have the content, as well as some files defining the styles.

You need to configure the "parse-oo" plugin, in this section in your conf/nutch-site.xml

<property>
  <name>plugin.includes</name>
<value>protocol-file|protocol-http|urlfilter-regex|parse-(text| html|js|oo)|index-basic|query-(basic|site|url)|summary-basic|scoring- opic|urlnormalizer-(pass|regex|basic)|recommended</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the
  underlying commons-httpclient library.
  </description>
</property>

On Aug 20, 2008, at 12:50 AM, Alexandre Haguiar wrote:

Hi,

I trying to create a index with a few ODT files but nutch identify the ODT files as ZIP content type. Can someone help me looking whats wrong with my
configuration xml.

Thanks

Alexandre Haguiar


Error parsing: http://localhost/arquivos/testOO.sxw:
org.apache.nutch.parse.ParseException: parser not found for
contentType=application/zip url=http://localhost/arquivos/testOO.sxw
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java: 336)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)

Error parsing: http://localhost/arquivos/softwarelivre.odt:
org.apache.nutch.parse.ParseException: parser not found for
contentType=application/zip url=http://localhost/arquivos/ softwarelivre.odt
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java: 336)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)

Error parsing: http://localhost/arquivos/ODF.odt:
org.apache.nutch.parse.ParseException: parser not found for
contentType=application/zip url=http://localhost/arquivos/ODF.odt
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java: 336)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)

crawl-urlfilter.txt
# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls| gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*localhost/arquivos

nutch-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
 <property>
  <name>http.agent.name</name>
  <value>SERPRO Busca</value>
  <description>Sistema de Busca do SERPRO
  </description>
</property>
 <property>
  <name>http.agent.description</name>
  <value>SERPRO Spiderman</value>
  <description>SERPRO spiderman
  </description>
</property>
 <property>
  <name>http.agent.url</name>
  <value>http://localhost/nutch </value>
  <description>http://localhost/nutch
  </description>
</property>
 <property>
  <name>http.agent.email</name>
  <value>Email</value>
  <description>[EMAIL PROTECTED]
  </description>
</property>
<property>
        <name>plugin.includes</name>
<value>protocol-(httpclient|file)|urlfilter-(regex)|parse-(text| html|pdf|xml|msword|odt)|index-(basic)|query-(basic|site|url)| summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>

</property>
  <property>
  <name>http.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content, in bytes.
  If this value is nonnegative (>=0), content longer than it will be
truncated;
  otherwise, no truncation at all.
  </description>
</property>
</configuration>

--
Alexandre Haguiar

Reply via email to