OpenOffice parser as ZIP

Alexandre Haguiar Wed, 20 Aug 2008 00:59:50 -0700

Hi,

I trying to create a index with a few ODT files but nutch identify the ODT
files as ZIP content type. Can someone help me looking whats wrong with my
configuration xml.


Thanks

Alexandre Haguiar


Error parsing: http://localhost/arquivos/testOO.sxw:
org.apache.nutch.parse.ParseException: parser not found for
contentType=application/zip url=http://localhost/arquivos/testOO.sxw
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)

Error parsing: http://localhost/arquivos/softwarelivre.odt:
org.apache.nutch.parse.ParseException: parser not found for
contentType=application/zip url=http://localhost/arquivos/softwarelivre.odt
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)

Error parsing: http://localhost/arquivos/ODF.odt:
org.apache.nutch.parse.ParseException: parser not found for
contentType=application/zip url=http://localhost/arquivos/ODF.odt
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)

crawl-urlfilter.txt
# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*localhost/arquivos

nutch-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
 <property>
  <name>http.agent.name</name>
  <value>SERPRO Busca</value>
  <description>Sistema de Busca do SERPRO
  </description>
</property>
 <property>
  <name>http.agent.description</name>
  <value>SERPRO Spiderman</value>
  <description>SERPRO spiderman
  </description>
</property>
 <property>
  <name>http.agent.url</name>
  <value>http://localhost/nutch </value>
  <description>http://localhost/nutch
  </description>
</property>
 <property>
  <name>http.agent.email</name>
  <value>Email</value>
  <description>[EMAIL PROTECTED]
  </description>
</property>
<property>
        <name>plugin.includes</name>
 
<value>protocol-(httpclient|file)|urlfilter-(regex)|parse-(text|html|pdf|xml|msword|odt)|index-(basic)|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>

</property>
  <property>
  <name>http.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content, in bytes.
  If this value is nonnegative (>=0), content longer than it will be
truncated;
  otherwise, no truncation at all.
  </description>
</property>
</configuration>

-- 
Alexandre Haguiar

OpenOffice parser as ZIP

Reply via email to