Re: OpenOffice parser as ZIP

Jasper Kamperman Wed, 20 Aug 2008 11:54:20 -0700

OpenOffice documents are in fact zip files. If you unzip them, youget a bunch of xml files that have the content, as well as some filesdefining the styles.

You need to configure the "parse-oo" plugin, in this section in yourconf/nutch-site.xml


<property>
  <name>plugin.includes</name>

  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.

In any case you need at least include the nutch-extensionpointsplugin. By

  default Nutch includes crawling just HTML and plain text via HTTP,

and basic indexing and search plugins. In order to use HTTPSplease enableprotocol-httpclient, but be aware of possible intermittentproblems with the

  underlying commons-httpclient library.
  </description>
</property>

On Aug 20, 2008, at 12:50 AM, Alexandre Haguiar wrote:

Hi,

I trying to create a index with a few ODT files but nutch identifythe ODTfiles as ZIP content type. Can someone help me looking whats wrongwith my

configuration xml.

Thanks

Alexandre Haguiar


Error parsing: http://localhost/arquivos/testOO.sxw:
org.apache.nutch.parse.ParseException: parser not found for
contentType=application/zip url=http://localhost/arquivos/testOO.sxw
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
        at

org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)

        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)

Error parsing: http://localhost/arquivos/softwarelivre.odt:
org.apache.nutch.parse.ParseException: parser not found for

contentType=application/zip url=http://localhost/arquivos/softwarelivre.odt

        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
        at

org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)

        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)

Error parsing: http://localhost/arquivos/ODF.odt:
org.apache.nutch.parse.ParseException: parser not found for
contentType=application/zip url=http://localhost/arquivos/ODF.odt
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
        at

org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)

        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)

crawl-urlfilter.txt
# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse

-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$


# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# skip URLs with slash-delimited segment that repeats 3+ times, tobreak

loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*localhost/arquivos

nutch-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
 <property>
  <name>http.agent.name</name>
  <value>SERPRO Busca</value>
  <description>Sistema de Busca do SERPRO
  </description>
</property>
 <property>
  <name>http.agent.description</name>
  <value>SERPRO Spiderman</value>
  <description>SERPRO spiderman
  </description>
</property>
 <property>
  <name>http.agent.url</name>
  <value>http://localhost/nutch </value>
  <description>http://localhost/nutch
  </description>
</property>
 <property>
  <name>http.agent.email</name>
  <value>Email</value>
  <description>[EMAIL PROTECTED]
  </description>
</property>
<property>
        <name>plugin.includes</name>


</property>
  <property>
  <name>http.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content, in bytes.
  If this value is nonnegative (>=0), content longer than it will be
truncated;
  otherwise, no truncation at all.
  </description>
</property>
</configuration>

--
Alexandre Haguiar

Re: OpenOffice parser as ZIP

Reply via email to