Take a look at Apache ManifoldCF (incubating, close to 0.1 release):

http://incubator.apache.org/connectors/

In addition to a fairly sophisticated general web crawler which maintains the state of crawled web pages it has a file system crawler and crawlers for a variety of document repositories. It has an output connector that sends documents and delete requests to Solr Cell.

-- Jack Krupansky

-----Original Message----- From: Erlend Garåsen
Sent: Wednesday, January 19, 2011 4:29 AM
To: solr-user@lucene.apache.org
Subject: How to keep a maintained index with crawled data


We need a crawler for all web pages outside our CMS, but one crucial
future seems to be missing in many of them - a way to detect changes in
these documents. Say that you have run a daily crawler job for two
months looking for new web pages to crawl in order to keep the Solr
index updated. But suddenly a lot of pages where either changed or
deleted, and now you have an outdated Solr index.

In other words, we need to detect removed web pages and trigger a delete
command to Solr. We also need to detect web pages which have been
modified in order to update the Solr index.

For me it seems that the Aperture web crawler is the only one with such
futures. The crawler handler has methods for modified and removed documents:
http://sourceforge.net/apps/trac/aperture/wiki/Crawlers

Or is it possible to do similar things with the other crawlers such as
Nutch?

Many thanks in advance for all kinds of suggestions!

Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Reply via email to