How to keep a maintained index with crawled data

Erlend Garåsen Wed, 19 Jan 2011 01:29:50 -0800

We need a crawler for all web pages outside our CMS, but one crucialfuture seems to be missing in many of them - a way to detect changes inthese documents. Say that you have run a daily crawler job for twomonths looking for new web pages to crawl in order to keep the Solrindex updated. But suddenly a lot of pages where either changed ordeleted, and now you have an outdated Solr index.

In other words, we need to detect removed web pages and trigger a deletecommand to Solr. We also need to detect web pages which have beenmodified in order to update the Solr index.

For me it seems that the Aperture web crawler is the only one with suchfutures. The crawler handler has methods for modified and removed documents:

http://sourceforge.net/apps/trac/aperture/wiki/Crawlers

Or is it possible to do similar things with the other crawlers such asNutch?


Many thanks in advance for all kinds of suggestions!

Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

How to keep a maintained index with crawled data

Reply via email to