We need a crawler for all web pages outside our CMS, but one crucial future seems to be missing in many of them - a way to detect changes in these documents. Say that you have run a daily crawler job for two months looking for new web pages to crawl in order to keep the Solr index updated. But suddenly a lot of pages where either changed or deleted, and now you have an outdated Solr index.

In other words, we need to detect removed web pages and trigger a delete command to Solr. We also need to detect web pages which have been modified in order to update the Solr index.

For me it seems that the Aperture web crawler is the only one with such futures. The crawler handler has methods for modified and removed documents:
http://sourceforge.net/apps/trac/aperture/wiki/Crawlers

Or is it possible to do similar things with the other crawlers such as Nutch?

Many thanks in advance for all kinds of suggestions!

Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Reply via email to