We need a crawler for all web pages outside our CMS, but one crucial
future seems to be missing in many of them - a way to detect changes in
these documents. Say that you have run a daily crawler job for two
months looking for new web pages to crawl in order to keep the Solr
index updated. But suddenly a lot of pages where either changed or
deleted, and now you have an outdated Solr index.
In other words, we need to detect removed web pages and trigger a delete
command to Solr. We also need to detect web pages which have been
modified in order to update the Solr index.
For me it seems that the Aperture web crawler is the only one with such
futures. The crawler handler has methods for modified and removed documents:
http://sourceforge.net/apps/trac/aperture/wiki/Crawlers
Or is it possible to do similar things with the other crawlers such as
Nutch?
Many thanks in advance for all kinds of suggestions!
Erlend
--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050