As far as I know, Nutch can only discover new URLs to crawl and send the parsed content to Solr. But what about maintaining the index? Say that you have a daily Nutch script that fetches/parses the web and updates the Solr index. After one month, several web pages have been modified and some have also been deleted. In other words, the Solr index is out of sync.

Is it possible to detect such changes in order to send update/delete commands to Solr?

It looks like the Aperture crawler has a workaround for this since the crawler handler have methods such as objectChanged(...):
http://sourceforge.net/apps/trac/aperture/wiki/Crawlers

Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Reply via email to