Can Nucth detect modified and deleted URLs?

Erlend Garåsen Tue, 18 Jan 2011 07:22:41 -0800

As far as I know, Nutch can only discover new URLs to crawl and send theparsed content to Solr. But what about maintaining the index? Say thatyou have a daily Nutch script that fetches/parses the web and updatesthe Solr index. After one month, several web pages have been modifiedand some have also been deleted. In other words, the Solr index is outof sync.

Is it possible to detect such changes in order to send update/deletecommands to Solr?

It looks like the Aperture crawler has a workaround for this since thecrawler handler have methods such as objectChanged(...):

http://sourceforge.net/apps/trac/aperture/wiki/Crawlers

Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Can Nucth detect modified and deleted URLs?

Reply via email to