Hi, Please open a ticket, i'll test it.
Cheers, On Wednesday 26 January 2011 18:12:51 Claudio Martella wrote: > Today I had a look at the code and wrote this class. It works here on my > test cluster. > > It scans the crawldb for entries carrying the STATUS_DB_GONE and it > issues a delete to solr for those entries. > > Is that what you guys have in mind? Should i file a JIRA? > > On 1/24/11 10:26 AM, Markus Jelsma wrote: > > Each item in the CrawlDB carries a status field. Reading the CrawlDB will > > return this information as well, the same goes for a complete dump with > > which you could create the appropriate delete statements for your Solr > > instance. > > > > 51 /** Page no longer exists. */ > > 52 public static final byte STATUS_DB_GONE = 0x03; > > > > http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apach > > e/nutch/crawl/CrawlDatum.java?view=markup > > > >> Where is that information stored? it could be then easily used to issue > >> deletes on solr. > >> > >> On 1/23/11 10:32 PM, Markus Jelsma wrote: > >>> Nutch can detect 404's by recrawling existing URL's. The mutation, > >>> however, is not pushed to Solr at the moment. > >>> > >>>> As far as I know, Nutch can only discover new URLs to crawl and send > >>>> the parsed content to Solr. But what about maintaining the index? Say > >>>> that you have a daily Nutch script that fetches/parses the web and > >>>> updates the Solr index. After one month, several web pages have been > >>>> modified and some have also been deleted. In other words, the Solr > >>>> index is out of sync. > >>>> > >>>> Is it possible to detect such changes in order to send update/delete > >>>> commands to Solr? > >>>> > >>>> It looks like the Aperture crawler has a workaround for this since the > >>>> crawler handler have methods such as objectChanged(...): > >>>> http://sourceforge.net/apps/trac/aperture/wiki/Crawlers > >>>> > >>>> Erlend -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350