Re: Can Nucth detect modified and deleted URLs?

Markus Jelsma Wed, 26 Jan 2011 09:30:56 -0800

Hi,

Please open a ticket, i'll test it.


Cheers,

On Wednesday 26 January 2011 18:12:51 Claudio Martella wrote:
> Today I had a look at the code and wrote this class. It works here on my
> test cluster.
> 
> It scans the crawldb for entries carrying the STATUS_DB_GONE and it
> issues a delete to solr for those entries.
> 
> Is that what you guys have in mind? Should i file a JIRA?
> 
> On 1/24/11 10:26 AM, Markus Jelsma wrote:
> > Each item in the CrawlDB carries a status field. Reading the CrawlDB will
> > return this information as well, the same goes for a complete dump with
> > which you could create the appropriate delete statements for your Solr
> > instance.
> > 
> > 51  /** Page no longer exists. */
> > 52  public static final byte STATUS_DB_GONE = 0x03;
> > 
> > http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apach
> > e/nutch/crawl/CrawlDatum.java?view=markup
> > 
> >> Where is that information stored? it could be then easily used to issue
> >> deletes on solr.
> >> 
> >> On 1/23/11 10:32 PM, Markus Jelsma wrote:
> >>> Nutch can detect 404's by recrawling existing URL's. The mutation,
> >>> however, is not pushed to Solr at the moment.
> >>> 
> >>>> As far as I know, Nutch can only discover new URLs to crawl and send
> >>>> the parsed content to Solr. But what about maintaining the index? Say
> >>>> that you have a daily Nutch script that fetches/parses the web and
> >>>> updates the Solr index. After one month, several web pages have been
> >>>> modified and some have also been deleted. In other words, the Solr
> >>>> index is out of sync.
> >>>> 
> >>>> Is it possible to detect such changes in order to send update/delete
> >>>> commands to Solr?
> >>>> 
> >>>> It looks like the Aperture crawler has a workaround for this since the
> >>>> crawler handler have methods such as objectChanged(...):
> >>>> http://sourceforge.net/apps/trac/aperture/wiki/Crawlers
> >>>> 
> >>>> Erlend

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Can Nucth detect modified and deleted URLs?

Reply via email to