Re: Can Nucth detect modified and deleted URLs?

Markus Jelsma Wed, 26 Jan 2011 12:21:37 -0800

See this post in a recent thread:
http://search.lucidimagination.com/search/document/5b7ba8a6fc5e0305/few_questions_from_a_newbie


> This is default behaviour. If pages are scheduled for fetching they will
> show up in the next segment. If you index that segment the old document in
> Solr is overwritten.
> 
> > But we also need to detect modified documents in order to trigger an
> > update command to Solr (an improvement of SolrIndexer). I was planning
> > to open a Jira issue on this missing functionality this week.
> > 
> > Erlend
> > 
> > On 26.01.11 18.12, Claudio Martella wrote:
> > > Today I had a look at the code and wrote this class. It works here on
> > > my test cluster.
> > > 
> > > It scans the crawldb for entries carrying the STATUS_DB_GONE and it
> > > issues a delete to solr for those entries.
> > > 
> > > Is that what you guys have in mind? Should i file a JIRA?
> > > 
> > > On 1/24/11 10:26 AM, Markus Jelsma wrote:
> > >> Each item in the CrawlDB carries a status field. Reading the CrawlDB
> > >> will return this information as well, the same goes for a complete
> > >> dump with which you could create the appropriate delete statements
> > >> for your Solr instance.
> > >> 
> > >> 51       /** Page no longer exists. */
> > >> 52       public static final byte STATUS_DB_GONE = 0x03;
> > >> 
> > >> http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/ap
> > >> ac he/nutch/crawl/CrawlDatum.java?view=markup
> > >> 
> > >>> Where is that information stored? it could be then easily used to
> > >>> issue deletes on solr.
> > >>> 
> > >>> On 1/23/11 10:32 PM, Markus Jelsma wrote:
> > >>>> Nutch can detect 404's by recrawling existing URL's. The mutation,
> > >>>> however, is not pushed to Solr at the moment.
> > >>>> 
> > >>>>> As far as I know, Nutch can only discover new URLs to crawl and
> > >>>>> send the parsed content to Solr. But what about maintaining the
> > >>>>> index? Say that you have a daily Nutch script that fetches/parses
> > >>>>> the web and updates the Solr index. After one month, several web
> > >>>>> pages have been modified and some have also been deleted. In other
> > >>>>> words, the Solr index is out of sync.
> > >>>>> 
> > >>>>> Is it possible to detect such changes in order to send
> > >>>>> update/delete commands to Solr?
> > >>>>> 
> > >>>>> It looks like the Aperture crawler has a workaround for this since
> > >>>>> the crawler handler have methods such as objectChanged(...):
> > >>>>> http://sourceforge.net/apps/trac/aperture/wiki/Crawlers
> > >>>>> 
> > >>>>> Erlend

Re: Can Nucth detect modified and deleted URLs?

Reply via email to