Re: Crawling - basic questions.

Markus Jelsma Sun, 19 Jun 2011 15:56:44 -0700

> Hi everyone,
> I am kind of a n00b to nutch. So here are a few questions for you to answer
> (or your amusement)
> 
> 1. Duing a nutch crawl and subsequent crawls, does the crawler always pick
> up new links on a page or just checks for old ones?
> 
> For eg. if i set 20 as the limit of number of links on a page and 5 as the
> depth. The first crawl gets me 20 links on a page. What does a subsequent
> crawl of the same page get me? Does it just checks for the first 20 links
> and sees if they have been crawled or does it get me new links?


That depends on your settings. By default Nutch will pick up new outlinks. 
Check config for db.update.additions.allowed.  If you run the crawl cycle 
plenty of times, everything should be crawled at some point.

> 
> 2. I know you cannot re-index a page that has once been crawled. Yet I
> cannot find out why when I put in links that have been crawled earlier
> with certain changes in the meta-data show any change in the index (I am
> picking up content, description and title). I have set the max time
> between subsequent re-fetches as 1 day.

I don't know about Nutch as a search server but reindexing should not be a 
problem right?
> 
> 3. i am using patch 963 for deleting 404-pages. Yet only few get deleted
> from the index. Is it because the pages initially were picked up through a
> normal crawl, but i am forcing links into url.txt that need to be deleted.

This patch is only when using a Solr backend (at the moment). So you're using 
Solr? Then you can reindex as much as you like. The solrclean job will simply 
delete ALL documents with a GONE status (404). This status can only be 
achieved if the fetcher downloads the URL and find a 404. Only then then status 
is updated and only then you can clean them.

If you have a list of URL to be deleted you'd simply do a Solr delete (if 
you're actually using Solr, which is not really clear.

> 
> 
> Thanks and Regards,
> Tamanjit Bindra

Re: Crawling - basic questions.

Reply via email to