Hi everyone,
I am kind of a n00b to nutch. So here are a few questions for you to answer (or 
your amusement)

1. Duing a nutch crawl and subsequent crawls, does the crawler always pick up 
new links on a page or just checks for old ones?

For eg. if i set 20 as the limit of number of links on a page and 5 as the 
depth. The first crawl gets me 20 links on a page. What does a subsequent crawl 
of the same page get me? Does it just checks for the first 20 links and sees if 
they have been crawled or does it get me new links?

2. I know you cannot re-index a page that has once been crawled. Yet I cannot 
find out why when I put in links that have been crawled earlier with certain 
changes in the meta-data show any change in the index (I am picking up content, 
description and title). I have set the max time between subsequent re-fetches 
as 1 day.

3. i am using patch 963 for deleting 404-pages. Yet only few get deleted from 
the index. Is it because the pages initially were picked up through a normal 
crawl, but i am forcing links into url.txt that need to be deleted.


Thanks and Regards,
Tamanjit Bindra

Reply via email to