Hi, I am quite new to Nutch. I know that I can give a list of urls for crawler to crawl. Does the crawler crawl external links referenced by the list of urls I give?
Can I limit the crawler to crawl only the pages that belong to the urls that I give? If so, will it affect the quality of the search result because Nutch can no long use external links for link analysis? Is setting depth the only way to limit the way crawler crawl page? I know that I can use filter as well. But I have undetermined # of urls to crawl (I will enter them manually into a text file, and the list will grow), I can't set filter for each individual url like the way people do when they do intranet crawling. So in conclusion, I have an undetermined # of urls that I need to crawl (I will add the urls manually into a text file for crawler to crawl). I am not crawling the whole internet like Google. And I want to do it in a way that don't affect the quality of my search result. How to do it? Many thanks. -- View this message in context: http://www.nabble.com/How-to-Force-Nutch-to-Crawl-Only-Certain-Sites%27-Pages--tf2171544.html#a6004194 Sent from the Nutch - User forum at Nabble.com.
