How to Force Nutch to Crawl Only Certain Sites' Pages?

victor_emailbox Sat, 26 Aug 2006 21:56:06 -0700

Hi,
  I am quite new to Nutch.  I know that I can give a list of urls for
crawler to crawl.  Does the crawler crawl external links referenced by the
list of urls I give?


Can I limit the crawler to crawl only the pages that belong to the urls that
I give?  If so, will it affect the quality of the search result because
Nutch can no long use external links for link analysis?  

Is setting depth the only way to limit the way crawler crawl page?  I know
that I can use filter as well.  But I have undetermined # of urls to crawl
(I will enter them manually into a text file, and the list will grow), I
can't set filter for each individual url like the way people do when they do
intranet crawling.

So in conclusion, I have an undetermined # of urls that I need to crawl (I
will add the urls manually into a text file for crawler to crawl).  I am not
crawling the whole internet like Google.  And I want to do it in a way that
don't affect the quality of my search result.  How to do it?

Many thanks.

-- 
View this message in context: 
http://www.nabble.com/How-to-Force-Nutch-to-Crawl-Only-Certain-Sites%27-Pages--tf2171544.html#a6004194
Sent from the Nutch - User forum at Nabble.com.

How to Force Nutch to Crawl Only Certain Sites' Pages?

Reply via email to