Re:Newbie: How to exclude domains from crawling websites?

Saurabh Bhutyani Fri, 22 Aug 2008 08:53:53 -0700
 Hi Daniel,One solution is the one that you pointed out to put the exlude urls 
in crawlurlfilter.txt.Another solution is to write a plugin and which check the 
urls at the time of fetching itself and discards the one that don't match your 
white list. Original message From:Daniel Fai< [EMAIL PROTECTED] >Date: 21 Aug 
08 00:00:58Subject:Newbie: How to exclude domains from crawling websites?To: 
[EMAIL PROTECTED], i successfully got nutch 0.9 running and i am really 
satisfied with it. But now i am unable to find the specific information i need 
(i also googled and searched this mail archive, but no answer really satisfied 
me).First i explain what i have done: I crawled around 400 internet webpages 
which include the specific content/topic which i am searching for. I have a 
text file "urls" which include all the 400 
pages.http://www.domain1.comhttp://www.domain2.com http://www.domain3.com 
http://www.domain4.com and so on....The crawl result is as expected buuuut it 
also
  found links to other domains which i don't want to have in my search results. 
For example one domain include a link to www.paypal.com which i don't want that 
this domain is a part of my nutch results.http://www.domain2.com has a link to 
www.paypal.com . Domain2 should be indexed but not the link to 
www.paypal.com.How and where do i exclude this domain to avoid fetching and 
indexing? I have some more domains which i don't like to become indexed.I know 
only one possibility to enter each domain to below this row: # accept hosts in 
MY.DOMAIN.NAMEBut shall i add all 400 domains here? Is there no part exclude or 
avoid named domains?i would be happy getting replies from you experts. Great 
would be adding an example for me.Daniel
Re:Newbie: How to exclude domains from crawling websites?

Reply via email to