Hi Daniel,One solution is the one that you pointed out to put the exlude urls
in crawlurlfilter.txt.Another solution is to write a plugin and which check the
urls at the time of fetching itself and discards the one that don't match your
white list. Original message From:Daniel Fai< [EMAIL PROTECTED] >Date: 21 Aug
08 00:00:58Subject:Newbie: How to exclude domains from crawling websites?To:
[EMAIL PROTECTED], i successfully got nutch 0.9 running and i am really
satisfied with it. But now i am unable to find the specific information i need
(i also googled and searched this mail archive, but no answer really satisfied
me).First i explain what i have done: I crawled around 400 internet webpages
which include the specific content/topic which i am searching for. I have a
text file "urls" which include all the 400
pages.http://www.domain1.comhttp://www.domain2.com http://www.domain3.com
http://www.domain4.com and so on....The crawl result is as expected buuuut it
also
found links to other domains which i don't want to have in my search results.
For example one domain include a link to www.paypal.com which i don't want that
this domain is a part of my nutch results.http://www.domain2.com has a link to
www.paypal.com . Domain2 should be indexed but not the link to
www.paypal.com.How and where do i exclude this domain to avoid fetching and
indexing? I have some more domains which i don't like to become indexed.I know
only one possibility to enter each domain to below this row: # accept hosts in
MY.DOMAIN.NAMEBut shall i add all 400 domains here? Is there no part exclude or
avoid named domains?i would be happy getting replies from you experts. Great
would be adding an example for me.Daniel