Hello, i successfully got nutch 0.9 running and i am really satisfied with it. But now i am unable to find the specific information i need (i also googled and searched this mail archive, but no answer really satisfied me).
First i explain what i have done: I crawled around 400 internet webpages which include the specific content/topic which i am searching for. I have a text file "urls" which include all the 400 pages. http://www.domain1.com http://www.domain2.com http://www.domain3.com http://www.domain4.com and so on.... The crawl result is as expected buuuut it also found links to other domains which i don't want to have in my search results. For example one domain include a link to www.paypal.com which i don't want that this domain is a part of my nutch results. http://www.domain2.com has a link to www.paypal.com . Domain2 should be indexed but not the link to www.paypal.com. How and where do i exclude this domain to avoid fetching and indexing? I have some more domains which i don't like to become indexed. I know only one possibility to enter each domain to below this row: # accept hosts in MY.DOMAIN.NAME But shall i add all 400 domains here? Is there no part exclude or avoid named domains? i would be happy getting replies from you experts. Great would be adding an example for me. Daniel
