I think you must put this 

mycity.gov/water in your crawl-urlfliter.txt file.

Alex.



 

-----Original Message-----
From: Robert Edmiston <robert.edmis...@gmail.com>
To: nutch-user@lucene.apache.org
Sent: Thu, 26 Mar 2009 1:32 pm
Subject: Limiting crawls to subwebs










I am trying to limit a crawl to just a subweb. I work for a city government
and I have been asked to set up a seperate crawl that is dedicated to just
our water department. So, if I were to run a crawl on
http://www.mycity.gov/water, how can I keep the crawl from including
http://subdomain.mycity.gov or root URL's of http://www.mycity.gov or
http://www.mycity.gov/xxx? I have tried going into the crawl-urlfilter.txt
file and making the following entries, which have not been successful:

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)mycity.gov/
+^http://([a-z0-9]*\.)www.mycity.gov/
+^http://localhost

I am using a urls.txt file that just has the URL of
http://www.mycity.gov/water but it manages to crawl back to the city
homepage from there and then do a full crawl of the entire city website.

Thank you in advance



 

Reply via email to