Re: Limiting crawls to subwebs

Alex Basa Thu, 26 Mar 2009 14:04:59 -0700

You should just be able to use

+^http://www.mycity.gov/water/



--- On Thu, 3/26/09, Robert Edmiston <robert.edmis...@gmail.com> wrote:

> From: Robert Edmiston <robert.edmis...@gmail.com>
> Subject: Limiting crawls to subwebs
> To: nutch-user@lucene.apache.org
> Date: Thursday, March 26, 2009, 3:32 PM
> I am trying to limit a crawl to just a subweb. I work for a
> city government
> and I have been asked to set up a seperate crawl that is
> dedicated to just
> our water department. So, if I were to run a crawl on
> http://www.mycity.gov/water, how can I keep the crawl from
> including
> http://subdomain.mycity.gov or root URL's of
> http://www.mycity.gov or
> http://www.mycity.gov/xxx? I have tried going into the
> crawl-urlfilter.txt
> file and making the following entries, which have not been
> successful:
> 
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-z0-9]*\.)mycity.gov/
> +^http://([a-z0-9]*\.)www.mycity.gov/
> +^http://localhost
> 
> I am using a urls.txt file that just has the URL of
> http://www.mycity.gov/water but it manages to crawl back to
> the city
> homepage from there and then do a full crawl of the entire
> city website.
> 
> Thank you in advance

Re: Limiting crawls to subwebs

Reply via email to