what do you mean by the "job file"? On Sun, Nov 25, 2012 at 10:43 PM, AC Nutch <acnu...@gmail.com> wrote:
> Hello, > > I am using Nutch 1.5.1 and I am looking to do something specific with it. I > have a few million base domains in a Solr index, so for example: > http://www.nutch.org, http://www.apache.org, http://www.whatever.com etc. > I > am trying to crawl each of these base domains in deploy mode and retrieve > all of their sub-urls associated with that domain in the most efficient way > possible. To give you an example of the workflow I am trying to achieve: > (1) Grab a base domain, let's say http://www.nutch.org (2) Crawl the base > domain for all URLs in that domain, let's say http://www.nutch.org/page1, > http://www.nutch.org/page2, http://www.nutch.org/page3, etc. etc. (3) > store > these results somewhere (perhaps another Solr instance) and (4) move on to > the next base domain in my Solr index and repeat the process. Essentially > just trying to grab all links associated with a page and then move on to > the next page. > > The part I am having trouble with is ensuring that this workflow is > efficient. The only way I can think to do this would be: (1) Grab a base > domain from Solr from my shell script (simple enough) (2) Add an entry to > regex-urlfilter with the domain I am looking to restrict the crawl to, in > the example above that would be an entry that says to only keep sub-pages > of http://www.nutch.org/ (3) Recreate the Nutch job file (~25 sec.) (4) > Start the crawl for pages associated with a domain and do the indexing > > My issue is with step #3, AFAIK if I want to restrict a crawl to a specific > domain I have to change regex-urlfilter and reload the job file. This is a > pretty significant problem, since adding 25 seconds every single time I > start a new base domain is going to add way too many seconds to my workflow > (25 sec x a few million = way too much time). Finally the question...is > there a way to add url filters on the fly when I start a crawl and/or > restrict a crawl to a particular domain on the fly. OR can you think of a > decent solution to the problem/am I missing something? >