Re: Nutch efficiency and multiple single URL crawls

Joe Zhang Sun, 25 Nov 2012 22:20:46 -0800

what do you mean by the "job file"?

On Sun, Nov 25, 2012 at 10:43 PM, AC Nutch <acnu...@gmail.com> wrote:


> Hello,
>
> I am using Nutch 1.5.1 and I am looking to do something specific with it. I
> have a few million base domains in a Solr index, so for example:
> http://www.nutch.org, http://www.apache.org, http://www.whatever.com etc.
> I
> am trying to crawl each of these base domains in deploy mode and retrieve
> all of their sub-urls associated with that domain in the most efficient way
> possible. To give you an example of the workflow I am trying to achieve:
> (1) Grab a base domain, let's say http://www.nutch.org (2) Crawl the base
> domain for all URLs in that domain, let's say http://www.nutch.org/page1,
> http://www.nutch.org/page2, http://www.nutch.org/page3, etc. etc. (3)
> store
> these results somewhere (perhaps another Solr instance) and (4) move on to
> the next base domain in my Solr index and repeat the process. Essentially
> just trying to grab all links associated with a page and then move on to
> the next page.
>
> The part I am having trouble with is ensuring that this workflow is
> efficient. The only way I can think to do this would be: (1) Grab a base
> domain from Solr from my shell script (simple enough) (2) Add an entry to
> regex-urlfilter with the domain I am looking to restrict the crawl to, in
> the example above that would be an entry that says to only keep sub-pages
> of http://www.nutch.org/ (3) Recreate the Nutch job file (~25 sec.) (4)
> Start the crawl for pages associated with a domain and do the indexing
>
> My issue is with step #3, AFAIK if I want to restrict a crawl to a specific
> domain I have to change regex-urlfilter and reload the job file. This is a
> pretty significant problem, since adding 25 seconds every single time I
> start a new base domain is going to add way too many seconds to my workflow
> (25 sec x a few million = way too much time). Finally the question...is
> there a way to add url filters on the fly when I start a crawl and/or
> restrict a crawl to a particular domain on the fly. OR can you think of a
> decent solution to the problem/am I missing something?
>

Re: Nutch efficiency and multiple single URL crawls

Reply via email to