In my own case, it will be much fewer than millions of base URLs. It sounds like AC has something bigger in mind.
On Thu, Nov 29, 2012 at 3:20 PM, Markus Jelsma <markus.jel...@openindex.io>wrote: > > > > > -----Original message----- > > From:Joe Zhang <smartag...@gmail.com> > > Sent: Thu 29-Nov-2012 23:15 > > To: user@nutch.apache.org > > Subject: Re: Nutch efficiency and multiple single URL crawls > > > > I realize I might be asking a real naive question here :) > > no problem > > > > > Why can't we put all the base URLs in a single seed file, and use a same > > config file (which has all the filtering patterns) to do the crawl? Is > > there anything wrong with this approach? > > No, there's nothing wrong in doing that. You can use the FreeGenerator > tool to generate a fetch list from a single (or multiple) seed file(s). > > But what are you trying to accomplish? Something large scale (tens of > millions) or much larger or smaller? > > Usually concerns of what is going to be crawled only matters if you're > doing something on a large or massive scale. > > > > > On Thu, Nov 29, 2012 at 2:46 PM, Alejandro Caceres < > > acace...@hyperiongray.com> wrote: > > > > > Got it, I will try that out, that's an excellent feature. Thank you > for the > > > help. > > > > > > > > > On Thu, Nov 29, 2012 at 4:06 AM, Markus Jelsma > > > <markus.jel...@openindex.io>wrote: > > > > > > > As i said, you don't rebuild, you just overwrite the config file in > the > > > > hadoop config directory on the data nodes. Config files are looked up > > > there > > > > as well. Just copy the file to the data nodes. > > > > > > > > -----Original message----- > > > > > From:AC Nutch <acnu...@gmail.com> > > > > > Sent: Thu 29-Nov-2012 05:38 > > > > > To: user@nutch.apache.org > > > > > Subject: Re: Nutch efficiency and multiple single URL crawls > > > > > > > > > > Thanks for the help. Perhaps I am misunderstanding, what would be > the > > > > > proper way to leverage this? I am a bit new to Nutch 1.5.1, I've > been > > > > using > > > > > 1.4 and have generally been using runtime/deploy/bin/nutch with a > .job > > > > > file. I notice things are done a bit differently in 1.5.1 with the > lack > > > > of > > > > > a nutch runtime and nutch deploy directories. How can I run a crawl > > > while > > > > > leveraging this functionality and not having to rebuild the job > file > > > each > > > > > new crawl? More specifically, I'm picturing the following > workflow... > > > > > > > > > > (1) update config file to restrict domain crawls -> (2) run command > > > that > > > > > crawls a domain with changes from config file while not having to > > > rebuild > > > > > job file -> (3) index to Solr > > > > > > > > > > What would the (general) command be for step (2) is my question. > > > > > > > > > > On Mon, Nov 26, 2012 at 5:16 AM, Markus Jelsma > > > > > <markus.jel...@openindex.io>wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > Rebuilding the job file for each domain is not a good idea > indeed, > > > > plus it > > > > > > adds the Hadoop overhead. But you don't have to, we write dynamic > > > > config > > > > > > files to each node's Hadoop configuration directory and it is > picked > > > up > > > > > > instead of the embedded configuration file. > > > > > > > > > > > > Cheers, > > > > > > > > > > > > -----Original message----- > > > > > > > From:AC Nutch <acnu...@gmail.com> > > > > > > > Sent: Mon 26-Nov-2012 06:50 > > > > > > > To: user@nutch.apache.org > > > > > > > Subject: Nutch efficiency and multiple single URL crawls > > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > I am using Nutch 1.5.1 and I am looking to do something > specific > > > with > > > > > > it. I > > > > > > > have a few million base domains in a Solr index, so for > example: > > > > > > > http://www.nutch.org, http://www.apache.org, > > > > http://www.whatever.cometc. I > > > > > > > am trying to crawl each of these base domains in deploy mode > and > > > > retrieve > > > > > > > all of their sub-urls associated with that domain in the most > > > > efficient > > > > > > way > > > > > > > possible. To give you an example of the workflow I am trying to > > > > achieve: > > > > > > > (1) Grab a base domain, let's say http://www.nutch.org (2) > Crawl > > > the > > > > > > base > > > > > > > domain for all URLs in that domain, let's say > > > > http://www.nutch.org/page1 > > > > > > , > > > > > > > http://www.nutch.org/page2, http://www.nutch.org/page3, etc. > etc. > > > > (3) > > > > > > store > > > > > > > these results somewhere (perhaps another Solr instance) and (4) > > > move > > > > on > > > > > > to > > > > > > > the next base domain in my Solr index and repeat the process. > > > > Essentially > > > > > > > just trying to grab all links associated with a page and then > move > > > > on to > > > > > > > the next page. > > > > > > > > > > > > > > The part I am having trouble with is ensuring that this > workflow is > > > > > > > efficient. The only way I can think to do this would be: (1) > Grab a > > > > base > > > > > > > domain from Solr from my shell script (simple enough) (2) Add > an > > > > entry to > > > > > > > regex-urlfilter with the domain I am looking to restrict the > crawl > > > > to, in > > > > > > > the example above that would be an entry that says to only keep > > > > sub-pages > > > > > > > of http://www.nutch.org/ (3) Recreate the Nutch job file (~25 > > > sec.) > > > > (4) > > > > > > > Start the crawl for pages associated with a domain and do the > > > > indexing > > > > > > > > > > > > > > My issue is with step #3, AFAIK if I want to restrict a crawl > to a > > > > > > specific > > > > > > > domain I have to change regex-urlfilter and reload the job > file. > > > > This is > > > > > > a > > > > > > > pretty significant problem, since adding 25 seconds every > single > > > > time I > > > > > > > start a new base domain is going to add way too many seconds > to my > > > > > > workflow > > > > > > > (25 sec x a few million = way too much time). Finally the > > > > question...is > > > > > > > there a way to add url filters on the fly when I start a crawl > > > and/or > > > > > > > restrict a crawl to a particular domain on the fly. OR can you > > > think > > > > of a > > > > > > > decent solution to the problem/am I missing something? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > ___ > > > > > > Alejandro Caceres > > > Hyperion Gray, LLC > > > Owner/CTO > > > > > >