Re: Nutch efficiency and multiple single URL crawls

Joe Zhang Thu, 29 Nov 2012 14:25:20 -0800

In my own case, it will be much fewer than millions of base URLs.

It sounds like AC has something bigger in mind.


On Thu, Nov 29, 2012 at 3:20 PM, Markus Jelsma
<markus.jel...@openindex.io>wrote:

>
>
>
>
> -----Original message-----
> > From:Joe Zhang <smartag...@gmail.com>
> > Sent: Thu 29-Nov-2012 23:15
> > To: user@nutch.apache.org
> > Subject: Re: Nutch efficiency and multiple single URL crawls
> >
> > I realize I might be asking a real naive question here :)
>
> no problem
>
> >
> > Why can't we put all the base URLs in a single seed file, and use a same
> > config file (which has all the filtering patterns) to do the crawl? Is
> > there anything wrong with this approach?
>
> No, there's nothing wrong in doing that. You can use the FreeGenerator
> tool to generate  a fetch list from a single (or multiple) seed file(s).
>
> But what are you trying to accomplish? Something large scale (tens of
> millions) or much larger or smaller?
>
> Usually concerns of what is going to be crawled only matters if you're
> doing something on a large or massive scale.
>
> >
> > On Thu, Nov 29, 2012 at 2:46 PM, Alejandro Caceres <
> > acace...@hyperiongray.com> wrote:
> >
> > > Got it, I will try that out, that's an excellent feature. Thank you
> for the
> > > help.
> > >
> > >
> > > On Thu, Nov 29, 2012 at 4:06 AM, Markus Jelsma
> > > <markus.jel...@openindex.io>wrote:
> > >
> > > > As i said, you don't rebuild, you just overwrite the config file in
> the
> > > > hadoop config directory on the data nodes. Config files are looked up
> > > there
> > > > as well. Just copy the file to the data nodes.
> > > >
> > > > -----Original message-----
> > > > > From:AC Nutch <acnu...@gmail.com>
> > > > > Sent: Thu 29-Nov-2012 05:38
> > > > > To: user@nutch.apache.org
> > > > > Subject: Re: Nutch efficiency and multiple single URL crawls
> > > > >
> > > > > Thanks for the help. Perhaps I am misunderstanding, what would be
> the
> > > > > proper way to leverage this? I am a bit new to Nutch 1.5.1, I've
> been
> > > > using
> > > > > 1.4 and have generally been using runtime/deploy/bin/nutch with a
> .job
> > > > > file. I notice things are done a bit differently in 1.5.1 with the
> lack
> > > > of
> > > > > a nutch runtime and nutch deploy directories. How can I run a crawl
> > > while
> > > > > leveraging this functionality and not having to rebuild the job
> file
> > > each
> > > > > new crawl? More specifically, I'm picturing the following
> workflow...
> > > > >
> > > > > (1) update config file to restrict domain crawls -> (2) run command
> > > that
> > > > > crawls a domain with changes from config file while not having to
> > > rebuild
> > > > > job file  -> (3) index to Solr
> > > > >
> > > > > What would the (general) command be for step (2) is my question.
> > > > >
> > > > > On Mon, Nov 26, 2012 at 5:16 AM, Markus Jelsma
> > > > > <markus.jel...@openindex.io>wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Rebuilding the job file for each domain is not a good idea
> indeed,
> > > > plus it
> > > > > > adds the Hadoop overhead. But you don't have to, we write dynamic
> > > > config
> > > > > > files to each node's Hadoop configuration directory and it is
> picked
> > > up
> > > > > > instead of the embedded configuration file.
> > > > > >
> > > > > > Cheers,
> > > > > >
> > > > > > -----Original message-----
> > > > > > > From:AC Nutch <acnu...@gmail.com>
> > > > > > > Sent: Mon 26-Nov-2012 06:50
> > > > > > > To: user@nutch.apache.org
> > > > > > > Subject: Nutch efficiency and multiple single URL crawls
> > > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > I am using Nutch 1.5.1 and I am looking to do something
> specific
> > > with
> > > > > > it. I
> > > > > > > have a few million base domains in a Solr index, so for
> example:
> > > > > > > http://www.nutch.org, http://www.apache.org,
> > > > http://www.whatever.cometc. I
> > > > > > > am trying to crawl each of these base domains in deploy mode
> and
> > > > retrieve
> > > > > > > all of their sub-urls associated with that domain in the most
> > > > efficient
> > > > > > way
> > > > > > > possible. To give you an example of the workflow I am trying to
> > > > achieve:
> > > > > > > (1) Grab a base domain, let's say http://www.nutch.org (2)
> Crawl
> > > the
> > > > > > base
> > > > > > > domain for all URLs in that domain, let's say
> > > > http://www.nutch.org/page1
> > > > > > ,
> > > > > > > http://www.nutch.org/page2, http://www.nutch.org/page3, etc.
> etc.
> > > > (3)
> > > > > > store
> > > > > > > these results somewhere (perhaps another Solr instance) and (4)
> > > move
> > > > on
> > > > > > to
> > > > > > > the next base domain in my Solr index and repeat the process.
> > > > Essentially
> > > > > > > just trying to grab all links associated with a page and then
> move
> > > > on to
> > > > > > > the next page.
> > > > > > >
> > > > > > > The part I am having trouble with is ensuring that this
> workflow is
> > > > > > > efficient. The only way I can think to do this would be: (1)
> Grab a
> > > > base
> > > > > > > domain from Solr from my shell script (simple enough) (2) Add
> an
> > > > entry to
> > > > > > > regex-urlfilter with the domain I am looking to restrict the
> crawl
> > > > to, in
> > > > > > > the example above that would be an entry that says to only keep
> > > > sub-pages
> > > > > > > of http://www.nutch.org/ (3) Recreate the Nutch job file (~25
> > > sec.)
> > > > (4)
> > > > > > > Start the crawl for pages associated with a domain and do the
> > > > indexing
> > > > > > >
> > > > > > > My issue is with step #3, AFAIK if I want to restrict a crawl
> to a
> > > > > > specific
> > > > > > > domain I have to change regex-urlfilter and reload the job
> file.
> > > > This is
> > > > > > a
> > > > > > > pretty significant problem, since adding 25 seconds every
> single
> > > > time I
> > > > > > > start a new base domain is going to add way too many seconds
> to my
> > > > > > workflow
> > > > > > > (25 sec x a few million = way too much time). Finally the
> > > > question...is
> > > > > > > there a way to add url filters on the fly when I start a crawl
> > > and/or
> > > > > > > restrict a crawl to a particular domain on the fly. OR can you
> > > think
> > > > of a
> > > > > > > decent solution to the problem/am I missing something?
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > ___
> > >
> > > Alejandro Caceres
> > > Hyperion Gray, LLC
> > > Owner/CTO
> > >
> >
>

Re: Nutch efficiency and multiple single URL crawls

Reply via email to