Re: Nutch efficiency and multiple single URL crawls
In my own case, it will be much fewer than millions of base URLs. It sounds like AC has something bigger in mind. On Thu, Nov 29, 2012 at 3:20 PM, Markus Jelsma wrote: > > > > > -Original message- > > From:Joe Zhang > > Sent: Thu 29-Nov-2012 23:15 > > To: user@nutch.apache.org > > Subject: Re: Nutch efficiency and multiple single URL crawls > > > > I realize I might be asking a real naive question here :) > > no problem > > > > > Why can't we put all the base URLs in a single seed file, and use a same > > config file (which has all the filtering patterns) to do the crawl? Is > > there anything wrong with this approach? > > No, there's nothing wrong in doing that. You can use the FreeGenerator > tool to generate a fetch list from a single (or multiple) seed file(s). > > But what are you trying to accomplish? Something large scale (tens of > millions) or much larger or smaller? > > Usually concerns of what is going to be crawled only matters if you're > doing something on a large or massive scale. > > > > > On Thu, Nov 29, 2012 at 2:46 PM, Alejandro Caceres < > > acace...@hyperiongray.com> wrote: > > > > > Got it, I will try that out, that's an excellent feature. Thank you > for the > > > help. > > > > > > > > > On Thu, Nov 29, 2012 at 4:06 AM, Markus Jelsma > > > wrote: > > > > > > > As i said, you don't rebuild, you just overwrite the config file in > the > > > > hadoop config directory on the data nodes. Config files are looked up > > > there > > > > as well. Just copy the file to the data nodes. > > > > > > > > -Original message- > > > > > From:AC Nutch > > > > > Sent: Thu 29-Nov-2012 05:38 > > > > > To: user@nutch.apache.org > > > > > Subject: Re: Nutch efficiency and multiple single URL crawls > > > > > > > > > > Thanks for the help. Perhaps I am misunderstanding, what would be > the > > > > > proper way to leverage this? I am a bit new to Nutch 1.5.1, I've > been > > > > using > > > > > 1.4 and have generally been using runtime/deploy/bin/nutch with a > .job > > > > > file. I notice things are done a bit differently in 1.5.1 with the > lack > > > > of > > > > > a nutch runtime and nutch deploy directories. How can I run a crawl > > > while > > > > > leveraging this functionality and not having to rebuild the job > file > > > each > > > > > new crawl? More specifically, I'm picturing the following > workflow... > > > > > > > > > > (1) update config file to restrict domain crawls -> (2) run command > > > that > > > > > crawls a domain with changes from config file while not having to > > > rebuild > > > > > job file -> (3) index to Solr > > > > > > > > > > What would the (general) command be for step (2) is my question. > > > > > > > > > > On Mon, Nov 26, 2012 at 5:16 AM, Markus Jelsma > > > > > wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > Rebuilding the job file for each domain is not a good idea > indeed, > > > > plus it > > > > > > adds the Hadoop overhead. But you don't have to, we write dynamic > > > > config > > > > > > files to each node's Hadoop configuration directory and it is > picked > > > up > > > > > > instead of the embedded configuration file. > > > > > > > > > > > > Cheers, > > > > > > > > > > > > -Original message- > > > > > > > From:AC Nutch > > > > > > > Sent: Mon 26-Nov-2012 06:50 > > > > > > > To: user@nutch.apache.org > > > > > > > Subject: Nutch efficiency and multiple single URL crawls > > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > I am using Nutch 1.5.1 and I am looking to do something > specific > > > with > > > > > > it. I > > > > > > > have a few million base domains in a Solr index, so for > example: > > > > > > > http://www.nutch.org, http://www.apache.org, > > > > http://www.whatever.cometc. I > > > > >
RE: Nutch efficiency and multiple single URL crawls
-Original message- > From:Joe Zhang > Sent: Thu 29-Nov-2012 23:15 > To: user@nutch.apache.org > Subject: Re: Nutch efficiency and multiple single URL crawls > > I realize I might be asking a real naive question here :) no problem > > Why can't we put all the base URLs in a single seed file, and use a same > config file (which has all the filtering patterns) to do the crawl? Is > there anything wrong with this approach? No, there's nothing wrong in doing that. You can use the FreeGenerator tool to generate a fetch list from a single (or multiple) seed file(s). But what are you trying to accomplish? Something large scale (tens of millions) or much larger or smaller? Usually concerns of what is going to be crawled only matters if you're doing something on a large or massive scale. > > On Thu, Nov 29, 2012 at 2:46 PM, Alejandro Caceres < > acace...@hyperiongray.com> wrote: > > > Got it, I will try that out, that's an excellent feature. Thank you for the > > help. > > > > > > On Thu, Nov 29, 2012 at 4:06 AM, Markus Jelsma > > wrote: > > > > > As i said, you don't rebuild, you just overwrite the config file in the > > > hadoop config directory on the data nodes. Config files are looked up > > there > > > as well. Just copy the file to the data nodes. > > > > > > -Original message- > > > > From:AC Nutch > > > > Sent: Thu 29-Nov-2012 05:38 > > > > To: user@nutch.apache.org > > > > Subject: Re: Nutch efficiency and multiple single URL crawls > > > > > > > > Thanks for the help. Perhaps I am misunderstanding, what would be the > > > > proper way to leverage this? I am a bit new to Nutch 1.5.1, I've been > > > using > > > > 1.4 and have generally been using runtime/deploy/bin/nutch with a .job > > > > file. I notice things are done a bit differently in 1.5.1 with the lack > > > of > > > > a nutch runtime and nutch deploy directories. How can I run a crawl > > while > > > > leveraging this functionality and not having to rebuild the job file > > each > > > > new crawl? More specifically, I'm picturing the following workflow... > > > > > > > > (1) update config file to restrict domain crawls -> (2) run command > > that > > > > crawls a domain with changes from config file while not having to > > rebuild > > > > job file -> (3) index to Solr > > > > > > > > What would the (general) command be for step (2) is my question. > > > > > > > > On Mon, Nov 26, 2012 at 5:16 AM, Markus Jelsma > > > > wrote: > > > > > > > > > Hi, > > > > > > > > > > Rebuilding the job file for each domain is not a good idea indeed, > > > plus it > > > > > adds the Hadoop overhead. But you don't have to, we write dynamic > > > config > > > > > files to each node's Hadoop configuration directory and it is picked > > up > > > > > instead of the embedded configuration file. > > > > > > > > > > Cheers, > > > > > > > > > > -Original message- > > > > > > From:AC Nutch > > > > > > Sent: Mon 26-Nov-2012 06:50 > > > > > > To: user@nutch.apache.org > > > > > > Subject: Nutch efficiency and multiple single URL crawls > > > > > > > > > > > > Hello, > > > > > > > > > > > > I am using Nutch 1.5.1 and I am looking to do something specific > > with > > > > > it. I > > > > > > have a few million base domains in a Solr index, so for example: > > > > > > http://www.nutch.org, http://www.apache.org, > > > http://www.whatever.cometc. I > > > > > > am trying to crawl each of these base domains in deploy mode and > > > retrieve > > > > > > all of their sub-urls associated with that domain in the most > > > efficient > > > > > way > > > > > > possible. To give you an example of the workflow I am trying to > > > achieve: > > > > > > (1) Grab a base domain, let's say http://www.nutch.org (2) Crawl > > the > > > > > base > > > > > > domain for all URLs in that domain, let's say > > > http://www.nutch.org/page1 > > > > > , > > > > > > http://www.nutch.org/pag
Re: Nutch efficiency and multiple single URL crawls
I realize I might be asking a real naive question here :) Why can't we put all the base URLs in a single seed file, and use a same config file (which has all the filtering patterns) to do the crawl? Is there anything wrong with this approach? On Thu, Nov 29, 2012 at 2:46 PM, Alejandro Caceres < acace...@hyperiongray.com> wrote: > Got it, I will try that out, that's an excellent feature. Thank you for the > help. > > > On Thu, Nov 29, 2012 at 4:06 AM, Markus Jelsma > wrote: > > > As i said, you don't rebuild, you just overwrite the config file in the > > hadoop config directory on the data nodes. Config files are looked up > there > > as well. Just copy the file to the data nodes. > > > > -Original message- > > > From:AC Nutch > > > Sent: Thu 29-Nov-2012 05:38 > > > To: user@nutch.apache.org > > > Subject: Re: Nutch efficiency and multiple single URL crawls > > > > > > Thanks for the help. Perhaps I am misunderstanding, what would be the > > > proper way to leverage this? I am a bit new to Nutch 1.5.1, I've been > > using > > > 1.4 and have generally been using runtime/deploy/bin/nutch with a .job > > > file. I notice things are done a bit differently in 1.5.1 with the lack > > of > > > a nutch runtime and nutch deploy directories. How can I run a crawl > while > > > leveraging this functionality and not having to rebuild the job file > each > > > new crawl? More specifically, I'm picturing the following workflow... > > > > > > (1) update config file to restrict domain crawls -> (2) run command > that > > > crawls a domain with changes from config file while not having to > rebuild > > > job file -> (3) index to Solr > > > > > > What would the (general) command be for step (2) is my question. > > > > > > On Mon, Nov 26, 2012 at 5:16 AM, Markus Jelsma > > > wrote: > > > > > > > Hi, > > > > > > > > Rebuilding the job file for each domain is not a good idea indeed, > > plus it > > > > adds the Hadoop overhead. But you don't have to, we write dynamic > > config > > > > files to each node's Hadoop configuration directory and it is picked > up > > > > instead of the embedded configuration file. > > > > > > > > Cheers, > > > > > > > > -Original message- > > > > > From:AC Nutch > > > > > Sent: Mon 26-Nov-2012 06:50 > > > > > To: user@nutch.apache.org > > > > > Subject: Nutch efficiency and multiple single URL crawls > > > > > > > > > > Hello, > > > > > > > > > > I am using Nutch 1.5.1 and I am looking to do something specific > with > > > > it. I > > > > > have a few million base domains in a Solr index, so for example: > > > > > http://www.nutch.org, http://www.apache.org, > > http://www.whatever.cometc. I > > > > > am trying to crawl each of these base domains in deploy mode and > > retrieve > > > > > all of their sub-urls associated with that domain in the most > > efficient > > > > way > > > > > possible. To give you an example of the workflow I am trying to > > achieve: > > > > > (1) Grab a base domain, let's say http://www.nutch.org (2) Crawl > the > > > > base > > > > > domain for all URLs in that domain, let's say > > http://www.nutch.org/page1 > > > > , > > > > > http://www.nutch.org/page2, http://www.nutch.org/page3, etc. etc. > > (3) > > > > store > > > > > these results somewhere (perhaps another Solr instance) and (4) > move > > on > > > > to > > > > > the next base domain in my Solr index and repeat the process. > > Essentially > > > > > just trying to grab all links associated with a page and then move > > on to > > > > > the next page. > > > > > > > > > > The part I am having trouble with is ensuring that this workflow is > > > > > efficient. The only way I can think to do this would be: (1) Grab a > > base > > > > > domain from Solr from my shell script (simple enough) (2) Add an > > entry to > > > > > regex-urlfilter with the domain I am looking to restrict the crawl > > to, in > > > > > the example above that would be an entry that says to only keep > > sub-pages > > > > > of http://www.nutch.org/ (3) Recreate the Nutch job file (~25 > sec.) > > (4) > > > > > Start the crawl for pages associated with a domain and do the > > indexing > > > > > > > > > > My issue is with step #3, AFAIK if I want to restrict a crawl to a > > > > specific > > > > > domain I have to change regex-urlfilter and reload the job file. > > This is > > > > a > > > > > pretty significant problem, since adding 25 seconds every single > > time I > > > > > start a new base domain is going to add way too many seconds to my > > > > workflow > > > > > (25 sec x a few million = way too much time). Finally the > > question...is > > > > > there a way to add url filters on the fly when I start a crawl > and/or > > > > > restrict a crawl to a particular domain on the fly. OR can you > think > > of a > > > > > decent solution to the problem/am I missing something? > > > > > > > > > > > > > > > > > > -- > ___ > > Alejandro Caceres > Hyperion Gray, LLC > Owner/CTO >
Re: Nutch efficiency and multiple single URL crawls
Got it, I will try that out, that's an excellent feature. Thank you for the help. On Thu, Nov 29, 2012 at 4:06 AM, Markus Jelsma wrote: > As i said, you don't rebuild, you just overwrite the config file in the > hadoop config directory on the data nodes. Config files are looked up there > as well. Just copy the file to the data nodes. > > -Original message- > > From:AC Nutch > > Sent: Thu 29-Nov-2012 05:38 > > To: user@nutch.apache.org > > Subject: Re: Nutch efficiency and multiple single URL crawls > > > > Thanks for the help. Perhaps I am misunderstanding, what would be the > > proper way to leverage this? I am a bit new to Nutch 1.5.1, I've been > using > > 1.4 and have generally been using runtime/deploy/bin/nutch with a .job > > file. I notice things are done a bit differently in 1.5.1 with the lack > of > > a nutch runtime and nutch deploy directories. How can I run a crawl while > > leveraging this functionality and not having to rebuild the job file each > > new crawl? More specifically, I'm picturing the following workflow... > > > > (1) update config file to restrict domain crawls -> (2) run command that > > crawls a domain with changes from config file while not having to rebuild > > job file -> (3) index to Solr > > > > What would the (general) command be for step (2) is my question. > > > > On Mon, Nov 26, 2012 at 5:16 AM, Markus Jelsma > > wrote: > > > > > Hi, > > > > > > Rebuilding the job file for each domain is not a good idea indeed, > plus it > > > adds the Hadoop overhead. But you don't have to, we write dynamic > config > > > files to each node's Hadoop configuration directory and it is picked up > > > instead of the embedded configuration file. > > > > > > Cheers, > > > > > > -Original message- > > > > From:AC Nutch > > > > Sent: Mon 26-Nov-2012 06:50 > > > > To: user@nutch.apache.org > > > > Subject: Nutch efficiency and multiple single URL crawls > > > > > > > > Hello, > > > > > > > > I am using Nutch 1.5.1 and I am looking to do something specific with > > > it. I > > > > have a few million base domains in a Solr index, so for example: > > > > http://www.nutch.org, http://www.apache.org, > http://www.whatever.cometc. I > > > > am trying to crawl each of these base domains in deploy mode and > retrieve > > > > all of their sub-urls associated with that domain in the most > efficient > > > way > > > > possible. To give you an example of the workflow I am trying to > achieve: > > > > (1) Grab a base domain, let's say http://www.nutch.org (2) Crawl the > > > base > > > > domain for all URLs in that domain, let's say > http://www.nutch.org/page1 > > > , > > > > http://www.nutch.org/page2, http://www.nutch.org/page3, etc. etc. > (3) > > > store > > > > these results somewhere (perhaps another Solr instance) and (4) move > on > > > to > > > > the next base domain in my Solr index and repeat the process. > Essentially > > > > just trying to grab all links associated with a page and then move > on to > > > > the next page. > > > > > > > > The part I am having trouble with is ensuring that this workflow is > > > > efficient. The only way I can think to do this would be: (1) Grab a > base > > > > domain from Solr from my shell script (simple enough) (2) Add an > entry to > > > > regex-urlfilter with the domain I am looking to restrict the crawl > to, in > > > > the example above that would be an entry that says to only keep > sub-pages > > > > of http://www.nutch.org/ (3) Recreate the Nutch job file (~25 sec.) > (4) > > > > Start the crawl for pages associated with a domain and do the > indexing > > > > > > > > My issue is with step #3, AFAIK if I want to restrict a crawl to a > > > specific > > > > domain I have to change regex-urlfilter and reload the job file. > This is > > > a > > > > pretty significant problem, since adding 25 seconds every single > time I > > > > start a new base domain is going to add way too many seconds to my > > > workflow > > > > (25 sec x a few million = way too much time). Finally the > question...is > > > > there a way to add url filters on the fly when I start a crawl and/or > > > > restrict a crawl to a particular domain on the fly. OR can you think > of a > > > > decent solution to the problem/am I missing something? > > > > > > > > > > -- ___ Alejandro Caceres Hyperion Gray, LLC Owner/CTO
RE: Nutch efficiency and multiple single URL crawls
As i said, you don't rebuild, you just overwrite the config file in the hadoop config directory on the data nodes. Config files are looked up there as well. Just copy the file to the data nodes. -Original message- > From:AC Nutch > Sent: Thu 29-Nov-2012 05:38 > To: user@nutch.apache.org > Subject: Re: Nutch efficiency and multiple single URL crawls > > Thanks for the help. Perhaps I am misunderstanding, what would be the > proper way to leverage this? I am a bit new to Nutch 1.5.1, I've been using > 1.4 and have generally been using runtime/deploy/bin/nutch with a .job > file. I notice things are done a bit differently in 1.5.1 with the lack of > a nutch runtime and nutch deploy directories. How can I run a crawl while > leveraging this functionality and not having to rebuild the job file each > new crawl? More specifically, I'm picturing the following workflow... > > (1) update config file to restrict domain crawls -> (2) run command that > crawls a domain with changes from config file while not having to rebuild > job file -> (3) index to Solr > > What would the (general) command be for step (2) is my question. > > On Mon, Nov 26, 2012 at 5:16 AM, Markus Jelsma > wrote: > > > Hi, > > > > Rebuilding the job file for each domain is not a good idea indeed, plus it > > adds the Hadoop overhead. But you don't have to, we write dynamic config > > files to each node's Hadoop configuration directory and it is picked up > > instead of the embedded configuration file. > > > > Cheers, > > > > -Original message- > > > From:AC Nutch > > > Sent: Mon 26-Nov-2012 06:50 > > > To: user@nutch.apache.org > > > Subject: Nutch efficiency and multiple single URL crawls > > > > > > Hello, > > > > > > I am using Nutch 1.5.1 and I am looking to do something specific with > > it. I > > > have a few million base domains in a Solr index, so for example: > > > http://www.nutch.org, http://www.apache.org, http://www.whatever.cometc. I > > > am trying to crawl each of these base domains in deploy mode and retrieve > > > all of their sub-urls associated with that domain in the most efficient > > way > > > possible. To give you an example of the workflow I am trying to achieve: > > > (1) Grab a base domain, let's say http://www.nutch.org (2) Crawl the > > base > > > domain for all URLs in that domain, let's say http://www.nutch.org/page1 > > , > > > http://www.nutch.org/page2, http://www.nutch.org/page3, etc. etc. (3) > > store > > > these results somewhere (perhaps another Solr instance) and (4) move on > > to > > > the next base domain in my Solr index and repeat the process. Essentially > > > just trying to grab all links associated with a page and then move on to > > > the next page. > > > > > > The part I am having trouble with is ensuring that this workflow is > > > efficient. The only way I can think to do this would be: (1) Grab a base > > > domain from Solr from my shell script (simple enough) (2) Add an entry to > > > regex-urlfilter with the domain I am looking to restrict the crawl to, in > > > the example above that would be an entry that says to only keep sub-pages > > > of http://www.nutch.org/ (3) Recreate the Nutch job file (~25 sec.) (4) > > > Start the crawl for pages associated with a domain and do the indexing > > > > > > My issue is with step #3, AFAIK if I want to restrict a crawl to a > > specific > > > domain I have to change regex-urlfilter and reload the job file. This is > > a > > > pretty significant problem, since adding 25 seconds every single time I > > > start a new base domain is going to add way too many seconds to my > > workflow > > > (25 sec x a few million = way too much time). Finally the question...is > > > there a way to add url filters on the fly when I start a crawl and/or > > > restrict a crawl to a particular domain on the fly. OR can you think of a > > > decent solution to the problem/am I missing something? > > > > > >
Re: Nutch efficiency and multiple single URL crawls
Thanks for the help. Perhaps I am misunderstanding, what would be the proper way to leverage this? I am a bit new to Nutch 1.5.1, I've been using 1.4 and have generally been using runtime/deploy/bin/nutch with a .job file. I notice things are done a bit differently in 1.5.1 with the lack of a nutch runtime and nutch deploy directories. How can I run a crawl while leveraging this functionality and not having to rebuild the job file each new crawl? More specifically, I'm picturing the following workflow... (1) update config file to restrict domain crawls -> (2) run command that crawls a domain with changes from config file while not having to rebuild job file -> (3) index to Solr What would the (general) command be for step (2) is my question. On Mon, Nov 26, 2012 at 5:16 AM, Markus Jelsma wrote: > Hi, > > Rebuilding the job file for each domain is not a good idea indeed, plus it > adds the Hadoop overhead. But you don't have to, we write dynamic config > files to each node's Hadoop configuration directory and it is picked up > instead of the embedded configuration file. > > Cheers, > > -Original message- > > From:AC Nutch > > Sent: Mon 26-Nov-2012 06:50 > > To: user@nutch.apache.org > > Subject: Nutch efficiency and multiple single URL crawls > > > > Hello, > > > > I am using Nutch 1.5.1 and I am looking to do something specific with > it. I > > have a few million base domains in a Solr index, so for example: > > http://www.nutch.org, http://www.apache.org, http://www.whatever.cometc. I > > am trying to crawl each of these base domains in deploy mode and retrieve > > all of their sub-urls associated with that domain in the most efficient > way > > possible. To give you an example of the workflow I am trying to achieve: > > (1) Grab a base domain, let's say http://www.nutch.org (2) Crawl the > base > > domain for all URLs in that domain, let's say http://www.nutch.org/page1 > , > > http://www.nutch.org/page2, http://www.nutch.org/page3, etc. etc. (3) > store > > these results somewhere (perhaps another Solr instance) and (4) move on > to > > the next base domain in my Solr index and repeat the process. Essentially > > just trying to grab all links associated with a page and then move on to > > the next page. > > > > The part I am having trouble with is ensuring that this workflow is > > efficient. The only way I can think to do this would be: (1) Grab a base > > domain from Solr from my shell script (simple enough) (2) Add an entry to > > regex-urlfilter with the domain I am looking to restrict the crawl to, in > > the example above that would be an entry that says to only keep sub-pages > > of http://www.nutch.org/ (3) Recreate the Nutch job file (~25 sec.) (4) > > Start the crawl for pages associated with a domain and do the indexing > > > > My issue is with step #3, AFAIK if I want to restrict a crawl to a > specific > > domain I have to change regex-urlfilter and reload the job file. This is > a > > pretty significant problem, since adding 25 seconds every single time I > > start a new base domain is going to add way too many seconds to my > workflow > > (25 sec x a few million = way too much time). Finally the question...is > > there a way to add url filters on the fly when I start a crawl and/or > > restrict a crawl to a particular domain on the fly. OR can you think of a > > decent solution to the problem/am I missing something? > > >
RE: Nutch efficiency and multiple single URL crawls
Hi, Rebuilding the job file for each domain is not a good idea indeed, plus it adds the Hadoop overhead. But you don't have to, we write dynamic config files to each node's Hadoop configuration directory and it is picked up instead of the embedded configuration file. Cheers, -Original message- > From:AC Nutch > Sent: Mon 26-Nov-2012 06:50 > To: user@nutch.apache.org > Subject: Nutch efficiency and multiple single URL crawls > > Hello, > > I am using Nutch 1.5.1 and I am looking to do something specific with it. I > have a few million base domains in a Solr index, so for example: > http://www.nutch.org, http://www.apache.org, http://www.whatever.com etc. I > am trying to crawl each of these base domains in deploy mode and retrieve > all of their sub-urls associated with that domain in the most efficient way > possible. To give you an example of the workflow I am trying to achieve: > (1) Grab a base domain, let's say http://www.nutch.org (2) Crawl the base > domain for all URLs in that domain, let's say http://www.nutch.org/page1, > http://www.nutch.org/page2, http://www.nutch.org/page3, etc. etc. (3) store > these results somewhere (perhaps another Solr instance) and (4) move on to > the next base domain in my Solr index and repeat the process. Essentially > just trying to grab all links associated with a page and then move on to > the next page. > > The part I am having trouble with is ensuring that this workflow is > efficient. The only way I can think to do this would be: (1) Grab a base > domain from Solr from my shell script (simple enough) (2) Add an entry to > regex-urlfilter with the domain I am looking to restrict the crawl to, in > the example above that would be an entry that says to only keep sub-pages > of http://www.nutch.org/ (3) Recreate the Nutch job file (~25 sec.) (4) > Start the crawl for pages associated with a domain and do the indexing > > My issue is with step #3, AFAIK if I want to restrict a crawl to a specific > domain I have to change regex-urlfilter and reload the job file. This is a > pretty significant problem, since adding 25 seconds every single time I > start a new base domain is going to add way too many seconds to my workflow > (25 sec x a few million = way too much time). Finally the question...is > there a way to add url filters on the fly when I start a crawl and/or > restrict a crawl to a particular domain on the fly. OR can you think of a > decent solution to the problem/am I missing something? >
Re: Nutch efficiency and multiple single URL crawls
what do you mean by the "job file"? On Sun, Nov 25, 2012 at 10:43 PM, AC Nutch wrote: > Hello, > > I am using Nutch 1.5.1 and I am looking to do something specific with it. I > have a few million base domains in a Solr index, so for example: > http://www.nutch.org, http://www.apache.org, http://www.whatever.com etc. > I > am trying to crawl each of these base domains in deploy mode and retrieve > all of their sub-urls associated with that domain in the most efficient way > possible. To give you an example of the workflow I am trying to achieve: > (1) Grab a base domain, let's say http://www.nutch.org (2) Crawl the base > domain for all URLs in that domain, let's say http://www.nutch.org/page1, > http://www.nutch.org/page2, http://www.nutch.org/page3, etc. etc. (3) > store > these results somewhere (perhaps another Solr instance) and (4) move on to > the next base domain in my Solr index and repeat the process. Essentially > just trying to grab all links associated with a page and then move on to > the next page. > > The part I am having trouble with is ensuring that this workflow is > efficient. The only way I can think to do this would be: (1) Grab a base > domain from Solr from my shell script (simple enough) (2) Add an entry to > regex-urlfilter with the domain I am looking to restrict the crawl to, in > the example above that would be an entry that says to only keep sub-pages > of http://www.nutch.org/ (3) Recreate the Nutch job file (~25 sec.) (4) > Start the crawl for pages associated with a domain and do the indexing > > My issue is with step #3, AFAIK if I want to restrict a crawl to a specific > domain I have to change regex-urlfilter and reload the job file. This is a > pretty significant problem, since adding 25 seconds every single time I > start a new base domain is going to add way too many seconds to my workflow > (25 sec x a few million = way too much time). Finally the question...is > there a way to add url filters on the fly when I start a crawl and/or > restrict a crawl to a particular domain on the fly. OR can you think of a > decent solution to the problem/am I missing something? >