RE: Nutch generate fetch lists for a single domain (but with multiple urls) crawl

Markus Jelsma Thu, 18 Oct 2012 13:55:02 -0700

Hi - the generator tool partitions URL's by host, domain or IP address, they'll 
all end up in the same fetch list. Since you're doing only one domain there is 
no need to run additional mappers. If you want to crawl them as fast as you can 
(and you are allowed to do that) then use only one mapper and increase the 
number of threads and the number of threads per queue.


Keep in mind that it is considered impolite to crawl some host with too many 
threads and too little delay between successive fetches. You can do it if you 
own the host or have an agreement to do it. Reuters.com won't appreciate many 
URL's fetched with 10 threads without delay.
 
-----Original message-----
> From:shri_s_ram <shrirama...@gmail.com>
> Sent: Thu 18-Oct-2012 22:40
> To: user@nutch.apache.org
> Subject: Nutch generate fetch lists for a single domain (but with multiple 
> urls) crawl
> 
> Hi I am using Apache Nutch to crawl a website (say reuters.com). My seed urls
> are like the following
> 1.http://www.reuters.com/news/archive?view=page&page=1&pageSize=10,
> 2.http://www.reuters.com/news/archive?view=page&page=2&pageSize=10...... Now
> when I use the crawl command with 100 mapred.map.tasks parameter and
> partition.url.mode - byHost, Nutch generates 100 fetch lists but only one of
> them has all the urls. This in turn meant that out of 100 fetch jobs one of
> them takes a long time (actually all the time) I need to fetch urls from the
> same domain (but different urls) in multiple fetch jobs. Can someone help me
> out with the parameter setting for the same? Is this possible? Cheers
> Shriram Sridharan 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Nutch-generate-fetch-lists-for-a-single-domain-but-with-multiple-urls-crawl-tp4014573.html
> Sent from the Nutch - User mailing list archive at Nabble.com.

RE: Nutch generate fetch lists for a single domain (but with multiple urls) crawl

Reply via email to