Thanks,the problem is that if I reduce the URLs in the seed list to any 5 , all of them are being crawled , which tells me its not a URL filtering issue , is just seems Nutch is not able to crawl more than 5 domains from the seed list , is there a property that I am setting by mistake that's causing this behavior?
On Wed, Aug 20, 2014 at 11:38 AM, Bin Wang <binwang...@gmail.com> wrote: > Hi S.L., > > 1. Nutch will follow site's robots.txt file as default, maybe you can take > a look at robot rule for the missing domains by going to > http://example.com/robots.txt? > > 2. Also, there are some URL filters that will be applied, maybe you can > paste the output after you inject the seed.txt (nutch inject), so you can > make sure all the URLs passed the filtering process. > > Bin > > > On Tue, Aug 19, 2014 at 11:03 PM, S.L <simpleliving...@gmail.com> wrote: > > > Hi All, > > > > I have 10 domains in the seed list , *Nutch 1.7 *consistently crawls > only 5 > > of those domaisn and ignores the other 5 domains , can you please let me > > know whats preventing it from crawling all the domains. > > > > I am running this on *Hadoop2.3.0* and in a cluster mode and giving a > > *depth > > of 10* when submitting the job. I have already set the > > *db.ignore.external.links > > *property to tru as I only intend to crawl the domains in the seed list. > > > > Some relevant properties that I have set , are mentioned below ,* please > > advise*. > > > > <property> > > <name>*fetcher.threads.per.queue*</name> > > <value>5</value> > > <description>This number is the maximum number of threads that > > should be allowed to access a queue at one time. Replaces > > deprecated parameter 'fetcher.threads.per.host'. > > </description> > > </property> > > > > <property> > > <name>*db.ignore.external.links*</name> > > <value>true</value> > > <description>If true, outlinks leading from a page to external > > hosts > > will be ignored. This is an effective way to limit the crawl > to > > include > > only initially injected hosts, without creating complex > > URLFilters. > > </description> > > </property> > > >