Thanks,the problem is that if I reduce the URLs in the seed list to any 5 ,
all of them are being crawled , which tells me its not a URL filtering
issue , is just seems Nutch is not able to crawl more than 5 domains from
the seed list , is there  a property that I am setting by mistake that's
causing this behavior?


On Wed, Aug 20, 2014 at 11:38 AM, Bin Wang <binwang...@gmail.com> wrote:

> Hi S.L.,
>
> 1. Nutch will follow site's robots.txt file as default, maybe you can take
> a look at robot rule for the missing domains by going to
> http://example.com/robots.txt?
>
> 2. Also, there are some URL filters that will be applied, maybe you can
> paste the output after you inject the seed.txt (nutch inject), so you can
> make sure all the URLs passed the filtering process.
>
> Bin
>
>
> On Tue, Aug 19, 2014 at 11:03 PM, S.L <simpleliving...@gmail.com> wrote:
>
> > Hi All,
> >
> > I have 10 domains in the seed list , *Nutch 1.7 *consistently crawls
> only 5
> > of those domaisn and ignores the other 5  domains , can you please let me
> > know whats preventing it from crawling all the domains.
> >
> > I am running this on *Hadoop2.3.0* and in a cluster mode and giving a
> > *depth
> > of 10* when submitting the job. I have already set the
> > *db.ignore.external.links
> > *property to tru as I only intend to crawl the domains in the seed list.
> >
> > Some relevant properties that I have set , are mentioned below ,* please
> > advise*.
> >
> > <property>
> >         <name>*fetcher.threads.per.queue*</name>
> >         <value>5</value>
> >         <description>This number is the maximum number of threads that
> >             should be allowed to access a queue at one time. Replaces
> >             deprecated parameter 'fetcher.threads.per.host'.
> >         </description>
> >     </property>
> >
> >     <property>
> >         <name>*db.ignore.external.links*</name>
> >         <value>true</value>
> >         <description>If true, outlinks leading from a page to external
> > hosts
> >             will be ignored. This is an effective way to limit the crawl
> to
> >             include
> >             only initially injected hosts, without creating complex
> > URLFilters.
> >         </description>
> >     </property>
> >
>

Reply via email to