late a
> lot of things for most records and then the reducer limits records by host
> or domain taking a lot of additional CPU time and RAM.
>
> You must disable filtering and normalizing but this will only help for a
> short while. If the CrawlDB grows again you must you a cluster t
Hi,
This is not quite similar but there's a new parameter for the generator
in Nutch 1.5 where you can restrict selection by status.
Cheers
Original Message
Subject: Re: Make Nutch to crawl internal urls only
Date: Thu, 10 May 2012 02:24:18 -0700 (PDT)
From: Greg Field
I have a similar problem. Is there a way i can force the fetcher to only take
urls from the unfetched url list?
--
View this message in context:
http://lucene.472066.n3.nabble.com/Make-Nutch-to-crawl-internal-urls-only-tp3974397p3976568.html
Sent from the Nutch - User mailing list archive at
isable filtering and normalizing but this will only help for
a short while. If the CrawlDB grows again you must you a cluster to do
the work.
Thanks,
James Ford
--
View this message in context:
http://lucene.472066.n3.nabble.com/Make-Nutch-to-crawl-internal-urls-only-tp3974397p3976511.html
ng so long? It can't take that
much time selecting X urls from a database of about 10 million URLs?
Thanks,
James Ford
--
View this message in context:
http://lucene.472066.n3.nabble.com/Make-Nutch-to-crawl-internal-urls-only-tp3974397p3976511.html
Sent from the Nutch - User mailing list
gt; for each iteration?
>
> What should I set "db.max.outlinks.per.page" to? I was wondering about
> setting it to 4, to get 4*5k=20k for the first iteration?
>
> Can anyone help me?
>
> Thanks,
> James Ford
>
> --
> View this message in context:
> htt
anks,
James Ford
--
View this message in context:
http://lucene.472066.n3.nabble.com/Make-Nutch-to-crawl-internal-urls-only-tp3974397.html
Sent from the Nutch - User mailing list archive at Nabble.com.
--
Markus Jelsma - CTO - Openindex
ext:
http://lucene.472066.n3.nabble.com/Make-Nutch-to-crawl-internal-urls-only-tp3974397.html
Sent from the Nutch - User mailing list archive at Nabble.com.
8 matches
Mail list logo