Re: Make Nutch to crawl internal urls only

2012-05-10 Thread Julien Nioche
late a > lot of things for most records and then the reducer limits records by host > or domain taking a lot of additional CPU time and RAM. > > You must disable filtering and normalizing but this will only help for a > short while. If the CrawlDB grows again you must you a cluster t

Fwd: Re: Make Nutch to crawl internal urls only

2012-05-10 Thread Markus Jelsma
Hi, This is not quite similar but there's a new parameter for the generator in Nutch 1.5 where you can restrict selection by status. Cheers Original Message Subject: Re: Make Nutch to crawl internal urls only Date: Thu, 10 May 2012 02:24:18 -0700 (PDT) From: Greg Field

Re: Make Nutch to crawl internal urls only

2012-05-10 Thread Greg Fields
I have a similar problem. Is there a way i can force the fetcher to only take urls from the unfetched url list? -- View this message in context: http://lucene.472066.n3.nabble.com/Make-Nutch-to-crawl-internal-urls-only-tp3974397p3976568.html Sent from the Nutch - User mailing list archive at

Re: Make Nutch to crawl internal urls only

2012-05-10 Thread Markus Jelsma
isable filtering and normalizing but this will only help for a short while. If the CrawlDB grows again you must you a cluster to do the work. Thanks, James Ford -- View this message in context: http://lucene.472066.n3.nabble.com/Make-Nutch-to-crawl-internal-urls-only-tp3974397p3976511.html

Re: Make Nutch to crawl internal urls only

2012-05-10 Thread James Ford
ng so long? It can't take that much time selecting X urls from a database of about 10 million URLs? Thanks, James Ford -- View this message in context: http://lucene.472066.n3.nabble.com/Make-Nutch-to-crawl-internal-urls-only-tp3974397p3976511.html Sent from the Nutch - User mailing list

Re: Make Nutch to crawl internal urls only

2012-05-09 Thread Ken Krugler
gt; for each iteration? > > What should I set "db.max.outlinks.per.page" to? I was wondering about > setting it to 4, to get 4*5k=20k for the first iteration? > > Can anyone help me? > > Thanks, > James Ford > > -- > View this message in context: > htt

Re: Make Nutch to crawl internal urls only

2012-05-09 Thread Markus Jelsma
anks, James Ford -- View this message in context: http://lucene.472066.n3.nabble.com/Make-Nutch-to-crawl-internal-urls-only-tp3974397.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex

Make Nutch to crawl internal urls only

2012-05-09 Thread James Ford
ext: http://lucene.472066.n3.nabble.com/Make-Nutch-to-crawl-internal-urls-only-tp3974397.html Sent from the Nutch - User mailing list archive at Nabble.com.