Hi,
I seeded 4 urls, all in the same domain.
I am running fetch with 20 threads and 80 numTasks. The reducer is stuck on
the last reduce.
I ran a dump of the readdb to see the status, and I see 122K of the total
133K urls are 'status_unfetched'. This is after 12 hours. The delay between
fetches is 5s (default)

My hadoop cluster has 10 datanodes, each is about 24 core and 48G Ram.
The average size of each page is 150KB. The site I am crawling responds
fast enough (it is internal)
So I do not understand where the bottleneck is?

It is still not complete.



On Tue, Jul 2, 2013 at 5:12 AM, Markus Jelsma <markus.jel...@openindex.io>wrote:

> Hi,
>
> Nutch can easily scale to many many billions of records, it just depends
> on how many and how powerful your nodes are. Crawl speed is not very
> relevant as it is always very fast, the problem usually is updating the
> databases. If you spread your data over more machines you will increase
> your throughput! We can easily manage 2m records on a very small 1 core 1
> GB VPS but we can also manage dozens of billions records on a small cluster
> of 5 16 core 16GB nodes. It depends on your cluster!
>
> Cheers,
> Markus
>
>
>
> -----Original message-----
> > From:h b <hb6...@gmail.com>
> > Sent: Tuesday 2nd July 2013 7:35
> > To: user@nutch.apache.org
> > Subject: Nutch scalability tests
> >
> > Hi
> > Does anyone have some stats around scalability of how many urls you
> crawled
> > and how long it took. Definitely these stats are environment based and
> the
> > site(s) crawled, but would be nice to see
> >  some stats here.
> >
> > I used nutch with HBase and solr and have got a nice working enviroment
> and
> > so far have been able to crawl a limited set, rather very very limited
> set
> > of urls satisfactorily. Now that I have a proof of concept, I want to run
> > it full blown, but before I do that, I want to see if my setup can even
> > handle this. If not, I want to see how I can throttle my runs. So some
> > stats/test results would be nice to have.
> >
> >
> > Regards
> > Hemant
> >
>

Reply via email to