Hi, Please try *http://s.apache.org/mo* Specifically the generate.max.count property. Many many URLs are unfetched here... look into the logs and see what is going on. This is really quite bad and there is most likely one/a small number of reasons which ultimately determine why so many URLs are unfetched. Golden rule here, unless you know that lareg fetches can be left to their own devices (unmonitored), then try to generate many small fetch lists and check on the progress. This really helps you to improve throughput and increase the productiveness to time ratio for fetching tasks. hth Lewis
On Tue, Jul 2, 2013 at 2:48 PM, h b <hb6...@gmail.com> wrote: > Hi, > I seeded 4 urls, all in the same domain. > I am running fetch with 20 threads and 80 numTasks. The reducer is stuck on > the last reduce. > I ran a dump of the readdb to see the status, and I see 122K of the total > 133K urls are 'status_unfetched'. This is after 12 hours. The delay between > fetches is 5s (default) > > My hadoop cluster has 10 datanodes, each is about 24 core and 48G Ram. > The average size of each page is 150KB. The site I am crawling responds > fast enough (it is internal) > So I do not understand where the bottleneck is? > > It is still not complete. > > > > On Tue, Jul 2, 2013 at 5:12 AM, Markus Jelsma <markus.jel...@openindex.io > >wrote: > > > Hi, > > > > Nutch can easily scale to many many billions of records, it just depends > > on how many and how powerful your nodes are. Crawl speed is not very > > relevant as it is always very fast, the problem usually is updating the > > databases. If you spread your data over more machines you will increase > > your throughput! We can easily manage 2m records on a very small 1 core 1 > > GB VPS but we can also manage dozens of billions records on a small > cluster > > of 5 16 core 16GB nodes. It depends on your cluster! > > > > Cheers, > > Markus > > > > > > > > -----Original message----- > > > From:h b <hb6...@gmail.com> > > > Sent: Tuesday 2nd July 2013 7:35 > > > To: user@nutch.apache.org > > > Subject: Nutch scalability tests > > > > > > Hi > > > Does anyone have some stats around scalability of how many urls you > > crawled > > > and how long it took. Definitely these stats are environment based and > > the > > > site(s) crawled, but would be nice to see > > > some stats here. > > > > > > I used nutch with HBase and solr and have got a nice working enviroment > > and > > > so far have been able to crawl a limited set, rather very very limited > > set > > > of urls satisfactorily. Now that I have a proof of concept, I want to > run > > > it full blown, but before I do that, I want to see if my setup can even > > > handle this. If not, I want to see how I can throttle my runs. So some > > > stats/test results would be nice to have. > > > > > > > > > Regards > > > Hemant > > > > > > -- *Lewis*