@otis or did you mean nutch/host. I have only one server for my tests. @larsson my spinwaiting phase is usually less than 30minutes.
Something I noticed as well is the speed in the beginning is so fast that I can't read the screen. Not sure when the standard.out occurs at start_fetch or fetch_complete. -Raymond- 2009/5/27 Raymond Balmès <[email protected]> > I have many URLs per host of course. Need to get all the pages of the > sites, don't understand the question. > > -Raymond > > 2009/5/26 Otis Gospodnetic <[email protected]> > > >> But how, Ray, if you have only 1 URL per host? >> >> Otis >> -- >> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch >> >> >> >> ----- Original Message ---- >> > From: Raymond Balmès <[email protected]> >> > To: [email protected] >> > Sent: Tuesday, May 26, 2009 4:11:27 PM >> > Subject: Re: threads get stuck in spinwaiting >> > >> > Observing what my crawls do, I believe Ken must be right. >> > Towards the end of the crawl (when the fetchqueues.totalSize="xxxx" >> counts >> > down) in some cases I'm only fetching on two sites roughly , so indeed >> the >> > politeness starts to play a role there at least it should. >> > >> > -Ray- >> > >> > 2009/5/26 Raymond Balmès >> > >> > > Please read this too : >> > > >> > > >> > >> http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/ >> > > >> > > Interesting build from ken. >> > > >> > > 2009/5/26 Raymond Balmès >> > > >> > > yes already reported in multiple-threads. >> > >> I noted that if one does a "recrawl" you don't get this behavior... >> no >> > >> idea why. >> > >> >> > >> -Raymond- >> > >> >> > >> 2009/5/26 Larsson85 >> > >> >> > >> >> > >>> When I try to do my crawl it seems like the threads get stuck in >> som >> > >>> spinwaiting mode. At first the crawl goes as planned, and I couldnt >> be >> > >>> happier. But after som time, it starts reporting more of these >> > >>> spinwaiting >> > >>> messages. >> > >>> >> > >>> I print a log here to show you what it looks like. As you can see it >> gets >> > >>> stuck, and the queue decrease by 1 all the time. I've tried doing a >> > >>> smaller >> > >>> crawl, and what happends is that it counts down untill the >> > >>> "fetchQueues.totalSize" reaches 0, and then the crawl is done. >> > >>> >> > >>> But the problem is that this countdown is very slow,there's no >> effective >> > >>> crawling going on, not using eather bandwith or cpu power. Basicly, >> this >> > >>> costs way to much time, I cant let it go on like this for hours to >> be >> > >>> done. >> > >>> How can I fix this? >> > >>> >> > >>> >> > >>> after about an hour of crawling this is what the log looks like >> > >>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2526 >> > >>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2526 >> > >>> - fetching http://home.swipnet.se/~w-147200/ >> > >>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2525 >> > >>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2525 >> > >>> - fetching http://biphome.spray.se/alarsson/ >> > >>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2524 >> > >>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2524 >> > >>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2524 >> > >>> - fetching http://home.swipnet.se/~w-31853/html/ >> > >>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2523 >> > >>> >> > >>> .... >> > >>> >> > >>> -- >> > >>> View this message in context: >> > >>> >> > >> http://www.nabble.com/threads-get-stuck-in-spinwaiting-tp23723825p23723825.html >> > >>> Sent from the Nutch - User mailing list archive at Nabble.com. >> > >>> >> > >>> >> > >> >> > > >> >> >
