I have many URLs per host of course. Need to get all the pages of the sites, don't understand the question.
-Raymond 2009/5/26 Otis Gospodnetic <[email protected]> > > But how, Ray, if you have only 1 URL per host? > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > ----- Original Message ---- > > From: Raymond Balmès <[email protected]> > > To: [email protected] > > Sent: Tuesday, May 26, 2009 4:11:27 PM > > Subject: Re: threads get stuck in spinwaiting > > > > Observing what my crawls do, I believe Ken must be right. > > Towards the end of the crawl (when the fetchqueues.totalSize="xxxx" > counts > > down) in some cases I'm only fetching on two sites roughly , so indeed > the > > politeness starts to play a role there at least it should. > > > > -Ray- > > > > 2009/5/26 Raymond Balmès > > > > > Please read this too : > > > > > > > > > http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/ > > > > > > Interesting build from ken. > > > > > > 2009/5/26 Raymond Balmès > > > > > > yes already reported in multiple-threads. > > >> I noted that if one does a "recrawl" you don't get this behavior... no > > >> idea why. > > >> > > >> -Raymond- > > >> > > >> 2009/5/26 Larsson85 > > >> > > >> > > >>> When I try to do my crawl it seems like the threads get stuck in > som > > >>> spinwaiting mode. At first the crawl goes as planned, and I couldnt > be > > >>> happier. But after som time, it starts reporting more of these > > >>> spinwaiting > > >>> messages. > > >>> > > >>> I print a log here to show you what it looks like. As you can see it > gets > > >>> stuck, and the queue decrease by 1 all the time. I've tried doing a > > >>> smaller > > >>> crawl, and what happends is that it counts down untill the > > >>> "fetchQueues.totalSize" reaches 0, and then the crawl is done. > > >>> > > >>> But the problem is that this countdown is very slow,there's no > effective > > >>> crawling going on, not using eather bandwith or cpu power. Basicly, > this > > >>> costs way to much time, I cant let it go on like this for hours to be > > >>> done. > > >>> How can I fix this? > > >>> > > >>> > > >>> after about an hour of crawling this is what the log looks like > > >>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2526 > > >>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2526 > > >>> - fetching http://home.swipnet.se/~w-147200/ > > >>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2525 > > >>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2525 > > >>> - fetching http://biphome.spray.se/alarsson/ > > >>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2524 > > >>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2524 > > >>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2524 > > >>> - fetching http://home.swipnet.se/~w-31853/html/ > > >>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2523 > > >>> > > >>> .... > > >>> > > >>> -- > > >>> View this message in context: > > >>> > > > http://www.nabble.com/threads-get-stuck-in-spinwaiting-tp23723825p23723825.html > > >>> Sent from the Nutch - User mailing list archive at Nabble.com. > > >>> > > >>> > > >> > > > > >
