You're probably right that it has something to do with the politeness. I didn't notice it before, but now when you mention it I can see that all the pages it's fetching at the end of the crawl is from the same domain. Is there any way to turn of the politeness, or perhaps make it less polite to speed things up? I've been doing a test run today, and the result is that it has been stuck in this spinwaiting state for about 3 hours, which is not acceptable.
Perhaps it is that I'm using a to small url-list to start with. I'm using the dmoz list from the nutch tutorial, and I have a filter on .se and .nu domains which probably disqualifies a lot of the urls in the list. Any tip on where to get a bigger list? And most important, any tip on how I can turn off the politeness, or atleast make it less polite. Thanks for all the help. Raymond Balmès wrote: > > Observing what my crawls do, I believe Ken must be right. > Towards the end of the crawl (when the fetchqueues.totalSize="xxxx" counts > down) in some cases I'm only fetching on two sites roughly , so indeed the > politeness starts to play a role there at least it should. > > -Ray- > > 2009/5/26 Raymond Balmès <[email protected]> > >> Please read this too : >> >> http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/ >> >> Interesting build from ken. >> >> 2009/5/26 Raymond Balmès <[email protected]> >> >> yes already reported in multiple-threads. >>> I noted that if one does a "recrawl" you don't get this behavior... no >>> idea why. >>> >>> -Raymond- >>> >>> 2009/5/26 Larsson85 <[email protected]> >>> >>> >>>> When I try to do my crawl it seems like the threads get stuck in som >>>> spinwaiting mode. At first the crawl goes as planned, and I couldnt be >>>> happier. But after som time, it starts reporting more of these >>>> spinwaiting >>>> messages. >>>> >>>> I print a log here to show you what it looks like. As you can see it >>>> gets >>>> stuck, and the queue decrease by 1 all the time. I've tried doing a >>>> smaller >>>> crawl, and what happends is that it counts down untill the >>>> "fetchQueues.totalSize" reaches 0, and then the crawl is done. >>>> >>>> But the problem is that this countdown is very slow,there's no >>>> effective >>>> crawling going on, not using eather bandwith or cpu power. Basicly, >>>> this >>>> costs way to much time, I cant let it go on like this for hours to be >>>> done. >>>> How can I fix this? >>>> >>>> >>>> after about an hour of crawling this is what the log looks like >>>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2526 >>>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2526 >>>> - fetching http://home.swipnet.se/~w-147200/ >>>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2525 >>>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2525 >>>> - fetching http://biphome.spray.se/alarsson/ >>>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2524 >>>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2524 >>>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2524 >>>> - fetching http://home.swipnet.se/~w-31853/html/ >>>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2523 >>>> >>>> .... >>>> >>>> -- >>>> View this message in context: >>>> http://www.nabble.com/threads-get-stuck-in-spinwaiting-tp23723825p23723825.html >>>> Sent from the Nutch - User mailing list archive at Nabble.com. >>>> >>>> >>> >> > > -- View this message in context: http://www.nabble.com/threads-get-stuck-in-spinwaiting-tp23723825p23742537.html Sent from the Nutch - User mailing list archive at Nabble.com.
