That could be true, but is that something I, as a nutch user can configure?
It's interesting that your spin-waiting only takes about 30 minutes, and
mine takes a whole lot longer. At least a couple of hours if the queue size
is about 2-3000.

I've tried adding the following lines to my nutch-default.xml , perhaps that
can help. Gonna do a test run soon.
(for some reason it doesn't seem like things I put in nutch-site.xml gets
loaded, thats why I put it in nutch-default)

 <property>
   <name>fetcher.threads.per.host</name>
   <value>10</value>
   <description>This number is the maximum number of threads that
     should be allowed to access a host at one time.</description>
 </property>


Raymond Balmès wrote:
> 
> maybe the problem is not in the fetcher but rather in the generate fetch
> list phase where it should take care in not sticking all URLs to the same
> domain together.
> 
> -Ray-
> 
> 2009/5/27 Larsson85 <[email protected]>
> 
>>
>> You're probably right that it has something to do with the politeness. I
>> didn't notice it before, but now when you mention it I can see that all
>> the
>> pages it's fetching at the end of the crawl is from the same domain. Is
>> there any way to turn of the politeness, or perhaps make it less polite
>> to
>> speed things up? I've been doing a test run today, and the result is that
>> it
>> has been stuck in this spinwaiting state for about 3 hours, which is not
>> acceptable.
>>
>> Perhaps it is that I'm using a to small url-list to start with. I'm using
>> the dmoz list from the nutch tutorial, and I have a filter on .se and .nu
>> domains which probably disqualifies a lot of the urls in the list. Any
>> tip
>> on where to get a bigger list? And most important, any tip on how I can
>> turn
>> off the politeness, or atleast make it less polite.
>> Thanks for all the help.
>>
>>
>> Raymond Balmès wrote:
>> >
>> > Observing what my crawls do, I believe Ken must be right.
>> > Towards the end of the crawl (when the fetchqueues.totalSize="xxxx"
>> counts
>> > down) in some cases I'm only fetching on two sites roughly , so indeed
>> the
>> > politeness starts to play a role there at least it should.
>> >
>> > -Ray-
>> >
>> > 2009/5/26 Raymond Balmès <[email protected]>
>> >
>> >> Please read this too :
>> >>
>> >>
>> http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/
>> >>
>> >> Interesting build from ken.
>> >>
>> >> 2009/5/26 Raymond Balmès <[email protected]>
>> >>
>> >>  yes already reported in multiple-threads.
>> >>> I noted that if one does a "recrawl" you don't get this behavior...
>> no
>> >>> idea why.
>> >>>
>> >>> -Raymond-
>> >>>
>> >>> 2009/5/26 Larsson85 <[email protected]>
>> >>>
>> >>>
>> >>>> When I try to do my crawl it seems like the threads get stuck in som
>> >>>> spinwaiting mode. At first the crawl goes as planned, and I couldnt
>> be
>> >>>> happier. But after som time, it starts reporting more of these
>> >>>> spinwaiting
>> >>>> messages.
>> >>>>
>> >>>> I print a log here to show you what it looks like. As you can see it
>> >>>> gets
>> >>>> stuck, and the queue decrease by 1 all the time. I've tried doing a
>> >>>> smaller
>> >>>> crawl, and what happends is that it counts down untill the
>> >>>> "fetchQueues.totalSize" reaches 0, and then the crawl is done.
>> >>>>
>> >>>> But the problem is that this countdown is very slow,there's no
>> >>>> effective
>> >>>> crawling going on, not using eather bandwith or cpu power. Basicly,
>> >>>> this
>> >>>> costs way to much time, I cant let it go on like this for hours to
>> be
>> >>>> done.
>> >>>> How can I fix this?
>> >>>>
>> >>>>
>> >>>> after about an hour of crawling this is what the log looks like
>> >>>>  -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2526
>> >>>>  -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2526
>> >>>>  - fetching http://home.swipnet.se/~w-147200/
>> >>>>  -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2525
>> >>>>  -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2525
>> >>>>  - fetching http://biphome.spray.se/alarsson/
>> >>>>  -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2524
>> >>>>  -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2524
>> >>>>  -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2524
>> >>>>  - fetching http://home.swipnet.se/~w-31853/html/
>> >>>>  -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2523
>> >>>>
>> >>>> ....
>> >>>>
>> >>>> --
>> >>>> View this message in context:
>> >>>>
>> http://www.nabble.com/threads-get-stuck-in-spinwaiting-tp23723825p23723825.html
>> >>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>> >>>>
>> >>>>
>> >>>
>> >>
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/threads-get-stuck-in-spinwaiting-tp23723825p23742537.html
>>  Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/threads-get-stuck-in-spinwaiting-tp23723825p23743682.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to