I have many URLs per host of course. Need to get all the pages of the sites,
don't understand the question.

-Raymond

2009/5/26 Otis Gospodnetic <[email protected]>

>
> But how, Ray, if you have only 1 URL per host?
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
> > From: Raymond Balmès <[email protected]>
> > To: [email protected]
> > Sent: Tuesday, May 26, 2009 4:11:27 PM
> > Subject: Re: threads get stuck in spinwaiting
> >
> > Observing what my crawls do, I believe Ken must be right.
> > Towards the end of the crawl (when the fetchqueues.totalSize="xxxx"
> counts
> > down) in some cases I'm only fetching on two sites roughly , so indeed
> the
> > politeness starts to play a role there at least it should.
> >
> > -Ray-
> >
> > 2009/5/26 Raymond Balmès
> >
> > > Please read this too :
> > >
> > >
> >
> http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/
> > >
> > > Interesting build from ken.
> > >
> > > 2009/5/26 Raymond Balmès
> > >
> > >  yes already reported in multiple-threads.
> > >> I noted that if one does a "recrawl" you don't get this behavior... no
> > >> idea why.
> > >>
> > >> -Raymond-
> > >>
> > >> 2009/5/26 Larsson85
> > >>
> > >>
>  > >>> When I try to do my crawl it seems like the threads get stuck in
> som
> > >>> spinwaiting mode. At first the crawl goes as planned, and I couldnt
> be
> > >>> happier. But after som time, it starts reporting more of these
> > >>> spinwaiting
> > >>> messages.
> > >>>
> > >>> I print a log here to show you what it looks like. As you can see it
> gets
> > >>> stuck, and the queue decrease by 1 all the time. I've tried doing a
> > >>> smaller
> > >>> crawl, and what happends is that it counts down untill the
> > >>> "fetchQueues.totalSize" reaches 0, and then the crawl is done.
> > >>>
> > >>> But the problem is that this countdown is very slow,there's no
> effective
> > >>> crawling going on, not using eather bandwith or cpu power. Basicly,
> this
> > >>> costs way to much time, I cant let it go on like this for hours to be
> > >>> done.
> > >>> How can I fix this?
> > >>>
> > >>>
> > >>> after about an hour of crawling this is what the log looks like
> > >>>  -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2526
> > >>>  -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2526
> > >>>  - fetching http://home.swipnet.se/~w-147200/
> > >>>  -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2525
> > >>>  -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2525
> > >>>  - fetching http://biphome.spray.se/alarsson/
> > >>>  -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2524
> > >>>  -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2524
> > >>>  -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2524
> > >>>  - fetching http://home.swipnet.se/~w-31853/html/
> > >>>  -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2523
> > >>>
> > >>> ....
> > >>>
> > >>> --
> > >>> View this message in context:
> > >>>
> >
> http://www.nabble.com/threads-get-stuck-in-spinwaiting-tp23723825p23723825.html
> > >>> Sent from the Nutch - User mailing list archive at Nabble.com.
> > >>>
> > >>>
> > >>
> > >
>
>

Reply via email to