How can I set the max URLs/host to be fetched per run?
I use the following shell-script for my crawling. As you can see I try to
set the topN value very low, here I have 5. But it doesn't seem to care
about the topN value. The first segment it crawled with this script was
about 140mb. Are you talking about the topN value when you say I should set
the max URLs/hosts, or is there another setting I haven't found yet?

http://pastebin.com/m33bb6e6b


Ken Krugler wrote:
> 
>>That could be true, but is that something I, as a nutch user can
configure?
>>It's interesting that your spin-waiting only takes about 30 minutes, and
>>mine takes a whole lot longer. At least a couple of hours if the queue
size
>>is about 2-3000.
>>
>>I've tried adding the following lines to my nutch-default.xml , perhaps
that
>>can help. Gonna do a test run soon.
>>(for some reason it doesn't seem like things I put in nutch-site.xml gets
>>loaded, thats why I put it in nutch-default)
>>
>>  <property>
>>    <name>fetcher.threads.per.host</name>
>>    <value>10</value>
>>    <description>This number is the maximum number of threads that
>>      should be allowed to access a host at one time.</description>
>>  </property>
> 
> If you are going to be hitting servers that you'd 
> don't control or have special arrangements with 
> (e.g. domains that end in .se and .nu), and 
> you're using a very impolite setting like 10 
> threads/host, then please, please make sure your 
> user agent string clearly specifies how the 
> enraged IT people can directly get in touch with 
> you.
> 
> Though including your home phone number might not be a good idea :)
> 
> And by all means do not leave it set to the 
> default Nutch user agent string - in fact, it 
> would be best if you didn't mention Nutch at all, 
> as a number of sites block any crawler that has 
> "nutch" in the user agent string due to people 
> running crawls with settings such as what you've 
> referenced above.
> 
> When crawling a limited number of domains, the 
> best approach is to assume that you aren't going 
> to get all of the pages from every domain in your 
> first pass. Set the max URLs/host to something 
> reasonable, and then do multiple fetch cycles.
> 
> -- Ken
> 
> PS - In Bixo I just added support for adaptive 
> crawl delay. For the special case of crawling 
> pages from partners, this will dynamically reduce 
> the crawl delay down to a specified minimum, to 
> try to fetch all of the pages from a domain 
> within the target crawl duration. Something 
> similar might be useful for Nutch.
> 
> 
>>Raymond Balmès wrote:
>>>
>>>  maybe the problem is not in the fetcher but rather in the generate
>>> fetch
>>>  list phase where it should take care in not sticking all URLs to the
>>> same
>>>  domain together.
>>>
>>>  -Ray-
>>>
>>>  2009/5/27 Larsson85 <[email protected]>
>>>
>>>>
>>>>  You're probably right that it has something to do with the politeness.
>>>> I
>>>>  didn't notice it before, but now when you mention it I can see that
>>>> all
>>>>  the
>>>>  pages it's fetching at the end of the crawl is from the same domain.
>>>> Is
>>>>  there any way to turn of the politeness, or perhaps make it less
>>>> polite
>>>>  to
>>>>  speed things up? I've been doing a test run today, and the result is
>>>> that
>>>>  it
>>>>  has been stuck in this spinwaiting state for about 3 hours, which is
>>>> not
>>>>  acceptable.
>>>>
>>  >> Perhaps it is that I'm using a to small url-list to start with. I'm
>> using
>>>>  the dmoz list from the nutch tutorial, and I have a filter on .se and
>>>> .nu
>>>>  domains which probably disqualifies a lot of the urls in the list. Any
>>>>  tip
>>  >> on where to get a bigger list? And most important, any tip on how I
>> can
>>>>  turn
>>>>  off the politeness, or atleast make it less polite.
>>>>  Thanks for all the help.
>>>>
>>>>
>>>>  Raymond Balmès wrote:
>>>>  >
>>>>  > Observing what my crawls do, I believe Ken must be right.
>>>>  > Towards the end of the crawl (when the fetchqueues.totalSize="xxxx"
>>>>  counts
>>>>  > down) in some cases I'm only fetching on two sites roughly , so
>>>> indeed
>>>>  the
>>>>  > politeness starts to play a role there at least it should.
>>>>  >
>>>>  > -Ray-
>>>>  >
>>>>  > 2009/5/26 Raymond Balmès <[email protected]>
>>>>  >
>>>>  >> Please read this too :
>>>>  >>
>>>>  >>
>>>> 
>>>>http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/
>>>>  >>
>>>>  >> Interesting build from ken.
>>>>  >>
>>>>  >> 2009/5/26 Raymond Balmès <[email protected]>
>>>>  >>
>>>>  >>  yes already reported in multiple-threads.
>>>>  >>> I noted that if one does a "recrawl" you don't get this
>>>> behavior...
>>>>  no
>>>>  >>> idea why.
>>>>  >>>
>>>>  >>> -Raymond-
>>  >> >>>
>>>>  >>> 2009/5/26 Larsson85 <[email protected]>
>>>>  >>>
>>>>  >>>
>>>>  >>>> When I try to do my crawl it seems like the threads get stuck in
>>>> som
>>>>  >>>> spinwaiting mode. At first the crawl goes as planned, and I
>>>> couldnt
>>>>  be
>>>>  >>>> happier. But after som time, it starts reporting more of these
>>  >> >>>> spinwaiting
>>>>  >>>> messages.
>>>>  >>>>
>>  >> >>>> I print a log here to show you what it looks like. As you can
>> see it
>>>>  >>>> gets
>>>>  >>>> stuck, and the queue decrease by 1 all the time. I've tried doing
>>>> a
>>>>  >>>> smaller
>>>>  >>>> crawl, and what happends is that it counts down untill the
>>>>  >>>> "fetchQueues.totalSize" reaches 0, and then the crawl is done.
>>>>  >>>>
>>>>  >>>> But the problem is that this countdown is very slow,there's no
>>>>  >>>> effective
>>  >> >>>> crawling going on, not using eather bandwith or cpu power.
>> Basicly,
>>  >> >>>> this
>>>>  >>>> costs way to much time, I cant let it go on like this for hours
>>>> to
>>>>  be
>>>>  >>>> done.
>>>>  >>>> How can I fix this?
>>>>  >>>>
>>>>  >>>>
>>>>  >>>> after about an hour of crawling this is what the log looks like
>>>>  >>>>  -activeThreads=1000, spinWaiting=1000,
>>>> fetchQueues.totalSize=2526
>>>>  >>>>  -activeThreads=1000, spinWaiting=1000,
>>>> fetchQueues.totalSize=2526
>>>>  >>>>  - fetching http://home.swipnet.se/~w-147200/
>>>>  >>>>  -activeThreads=1000, spinWaiting=1000,
>>>> fetchQueues.totalSize=2525
>>>>  >>>>  -activeThreads=1000, spinWaiting=1000,
>>>> fetchQueues.totalSize=2525
>>>>  >>>>  - fetching http://biphome.spray.se/alarsson/
>>>>  >>>>  -activeThreads=1000, spinWaiting=1000,
>>>> fetchQueues.totalSize=2524
>>>>  >>>>  -activeThreads=1000, spinWaiting=1000,
>>>> fetchQueues.totalSize=2524
>>>>  >>>>  -activeThreads=1000, spinWaiting=1000,
>>>> fetchQueues.totalSize=2524
>>>>  >>>>  - fetching http://home.swipnet.se/~w-31853/html/
>>>>  >>>>  -activeThreads=1000, spinWaiting=1000,
>>>> fetchQueues.totalSize=2523
>>>>  >>>>
>>>>  >>>> ....
>>>>  >>>>
>>>>  >>>> --
>>>>  >>>> View this message in context:
>>>>  >>>>
>>>> 
>>>>http://www.nabble.com/threads-get-stuck-in-spinwaiting-tp23723825p23723825.html
>>>>  >>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>>  >>>>
>>>>  >>>>
>>>>  >>>
>>>>  >>
>>>>  >
>>>>  >
>>>>
>>>>  --
>>>>  View this message in context:
>>>> 
>>>>http://www.nabble.com/threads-get-stuck-in-spinwaiting-tp23723825p23742537.html
>>>>   Sent from the Nutch - User mailing list archive at Nabble.com.
>>>>
>>>>
>>>
>>>
>>
>>--
>>View this message in context: 
>>http://www.nabble.com/threads-get-stuck-in-spinwaiting-tp23723825p23743682.html
>>Sent from the Nutch - User mailing list archive at Nabble.com.
> 
> 
> --
> Ken Krugler
> +1 530-210-6378
> 
> 

-- 
View this message in context: 
http://www.nabble.com/threads-get-stuck-in-spinwaiting-tp23723825p23750619.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to