[ 
https://issues.apache.org/jira/browse/NUTCH-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13875947#comment-13875947
 ] 

Tejas Patil commented on NUTCH-1630:
------------------------------------

Hi [~talat],
So from 2nd depth onwards, you would ping the host in generate phase and get 
the response time. 
For large scale crawl setups, Generator itself might runs for few hours and at 
the time when you ping the host it might be loaded or there might be network 
traffic. When the acutal fetch phase runs, the response time might be different 
depending upon the load on the server. As I mentioned in earlier comment, I 
thought you were doing a cumulative sum of response timings for several urls of 
a host and then getting an average from it... which would give a better 
response time numbers. This would be harder to code in the existing codebase 
and might look ugly as fetcher needs to pass on this information to generator.

+A more broader concern for crawls which run for days+
Server response timings itself change as the local time changes. For example 
during day time (say 8:00 - 11:00 am) there might be decent requests from users 
to the server as compared to night time (say 1:00 - 4:00 am) when there are 
very small number of users requesting the servers. Pinging the server during at 
some point in the 24 hour day would not give a good approximation for the 
response time for long running crawls. 

+Effect on crawlspace of slow servers+
If a server is genuine slow (say due to low end hardware), then it would always 
have slower response time as compared to other servers. Effectively, we would 
end up having smaller fetch queue for that host and thus creating huge backlog 
of its urls which would end up sitting in crawldb for not being generated over 
and over again. I would take your side on this: try to fetch as much as we can. 
But some crawl owners might be unhappy with this.

> How to achieve finishing fetch approximately at the same time for each queue 
> (a.k.a adaptive queue size) 
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1630
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1630
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 2.1, 2.2, 2.2.1
>            Reporter: Talat UYARER
>              Labels: improvement
>             Fix For: 2.3
>
>         Attachments: NUTCH-1630.patch, NUTCH-1630v2.patch
>
>
> Problem Definition:
> When crawling, due to unproportional size of queues; fetching needs to wait 
> for a long time for long lasting queues when shorter ones are finished. That 
> means you may have to wait for a couple of days for some of queues.
> Normally we define max queue size with generate.max.count but that's a static 
> value. However number of URLs to be fetched increases with each depth. 
> Defining same length for all queues does not mean all queues will finish 
> around the same time. This problem has been addressed by some other users 
> before [1]. So we came up with a different approach to this issue.
> Solution:
> Nutch has three mods for creating fetch queues (byHost, byDomain, ByIp). Our 
> solution can be applicable to all three mods.
> 1-Define a "fetch workload of current queue" (FW) value for each queue based 
> on the previous fetches of that queue.
> We calculate this by:
>     FW=average response time of previous depth * number of urls in current 
> queue
> 2- Calculate the harmonic mean [2] of all FW's to get the average workload of 
> current depth (AW)
> 3- Get the length for a queue by dividing AW by previously known average 
> response time of that queue:
>     Queue Length=AW / average response time
> Using this algoritm leads to a fetch phase where all queues finish up around 
> the same time.
> As soon as posible i will send my patch. Do you have any comments ? 
> [1] http://osdir.com/ml/dev.nutch.apache.org/2011-11/msg00101.html
> [2] In our opinion; harmonic mean is best in our case because our data has a 
> few points that are much higher than the rest. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to