[ 
https://issues.apache.org/jira/browse/NUTCH-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13875972#comment-13875972
 ] 

Talat UYARER commented on NUTCH-1630:
-------------------------------------

Hi [~tejasp],

I guess You miss understood me. I develop same as you said. It doesn't send any 
ping to host in GeneratorJob. It save response time with NUTCH-1413 issue for 
every url. It sum up those url response save in DBUpdateReducer to Host Table. 
At GeneratorJob it just calculate average response time and harmonic mean. It 
always calculate average response time and harmonic mean previous depth 
knowledge. You can follow code with First Fetch, First DBUpdateReducer, Second 
GeneratorJob, Second GenerateReducer.

This code just regard previous depth information. If web site change its own 
response time, code will learn after fetch like adaptive fetch scheduler. 

This code is a option for generating fetch queues. You can turn off or turn on 
from nutch-site.xml. If a server is genuine slow, there is no way for fast 
fetching. I think this is a decision. If you need discovery more url at the 
same time. You can choose this option. Or if you want to fetch specially 
genuine slow website. You can turn off this option. 

At the present we fetch ~140 million url per a day. Before this code, we always 
waited only slow website. we used our bandwidth like The inverse relationship 
curve. At the beginning we use all of them, at the end we use some piece of 
bandwidth.  Google say: "Googlebot uses an algorithmic process: computer 
programs determine which sites to crawl, how often, and how many pages to fetch 
from each site."[0] I develop this algorithm developed for how many pages to 
fetch from each site depend on response time.

[0] https://support.google.com/webmasters/answer/70897?hl=en


> How to achieve finishing fetch approximately at the same time for each queue 
> (a.k.a adaptive queue size) 
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1630
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1630
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 2.1, 2.2, 2.2.1
>            Reporter: Talat UYARER
>              Labels: improvement
>             Fix For: 2.3
>
>         Attachments: NUTCH-1630.patch, NUTCH-1630v2.patch
>
>
> Problem Definition:
> When crawling, due to unproportional size of queues; fetching needs to wait 
> for a long time for long lasting queues when shorter ones are finished. That 
> means you may have to wait for a couple of days for some of queues.
> Normally we define max queue size with generate.max.count but that's a static 
> value. However number of URLs to be fetched increases with each depth. 
> Defining same length for all queues does not mean all queues will finish 
> around the same time. This problem has been addressed by some other users 
> before [1]. So we came up with a different approach to this issue.
> Solution:
> Nutch has three mods for creating fetch queues (byHost, byDomain, ByIp). Our 
> solution can be applicable to all three mods.
> 1-Define a "fetch workload of current queue" (FW) value for each queue based 
> on the previous fetches of that queue.
> We calculate this by:
>     FW=average response time of previous depth * number of urls in current 
> queue
> 2- Calculate the harmonic mean [2] of all FW's to get the average workload of 
> current depth (AW)
> 3- Get the length for a queue by dividing AW by previously known average 
> response time of that queue:
>     Queue Length=AW / average response time
> Using this algoritm leads to a fetch phase where all queues finish up around 
> the same time.
> As soon as posible i will send my patch. Do you have any comments ? 
> [1] http://osdir.com/ml/dev.nutch.apache.org/2011-11/msg00101.html
> [2] In our opinion; harmonic mean is best in our case because our data has a 
> few points that are much higher than the rest. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to