[jira] [Commented] (NUTCH-1630) How to achieve finishing fetch approximately at the same time for each queue (a.k.a adaptive queue size)

Talat UYARER (JIRA) Sun, 19 Jan 2014 11:19:05 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13875987#comment-13875987
 ]


Talat UYARER commented on NUTCH-1630:
-------------------------------------

Hi again [~tejasp] :),

Thanks for your comment. Code has a configuration. it can turned on or turned 
off by db.fetch.adaptive.queue.size. I am not sure How can I develop as a 
plugin. I cared of code should have less changes for this. But this code for 
main crawler running style. Same as urlpartitioner. I am open for any 
suggestion about this code. 





> How to achieve finishing fetch approximately at the same time for each queue 
> (a.k.a adaptive queue size) 
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1630
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1630
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 2.1, 2.2, 2.2.1
>            Reporter: Talat UYARER
>              Labels: improvement
>             Fix For: 2.3
>
>         Attachments: NUTCH-1630.patch, NUTCH-1630v2.patch
>
>
> Problem Definition:
> When crawling, due to unproportional size of queues; fetching needs to wait 
> for a long time for long lasting queues when shorter ones are finished. That 
> means you may have to wait for a couple of days for some of queues.
> Normally we define max queue size with generate.max.count but that's a static 
> value. However number of URLs to be fetched increases with each depth. 
> Defining same length for all queues does not mean all queues will finish 
> around the same time. This problem has been addressed by some other users 
> before [1]. So we came up with a different approach to this issue.
> Solution:
> Nutch has three mods for creating fetch queues (byHost, byDomain, ByIp). Our 
> solution can be applicable to all three mods.
> 1-Define a "fetch workload of current queue" (FW) value for each queue based 
> on the previous fetches of that queue.
> We calculate this by:
>     FW=average response time of previous depth * number of urls in current 
> queue
> 2- Calculate the harmonic mean [2] of all FW's to get the average workload of 
> current depth (AW)
> 3- Get the length for a queue by dividing AW by previously known average 
> response time of that queue:
>     Queue Length=AW / average response time
> Using this algoritm leads to a fetch phase where all queues finish up around 
> the same time.
> As soon as posible i will send my patch. Do you have any comments ? 
> [1] http://osdir.com/ml/dev.nutch.apache.org/2011-11/msg00101.html
> [2] In our opinion; harmonic mean is best in our case because our data has a 
> few points that are much higher than the rest. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (NUTCH-1630) How to achieve finishing fetch approximately at the same time for each queue (a.k.a adaptive queue size)

Reply via email to