[jira] [Commented] (NUTCH-1630) How to achieve finishing fetch approximately at the same time for each queue (a.k.a adaptive queue size)

Julien Nioche (JIRA) Wed, 27 Nov 2013 03:00:23 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13833653#comment-13833653
 ]


Julien Nioche commented on NUTCH-1630:
--------------------------------------

This is a large patch which seems to affect the code in several places so 
unless this is completely pluggable I would recommend to be very cautious in 
testing / discussing it before any commits.

Re-formatting : Nutch 2.x has a format template for Eclipse that can be used to 
do that automatically (eclipse-codeformat.xml)


> How to achieve finishing fetch approximately at the same time for each queue 
> (a.k.a adaptive queue size) 
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1630
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1630
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 2.1, 2.2, 2.2.1
>            Reporter: Talat UYARER
>              Labels: improvement
>             Fix For: 2.3
>
>         Attachments: NUTCH-1630.patch
>
>
> Problem Definition:
> When crawling, due to unproportional size of queues; fetching needs to wait 
> for a long time for long lasting queues when shorter ones are finished. That 
> means you may have to wait for a couple of days for some of queues.
> Normally we define max queue size with generate.max.count but that's a static 
> value. However number of URLs to be fetched increases with each depth. 
> Defining same length for all queues does not mean all queues will finish 
> around the same time. This problem has been addressed by some other users 
> before [1]. So we came up with a different approach to this issue.
> Solution:
> Nutch has three mods for creating fetch queues (byHost, byDomain, ByIp). Our 
> solution can be applicable to all three mods.
> 1-Define a "fetch workload of current queue" (FW) value for each queue based 
> on the previous fetches of that queue.
> We calculate this by:
>     FW=average response time of previous depth * number of urls in current 
> queue
> 2- Calculate the harmonic mean [2] of all FW's to get the average workload of 
> current depth (AW)
> 3- Get the length for a queue by dividing AW by previously known average 
> response time of that queue:
>     Queue Length=AW / average response time
> Using this algoritm leads to a fetch phase where all queues finish up around 
> the same time.
> As soon as posible i will send my patch. Do you have any comments ? 
> [1] http://osdir.com/ml/dev.nutch.apache.org/2011-11/msg00101.html
> [2] In our opinion; harmonic mean is best in our case because our data has a 
> few points that are much higher than the rest. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (NUTCH-1630) How to achieve finishing fetch approximately at the same time for each queue (a.k.a adaptive queue size)

Reply via email to