[ https://issues.apache.org/jira/browse/NUTCH-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13833653#comment-13833653 ]
Julien Nioche commented on NUTCH-1630: -------------------------------------- This is a large patch which seems to affect the code in several places so unless this is completely pluggable I would recommend to be very cautious in testing / discussing it before any commits. Re-formatting : Nutch 2.x has a format template for Eclipse that can be used to do that automatically (eclipse-codeformat.xml) > How to achieve finishing fetch approximately at the same time for each queue > (a.k.a adaptive queue size) > --------------------------------------------------------------------------------------------------------- > > Key: NUTCH-1630 > URL: https://issues.apache.org/jira/browse/NUTCH-1630 > Project: Nutch > Issue Type: Improvement > Affects Versions: 2.1, 2.2, 2.2.1 > Reporter: Talat UYARER > Labels: improvement > Fix For: 2.3 > > Attachments: NUTCH-1630.patch > > > Problem Definition: > When crawling, due to unproportional size of queues; fetching needs to wait > for a long time for long lasting queues when shorter ones are finished. That > means you may have to wait for a couple of days for some of queues. > Normally we define max queue size with generate.max.count but that's a static > value. However number of URLs to be fetched increases with each depth. > Defining same length for all queues does not mean all queues will finish > around the same time. This problem has been addressed by some other users > before [1]. So we came up with a different approach to this issue. > Solution: > Nutch has three mods for creating fetch queues (byHost, byDomain, ByIp). Our > solution can be applicable to all three mods. > 1-Define a "fetch workload of current queue" (FW) value for each queue based > on the previous fetches of that queue. > We calculate this by: > FW=average response time of previous depth * number of urls in current > queue > 2- Calculate the harmonic mean [2] of all FW's to get the average workload of > current depth (AW) > 3- Get the length for a queue by dividing AW by previously known average > response time of that queue: > Queue Length=AW / average response time > Using this algoritm leads to a fetch phase where all queues finish up around > the same time. > As soon as posible i will send my patch. Do you have any comments ? > [1] http://osdir.com/ml/dev.nutch.apache.org/2011-11/msg00101.html > [2] In our opinion; harmonic mean is best in our case because our data has a > few points that are much higher than the rest. -- This message was sent by Atlassian JIRA (v6.1#6144)