[ https://issues.apache.org/jira/browse/NUTCH-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13875987#comment-13875987 ]
Talat UYARER commented on NUTCH-1630: ------------------------------------- Hi again [~tejasp] :), Thanks for your comment. Code has a configuration. it can turned on or turned off by db.fetch.adaptive.queue.size. I am not sure How can I develop as a plugin. I cared of code should have less changes for this. But this code for main crawler running style. Same as urlpartitioner. I am open for any suggestion about this code. > How to achieve finishing fetch approximately at the same time for each queue > (a.k.a adaptive queue size) > --------------------------------------------------------------------------------------------------------- > > Key: NUTCH-1630 > URL: https://issues.apache.org/jira/browse/NUTCH-1630 > Project: Nutch > Issue Type: Improvement > Affects Versions: 2.1, 2.2, 2.2.1 > Reporter: Talat UYARER > Labels: improvement > Fix For: 2.3 > > Attachments: NUTCH-1630.patch, NUTCH-1630v2.patch > > > Problem Definition: > When crawling, due to unproportional size of queues; fetching needs to wait > for a long time for long lasting queues when shorter ones are finished. That > means you may have to wait for a couple of days for some of queues. > Normally we define max queue size with generate.max.count but that's a static > value. However number of URLs to be fetched increases with each depth. > Defining same length for all queues does not mean all queues will finish > around the same time. This problem has been addressed by some other users > before [1]. So we came up with a different approach to this issue. > Solution: > Nutch has three mods for creating fetch queues (byHost, byDomain, ByIp). Our > solution can be applicable to all three mods. > 1-Define a "fetch workload of current queue" (FW) value for each queue based > on the previous fetches of that queue. > We calculate this by: > FW=average response time of previous depth * number of urls in current > queue > 2- Calculate the harmonic mean [2] of all FW's to get the average workload of > current depth (AW) > 3- Get the length for a queue by dividing AW by previously known average > response time of that queue: > Queue Length=AW / average response time > Using this algoritm leads to a fetch phase where all queues finish up around > the same time. > As soon as posible i will send my patch. Do you have any comments ? > [1] http://osdir.com/ml/dev.nutch.apache.org/2011-11/msg00101.html > [2] In our opinion; harmonic mean is best in our case because our data has a > few points that are much higher than the rest. -- This message was sent by Atlassian JIRA (v6.1.5#6160)