[ https://issues.apache.org/jira/browse/NUTCH-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13764255#comment-13764255 ]
Markus Jelsma commented on NUTCH-1630: -------------------------------------- It is indeed 1.x that lives in trunk and 1.x is the main version of Nutch. 2.x is provided separately. Is it possible for you to port it to 1.x/trunk as well? That would be awesome. I also see you're doing host lookups in the mapper updatemapper. I think this will cause issues if the set is even slightly large. We've made a HostDB for Nutch 1.x (NUTCH-1325) and look up hosts in the reducer using a thread pool because it takes a very long time to sequentially look up hosts. Although im not too familiar with 2.x, isn't it better to do it in the reducer? You can get the same record multiple times in 1.x in the mapper so that would mean redundant look ups. > How to achieve finishing fetch approximately at the same time for each queue > (a.k.a adaptive queue size) > --------------------------------------------------------------------------------------------------------- > > Key: NUTCH-1630 > URL: https://issues.apache.org/jira/browse/NUTCH-1630 > Project: Nutch > Issue Type: Improvement > Affects Versions: 2.1, 2.2, 2.2.1 > Reporter: Talat UYARER > Labels: improvement > Fix For: 2.3 > > Attachments: NUTCH-1630.patch > > > Problem Definition: > When crawling, due to unproportional size of queues; fetching needs to wait > for a long time for long lasting queues when shorter ones are finished. That > means you may have to wait for a couple of days for some of queues. > Normally we define max queue size with generate.max.count but that's a static > value. However number of URLs to be fetched increases with each depth. > Defining same length for all queues does not mean all queues will finish > around the same time. This problem has been addressed by some other users > before [1]. So we came up with a different approach to this issue. > Solution: > Nutch has three mods for creating fetch queues (byHost, byDomain, ByIp). Our > solution can be applicable to all three mods. > 1-Define a "fetch workload of current queue" (FW) value for each queue based > on the previous fetches of that queue. > We calculate this by: > FW=average response time of previous depth * number of urls in current > queue > 2- Calculate the harmonic mean [2] of all FW's to get the average workload of > current depth (AW) > 3- Get the length for a queue by dividing AW by previously known average > response time of that queue: > Queue Length=AW / average response time > Using this algoritm leads to a fetch phase where all queues finish up around > the same time. > As soon as posible i will send my patch. Do you have any comments ? > [1] http://osdir.com/ml/dev.nutch.apache.org/2011-11/msg00101.html > [2] In our opinion; harmonic mean is best in our case because our data has a > few points that are much higher than the rest. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira