[jira] Commented: (NUTCH-207) Bandwidth target for fetcher rather than a thread count
[ https://issues.apache.org/jira/browse/NUTCH-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653412#action_12653412 ] Todd Lipcon commented on NUTCH-207: --- Are both fetcher and fetcher2 supposed to be supported for the forseeable future? Or could I simply implement this for one of them and not have it integrated until the other is removed in the future? > Bandwidth target for fetcher rather than a thread count > --- > > Key: NUTCH-207 > URL: https://issues.apache.org/jira/browse/NUTCH-207 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Affects Versions: 0.8 >Reporter: Rod Taylor > Attachments: ratelimit.patch > > > Increases or decreases the number of threads from the starting value > (fetcher.threads.fetch) up to a maximum (fetcher.threads.maximum) to achieve > a target bandwidth (fetcher.threads.bandwidth). > It seems to be able to keep within 10% of the target bandwidth even when > large numbers of errors are found or when a number of large pages is run > across. > To achieve more accurate tracking Nutch should keep track of protocol > overhead as well as the volume of pages downloaded. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-207) Bandwidth target for fetcher rather than a thread count
[ https://issues.apache.org/jira/browse/NUTCH-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653404#action_12653404 ] Dennis Kubes commented on NUTCH-207: I think this would be an interesting addition. It would also need to be ported to fetcher2 as well as fetcher. It you want to take on the task of porting it that would be great. If you have any questions feel free to ask. > Bandwidth target for fetcher rather than a thread count > --- > > Key: NUTCH-207 > URL: https://issues.apache.org/jira/browse/NUTCH-207 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Affects Versions: 0.8 >Reporter: Rod Taylor > Attachments: ratelimit.patch > > > Increases or decreases the number of threads from the starting value > (fetcher.threads.fetch) up to a maximum (fetcher.threads.maximum) to achieve > a target bandwidth (fetcher.threads.bandwidth). > It seems to be able to keep within 10% of the target bandwidth even when > large numbers of errors are found or when a number of large pages is run > across. > To achieve more accurate tracking Nutch should keep track of protocol > overhead as well as the volume of pages downloaded. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-207) Bandwidth target for fetcher rather than a thread count
[ https://issues.apache.org/jira/browse/NUTCH-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653368#action_12653368 ] Todd Lipcon commented on NUTCH-207: --- Any word on this JIRA? This would be a very useful feature for me - we are bandwidth constrained in the sense that we could easily pull a couple hundred mbits but don't want to go over our 95th percentile commit. I imagine others are in a similar situation. Tweaking the number of fetchers gets us in the ballpark, but a feature like this would be far superior (since crawls often start off pulling higher than our commit and then slow to 60% of our commit later on) If it's an issue of porting the patch against the current code I can take that on. > Bandwidth target for fetcher rather than a thread count > --- > > Key: NUTCH-207 > URL: https://issues.apache.org/jira/browse/NUTCH-207 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Affects Versions: 0.8 >Reporter: Rod Taylor > Attachments: ratelimit.patch > > > Increases or decreases the number of threads from the starting value > (fetcher.threads.fetch) up to a maximum (fetcher.threads.maximum) to achieve > a target bandwidth (fetcher.threads.bandwidth). > It seems to be able to keep within 10% of the target bandwidth even when > large numbers of errors are found or when a number of large pages is run > across. > To achieve more accurate tracking Nutch should keep track of protocol > overhead as well as the volume of pages downloaded. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-207) Bandwidth target for fetcher rather than a thread count
[ http://issues.apache.org/jira/browse/NUTCH-207?page=comments#action_12365462 ] Rod Taylor commented on NUTCH-207: -- Code was by Radu Mateescu with additional kibitzing by myself. > Bandwidth target for fetcher rather than a thread count > --- > > Key: NUTCH-207 > URL: http://issues.apache.org/jira/browse/NUTCH-207 > Project: Nutch > Type: New Feature > Components: fetcher > Versions: 0.8-dev > Reporter: Rod Taylor > Attachments: ratelimit.patch > > Increases or decreases the number of threads from the starting value > (fetcher.threads.fetch) up to a maximum (fetcher.threads.maximum) to achieve > a target bandwidth (fetcher.threads.bandwidth). > It seems to be able to keep within 10% of the target bandwidth even when > large numbers of errors are found or when a number of large pages is run > across. > To achieve more accurate tracking Nutch should keep track of protocol > overhead as well as the volume of pages downloaded. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira