[jira] Updated: (NUTCH-339) Refactor nutch to allow fetcher improvements
[ http://issues.apache.org/jira/browse/NUTCH-339?page=all ] Andrzej Bialecki updated NUTCH-339: Attachment: patch4-fixed.txt Sorry, the patch was incomplete - please try patch4-fixed.txt instead. Refactor nutch to allow fetcher improvements Key: NUTCH-339 URL: http://issues.apache.org/jira/browse/NUTCH-339 Project: Nutch Issue Type: Task Components: fetcher Affects Versions: 0.8 Environment: n/a Reporter: Sami Siren Assigned To: Andrzej Bialecki Fix For: 0.9.0 Attachments: patch.txt, patch2.txt, patch3.txt, patch4-fixed.txt, patch4-trunk.txt As I (and Stefan?) see it there are two major areas the current fetcher could be improved (as in speed) 1. Politeness code and how it is implemented is the biggest problem of current fetcher(together with robots.txt handling). With a simple code changes like replacing it with a PriorityQueue based solution showed very promising results in increased IO. 2. Changing fetcher to use non blocking io (this requires great amount of work as we need to implement the protocols from scratch again). I would like to start with working towards #1 by first refactoring the current code (plugins actually) in following way: 1. Move robots.txt handling away from (lib-http)plugin. Even if this is related only to http, leaving it to lib-http does not allow other kinds of scheduling strategies to be implemented (it is hardcoded to fetch robots.txt from the same thread when requesting a page from a site from witch it hasn't tried to load robots.txt) 2. Move code for politeness away from (lib-http)plugin It is really usable outside http and also the current design limits changing of the implementation (to queue based) Where to move these, well my suggestion is the nutch core, does anybody see problems with this? These code refactoring activities are to be done in a way that none of the current functionality is (at least deliberately) changed leaving current functionality as is thus leaving room and possibility to build the next generation fetcher(s) without destroying the old one at same time. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
RE: [jira] Created: (NUTCH-408) Plugin development documentation
I agree with you that documentation is vital not the just extending the current version but also for any plugins and patches created. I have been spending almost two weeks trying to adapt nutch to my project but I spend more time in reading code and trying to understand what they do before I can even start to fix problem. Come on guys, documentation is good coding practice, we can't read your mind to know exactly what you were trying to achieve by just looking at the implementation code. This is just a good constructive criticism. :) Armel -Original Message- From: nutch.newbie (JIRA) [mailto:[EMAIL PROTECTED] Sent: 25 November 2006 03:45 To: nutch-dev@lucene.apache.org Subject: [jira] Created: (NUTCH-408) Plugin development documentation Plugin development documentation Key: NUTCH-408 URL: http://issues.apache.org/jira/browse/NUTCH-408 Project: Nutch Issue Type: Improvement Affects Versions: 0.8.1 Environment: Linux Fedora Reporter: nutch.newbie Documentation is rare! But very vital for extending current (0.9) nutch. Current docs on the wiki for 0.7 plugin development was good but it doesn't apply to 0.9 and new developers who are joining directly 0.9 find the 0.7 documentation not enough. A more practical plugin writing documentation for 0.9 is desired also exposing the plugin principals in practical terms i.e. extension points and libs etc. furthermore it would be good to provide some best practice example i.e. look for the lib you are planning to use if its already in lib folder and maybe that version of the external lib is good for the plugin dev rather then using another version things like that.. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [jira] Created: (NUTCH-408) Plugin development documentation
did you erver browse this: http://wiki.media-style.com/display/ nutchDocu/Home Nothing big, but it will give you some ideas, also about plugins. On 25.11.2006, at 06:32, Armel T. Nene wrote: I agree with you that documentation is vital not the just extending the current version but also for any plugins and patches created. I have been spending almost two weeks trying to adapt nutch to my project but I spend more time in reading code and trying to understand what they do before I can even start to fix problem. Come on guys, documentation is good coding practice, we can't read your mind to know exactly what you were trying to achieve by just looking at the implementation code. This is just a good constructive criticism. :) Armel -Original Message- From: nutch.newbie (JIRA) [mailto:[EMAIL PROTECTED] Sent: 25 November 2006 03:45 To: nutch-dev@lucene.apache.org Subject: [jira] Created: (NUTCH-408) Plugin development documentation Plugin development documentation Key: NUTCH-408 URL: http://issues.apache.org/jira/browse/NUTCH-408 Project: Nutch Issue Type: Improvement Affects Versions: 0.8.1 Environment: Linux Fedora Reporter: nutch.newbie Documentation is rare! But very vital for extending current (0.9) nutch. Current docs on the wiki for 0.7 plugin development was good but it doesn't apply to 0.9 and new developers who are joining directly 0.9 find the 0.7 documentation not enough. A more practical plugin writing documentation for 0.9 is desired also exposing the plugin principals in practical terms i.e. extension points and libs etc. furthermore it would be good to provide some best practice example i.e. look for the lib you are planning to use if its already in lib folder and maybe that version of the external lib is good for the plugin dev rather then using another version things like that.. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/ software/jira ~~~ 101tec Inc. search tech for web 2.1 Menlo Park, California http://www.101tec.com
[jira] Commented: (NUTCH-408) Plugin development documentation
[ http://issues.apache.org/jira/browse/NUTCH-408?page=comments#action_12452610 ] nutch.newbie commented on NUTCH-408: Yes, I have gone through the media style documentation and it is a good start. and there are also some very good documentation in Nutch wiki. My thinking was to complete-compile existing documentation in a coherent way so that you get the whole picture. So I would like to give it a shot at writing but I can not do this all by myself as I lack background info plus I haven't wrote any plugin myself so if any of you would like to help me I would like to do this job. Feel free to mail me directly if you are up for it. I will probably ask lot of stupid question but we will have some development documentation at least :=) what you say? anyone up for this? Regards Plugin development documentation Key: NUTCH-408 URL: http://issues.apache.org/jira/browse/NUTCH-408 Project: Nutch Issue Type: Improvement Affects Versions: 0.8.1 Environment: Linux Fedora Reporter: nutch.newbie Documentation is rare! But very vital for extending current (0.9) nutch. Current docs on the wiki for 0.7 plugin development was good but it doesn't apply to 0.9 and new developers who are joining directly 0.9 find the 0.7 documentation not enough. A more practical plugin writing documentation for 0.9 is desired also exposing the plugin principals in practical terms i.e. extension points and libs etc. furthermore it would be good to provide some best practice example i.e. look for the lib you are planning to use if its already in lib folder and maybe that version of the external lib is good for the plugin dev rather then using another version things like that.. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-409) Add short circuit notion to filters to speedup mixed site/subsite crawling
[ http://issues.apache.org/jira/browse/NUTCH-409?page=all ] Doug Cook updated NUTCH-409: Attachment: shortcircuit.patch Add short circuit notion to filters to speedup mixed site/subsite crawling Key: NUTCH-409 URL: http://issues.apache.org/jira/browse/NUTCH-409 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.8 Reporter: Doug Cook Priority: Minor Attachments: shortcircuit.patch In the case where one is crawling a mixture of sites and sub-sites, the prefix matcher can match the sites quite quickly, but either the regex or automaton filters are considerably slower matching the sub-sites. In the current model of AND-ing all the filters together, the pattern-matching filter will be run on every site that matches the prefix matcher -- even if that entire site is to be crawled and there are no sub-site patterns. If only a small portion of the sites actually need sub-site pattern matching, this is much slower than it should be. I propose (and attach) a simple modification allowing considerable speedup for this usage pattern. I define the notion of a short circuit match that means accept this URL and don't run any of the remaining filters in the filter chain. Though with this change, any filter plugin can in theory return a short-circuit match, I have only implemented the short-circuit match for the PrefixURLFilter. The configuration file format is backwards-compatible; shortcircuit matches just have SHORTCIRCUIT: in front of them. One minor gotcha: * Because the shortcircuit matches will avoid running any later filters, all of the site-independent filters need to be BEFORE the PrefixURLFilter in the chain. I get my best performance using the following filter chain: 1) The SuffixURLFilter to throw away anything with unwanted extensions 2) The RegexURLFilter to do site-independent cleanup (ad removal, skipping mailto:, bulletin-board pages, etc.) 3) The PrefixURLFilter, with SHORTCIRCUIT: in front of every site name EXCEPT the sites needing subsite matching 4) The AutomatonURLFilter to match those sites needing subsite pattern matching. I have tens of thousands of sites and an order of magnitude fewer subsites, so skipping step #4 90% of the time speeds things up considerably (my reduce time on a round of crawling is down from some 26 hours to less than 10). There are only two drawbacks to the implementation, and I think they're pretty minor: 1) Because I pass a special token (_PASS_) in the place of the URL to implement the short circuit, if for some reason someone wanted to crawl a URL named _PASS_, there would be problems. I find this highly unlikely, since that's an invalid URL. 2) The correct behavior of steps #3 and #4 above depends upon coordination of the config files between the prefix and automaton filters, making an opportunity for user screwup. I thought about creating a new kind of filter which essentially combined prefix automaton's behaviors, took one config file, and internally handled the short-circuiting. But I think the approach I took is simpler, cleaner, more flexible, and avoids creating yet another kind of filter. Coordinating the config files is pretty easy (I generate them programmatically). As this is my first contribution to Nutch I'm sure that there are things I've missed, whether in coding style or desired patch format. I welcome any feedback, suggestions, etc. Doug -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: More fetcher speed increases
Done. See http://issues.apache.org/jira/browse/NUTCH-409 This is my first Nutch contribution, so hopefully I've got it right ;-) Any suggestions/questions/feedback welcome. Hope this is useful to others. D scott green wrote: Hi Doug, Your idea about PrefixURLFilter and AutomatonURLFilter combination sounds interesting. Could you please attach the patch to JIRA? Thanks - Scott On 11/17/06, Doug Cook [EMAIL PROTECTED] wrote: Hi, folks, I, too, was slowed down by reduce operations in fetch. Some benchmarking showed that in my case, the limiting operation was filtering (though a distant second was the time spent calculating Levenshtein distances, presumably part of the spellchecking that Sami just removed to speed things up, though I haven't looked at it yet). I've fixed the problem, and my reduce speed is better by about a factor of three. However, the fix is limited to certain usage patterns. In my case, I have tens of thousands of sites and subsites I'm crawling, and I'm using a combination of PrefixURLFilter + AutomatonURLFilter. I essentially use the prefix filter to limit to the set of sites, and then automaton to pattern-match within those sites. I only have subsite matches on 10% of the sites, however, so I was clearly wasting a lot of time running the automaton patterns that didn't need it. And automaton, though much faster than RegexURLFilter, is still dog-slow with that many patterns. A simple fix was to extend the current AND all the filters together model to have the notion of a short-circuit match, which allows a filter to say let this URL through and DON'T run the other filters by returning a special token to URLFilters. Now I have a version of PrefixURLFilter that can return both normal matches and short circuit matches, and only returns normal matches for those sites that need to run subsite patterns. It seems to work well, the overhead is negligible when not in use, and the speedup is massive for my usage pattern. I'd like to contribute it back, if people would find this useful (not that it's rocket science!). First, is there anyone out there besides me who would find this useful? Second, I've been thinking about the best way to handle PrefixURLFilter configuration. I can see a few options: 1. Have two different config files, one for normal matches, and one for short-circuit matches. 2. Have one config file, with a syntax to say make this pattern a short-circuit match, and make the default be a normal match, so it is backwards compatible with the current version. 3. Make a new type of filter which internally combines Prefix and Automaton, takes one config file, and decides internally which patterns should generate automaton inputs vs normal or short circuit prefix matches. Approach #3 requires no changes to the URLFilter model, and makes it difficult to screw up by making config files which are inconsistent (e.g. forgetting to put in a prefix pattern for one of the automaton patterns). It is also the least flexible, requires the most code, and introduces yet another kind of filter. I tend to like the changed URLFilter model; it's more flexible, even if it requires a little more care in configuration (a simple Perl script, in my case, to generate the config files correctly and consistently). I'm leaning towards approach #2. I'm thinking something simple, syntax-wise, like putting SHORTCIRCUIT: before the patterns which should short-circuit. Any suggestions for a better syntax? Or reasons why I should consider a different approach? Doug -- View this message in context: http://www.nabble.com/More-fetcher-speed-increases-tf2644170.html#a7381430 Sent from the Nutch - Dev mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/More-fetcher-speed-increases-tf2644170.html#a7543634 Sent from the Nutch - Dev mailing list archive at Nabble.com.
[jira] Commented: (NUTCH-409) Add short circuit notion to filters to speedup mixed site/subsite crawling
[ http://issues.apache.org/jira/browse/NUTCH-409?page=comments#action_12452617 ] Doug Cook commented on NUTCH-409: - I should also note that this approach is still not optimal (though it is faster for my usage pattern). I'm still running the site-independent regular expressions (ad removal, etc) on *every* URL; really, they should just be run on the URLs which belong to the set of sites I'm crawling. One could think of a slight extension to the change here, where each filter has a parameter: (A) run me on all URLs which have passed the prior filters or (B) run me on only the non-shortcircuit matches. This would allow us to put the RegexURLFilter *after* the PrefixURLFilter, and make it a type A (site-independent) filter, while the Automaton would be Type B. (site-dependent). Simple code-wise, but a little more complexity in configuration. Or one could return to the notion of a super filter which takes one config file, internally combines these effects and automatically optimizes the filtering. A little more ambitious code-wise, but ultimately easier to use. At any rate, the attached change is pretty simple, and at least helpful for me, if not perfect; thought I would share it. D Add short circuit notion to filters to speedup mixed site/subsite crawling Key: NUTCH-409 URL: http://issues.apache.org/jira/browse/NUTCH-409 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.8 Reporter: Doug Cook Priority: Minor Attachments: shortcircuit.patch In the case where one is crawling a mixture of sites and sub-sites, the prefix matcher can match the sites quite quickly, but either the regex or automaton filters are considerably slower matching the sub-sites. In the current model of AND-ing all the filters together, the pattern-matching filter will be run on every site that matches the prefix matcher -- even if that entire site is to be crawled and there are no sub-site patterns. If only a small portion of the sites actually need sub-site pattern matching, this is much slower than it should be. I propose (and attach) a simple modification allowing considerable speedup for this usage pattern. I define the notion of a short circuit match that means accept this URL and don't run any of the remaining filters in the filter chain. Though with this change, any filter plugin can in theory return a short-circuit match, I have only implemented the short-circuit match for the PrefixURLFilter. The configuration file format is backwards-compatible; shortcircuit matches just have SHORTCIRCUIT: in front of them. One minor gotcha: * Because the shortcircuit matches will avoid running any later filters, all of the site-independent filters need to be BEFORE the PrefixURLFilter in the chain. I get my best performance using the following filter chain: 1) The SuffixURLFilter to throw away anything with unwanted extensions 2) The RegexURLFilter to do site-independent cleanup (ad removal, skipping mailto:, bulletin-board pages, etc.) 3) The PrefixURLFilter, with SHORTCIRCUIT: in front of every site name EXCEPT the sites needing subsite matching 4) The AutomatonURLFilter to match those sites needing subsite pattern matching. I have tens of thousands of sites and an order of magnitude fewer subsites, so skipping step #4 90% of the time speeds things up considerably (my reduce time on a round of crawling is down from some 26 hours to less than 10). There are only two drawbacks to the implementation, and I think they're pretty minor: 1) Because I pass a special token (_PASS_) in the place of the URL to implement the short circuit, if for some reason someone wanted to crawl a URL named _PASS_, there would be problems. I find this highly unlikely, since that's an invalid URL. 2) The correct behavior of steps #3 and #4 above depends upon coordination of the config files between the prefix and automaton filters, making an opportunity for user screwup. I thought about creating a new kind of filter which essentially combined prefix automaton's behaviors, took one config file, and internally handled the short-circuiting. But I think the approach I took is simpler, cleaner, more flexible, and avoids creating yet another kind of filter. Coordinating the config files is pretty easy (I generate them programmatically). As this is my first contribution to Nutch I'm sure that there are things I've missed, whether in coding style or desired patch format. I welcome any feedback, suggestions, etc. Doug -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more