[ https://issues.apache.org/jira/browse/NUTCH-409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lewis John McGibbney updated NUTCH-409: --------------------------------------- Patch Info: Patch Available Fix Version/s: 2.2 1.7 > Add "short circuit" notion to filters to speedup mixed site/subsite crawling > ---------------------------------------------------------------------------- > > Key: NUTCH-409 > URL: https://issues.apache.org/jira/browse/NUTCH-409 > Project: Nutch > Issue Type: Improvement > Components: fetcher > Affects Versions: 0.8 > Reporter: Doug Cook > Priority: Minor > Fix For: 1.7, 2.2 > > Attachments: shortcircuit.patch > > > In the case where one is crawling a mixture of sites and sub-sites, the > prefix matcher can match the sites quite quickly, but either the regex or > automaton filters are considerably slower matching the sub-sites. In the > current model of AND-ing all the filters together, the pattern-matching > filter will be run on every site that matches the prefix matcher -- even if > that entire site is to be crawled and there are no sub-site patterns. If only > a small portion of the sites actually need sub-site pattern matching, this is > much slower than it should be. > I propose (and attach) a simple modification allowing considerable speedup > for this usage pattern. I define the notion of a "short circuit" match that > means "accept this URL and don't run any of the remaining filters in the > filter chain." > Though with this change, any filter plugin can in theory return a > short-circuit match, I have only implemented the short-circuit match for the > PrefixURLFilter. The configuration file format is backwards-compatible; > shortcircuit matches just have SHORTCIRCUIT: in front of them. > One minor "gotcha": > * Because the shortcircuit matches will avoid running any later filters, all > of the site-independent filters need to be BEFORE the PrefixURLFilter in the > chain. > I get my best performance using the following filter chain: > 1) The SuffixURLFilter to throw away anything with unwanted extensions > 2) The RegexURLFilter to do site-independent cleanup (ad removal, skipping > mailto:, bulletin-board pages, etc.) > 3) The PrefixURLFilter, with SHORTCIRCUIT: in front of every site name EXCEPT > the sites needing subsite matching > 4) The AutomatonURLFilter to match those sites needing subsite pattern > matching. > I have tens of thousands of sites and an order of magnitude fewer subsites, > so skipping step #4 90% of the time speeds things up considerably (my reduce > time on a round of crawling is down from some 26 hours to less than 10). > There are only two drawbacks to the implementation, and I think they're > pretty minor: > 1) Because I pass a special token (_PASS_) in the place of the URL to > implement the short circuit, if for some reason someone wanted to crawl a URL > named "_PASS_", there would be problems. I find this highly unlikely, since > that's an invalid URL. > 2) The correct behavior of steps #3 and #4 above depends upon coordination of > the config files between the prefix and automaton filters, making an > opportunity for user screwup. I thought about creating a "new kind of filter" > which essentially combined prefix & automaton's behaviors, took one config > file, and internally handled the short-circuiting. But I think the approach I > took is simpler, cleaner, more flexible, and avoids creating yet another kind > of filter. Coordinating the config files is pretty easy (I generate them > programmatically). > As this is my first contribution to Nutch I'm sure that there are things I've > missed, whether in coding style or desired patch format. I welcome any > feedback, suggestions, etc. > Doug -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira