[ http://issues.apache.org/jira/browse/NUTCH-409?page=all ]
Doug Cook updated NUTCH-409:
----------------------------
Attachment: shortcircuit.patch
> Add "short circuit" notion to filters to speedup mixed site/subsite crawling
> ----------------------------------------------------------------------------
>
> Key: NUTCH-409
> URL: http://issues.apache.org/jira/browse/NUTCH-409
> Project: Nutch
> Issue Type: Improvement
> Components: fetcher
> Affects Versions: 0.8
> Reporter: Doug Cook
> Priority: Minor
> Attachments: shortcircuit.patch
>
>
> In the case where one is crawling a mixture of sites and sub-sites, the
> prefix matcher can match the sites quite quickly, but either the regex or
> automaton filters are considerably slower matching the sub-sites. In the
> current model of AND-ing all the filters together, the pattern-matching
> filter will be run on every site that matches the prefix matcher -- even if
> that entire site is to be crawled and there are no sub-site patterns. If only
> a small portion of the sites actually need sub-site pattern matching, this is
> much slower than it should be.
> I propose (and attach) a simple modification allowing considerable speedup
> for this usage pattern. I define the notion of a "short circuit" match that
> means "accept this URL and don't run any of the remaining filters in the
> filter chain."
> Though with this change, any filter plugin can in theory return a
> short-circuit match, I have only implemented the short-circuit match for the
> PrefixURLFilter. The configuration file format is backwards-compatible;
> shortcircuit matches just have SHORTCIRCUIT: in front of them.
> One minor "gotcha":
> * Because the shortcircuit matches will avoid running any later filters, all
> of the site-independent filters need to be BEFORE the PrefixURLFilter in the
> chain.
> I get my best performance using the following filter chain:
> 1) The SuffixURLFilter to throw away anything with unwanted extensions
> 2) The RegexURLFilter to do site-independent cleanup (ad removal, skipping
> mailto:, bulletin-board pages, etc.)
> 3) The PrefixURLFilter, with SHORTCIRCUIT: in front of every site name EXCEPT
> the sites needing subsite matching
> 4) The AutomatonURLFilter to match those sites needing subsite pattern
> matching.
> I have tens of thousands of sites and an order of magnitude fewer subsites,
> so skipping step #4 90% of the time speeds things up considerably (my reduce
> time on a round of crawling is down from some 26 hours to less than 10).
> There are only two drawbacks to the implementation, and I think they're
> pretty minor:
> 1) Because I pass a special token (_PASS_) in the place of the URL to
> implement the short circuit, if for some reason someone wanted to crawl a URL
> named "_PASS_", there would be problems. I find this highly unlikely, since
> that's an invalid URL.
> 2) The correct behavior of steps #3 and #4 above depends upon coordination of
> the config files between the prefix and automaton filters, making an
> opportunity for user screwup. I thought about creating a "new kind of filter"
> which essentially combined prefix & automaton's behaviors, took one config
> file, and internally handled the short-circuiting. But I think the approach I
> took is simpler, cleaner, more flexible, and avoids creating yet another kind
> of filter. Coordinating the config files is pretty easy (I generate them
> programmatically).
> As this is my first contribution to Nutch I'm sure that there are things I've
> missed, whether in coding style or desired patch format. I welcome any
> feedback, suggestions, etc.
> Doug
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers