[jira] [Updated] (NUTCH-409) Add short circuit notion to filters to speedup mixed site/subsite crawling

2013-05-22 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-409:
--

Fix Version/s: 1.8

 Add short circuit notion to filters to speedup mixed site/subsite crawling
 

 Key: NUTCH-409
 URL: https://issues.apache.org/jira/browse/NUTCH-409
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.8
Reporter: Doug Cook
Priority: Minor
 Fix For: 2.3, 1.8

 Attachments: shortcircuit.patch


 In the case where one is crawling a mixture of sites and sub-sites, the 
 prefix matcher can match the sites quite quickly, but either the regex or 
 automaton filters are considerably slower matching the sub-sites. In the 
 current model of AND-ing all the filters together, the pattern-matching 
 filter will be run on every site that matches the prefix matcher -- even if 
 that entire site is to be crawled and there are no sub-site patterns. If only 
 a small portion of the sites actually need sub-site pattern matching, this is 
 much slower than it should be.
 I propose (and attach) a simple modification allowing considerable speedup 
 for this usage pattern. I define the notion of a short circuit match that 
 means accept this URL and don't run any of the remaining filters in the 
 filter chain. 
 Though with this change, any filter plugin can in theory return a 
 short-circuit match, I have only implemented the short-circuit match for the 
 PrefixURLFilter. The configuration file format is backwards-compatible; 
 shortcircuit matches just have SHORTCIRCUIT: in front of them.
 One minor gotcha:
 * Because the shortcircuit matches will avoid running any later filters, all 
 of the site-independent filters need to be BEFORE the PrefixURLFilter in the 
 chain.
 I get my best performance using the following filter chain:
 1) The SuffixURLFilter  to throw away anything with unwanted extensions
 2) The RegexURLFilter to do site-independent cleanup (ad removal, skipping 
 mailto:, bulletin-board pages, etc.)
 3) The PrefixURLFilter, with SHORTCIRCUIT: in front of every site name EXCEPT 
 the sites needing subsite matching
 4) The AutomatonURLFilter to match those sites needing subsite pattern 
 matching.
 I have tens of thousands of sites and an order of magnitude fewer subsites, 
 so skipping step #4 90% of the time speeds things up considerably (my reduce 
 time on a round of crawling is down from some 26 hours to less than 10).
 There are only two drawbacks to the implementation, and I think they're 
 pretty minor:
 1) Because I pass a special token (_PASS_) in the place of the URL to 
 implement the short circuit, if for some reason someone wanted to crawl a URL 
 named _PASS_, there would be problems. I find this highly unlikely, since 
 that's an invalid URL.
 2) The correct behavior of steps #3 and #4 above depends upon coordination of 
 the config files between the prefix and automaton filters, making an 
 opportunity for user screwup. I thought about creating a new kind of filter 
 which essentially combined prefix  automaton's behaviors, took one config 
 file, and internally handled the short-circuiting. But I think the approach I 
 took is simpler, cleaner, more flexible, and avoids creating yet another kind 
 of filter. Coordinating the config files is pretty easy (I generate them 
 programmatically).
 As this is my first contribution to Nutch I'm sure that there are things I've 
 missed, whether in coding style or desired patch format. I welcome any 
 feedback, suggestions, etc.
 Doug

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-409) Add short circuit notion to filters to speedup mixed site/subsite crawling

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-409:
---

   Patch Info: Patch Available
Fix Version/s: 2.2
   1.7

 Add short circuit notion to filters to speedup mixed site/subsite crawling
 

 Key: NUTCH-409
 URL: https://issues.apache.org/jira/browse/NUTCH-409
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.8
Reporter: Doug Cook
Priority: Minor
 Fix For: 1.7, 2.2

 Attachments: shortcircuit.patch


 In the case where one is crawling a mixture of sites and sub-sites, the 
 prefix matcher can match the sites quite quickly, but either the regex or 
 automaton filters are considerably slower matching the sub-sites. In the 
 current model of AND-ing all the filters together, the pattern-matching 
 filter will be run on every site that matches the prefix matcher -- even if 
 that entire site is to be crawled and there are no sub-site patterns. If only 
 a small portion of the sites actually need sub-site pattern matching, this is 
 much slower than it should be.
 I propose (and attach) a simple modification allowing considerable speedup 
 for this usage pattern. I define the notion of a short circuit match that 
 means accept this URL and don't run any of the remaining filters in the 
 filter chain. 
 Though with this change, any filter plugin can in theory return a 
 short-circuit match, I have only implemented the short-circuit match for the 
 PrefixURLFilter. The configuration file format is backwards-compatible; 
 shortcircuit matches just have SHORTCIRCUIT: in front of them.
 One minor gotcha:
 * Because the shortcircuit matches will avoid running any later filters, all 
 of the site-independent filters need to be BEFORE the PrefixURLFilter in the 
 chain.
 I get my best performance using the following filter chain:
 1) The SuffixURLFilter  to throw away anything with unwanted extensions
 2) The RegexURLFilter to do site-independent cleanup (ad removal, skipping 
 mailto:, bulletin-board pages, etc.)
 3) The PrefixURLFilter, with SHORTCIRCUIT: in front of every site name EXCEPT 
 the sites needing subsite matching
 4) The AutomatonURLFilter to match those sites needing subsite pattern 
 matching.
 I have tens of thousands of sites and an order of magnitude fewer subsites, 
 so skipping step #4 90% of the time speeds things up considerably (my reduce 
 time on a round of crawling is down from some 26 hours to less than 10).
 There are only two drawbacks to the implementation, and I think they're 
 pretty minor:
 1) Because I pass a special token (_PASS_) in the place of the URL to 
 implement the short circuit, if for some reason someone wanted to crawl a URL 
 named _PASS_, there would be problems. I find this highly unlikely, since 
 that's an invalid URL.
 2) The correct behavior of steps #3 and #4 above depends upon coordination of 
 the config files between the prefix and automaton filters, making an 
 opportunity for user screwup. I thought about creating a new kind of filter 
 which essentially combined prefix  automaton's behaviors, took one config 
 file, and internally handled the short-circuiting. But I think the approach I 
 took is simpler, cleaner, more flexible, and avoids creating yet another kind 
 of filter. Coordinating the config files is pretty easy (I generate them 
 programmatically).
 As this is my first contribution to Nutch I'm sure that there are things I've 
 missed, whether in coding style or desired patch format. I welcome any 
 feedback, suggestions, etc.
 Doug

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira