Hi Doug, Your idea about PrefixURLFilter and AutomatonURLFilter combination sounds interesting. Could you please attach the patch to JIRA? Thanks
- Scott On 11/17/06, Doug Cook <[EMAIL PROTECTED]> wrote: > > Hi, folks, > > I, too, was slowed down by reduce operations in fetch. Some benchmarking > showed that in my case, the limiting operation was filtering (though a > distant second was the time spent calculating Levenshtein distances, > presumably part of the spellchecking that Sami just removed to speed things > up, though I haven't looked at it yet). > > I've fixed the problem, and my reduce speed is better by about a factor of > three. However, the fix is limited to certain usage patterns. > > In my case, I have tens of thousands of sites and subsites I'm crawling, and > I'm using a combination of PrefixURLFilter + AutomatonURLFilter. I > essentially use the prefix filter to limit to the set of sites, and then > automaton to pattern-match within those sites. I only have subsite matches > on < 10% of the sites, however, so I was clearly wasting a lot of time > running the automaton patterns that didn't need it. And automaton, though > much faster than RegexURLFilter, is still dog-slow with that many patterns. > > A simple fix was to extend the current "AND all the filters together" model > to have the notion of a "short-circuit" match, which allows a filter to say > "let this URL through and DON'T run the other filters" by returning a > special token to URLFilters. Now I have a version of PrefixURLFilter that > can return both "normal" matches and "short circuit" matches, and only > returns "normal" matches for those sites that need to run subsite patterns. > It seems to work well, the overhead is negligible when not in use, and the > speedup is massive for my usage pattern. > > I'd like to contribute it back, if people would find this useful (not that > it's rocket science!). > > First, is there anyone out there besides me who would find this useful? > > Second, I've been thinking about the best way to handle PrefixURLFilter > configuration. I can see a few options: > > 1. Have two different config files, one for "normal" matches, and one for > "short-circuit" matches. > 2. Have one config file, with a syntax to say "make this pattern a > short-circuit match," and make the default be a "normal" match, so it is > backwards compatible with the current version. > 3. Make a new type of filter which internally combines Prefix and Automaton, > takes one config file, and decides internally which patterns should generate > automaton inputs vs "normal" or "short circuit" prefix matches. > > Approach #3 requires no changes to the URLFilter model, and makes it > difficult to screw up by making config files which are inconsistent (e.g. > forgetting to put in a prefix pattern for one of the automaton patterns). It > is also the least flexible, requires the most code, and introduces yet > another kind of filter. > > I tend to like the changed URLFilter model; it's more flexible, even if it > requires a little more care in configuration (a simple Perl script, in my > case, to generate the config files correctly and consistently). I'm leaning > towards approach #2. I'm thinking something simple, syntax-wise, like > putting SHORTCIRCUIT: before the patterns which should short-circuit. Any > suggestions for a better syntax? Or reasons why I should consider a > different approach? > > Doug > > -- > View this message in context: > http://www.nabble.com/More-fetcher-speed-increases-tf2644170.html#a7381430 > Sent from the Nutch - Dev mailing list archive at Nabble.com. > > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
