Hi, folks,

I, too, was slowed down by reduce operations in fetch. Some benchmarking
showed that in my case, the limiting operation was filtering (though a
distant second was the time spent calculating Levenshtein distances,
presumably part of the spellchecking that Sami just removed to speed things
up, though I haven't looked at it yet).

I've fixed the problem, and my reduce speed is better by about a factor of
three. However, the fix is limited to certain usage patterns.

In my case, I have tens of thousands of sites and subsites I'm crawling, and
I'm using a combination of PrefixURLFilter + AutomatonURLFilter. I
essentially use the prefix filter to limit to the set of sites, and then
automaton to pattern-match within those sites. I only have subsite matches
on < 10% of the sites, however, so I was clearly wasting a lot of time
running the automaton patterns that didn't need it. And automaton, though
much faster than RegexURLFilter, is still dog-slow with that many patterns.

A simple fix was to extend the current "AND all the filters together" model
to have the notion of a "short-circuit" match, which allows a filter to say
"let this URL through and DON'T run the other filters" by returning a
special token to URLFilters. Now I have a version of PrefixURLFilter that
can return both "normal" matches and "short circuit" matches, and only
returns "normal" matches for those sites that need to run subsite patterns.
It seems to work well, the overhead is negligible when not in use, and the
speedup is massive for my usage pattern.

I'd like to contribute it back, if people would find this useful (not that
it's rocket science!).

First, is there anyone out there besides me who would find this useful?

Second, I've been thinking about the best way to handle PrefixURLFilter
configuration. I can see a few options:

1. Have two different config files, one for "normal" matches, and one for
"short-circuit" matches.
2. Have one config file, with a syntax to say "make this pattern a
short-circuit match," and make the default be a "normal" match, so it is
backwards compatible with the current version.
3. Make a new type of filter which internally combines Prefix and Automaton,
takes one config file, and decides internally which patterns should generate
automaton inputs vs "normal" or "short circuit" prefix matches.

Approach #3 requires no changes to the URLFilter model, and makes it
difficult to screw up by making config files which are inconsistent (e.g.
forgetting to put in a prefix pattern for one of the automaton patterns). It
is also the least flexible, requires the most code, and introduces yet
another kind of filter.

I tend to like the changed URLFilter model; it's more flexible, even if it
requires a little more care in configuration (a simple Perl script, in my
case, to generate the config files correctly and consistently). I'm leaning
towards approach #2. I'm thinking something simple, syntax-wise, like
putting SHORTCIRCUIT: before the patterns which should short-circuit. Any
suggestions for a  better syntax? Or reasons why I should consider a
different approach?

Doug

-- 
View this message in context: 
http://www.nabble.com/More-fetcher-speed-increases-tf2644170.html#a7381430
Sent from the Nutch - Dev mailing list archive at Nabble.com.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to