Hi Doug,

Your idea about PrefixURLFilter and  AutomatonURLFilter combination
sounds interesting. Could you please attach the patch to JIRA? Thanks

- Scott

On 11/17/06, Doug Cook <[EMAIL PROTECTED]> wrote:
>
> Hi, folks,
>
> I, too, was slowed down by reduce operations in fetch. Some benchmarking
> showed that in my case, the limiting operation was filtering (though a
> distant second was the time spent calculating Levenshtein distances,
> presumably part of the spellchecking that Sami just removed to speed things
> up, though I haven't looked at it yet).
>
> I've fixed the problem, and my reduce speed is better by about a factor of
> three. However, the fix is limited to certain usage patterns.
>
> In my case, I have tens of thousands of sites and subsites I'm crawling, and
> I'm using a combination of PrefixURLFilter + AutomatonURLFilter. I
> essentially use the prefix filter to limit to the set of sites, and then
> automaton to pattern-match within those sites. I only have subsite matches
> on < 10% of the sites, however, so I was clearly wasting a lot of time
> running the automaton patterns that didn't need it. And automaton, though
> much faster than RegexURLFilter, is still dog-slow with that many patterns.
>
> A simple fix was to extend the current "AND all the filters together" model
> to have the notion of a "short-circuit" match, which allows a filter to say
> "let this URL through and DON'T run the other filters" by returning a
> special token to URLFilters. Now I have a version of PrefixURLFilter that
> can return both "normal" matches and "short circuit" matches, and only
> returns "normal" matches for those sites that need to run subsite patterns.
> It seems to work well, the overhead is negligible when not in use, and the
> speedup is massive for my usage pattern.
>
> I'd like to contribute it back, if people would find this useful (not that
> it's rocket science!).
>
> First, is there anyone out there besides me who would find this useful?
>
> Second, I've been thinking about the best way to handle PrefixURLFilter
> configuration. I can see a few options:
>
> 1. Have two different config files, one for "normal" matches, and one for
> "short-circuit" matches.
> 2. Have one config file, with a syntax to say "make this pattern a
> short-circuit match," and make the default be a "normal" match, so it is
> backwards compatible with the current version.
> 3. Make a new type of filter which internally combines Prefix and Automaton,
> takes one config file, and decides internally which patterns should generate
> automaton inputs vs "normal" or "short circuit" prefix matches.
>
> Approach #3 requires no changes to the URLFilter model, and makes it
> difficult to screw up by making config files which are inconsistent (e.g.
> forgetting to put in a prefix pattern for one of the automaton patterns). It
> is also the least flexible, requires the most code, and introduces yet
> another kind of filter.
>
> I tend to like the changed URLFilter model; it's more flexible, even if it
> requires a little more care in configuration (a simple Perl script, in my
> case, to generate the config files correctly and consistently). I'm leaning
> towards approach #2. I'm thinking something simple, syntax-wise, like
> putting SHORTCIRCUIT: before the patterns which should short-circuit. Any
> suggestions for a  better syntax? Or reasons why I should consider a
> different approach?
>
> Doug
>
> --
> View this message in context: 
> http://www.nabble.com/More-fetcher-speed-increases-tf2644170.html#a7381430
> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>
>

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to