[ http://issues.apache.org/jira/browse/NUTCH-410?page=all ]
Doug Cook updated NUTCH-410:
----------------------------
Attachment: betterRegexNorm.patch
> Faster RegexNormalize with more features
> ----------------------------------------
>
> Key: NUTCH-410
> URL: http://issues.apache.org/jira/browse/NUTCH-410
> Project: Nutch
> Issue Type: Improvement
> Components: fetcher
> Affects Versions: 0.8
> Environment: Tested on MacOS X 10.4.7/10.4.8
> Reporter: Doug Cook
> Priority: Minor
> Attachments: betterRegexNorm.patch
>
>
> The patch associated with this is backwards-compatible and has several
> improvements over the stock 0.8 RegexURLNormalizer:
> 1) About a 34% performance improvement, from only executing the superclass
> (BasicURLNormalizer) once in most cases, instead of twice as the stock
> version did.
> 2) Support for expensive host-specific normalizations with good performance.
> Each <regex> block optionally takes a list of hosts to which to apply the
> associated regex. If supplied, the regex will only be applied to these hosts.
> This should have scalable performance; the comparison is O(1) regardless of
> the number of hosts. The format is:
> <regex>
> <host>www.host1.com</host>
> <host>host2.site2.com</host>
> <pattern> my pattern here </pattern>
> <substitution> my substitution here </substitution>
> </regex>
> 3) Support for decoding URLs with escaped character encodings (e.g. %20,
> etc.). This is useful, for example, to decode "jump redirects" which have the
> target URL encoded within the source, as on Yahoo. I tried to create an
> extensible notion of "options," the first of which is "unescape." The
> unescape function is applied *after* the substitution and *only* if the
> substitution pattern matches. A simple pattern to unescape Yahoo directory
> redirects would be something like:
> <regex>
> <pattern>^http://[a-z\.]*\.yahoo\.com/.*/\*+(http[^&]+)</pattern>
> <substitution>$1</substitution>
> <options>unescape</options>
> </regex>
> 4) Added the notion of iterating the pattern chain. This is useful when the
> result of a normalization can itself be normalized. While some of this can be
> handled in the stock version by repeating patterns, or by careful ordering of
> patterns, the notion of iterating is cleaner and more powerful. The chain is
> defined to iterate only when the previous iteration changes the input, up to
> a configurable maxium number of iterations. The config parameter to change
> is: urlnormalizer.regex.maxiterations, which defaults to 1 (previous
> behavior). The change is performance-neutral when disabled, and has a
> relatively small performance cost when enabled.
> Pardon any potentially unconventional Java on my part. I've got lots of C/C++
> search engine experience, but Nutch is my first large Java app. I welcome any
> feedback, and hope this is useful.
> Doug
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers