[ https://issues.apache.org/jira/browse/NUTCH-410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lewis John McGibbney updated NUTCH-410: --------------------------------------- Patch Info: Patch Available Fix Version/s: 2.2 1.7 > Faster RegexNormalize with more features > ---------------------------------------- > > Key: NUTCH-410 > URL: https://issues.apache.org/jira/browse/NUTCH-410 > Project: Nutch > Issue Type: Improvement > Components: fetcher > Affects Versions: 0.8 > Environment: Tested on MacOS X 10.4.7/10.4.8 > Reporter: Doug Cook > Priority: Minor > Fix For: 1.7, 2.2 > > Attachments: betterRegexNorm.patch > > > The patch associated with this is backwards-compatible and has several > improvements over the stock 0.8 RegexURLNormalizer: > 1) About a 34% performance improvement, from only executing the superclass > (BasicURLNormalizer) once in most cases, instead of twice as the stock > version did. > 2) Support for expensive host-specific normalizations with good performance. > Each <regex> block optionally takes a list of hosts to which to apply the > associated regex. If supplied, the regex will only be applied to these hosts. > This should have scalable performance; the comparison is O(1) regardless of > the number of hosts. The format is: > <regex> > <host>www.host1.com</host> > <host>host2.site2.com</host> > <pattern> my pattern here </pattern> > <substitution> my substitution here </substitution> > </regex> > 3) Support for decoding URLs with escaped character encodings (e.g. %20, > etc.). This is useful, for example, to decode "jump redirects" which have the > target URL encoded within the source, as on Yahoo. I tried to create an > extensible notion of "options," the first of which is "unescape." The > unescape function is applied *after* the substitution and *only* if the > substitution pattern matches. A simple pattern to unescape Yahoo directory > redirects would be something like: > <regex> > <pattern>^http://[a-z\.]*\.yahoo\.com/.*/\*+(http[^&]+)</pattern> > <substitution>$1</substitution> > <options>unescape</options> > </regex> > 4) Added the notion of iterating the pattern chain. This is useful when the > result of a normalization can itself be normalized. While some of this can be > handled in the stock version by repeating patterns, or by careful ordering of > patterns, the notion of iterating is cleaner and more powerful. The chain is > defined to iterate only when the previous iteration changes the input, up to > a configurable maxium number of iterations. The config parameter to change > is: urlnormalizer.regex.maxiterations, which defaults to 1 (previous > behavior). The change is performance-neutral when disabled, and has a > relatively small performance cost when enabled. > Pardon any potentially unconventional Java on my part. I've got lots of C/C++ > search engine experience, but Nutch is my first large Java app. I welcome any > feedback, and hope this is useful. > Doug -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira