[jira] [Updated] (NUTCH-410) Faster RegexNormalize with more features

Sebastian Nagel (JIRA) Wed, 22 May 2013 00:21:26 -0700

     [ 
https://issues.apache.org/jira/browse/NUTCH-410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sebastian Nagel updated NUTCH-410:
----------------------------------

    Fix Version/s: 1.8
    
> Faster RegexNormalize with more features
> ----------------------------------------
>
>                 Key: NUTCH-410
>                 URL: https://issues.apache.org/jira/browse/NUTCH-410
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.8
>         Environment: Tested on MacOS X 10.4.7/10.4.8
>            Reporter: Doug Cook
>            Priority: Minor
>             Fix For: 2.3, 1.8
>
>         Attachments: betterRegexNorm.patch
>
>
> The patch associated with this is backwards-compatible and has several 
> improvements over the stock 0.8 RegexURLNormalizer:
> 1) About a 34% performance improvement, from only executing the superclass 
> (BasicURLNormalizer) once in most cases, instead of twice as the stock 
> version did. 
> 2) Support for expensive host-specific normalizations with good performance. 
> Each <regex> block optionally takes a list of hosts to which to apply the 
> associated regex. If supplied, the regex will only be applied to these hosts. 
> This should have scalable performance; the comparison is O(1) regardless of 
> the number of hosts. The format is:
>     <regex>
>         <host>www.host1.com</host>
>         <host>host2.site2.com</host>
>         <pattern> my pattern here </pattern>
>         <substitution> my substitution here </substitution>
>    </regex>
> 3)  Support for decoding URLs with escaped character encodings (e.g. %20, 
> etc.). This is useful, for example, to decode "jump redirects" which have the 
> target URL encoded within the source, as on Yahoo. I tried to create an 
> extensible notion of "options," the first of which is "unescape." The 
> unescape function is applied *after* the substitution and *only* if the 
> substitution pattern matches. A simple pattern to unescape Yahoo directory 
> redirects would be something like:
> <regex>
>   <pattern>^http://[a-z\.]*\.yahoo\.com/.*/\*+(http[^&amp;]+)</pattern>
>   <substitution>$1</substitution>
>   <options>unescape</options>
> </regex>
> 4) Added the notion of iterating the pattern chain. This is useful when the 
> result of a normalization can itself be normalized. While some of this can be 
> handled in the stock version by repeating patterns, or by careful ordering of 
> patterns, the notion of iterating is cleaner and more powerful. The chain is 
> defined to iterate only when the previous iteration changes the input, up to 
> a configurable maxium number of iterations. The config parameter to change 
> is: urlnormalizer.regex.maxiterations, which defaults to 1 (previous 
> behavior). The change is performance-neutral when disabled, and has a 
> relatively small performance cost when enabled.
> Pardon any potentially unconventional Java on my part. I've got lots of C/C++ 
> search engine experience, but Nutch is my first large Java app. I welcome any 
> feedback, and hope this is useful.
> Doug

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-410) Faster RegexNormalize with more features

Reply via email to