[Nutch-dev] [jira] Updated: (NUTCH-410) Faster RegexNormalize with more features

Doug Cook (JIRA) Wed, 29 Nov 2006 11:46:54 -0800

     [ http://issues.apache.org/jira/browse/NUTCH-410?page=all ]


Doug Cook updated NUTCH-410:
----------------------------

    Attachment: betterRegexNorm.patch

> Faster RegexNormalize with more features
> ----------------------------------------
>
>                 Key: NUTCH-410
>                 URL: http://issues.apache.org/jira/browse/NUTCH-410
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.8
>         Environment: Tested on MacOS X 10.4.7/10.4.8
>            Reporter: Doug Cook
>            Priority: Minor
>         Attachments: betterRegexNorm.patch
>
>
> The patch associated with this is backwards-compatible and has several 
> improvements over the stock 0.8 RegexURLNormalizer:
> 1) About a 34% performance improvement, from only executing the superclass 
> (BasicURLNormalizer) once in most cases, instead of twice as the stock 
> version did. 
> 2) Support for expensive host-specific normalizations with good performance. 
> Each <regex> block optionally takes a list of hosts to which to apply the 
> associated regex. If supplied, the regex will only be applied to these hosts. 
> This should have scalable performance; the comparison is O(1) regardless of 
> the number of hosts. The format is:
>     <regex>
>         <host>www.host1.com</host>
>         <host>host2.site2.com</host>
>         <pattern> my pattern here </pattern>
>         <substitution> my substitution here </substitution>
>    </regex>
> 3)  Support for decoding URLs with escaped character encodings (e.g. %20, 
> etc.). This is useful, for example, to decode "jump redirects" which have the 
> target URL encoded within the source, as on Yahoo. I tried to create an 
> extensible notion of "options," the first of which is "unescape." The 
> unescape function is applied *after* the substitution and *only* if the 
> substitution pattern matches. A simple pattern to unescape Yahoo directory 
> redirects would be something like:
> <regex>
>   <pattern>^http://[a-z\.]*\.yahoo\.com/.*/\*+(http[^&amp;]+)</pattern>
>   <substitution>$1</substitution>
>   <options>unescape</options>
> </regex>
> 4) Added the notion of iterating the pattern chain. This is useful when the 
> result of a normalization can itself be normalized. While some of this can be 
> handled in the stock version by repeating patterns, or by careful ordering of 
> patterns, the notion of iterating is cleaner and more powerful. The chain is 
> defined to iterate only when the previous iteration changes the input, up to 
> a configurable maxium number of iterations. The config parameter to change 
> is: urlnormalizer.regex.maxiterations, which defaults to 1 (previous 
> behavior). The change is performance-neutral when disabled, and has a 
> relatively small performance cost when enabled.
> Pardon any potentially unconventional Java on my part. I've got lots of C/C++ 
> search engine experience, but Nutch is my first large Java app. I welcome any 
> feedback, and hope this is useful.
> Doug

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] [jira] Updated: (NUTCH-410) Faster RegexNormalize with more features

Reply via email to