[jira] [Updated] (NUTCH-410) Faster RegexNormalize with more features

2013-05-22 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-410:
--

Fix Version/s: 1.8

 Faster RegexNormalize with more features
 

 Key: NUTCH-410
 URL: https://issues.apache.org/jira/browse/NUTCH-410
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.8
 Environment: Tested on MacOS X 10.4.7/10.4.8
Reporter: Doug Cook
Priority: Minor
 Fix For: 2.3, 1.8

 Attachments: betterRegexNorm.patch


 The patch associated with this is backwards-compatible and has several 
 improvements over the stock 0.8 RegexURLNormalizer:
 1) About a 34% performance improvement, from only executing the superclass 
 (BasicURLNormalizer) once in most cases, instead of twice as the stock 
 version did. 
 2) Support for expensive host-specific normalizations with good performance. 
 Each regex block optionally takes a list of hosts to which to apply the 
 associated regex. If supplied, the regex will only be applied to these hosts. 
 This should have scalable performance; the comparison is O(1) regardless of 
 the number of hosts. The format is:
 regex
 hostwww.host1.com/host
 hosthost2.site2.com/host
 pattern my pattern here /pattern
 substitution my substitution here /substitution
/regex
 3)  Support for decoding URLs with escaped character encodings (e.g. %20, 
 etc.). This is useful, for example, to decode jump redirects which have the 
 target URL encoded within the source, as on Yahoo. I tried to create an 
 extensible notion of options, the first of which is unescape. The 
 unescape function is applied *after* the substitution and *only* if the 
 substitution pattern matches. A simple pattern to unescape Yahoo directory 
 redirects would be something like:
 regex
   pattern^http://[a-z\.]*\.yahoo\.com/.*/\*+(http[^amp;]+)/pattern
   substitution$1/substitution
   optionsunescape/options
 /regex
 4) Added the notion of iterating the pattern chain. This is useful when the 
 result of a normalization can itself be normalized. While some of this can be 
 handled in the stock version by repeating patterns, or by careful ordering of 
 patterns, the notion of iterating is cleaner and more powerful. The chain is 
 defined to iterate only when the previous iteration changes the input, up to 
 a configurable maxium number of iterations. The config parameter to change 
 is: urlnormalizer.regex.maxiterations, which defaults to 1 (previous 
 behavior). The change is performance-neutral when disabled, and has a 
 relatively small performance cost when enabled.
 Pardon any potentially unconventional Java on my part. I've got lots of C/C++ 
 search engine experience, but Nutch is my first large Java app. I welcome any 
 feedback, and hope this is useful.
 Doug

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-410) Faster RegexNormalize with more features

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-410:
---

   Patch Info: Patch Available
Fix Version/s: 2.2
   1.7

 Faster RegexNormalize with more features
 

 Key: NUTCH-410
 URL: https://issues.apache.org/jira/browse/NUTCH-410
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.8
 Environment: Tested on MacOS X 10.4.7/10.4.8
Reporter: Doug Cook
Priority: Minor
 Fix For: 1.7, 2.2

 Attachments: betterRegexNorm.patch


 The patch associated with this is backwards-compatible and has several 
 improvements over the stock 0.8 RegexURLNormalizer:
 1) About a 34% performance improvement, from only executing the superclass 
 (BasicURLNormalizer) once in most cases, instead of twice as the stock 
 version did. 
 2) Support for expensive host-specific normalizations with good performance. 
 Each regex block optionally takes a list of hosts to which to apply the 
 associated regex. If supplied, the regex will only be applied to these hosts. 
 This should have scalable performance; the comparison is O(1) regardless of 
 the number of hosts. The format is:
 regex
 hostwww.host1.com/host
 hosthost2.site2.com/host
 pattern my pattern here /pattern
 substitution my substitution here /substitution
/regex
 3)  Support for decoding URLs with escaped character encodings (e.g. %20, 
 etc.). This is useful, for example, to decode jump redirects which have the 
 target URL encoded within the source, as on Yahoo. I tried to create an 
 extensible notion of options, the first of which is unescape. The 
 unescape function is applied *after* the substitution and *only* if the 
 substitution pattern matches. A simple pattern to unescape Yahoo directory 
 redirects would be something like:
 regex
   pattern^http://[a-z\.]*\.yahoo\.com/.*/\*+(http[^amp;]+)/pattern
   substitution$1/substitution
   optionsunescape/options
 /regex
 4) Added the notion of iterating the pattern chain. This is useful when the 
 result of a normalization can itself be normalized. While some of this can be 
 handled in the stock version by repeating patterns, or by careful ordering of 
 patterns, the notion of iterating is cleaner and more powerful. The chain is 
 defined to iterate only when the previous iteration changes the input, up to 
 a configurable maxium number of iterations. The config parameter to change 
 is: urlnormalizer.regex.maxiterations, which defaults to 1 (previous 
 behavior). The change is performance-neutral when disabled, and has a 
 relatively small performance cost when enabled.
 Pardon any potentially unconventional Java on my part. I've got lots of C/C++ 
 search engine experience, but Nutch is my first large Java app. I welcome any 
 feedback, and hope this is useful.
 Doug

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira