Normalize Host during Generate
------------------------------
Key: NUTCH-253
URL: http://issues.apache.org/jira/browse/NUTCH-253
Project: Nutch
Type: New Feature
Components: fetcher
Versions: 0.8-dev
Reporter: Rod Taylor
Extend URL Normalizer to allow for normalizion of the Hostname during Generate.
By default no rules are applied.
In short, this allows foo.bar.com, bif.baz.bar.com and bar.com to be counted as
being the same for generate.max.per.host if an appropriate regex is used.
Add "urlnormalizer-regex" to plugin.includes in nutch-site.xml in order to
enable it.
Since several modules now extend the urlnormalizer base we use a "scope"
parameter within plugin.xml to allow differentiation between the various
urlnormalizer modules to select the right module for Generate.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers