[ https://issues.apache.org/jira/browse/NUTCH-363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma closed NUTCH-363. ------------------------------- Bulk close of resolved issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira > Fetcher normalizes everything at least twice > -------------------------------------------- > > Key: NUTCH-363 > URL: https://issues.apache.org/jira/browse/NUTCH-363 > Project: Nutch > Issue Type: Bug > Components: fetcher > Affects Versions: 0.8 > Environment: OS X 10.4.7 > Reporter: Doug Cook > Priority: Minor > Fix For: 2.0 > > > New links are normalized twice by the fetcher: > First in DOMContentUtils.getOutlinks, where the constructor > Outlink(url.toString(), linkText.toString().trim(), conf) normalizes the URL. > The second time is in ParseOutputFormat.write(). > For some URLs (e.g. those repeated on a page) a given URL may be normalized a > number of times, but it is always normalized at least twice. > For those of us with expensive normalizations, this is probably burning some > CPU. > I'd gladly fix this, but I'm not yet familiar enough with the code to know if > there are some hidden assumptions which rely on this behavior. > [A related note is that URLs are normalized *before* filtering; this is > causing a lot of extra normalization as well. In general, filters may not be > safe to run before normalization, but there is likely a class of them which > are (filtering out .gif/.jpg etc). Perhaps the notion of a "pre-normalizer > filter" would be a useful one?] -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira