[ https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Doğacan Güney updated NUTCH-505: -------------------------------- Attachment: NUTCH-505.patch New patch. This is sort of a release candidate, if there are no objections, I think this patch can go in as it is. The biggest change is that ParseData is no longer a Configurable. In the current implementation, when a parse data comes to ParseOutputFormat, it contains at most db.max.outlinks.per.page, then after filtering, ParseOutputFormat outputs whatever remains. For example, in a situation where ignoreExternalLinks is true and the first hundred links (assuming db.max.outlinks per page is 100) are all external, no outlinks will be extracted, even if there are internal urls past 100th outlinks mark. So, now parse data reads all outlinks, ParseOutputFormat processes them and outputs at most db.max.outlinks.per.page many outlinks (Also resulting parse data contains db.max.outlinks.per.page outlinks too). I think this is a better approach but it may be a bit slower. Besides this change, UrlValidator code is cleaned up and moved into org.apache.nutch.net package. Also, outlinks are not normalized in ParseOutputFormat since they are already normalized in Outlink.Outlink. There is no point in normalizing them twice. > Outlink urls should be validated > -------------------------------- > > Key: NUTCH-505 > URL: https://issues.apache.org/jira/browse/NUTCH-505 > Project: Nutch > Issue Type: Improvement > Reporter: Doğacan Güney > Priority: Minor > Attachments: NUTCH-505.patch, NUTCH-505_draft.patch, > NUTCH-505_draft_v2.patch > > > See discussion here: > http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html > Parse plugins may extract garbage urls from pages. We need a url validation > system that tests these urls and filters out garbage. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.