[ https://issues.apache.org/jira/browse/NUTCH-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16211126#comment-16211126 ]
Sebastian Nagel commented on NUTCH-2443: ---------------------------------------- Keep it simple for now, and open a new issue to work on it systematically? Not to miss any link means some work, there are many attributes where URLs appear. Of course, only <a href=...> and <img src=...> are really frequent, see https://gist.github.com/sebastian-nagel/ff4379f9e2115d3c922416d520274b86 > Extract links from the video tag with the parse-html plugin > ----------------------------------------------------------- > > Key: NUTCH-2443 > URL: https://issues.apache.org/jira/browse/NUTCH-2443 > Project: Nutch > Issue Type: Improvement > Components: parser, plugin > Affects Versions: 1.13 > Reporter: Jorge Luis Betancourt Gonzalez > Assignee: Jorge Luis Betancourt Gonzalez > Priority: Minor > Fix For: 1.14 > > > At the moment the {{parse-html}} extracts links from the tags {{a, area, > form}} (configurable){{, frame, iframe, script, link, img}}. Since we allow > extracting links to binary files (images) extracting links also from the > {{video}} tag should be supported. -- This message was sent by Atlassian JIRA (v6.4.14#64029)