[ https://issues.apache.org/jira/browse/NUTCH-670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12655513#action_12655513 ]
Todd Lipcon commented on NUTCH-670: ----------------------------------- Turns out this is actually a bit trickier if I'm understanding the code correctly. It looks like the feed plugin outputs parse data for each of the feed URLs rather than for the feed itself. In the reduce phase of parsing, however, multiple parse datas for a single URL get reduced by simply picking the first. Therefore if an RSS feed has an enclosure link, but the HTML version of the post is also in the index *without* that link, then the link may be lost. I'm not entirely sure how to deal with this... any thoughts? > feed plugin does not parse RSS2 enclosures > ------------------------------------------ > > Key: NUTCH-670 > URL: https://issues.apache.org/jira/browse/NUTCH-670 > Project: Nutch > Issue Type: Improvement > Affects Versions: 0.9.0 > Reporter: Todd Lipcon > Priority: Minor > Original Estimate: 1h > Remaining Estimate: 1h > > The feed parse in plugins/feed does not get count links found in RSS2 > "enclosure" tags as Outlinks. > It's a pretty simple patch - SyndEntry has a getEnclosures call. I'll submit > the patch tomorrow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.