[ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma resolved NUTCH-1233. ---------------------------------- Resolution: Fixed Committed to trunk in revision 1730687. > Rely on Tika for outlink extraction > ----------------------------------- > > Key: NUTCH-1233 > URL: https://issues.apache.org/jira/browse/NUTCH-1233 > Project: Nutch > Issue Type: Improvement > Components: parser > Affects Versions: 1.11 > Reporter: Markus Jelsma > Assignee: Markus Jelsma > Fix For: 1.12 > > Attachments: NUTCH-1233-1.5-wip.patch, NUTCH-1233-1.6-1.patch, > NUTCH-1233-1.6-2.patch, NUTCH-1233.patch, NUTCH-1233.patch, post-1233-2.txt, > post-1233.txt, pre-1233-2.txt, pre-1233.txt > > > Tika provides outlink extraction features that are not used in Nutch. To be > able to use it in Nutch we need Tika to return the rel attr value of each > link, which it currently doesn't. There's a patch for Tika 1.1. If that patch > is included in Tika and we upgraded to that new version this issue can be > worked on. Here's preliminary code that does both Tika and current outlink > extraction. This also includes parts of the Boilerpipe code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)