[ 
https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1233:
---------------------------------

    Fix Version/s:     (was: 1.5)
                   1.6

20120304-push-1.6
                
> Rely on Tika for outlink extraction
> -----------------------------------
>
>                 Key: NUTCH-1233
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1233
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1233-1.5-wip.patch
>
>
> Tika provides outlink extraction features that are not used in Nutch. To be 
> able to use it in Nutch we need Tika to return the rel attr value of each 
> link, which it currently doesn't. There's a patch for Tika 1.1. If that patch 
> is included in Tika and we upgraded to that new version this issue can be 
> worked on. Here's preliminary code that does both Tika and current outlink 
> extraction. This also includes parts of the Boilerpipe code.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to