[ 
https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15148632#comment-15148632
 ] 

Hudson commented on NUTCH-1233:
-------------------------------

SUCCESS: Integrated in Nutch-trunk #3346 (See 
[https://builds.apache.org/job/Nutch-trunk/3346/])
NUTCH-1233 Rely on Tika for outlink extraction (markus: 
[http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1730687])
* trunk/CHANGES.txt
* 
trunk/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMBuilder.java
* 
trunk/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
* 
trunk/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java


> Rely on Tika for outlink extraction
> -----------------------------------
>
>                 Key: NUTCH-1233
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1233
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.11
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.12
>
>         Attachments: NUTCH-1233-1.5-wip.patch, NUTCH-1233-1.6-1.patch, 
> NUTCH-1233-1.6-2.patch, NUTCH-1233.patch, NUTCH-1233.patch, post-1233-2.txt, 
> post-1233.txt, pre-1233-2.txt, pre-1233.txt
>
>
> Tika provides outlink extraction features that are not used in Nutch. To be 
> able to use it in Nutch we need Tika to return the rel attr value of each 
> link, which it currently doesn't. There's a patch for Tika 1.1. If that patch 
> is included in Tika and we upgraded to that new version this issue can be 
> worked on. Here's preliminary code that does both Tika and current outlink 
> extraction. This also includes parts of the Boilerpipe code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to