[ https://issues.apache.org/jira/browse/NUTCH-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13016465#comment-13016465 ]
Gabriele Kahlout commented on NUTCH-967: ---------------------------------------- Julien, why doesn't your patch modify tika-parse plugin.xml to use tika-parsers-0.9 instead of tika-parsers-0.7? Trying to do so I get exception (for both html and pdfs): Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:156) at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:177) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:163) It's enough to set it back to 0.7 to have it work. This is not an issue with html only but also pdfs. > Upgrade to Tika 0.9 > ------------------- > > Key: NUTCH-967 > URL: https://issues.apache.org/jira/browse/NUTCH-967 > Project: Nutch > Issue Type: Task > Components: parser > Affects Versions: 1.3, 2.0 > Reporter: Markus Jelsma > Assignee: Julien Nioche > Fix For: 1.3, 2.0 > > Attachments: NUTCH-967-1.3.patch > > -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira