Re: Duplicate HTML Metadata When Parsed with Tika

2014-07-09 Thread Julien Nioche
Hi, Can you please open a JIRA issue on https://issues.apache.org/jira/browse/NUTCH and include a URL which can be used to reproduce the problem? Thanks Julien On 9 July 2014 14:37, Jonathan Cooper-Ellis wrote: > Hello Julien, > > Thanks for the reply. Unfortunately, undoing the changes I ma

Re: Duplicate HTML Metadata When Parsed with Tika

2014-07-09 Thread Jonathan Cooper-Ellis
Hello Julien, Thanks for the reply. Unfortunately, undoing the changes I made to parse-plugins.xml and only removing parse-html from plugin.includes does not fix the double indexing issue. It also might be worth mentioning that this is also happens on a fresh version of Nutch 1.8, without using Bo

Re: Duplicate HTML Metadata When Parsed with Tika

2014-07-09 Thread Julien Nioche
Hi Jonathan You shouldn't need to modify parse-plugins.xml to parse HTML docs with Tika : just remove parse-html from plugin.includes from nutch-site.xml. Could you please try that instead and see if that fixes your problem? Thanks Julien On 8 July 2014 19:41, Jonathan Cooper-Ellis wrote: >

Duplicate HTML Metadata When Parsed with Tika

2014-07-08 Thread Jonathan Cooper-Ellis
Hello, I'm using Nutch 1.8 and trying to index HTML metadata in Solr. I followed the steps for parsing metatags and had no issues while using parse-html for parsing HTML. The problem arises when I modify parse-plugins.xml to parse HTML docs with Tika. When Tika parses the doc and plugin.includes h