Hi,
Can you please open a JIRA issue on
https://issues.apache.org/jira/browse/NUTCH and include a URL which can be
used to reproduce the problem?
Thanks
Julien
On 9 July 2014 14:37, Jonathan Cooper-Ellis wrote:
> Hello Julien,
>
> Thanks for the reply. Unfortunately, undoing the changes I ma
Hello Julien,
Thanks for the reply. Unfortunately, undoing the changes I made to
parse-plugins.xml and only removing parse-html from plugin.includes does
not fix the double indexing issue. It also might be worth mentioning that
this is also happens on a fresh version of Nutch 1.8, without using
Bo
Hi Jonathan
You shouldn't need to modify parse-plugins.xml to parse HTML docs with
Tika : just remove parse-html from plugin.includes from nutch-site.xml.
Could you please try that instead and see if that fixes your problem?
Thanks
Julien
On 8 July 2014 19:41, Jonathan Cooper-Ellis wrote:
>
Hello,
I'm using Nutch 1.8 and trying to index HTML metadata in Solr. I followed
the steps for parsing metatags and had no issues while using parse-html for
parsing HTML. The problem arises when I modify parse-plugins.xml to parse
HTML docs with Tika. When Tika parses the doc and plugin.includes h
4 matches
Mail list logo