Hi Arcondo,
On Mon, Jan 7, 2013 at 10:12 PM, Arcondo Dasilva
<[email protected]>wrote:
> My question : why I can't use Tika to parse Html instead of Neko ? is it
> possible to get ride of Neko or it is mandatory ?
>
I would urge you to override the parsing logic in parse-plugins.xml [0]
(which by default uses Tika to guess the Mimetype before assigning the
correct plugin).
You can do this like
<mimeType name="text/html">
<plugin id="parse-tika" />
</mimeType>
<mimeType name="application/xhtml+xml">
<plugin id="parse-tika" />
</mimeType>
Please note that you will have to rebuild Nutch from source once this is
done OK.
> The other weird thing with neko is when I dig into
> nutch21/src/plugins/lib-nekohtml, there only build, ivy and plugin.xml with
> no src folder with java classes whereas the others plugins having them. is
> it important ?
Yes it is important, but it is not the root of the problem.
> how it could be possible to build them if there aren't
> present ?
>
Because the legacy HTML parsing logic resides in parse-html, this then uses
lib-nekohtml as a requirement, please see [1]
I hope this overrides the problem, however there certainly seems to be a
problem here. Can you pass the URL you are attempting to parse?
Thank you
Lewis
[0] http://svn.apache.org/repos/asf/nutch/trunk/conf/parse-plugins.xml
[1]
http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/parse-html/plugin.xml