LostText

Lewis John Mcgibbney Tue, 08 Jan 2013 15:22:39 -0800

Hi Arcondo,

On Mon, Jan 7, 2013 at 10:12 PM, Arcondo Dasilva
<[email protected]>wrote:


> My question : why I can't use Tika to parse Html instead of Neko ? is it
> possible to get ride of Neko or it is mandatory ?
>

I would urge you to override the parsing logic in parse-plugins.xml [0]
(which by default uses Tika to guess the Mimetype before assigning the
correct plugin).
You can do this like

<mimeType name="text/html">
                <plugin id="parse-tika" />
</mimeType>
<mimeType name="application/xhtml+xml">
                <plugin id="parse-tika" />
</mimeType>


 Please note that you will have to rebuild Nutch from source once this is
done OK.


> The other weird thing with neko is when I dig into
> nutch21/src/plugins/lib-nekohtml, there only build, ivy and plugin.xml with
> no src folder with java classes whereas the others plugins having them. is
> it important ?


Yes it is important, but it is not the root of the problem.

> how it could be possible to build them if there aren't
> present ?
>

Because the legacy HTML parsing logic resides in parse-html, this then uses
lib-nekohtml as a requirement, please see [1]

I hope this overrides the problem, however there certainly seems to be a
problem here. Can you pass the URL you are attempting to parse?

Thank you

Lewis

[0] http://svn.apache.org/repos/asf/nutch/trunk/conf/parse-plugins.xml
[1]
http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/parse-html/plugin.xml

Re: Parsing error : java.lang.NoClassDefFoundError: org/cyberneko/html/LostText

Reply via email to