Hi Lewis, Thanks for your support. The URL I attempted to parse is : www.ab-advisory.com a simple web site under drupal7 with only html pages. I tried in the past to crawl www.tripadvisor.com, www.nytime.com and I ended up with same result.
kr, Arcondo On Wed, Jan 9, 2013 at 12:22 AM, Lewis John Mcgibbney < [email protected]> wrote: > Hi Arcondo, > > On Mon, Jan 7, 2013 at 10:12 PM, Arcondo Dasilva > <[email protected]>wrote: > > > My question : why I can't use Tika to parse Html instead of Neko ? is it > > possible to get ride of Neko or it is mandatory ? > > > > I would urge you to override the parsing logic in parse-plugins.xml [0] > (which by default uses Tika to guess the Mimetype before assigning the > correct plugin). > You can do this like > > <mimeType name="text/html"> > <plugin id="parse-tika" /> > </mimeType> > <mimeType name="application/xhtml+xml"> > <plugin id="parse-tika" /> > </mimeType> > > > Please note that you will have to rebuild Nutch from source once this is > done OK. > > > > The other weird thing with neko is when I dig into > > nutch21/src/plugins/lib-nekohtml, there only build, ivy and plugin.xml > with > > no src folder with java classes whereas the others plugins having them. > is > > it important ? > > > Yes it is important, but it is not the root of the problem. > > > how it could be possible to build them if there aren't > > present ? > > > > Because the legacy HTML parsing logic resides in parse-html, this then uses > lib-nekohtml as a requirement, please see [1] > > I hope this overrides the problem, however there certainly seems to be a > problem here. Can you pass the URL you are attempting to parse? > > Thank you > > Lewis > > [0] http://svn.apache.org/repos/asf/nutch/trunk/conf/parse-plugins.xml > [1] > > http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/parse-html/plugin.xml >

