Re: Nutch HTML Parsers & tika-boilerpipe configuration

Saravanakumar Karunanithi Mon, 29 Jul 2013 04:36:26 -0700

after applying the patch, I tried the following command

*bin/nutch parsechecker -dumpText
http://indiatoday.intoday.in/story/google-unveils-android-4.3-jelly-bean-operating-system/1/296208.html
*
Which resulted the expected the results, but when I run the crawler, I get
~98% Error while Parsing,


I get the following error

*"Unable to successfully parse content URL*"



On Mon, Jul 29, 2013 at 4:53 PM, Markus Jelsma
<[email protected]>wrote:

> Simple, only use parse-tika and patch with NUTCH-961.
> https://issues.apache.org/jira/browse/NUTCH-961
>
> Extractor algorithms are fixed, it is not possible to preanalyze a page
> and select an extractor accordingly.
>
>
> -----Original message-----
> > From:imran khan <[email protected]>
> > Sent: Monday 29th July 2013 11:25
> > To: [email protected]
> > Subject: Nutch HTML Parsers &amp; tika-boilerpipe configuration
> >
> > Greetings,
> >
> > I am trying to understand the role/functionality of different html
> parsers
> > (parse-html and parse-tika) plugin in nutch 2.2.
> >
> > My plugin.includes has "parse-(html|tika) " and my parse-plugins.xml has
> >
> > <mimeType name="*">
> >   <plugin id="parse-tika" />
> > </mimeType>
> >
> > <mimeType name="text/html">
> > <plugin id="parse-html" />
> > </mimeType>
> >
> >         <mimeType name="application/xhtml+xml">
> > <plugin id="parse-html" />
> > </mimeType>
> >
> > So does it mean for parsing html pages "parse-html" plugin would be used
> ?
> > And to use Tika for parsing my html pages I would simply replace it with
> > "parse-tika" plugin ?
> >
> > And if I want to remove the boilerplate text like menu, ads text etc.
> from
> > my 'content' field in nutch then I guess I have to use Tika with
> boilerpipe
> > ?
> >
> > Where can I configure nutch to use boilerpipe with Tika and other
> > extracters ? And is there any configuration in Tika/boilerpipe which
> would
> > automatically pick the right extractor for Tika for current Html page ?
> >
> > Regards,
> > Imran
> >
>



-- 
Thanks & Regards,
Saravanakumar Karunanithi

Re: Nutch HTML Parsers & tika-boilerpipe configuration

Reply via email to