after applying the patch, I tried the following command *bin/nutch parsechecker -dumpText http://indiatoday.intoday.in/story/google-unveils-android-4.3-jelly-bean-operating-system/1/296208.html * Which resulted the expected the results, but when I run the crawler, I get ~98% Error while Parsing,
I get the following error *"Unable to successfully parse content URL*" On Mon, Jul 29, 2013 at 4:53 PM, Markus Jelsma <[email protected]>wrote: > Simple, only use parse-tika and patch with NUTCH-961. > https://issues.apache.org/jira/browse/NUTCH-961 > > Extractor algorithms are fixed, it is not possible to preanalyze a page > and select an extractor accordingly. > > > -----Original message----- > > From:imran khan <[email protected]> > > Sent: Monday 29th July 2013 11:25 > > To: [email protected] > > Subject: Nutch HTML Parsers & tika-boilerpipe configuration > > > > Greetings, > > > > I am trying to understand the role/functionality of different html > parsers > > (parse-html and parse-tika) plugin in nutch 2.2. > > > > My plugin.includes has "parse-(html|tika) " and my parse-plugins.xml has > > > > <mimeType name="*"> > > <plugin id="parse-tika" /> > > </mimeType> > > > > <mimeType name="text/html"> > > <plugin id="parse-html" /> > > </mimeType> > > > > <mimeType name="application/xhtml+xml"> > > <plugin id="parse-html" /> > > </mimeType> > > > > So does it mean for parsing html pages "parse-html" plugin would be used > ? > > And to use Tika for parsing my html pages I would simply replace it with > > "parse-tika" plugin ? > > > > And if I want to remove the boilerplate text like menu, ads text etc. > from > > my 'content' field in nutch then I guess I have to use Tika with > boilerpipe > > ? > > > > Where can I configure nutch to use boilerpipe with Tika and other > > extracters ? And is there any configuration in Tika/boilerpipe which > would > > automatically pick the right extractor for Tika for current Html page ? > > > > Regards, > > Imran > > > -- Thanks & Regards, Saravanakumar Karunanithi

