Well then use the parsechecker -dumpText tool to make sure it works or not. Reuters is notoriously bad to parse but Boilerpipe should make sense of it!
-----Original message----- > From:Joe Zhang <[email protected]> > Sent: Wednesday 7th August 2013 20:44 > To: user <[email protected]> > Subject: Re: Boilerplate removal > > I didn't test this time around, but I think I did do testing before... > > Anything could possibly go wrong? Anything else I can do? > > > On Wed, Aug 7, 2013 at 11:30 AM, Markus Jelsma > <[email protected]>wrote: > > > You are sure the patch works? You get different text output with > > tika.use_boilerpipe enabled and disabled? > > > > > > -----Original message----- > > > From:Joe Zhang <[email protected]> > > > Sent: Wednesday 7th August 2013 20:12 > > > To: user <[email protected]> > > > Subject: Boilerplate removal > > > > > > I'm having the following in my nutchsite.xml. Yet the boilerplate removal > > > isn't quite successful. A lot of webpages (from reputable sources such as > > > reuters.com) come with sidepanes and other junks that were not removed. > > Any > > > suggestions from the experts? > > > > > > <name>plugin.includes</name> > > > > > > > > <value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more)|scoring-opic|urlnormalizer-(pass|regex|basic)</value> > > > <description>Regular expression naming plugin directory names to > > > include. Any plugin not matching this expression is excluded. > > > In any case you need at least include the nutch-extensionpoints > > plugin. By > > > default Nutch includes crawling just HTML and plain text via HTTP, > > > and basic indexing and search plugins. In order to use HTTPS please > > enable > > > protocol-httpclient, but be aware of possible intermittent problems > > with > > > the > > > underlying commons-httpclient library. > > > </description> > > > </property> > > > <!-- tika properties to use BoilerPipe, according to Marcus Jelsma --> > > > <property> > > > <name>tika.use_boilerpipe</name> > > > <value>true</value> > > > </property> > > > <property> > > > <name>tika.boilerpipe.extractor</name> > > > <value>ArticleExtractor</value> > > > </property> > > > > > >

