I didn't test this time around, but I think I did do testing before... Anything could possibly go wrong? Anything else I can do?
On Wed, Aug 7, 2013 at 11:30 AM, Markus Jelsma <[email protected]>wrote: > You are sure the patch works? You get different text output with > tika.use_boilerpipe enabled and disabled? > > > -----Original message----- > > From:Joe Zhang <[email protected]> > > Sent: Wednesday 7th August 2013 20:12 > > To: user <[email protected]> > > Subject: Boilerplate removal > > > > I'm having the following in my nutchsite.xml. Yet the boilerplate removal > > isn't quite successful. A lot of webpages (from reputable sources such as > > reuters.com) come with sidepanes and other junks that were not removed. > Any > > suggestions from the experts? > > > > <name>plugin.includes</name> > > > > > <value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more)|scoring-opic|urlnormalizer-(pass|regex|basic)</value> > > <description>Regular expression naming plugin directory names to > > include. Any plugin not matching this expression is excluded. > > In any case you need at least include the nutch-extensionpoints > plugin. By > > default Nutch includes crawling just HTML and plain text via HTTP, > > and basic indexing and search plugins. In order to use HTTPS please > enable > > protocol-httpclient, but be aware of possible intermittent problems > with > > the > > underlying commons-httpclient library. > > </description> > > </property> > > <!-- tika properties to use BoilerPipe, according to Marcus Jelsma --> > > <property> > > <name>tika.use_boilerpipe</name> > > <value>true</value> > > </property> > > <property> > > <name>tika.boilerpipe.extractor</name> > > <value>ArticleExtractor</value> > > </property> > > >

