Well then use the parsechecker -dumpText tool to make sure it works or not. 
Reuters is notoriously bad to parse but Boilerpipe should make sense of it! 

 
 
-----Original message-----
> From:Joe Zhang <[email protected]>
> Sent: Wednesday 7th August 2013 20:44
> To: user <[email protected]>
> Subject: Re: Boilerplate removal
> 
> I didn't test this time around, but I think I did do testing before...
> 
> Anything could possibly go wrong? Anything else I can do?
> 
> 
> On Wed, Aug 7, 2013 at 11:30 AM, Markus Jelsma
> <[email protected]>wrote:
> 
> > You are sure the patch works? You get different text output with
> > tika.use_boilerpipe enabled and disabled?
> >
> >
> > -----Original message-----
> > > From:Joe Zhang <[email protected]>
> > > Sent: Wednesday 7th August 2013 20:12
> > > To: user <[email protected]>
> > > Subject: Boilerplate removal
> > >
> > > I'm having the following in my nutchsite.xml. Yet the boilerplate removal
> > > isn't quite successful. A lot of webpages (from reputable sources such as
> > > reuters.com) come with sidepanes and other junks that were not removed.
> > Any
> > > suggestions from the experts?
> > >
> > >   <name>plugin.includes</name>
> > >
> > >
> > <value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> > >   <description>Regular expression naming plugin directory names to
> > >   include.  Any plugin not matching this expression is excluded.
> > >   In any case you need at least include the nutch-extensionpoints
> > plugin. By
> > >   default Nutch includes crawling just HTML and plain text via HTTP,
> > >   and basic indexing and search plugins. In order to use HTTPS please
> > enable
> > >   protocol-httpclient, but be aware of possible intermittent problems
> > with
> > > the
> > >   underlying commons-httpclient library.
> > >   </description>
> > > </property>
> > > <!-- tika properties to use BoilerPipe, according to Marcus Jelsma -->
> > > <property>
> > >   <name>tika.use_boilerpipe</name>
> > >   <value>true</value>
> > > </property>
> > > <property>
> > >   <name>tika.boilerpipe.extractor</name>
> > >   <value>ArticleExtractor</value>
> > > </property>
> > >
> >
> 

Reply via email to