Re: Boilerplate removal

Joe Zhang Wed, 07 Aug 2013 11:45:20 -0700

I didn't test this time around, but I think I did do testing before...

Anything could possibly go wrong? Anything else I can do?



On Wed, Aug 7, 2013 at 11:30 AM, Markus Jelsma
<[email protected]>wrote:

> You are sure the patch works? You get different text output with
> tika.use_boilerpipe enabled and disabled?
>
>
> -----Original message-----
> > From:Joe Zhang <[email protected]>
> > Sent: Wednesday 7th August 2013 20:12
> > To: user <[email protected]>
> > Subject: Boilerplate removal
> >
> > I'm having the following in my nutchsite.xml. Yet the boilerplate removal
> > isn't quite successful. A lot of webpages (from reputable sources such as
> > reuters.com) come with sidepanes and other junks that were not removed.
> Any
> > suggestions from the experts?
> >
> >   <name>plugin.includes</name>
> >
> >
> <value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> >   <description>Regular expression naming plugin directory names to
> >   include.  Any plugin not matching this expression is excluded.
> >   In any case you need at least include the nutch-extensionpoints
> plugin. By
> >   default Nutch includes crawling just HTML and plain text via HTTP,
> >   and basic indexing and search plugins. In order to use HTTPS please
> enable
> >   protocol-httpclient, but be aware of possible intermittent problems
> with
> > the
> >   underlying commons-httpclient library.
> >   </description>
> > </property>
> > <!-- tika properties to use BoilerPipe, according to Marcus Jelsma -->
> > <property>
> >   <name>tika.use_boilerpipe</name>
> >   <value>true</value>
> > </property>
> > <property>
> >   <name>tika.boilerpipe.extractor</name>
> >   <value>ArticleExtractor</value>
> > </property>
> >
>

Re: Boilerplate removal

Reply via email to