Is it interesting that you acknowledge BP as the most effective, but you
also said you don't use it any more.


On Tue, Jun 11, 2013 at 3:08 PM, Markus Jelsma
<markus.jel...@openindex.io>wrote:

> In my opinion Boilerpipe is the most effective free and open source tool
> for the job :)
>
> It does require some patching (see linked issues) and manual upgrade to
> Boilerpipe 1.2.0.
>
> -----Original message-----
> > From:Joe Zhang <smartag...@gmail.com>
> > Sent: Tue 11-Jun-2013 21:19
> > To: user <user@nutch.apache.org>
> > Subject: Re: using Tika within Nutch to remove boiler plates?
> >
> > So what in your opinion is the most effective way of removing
> boilerplates
> > in Nutch crawls?
> >
> >
> > On Tue, Jun 11, 2013 at 12:12 PM, Markus Jelsma
> > <markus.jel...@openindex.io>wrote:
> >
> > > Yes, Boilerpipe is complex and difficult to adapt. It also requires
> you to
> > > preset an extraction algorithm which is impossible for us. I've
> created an
> > > extractor instead that works for most pages and ignores stuff like news
> > > overviews and major parts of homepages. It's also tightly coupled with
> our
> > > date extractor (based on [1]) and language detector (based on
> LangDetect)
> > > and image extraction.
> > >
> > > In many cases boilerpipe's articleextractor will work very well but
> date
> > > extraction such as NUTCH-141 won't do the trick as it only works on
> > > extracted text as a whole and does not consider page semantics.
> > >
> > > [1]: https://issues.apache.org/jira/browse/NUTCH-1414
> > >
> > > -----Original message-----
> > > > From:Joe Zhang <smartag...@gmail.com>
> > > > Sent: Tue 11-Jun-2013 18:06
> > > > To: user <user@nutch.apache.org>
> > > > Subject: Re: using Tika within Nutch to remove boiler plates?
> > > >
> > > > Any particular reason why you don't use boilerpipe any more? So what
> do
> > > you
> > > > suggest as an alternative?
> > > >
> > > >
> > > > On Tue, Jun 11, 2013 at 5:41 AM, Markus Jelsma
> > > > <markus.jel...@openindex.io>wrote:
> > > >
> > > > > we don't use Boilerpipe anymore so no point in sharing. Just set
> those
> > > two
> > > > > configuration options in nutch-site.xml as
> > > > >
> > > > >   <property>
> > > > >   <name>tika.use_boilerpipe</name>
> > > > >   <value>true</value>
> > > > >  </property>
> > > > >   <property>
> > > > >   <name>tika.boilerpipe.extractor</name>
> > > > >   <value>ArticleExtractor</value>
> > > > >  </property>
> > > > >
> > > > > and it should work
> > > > >
> > > > > -----Original message-----
> > > > > > From:Joe Zhang <smartag...@gmail.com>
> > > > > > Sent: Tue 11-Jun-2013 01:42
> > > > > > To: user <user@nutch.apache.org>
> > > > > > Subject: Re: using Tika within Nutch to remove boiler plates?
> > > > > >
> > > > > > Marcus, do you mind sharing a sample nutch-site.xml?
> > > > > >
> > > > > >
> > > > > > On Mon, Jun 10, 2013 at 1:42 AM, Markus Jelsma
> > > > > > <markus.jel...@openindex.io>wrote:
> > > > > >
> > > > > > > Those settings belong to nutch-site. Enable BP and set the
> correct
> > > > > > > extractor and it should work just fine.
> > > > > > >
> > > > > > >
> > > > > > > -----Original message-----
> > > > > > > > From:Lewis John Mcgibbney <lewis.mcgibb...@gmail.com>
> > > > > > > > Sent: Sun 09-Jun-2013 20:47
> > > > > > > > To: user@nutch.apache.org
> > > > > > > > Subject: Re: using Tika within Nutch to remove boiler plates?
> > > > > > > >
> > > > > > > > Hi Joe,
> > > > > > > > I've not used this feature, it would be great if one of the
> > > others
> > > > > could
> > > > > > > > chime in here.
> > > > > > > > From what I can infer from the correspondence on the issue,
> and
> > > the
> > > > > > > > available patches, you should be applying the most recent one
> > > > > uploaded by
> > > > > > > > Markus [0] as your starting point. This is dated as
> 22/11/2011.
> > > > > > > >
> > > > > > > > On Sun, Jun 9, 2013 at 11:00 AM, Joe Zhang <
> smartag...@gmail.com
> > > >
> > > > > wrote:
> > > > > > > >
> > > > > > > > >
> > > > > > > > > One of the comments mentioned the following:
> > > > > > > > >
> > > > > > > > > tika.use_boilerpipe=true
> > > > > > > > > tika.boilerpipe.extractor=ArticleExtractor|CanolaExtractor
> > > > > > > > >
> > > > > > > > > which part the code is it referring to?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > You will see this included in one of the earlier patches
> > > uploaded by
> > > > > > > Markus
> > > > > > > > on 11/05/2011 [1]
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Also, within the current Nutch config, should I focus on
> > > > > > > parse-plugin.xml?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > Look at the other patches and also Gabriele's comments. You
> may
> > > most
> > > > > > > likely
> > > > > > > > need to alter something but AFAICT the work hasbeen done..
> it's
> > > just
> > > > > a
> > > > > > > case
> > > > > > > > of pulling together several contributions.
> > > > > > > >
> > > > > > > > Maybe you should look at the patch for 2.x (uploaded most
> > > recently by
> > > > > > > > Roland) and see what is going on there.
> > > > > > > >
> > > > > > > > hth
> > > > > > > >
> > > > > > > > [0]
> > > > > > > >
> > > > > > >
> > > > >
> > >
> https://issues.apache.org/jira/secure/attachment/12504736/NUTCH-961-1.5-1.patch
> > > > > > > > [1]
> > > > > > > >
> > > > > > >
> > > > >
> > >
> https://issues.apache.org/jira/secure/attachment/12478927/NUTCH-961-1.3-tikaparser1.patch
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to