we don't use Boilerpipe anymore so no point in sharing. Just set those two 
configuration options in nutch-site.xml as 

  <property>
  <name>tika.use_boilerpipe</name>
  <value>true</value>
 </property>
  <property>
  <name>tika.boilerpipe.extractor</name>
  <value>ArticleExtractor</value>
 </property>

and it should work
 
-----Original message-----
> From:Joe Zhang <smartag...@gmail.com>
> Sent: Tue 11-Jun-2013 01:42
> To: user <user@nutch.apache.org>
> Subject: Re: using Tika within Nutch to remove boiler plates?
> 
> Marcus, do you mind sharing a sample nutch-site.xml?
> 
> 
> On Mon, Jun 10, 2013 at 1:42 AM, Markus Jelsma
> <markus.jel...@openindex.io>wrote:
> 
> > Those settings belong to nutch-site. Enable BP and set the correct
> > extractor and it should work just fine.
> >
> >
> > -----Original message-----
> > > From:Lewis John Mcgibbney <lewis.mcgibb...@gmail.com>
> > > Sent: Sun 09-Jun-2013 20:47
> > > To: user@nutch.apache.org
> > > Subject: Re: using Tika within Nutch to remove boiler plates?
> > >
> > > Hi Joe,
> > > I've not used this feature, it would be great if one of the others could
> > > chime in here.
> > > From what I can infer from the correspondence on the issue, and the
> > > available patches, you should be applying the most recent one uploaded by
> > > Markus [0] as your starting point. This is dated as 22/11/2011.
> > >
> > > On Sun, Jun 9, 2013 at 11:00 AM, Joe Zhang <smartag...@gmail.com> wrote:
> > >
> > > >
> > > > One of the comments mentioned the following:
> > > >
> > > > tika.use_boilerpipe=true
> > > > tika.boilerpipe.extractor=ArticleExtractor|CanolaExtractor
> > > >
> > > > which part the code is it referring to?
> > > >
> > > >
> > > You will see this included in one of the earlier patches uploaded by
> > Markus
> > > on 11/05/2011 [1]
> > >
> > >
> > > >
> > > > Also, within the current Nutch config, should I focus on
> > parse-plugin.xml?
> > > >
> > > >
> > > Look at the other patches and also Gabriele's comments. You may most
> > likely
> > > need to alter something but AFAICT the work hasbeen done.. it's just a
> > case
> > > of pulling together several contributions.
> > >
> > > Maybe you should look at the patch for 2.x (uploaded most recently by
> > > Roland) and see what is going on there.
> > >
> > > hth
> > >
> > > [0]
> > >
> > https://issues.apache.org/jira/secure/attachment/12504736/NUTCH-961-1.5-1.patch
> > > [1]
> > >
> > https://issues.apache.org/jira/secure/attachment/12478927/NUTCH-961-1.3-tikaparser1.patch
> > >
> >
> 

Reply via email to