Hi Hany,

As BlackIce said, there is an open issue on
https://issues.apache.org/jira/browse/NUTCH-585 specifically the
(blacklist_whitelist_plugin) by now I'm not sure (probably not) that the
patch can be applied directly to master, but should provide a good general
idea on how to write a custom plugin for removing specific HTML nodes from
the crawl.

Hope it helps,
Jorge

On Fri, Nov 16, 2018 at 10:30 AM BlackIce <blackice...@gmail.com> wrote:

> There was a plugin awhile ago which allowed you to specify different tags
> to be indexed or excluded from being indexed if I'm not mistaken it was
> this:
>
>
> http://www.longconnections.com/blog/2015/6/3/using-apache-nutchsolr-to-build-a-search-engine-with-auto-complete-feature
>
> Good luck and please let me know what you come up with, Thank you!
>
> On Fri, Nov 16, 2018 at 10:04 AM <hany.n...@hsbc.com> wrote:
>
> > Anyone was facing this requirement before?
> >
> > Kind regards,
> > Hany Shehata
> > Solutions Architect, Marketing and Communications IT
> > Corporate Functions | HSBC Operations, Services and Technology (HOST)
> > ul. Kapelanka 42A, 30-347 Kraków, Poland
> > __________________________________________________________________
> >
> > Tie line: 7148 7689 4698
> > External: +48 123 42 0698
> > Mobile: +48 723 680 278
> > E-mail: hany.n...@hsbc.com
> > __________________________________________________________________
> > Protect our environment - please only print this if you have to!
> >
> >
> > -----Original Message-----
> > From: Hany NASR
> > Sent: Thursday, November 15, 2018 4:18 PM
> > To: user@nutch.apache.org
> > Subject: RE: Block certain parts of HTML code from being indexed
> >
> > Hello Markus,
> >
> > What if I want to remove specific component or page section?
> >
> > Kind regards,
> > Hany Shehata
> > Solutions Architect, Marketing and Communications IT Corporate Functions
> |
> > HSBC Operations, Services and Technology (HOST) ul. Kapelanka 42A, 30-347
> > Kraków, Poland
> > __________________________________________________________________
> >
> > Tie line: 7148 7689 4698
> > External: +48 123 42 0698
> > Mobile: +48 723 680 278
> > E-mail: hany.n...@hsbc.com
> > __________________________________________________________________
> > Protect our environment - please only print this if you have to!
> >
> > -----Original Message-----
> > From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> > Sent: Wednesday, November 14, 2018 4:11 PM
> > To: user@nutch.apache.org
> > Subject: RE: Block certain parts of HTML code from being indexed
> >
> > Hello Hany,
> >
> > Using parse-tika as your HTML parser, you can enable Boilerpipe (see
> > nutch-default).
> >
> > Regards,
> > Markus
> >
> >
> >
> > -----Original message-----
> > > From:hany.n...@hsbc.com <hany.n...@hsbc.com>
> > > Sent: Wednesday 14th November 2018 15:53
> > > To: user@nutch.apache.org
> > > Subject: Block certain parts of HTML code from being indexed
> > >
> > > Hello All,
> > >
> > > I am using Nutch 1.15, and wondering if there is a feature for blocking
> > certain parts of HTML code from being indexed (header & footer).
> > >
> > > Kind regards,
> > > Hany Shehata
> > > Solutions Architect, Marketing and Communications IT Corporate
> > > Functions | HSBC Operations, Services and Technology (HOST) ul.
> > > Kapelanka 42A, 30-347 Kraków, Poland
> > > __________________________________________________________________
> > >
> > > Tie line: 7148 7689 4698
> > > External: +48 123 42 0698
> > > Mobile: +48 723 680 278
> > > E-mail: hany.n...@hsbc.com<mailto:hany.n...@hsbc.com>
> > > __________________________________________________________________
> > > Protect our environment - please only print this if you have to!
> > >
> > >
> > >
> > > -----------------------------------------
> > > SAVE PAPER - THINK BEFORE YOU PRINT!
> > >
> > > This E-mail is confidential.
> > >
> > > It may also be legally privileged. If you are not the addressee you
> > > may not copy, forward, disclose or use any part of it. If you have
> > > received this message in error, please delete it and all copies from
> > > your system and notify the sender immediately by return E-mail.
> > >
> > > Internet communications cannot be guaranteed to be timely secure, error
> > or virus-free.
> > > The sender does not accept liability for any errors or omissions.
> > >
> >
> >
> > ***************************************************
> > This message originated from the Internet. Its originator may or may not
> > be who they claim to be and the information contained in the message and
> > any attachments may or may not be accurate.
> > ****************************************************
> >
> >
> >
> >
> > -----------------------------------------
> > SAVE PAPER - THINK BEFORE YOU PRINT!
> >
> > This E-mail is confidential.
> >
> > It may also be legally privileged. If you are not the addressee you may
> > not copy,
> > forward, disclose or use any part of it. If you have received this
> message
> > in error,
> > please delete it and all copies from your system and notify the sender
> > immediately by
> > return E-mail.
> >
> > Internet communications cannot be guaranteed to be timely secure, error
> or
> > virus-free.
> > The sender does not accept liability for any errors or omissions.
> >
>

Reply via email to