Hi Hany, The Tika parser supports Boilerpipe for header and footer removal, but I don't know how well it works. You can test it online at https://boilerpipe-web.appspot.com/
> -----Original Message----- > From: hany.n...@hsbc.com <hany.n...@hsbc.com> > Sent: 14 November 2018 16:53 > To: user@nutch.apache.org > Subject: Block certain parts of HTML code from being indexed > > Hello All, > > I am using Nutch 1.15, and wondering if there is a feature for blocking > certain > parts of HTML code from being indexed (header & footer). > > Kind regards, > Hany Shehata > Solutions Architect, Marketing and Communications IT Corporate Functions | > HSBC Operations, Services and Technology (HOST) ul. Kapelanka 42A, 30-347 > Kraków, Poland > _________________________________________________________________ > _ > > Tie line: 7148 7689 4698 > External: +48 123 42 0698 > Mobile: +48 723 680 278 > E-mail: hany.n...@hsbc.com<mailto:hany.n...@hsbc.com> > _________________________________________________________________ > _ > Protect our environment - please only print this if you have to! > > > > ----------------------------------------- > SAVE PAPER - THINK BEFORE YOU PRINT! > > This E-mail is confidential. > > It may also be legally privileged. If you are not the addressee you may not > copy, > forward, disclose or use any part of it. If you have received this message in > error, please delete it and all copies from your system and notify the sender > immediately by return E-mail. > > Internet communications cannot be guaranteed to be timely secure, error or > virus-free. > The sender does not accept liability for any errors or omissions.