Hello, I use Nutch 1.18 to crawl our documentation with the parse-html plugin. Each page has elements like TOCs which should not be included. I know https://issues.apache.org/jira/browse/NUTCH-585 and included one of the patches.
However, I wonder if there is not already a build-in option to exclude HTML elements (like a div with a given id or class or other elements like header). I also do not understand why this little patch has not already been added to Nutch? Are there drawbacks? Regards, Michael Dr. Michael Fritsch Technical Editor [A picture containing graphics, graphic design, font, logo Description automatically generated]<https://www.coremedia.com/> Elevate Experience. Drive Impact. E-Mail: michael.frit...@coremedia.com<mailto:michael.frit...@coremedia.com> Phone: +49 (0) 40 325 587 0 www.coremedia.com<https://www.coremedia.com/> [A pink and red letter on a black background Description automatically generated with low confidence]<https://www.linkedin.com/company/coremedia-corp/>[A logo of a camera Description automatically generated with low confidence]<https://www.instagram.com/coremediacc/>[A picture containing colorfulness, screenshot, graphics, red Description automatically generated]<https://www.youtube.com/channel/UC3u29ExYv1263SfUBWnsgdQ>[A pink bird with wings Description automatically generated with low confidence]<https://twitter.com/coremedia?lang=en> [Diagram Description automatically generated]<https://www.coremedia.com/blog/sustainability-matters/> -------------------------------------------------------------------------------- CoreMedia GmbH Rödingsmarkt 9, 20459 Hamburg, Germany Managing Director: Sören Stamer Commercial Register: Amtsgericht Hamburg, HRB 162480 --------------------------------------------------------------------------------