Exclude HTML elements from Crawl

Fritsch, Michael Thu, 21 Sep 2023 06:48:04 -0700

Hello,

I use Nutch 1.18 to crawl our documentation with the parse-html plugin. Each 
page has elements like TOCs which should not be included.
I know https://issues.apache.org/jira/browse/NUTCH-585 and included one of the 
patches.


However, I wonder if there is not already a build-in option to exclude HTML 
elements (like a div with a given id or class or other elements like header).
I also do not understand why this little patch has not already been added to 
Nutch? Are there drawbacks?

Regards,
Michael

Dr. Michael Fritsch
Technical Editor
[A picture containing graphics, graphic design, font, logo  Description 
automatically generated]<https://www.coremedia.com/>

Elevate Experience. Drive Impact.

E-Mail: [email protected]<mailto:[email protected]>
Phone: +49 (0) 40 325 587 0
www.coremedia.com<https://www.coremedia.com/>
[A pink and red letter on a black background  Description automatically 
generated with low 
confidence]<https://www.linkedin.com/company/coremedia-corp/>[A logo of a 
camera  Description automatically generated with low 
confidence]<https://www.instagram.com/coremediacc/>[A picture containing 
colorfulness, screenshot, graphics, red  Description automatically 
generated]<https://www.youtube.com/channel/UC3u29ExYv1263SfUBWnsgdQ>[A pink 
bird with wings  Description automatically generated with low 
confidence]<https://twitter.com/coremedia?lang=en>
[Diagram  Description automatically 
generated]<https://www.coremedia.com/blog/sustainability-matters/>
--------------------------------------------------------------------------------
CoreMedia GmbH
Rödingsmarkt 9, 20459 Hamburg, Germany
Managing Director: Sören Stamer
Commercial Register: Amtsgericht Hamburg, HRB 162480
--------------------------------------------------------------------------------

Exclude HTML elements from Crawl

Reply via email to