Hi Michael,

> I wonder if there is not already a build-in option to exclude HTML
> elements (like a div with a given id or class or other elements like header).

No, there isn't one so far.


> I know https://issues.apache.org/jira/browse/NUTCH-585

> I also do not understand why this little patch has not already been added to
> Nutch? Are there drawbacks?

Well, good question. Don't know. I'll have a look...


Maybe, one comment: I definitely agree that it would be very useful to have some configurable method to clean up the HTML-to-text extract from undesired content (headers, footers, etc.) - ideally, it should be possible to use the full expressive power of CSS for that.


Thanks for the suggestion and remembering us! Nutch is a community project and
any contribution is welcome and appreciated!


Best,
Sebastian


On 9/21/23 15:46, Fritsch, Michael wrote:
Hello,

I use Nutch 1.18 to crawl our documentation with the parse-html plugin. Each page has elements like TOCs which should not be included.

I know https://issues.apache.org/jira/browse/NUTCH-585 <https://issues.apache.org/jira/browse/NUTCH-585> and included one of the patches.

However, I wonder if there is not already a build-in option to exclude HTML elements (like a div with a given id or class or other elements like header).

I also do not understand why this little patch has not already been added to Nutch? Are there drawbacks?

Regards,

Michael

Dr. Michael Fritsch
Technical Editor

A picture containing graphics, graphic design, font, logo Description automatically generated <https://www.coremedia.com/>

**

*Elevate Experience. Drive Impact.*


E-Mail: michael.frit...@coremedia.com <mailto:michael.frit...@coremedia.com>

Phone: +49 (0) 40 325 587 0
*www.coremedia.com* <https://www.coremedia.com/>

A pink and red letter on a black background Description automatically generated with low confidence <https://www.linkedin.com/company/coremedia-corp/>A logo of a camera Description automatically generated with low confidence <https://www.instagram.com/coremediacc/>A picture containing colorfulness, screenshot, graphics, red Description automatically generated <https://www.youtube.com/channel/UC3u29ExYv1263SfUBWnsgdQ>A pink bird with wings Description automatically generated with low confidence <https://twitter.com/coremedia?lang=en>

Diagram Description automatically generated <https://www.coremedia.com/blog/sustainability-matters/>

--------------------------------------------------------------------------------

CoreMedia GmbH

Rödingsmarkt 9, 20459 Hamburg, Germany

Managing Director: Sören Stamer

Commercial Register: Amtsgericht Hamburg, HRB 162480

--------------------------------------------------------------------------------

Reply via email to