Hi Michael,
> I wonder if there is not already a build-in option to exclude HTML
> elements (like a div with a given id or class or other elements like header).
No, there isn't one so far.
> I know https://issues.apache.org/jira/browse/NUTCH-585
> I also do not understand why this little patch has not already been added to
> Nutch? Are there drawbacks?
Well, good question. Don't know. I'll have a look...
Maybe, one comment: I definitely agree that it would be very useful to have some
configurable method to clean up the HTML-to-text extract from undesired content
(headers, footers, etc.) - ideally, it should be possible to use the full
expressive power of CSS for that.
Thanks for the suggestion and remembering us! Nutch is a community project and
any contribution is welcome and appreciated!
Best,
Sebastian
On 9/21/23 15:46, Fritsch, Michael wrote:
Hello,
I use Nutch 1.18 to crawl our documentation with the parse-html plugin. Each
page has elements like TOCs which should not be included.
I know https://issues.apache.org/jira/browse/NUTCH-585
<https://issues.apache.org/jira/browse/NUTCH-585> and included one of the patches.
However, I wonder if there is not already a build-in option to exclude HTML
elements (like a div with a given id or class or other elements like header).
I also do not understand why this little patch has not already been added to
Nutch? Are there drawbacks?
Regards,
Michael
Dr. Michael Fritsch
Technical Editor
A picture containing graphics, graphic design, font, logo Description
automatically generated <https://www.coremedia.com/>
**
*Elevate Experience. Drive Impact.*
E-Mail: michael.frit...@coremedia.com <mailto:michael.frit...@coremedia.com>
Phone: +49 (0) 40 325 587 0
*www.coremedia.com* <https://www.coremedia.com/>
A pink and red letter on a black background Description automatically generated
with low confidence <https://www.linkedin.com/company/coremedia-corp/>A logo of
a camera Description automatically generated with low confidence
<https://www.instagram.com/coremediacc/>A picture containing colorfulness,
screenshot, graphics, red Description automatically generated
<https://www.youtube.com/channel/UC3u29ExYv1263SfUBWnsgdQ>A pink bird with wings
Description automatically generated with low confidence
<https://twitter.com/coremedia?lang=en>
Diagram Description automatically generated
<https://www.coremedia.com/blog/sustainability-matters/>
--------------------------------------------------------------------------------
CoreMedia GmbH
Rödingsmarkt 9, 20459 Hamburg, Germany
Managing Director: Sören Stamer
Commercial Register: Amtsgericht Hamburg, HRB 162480
--------------------------------------------------------------------------------