Just for reference, here's another nasty example. Different site, same
disorder:
https://www.bautzenerbote.de/250-jahre-dachrinnen-in-tschechien/

Op vr 30 aug 2024 om 11:40 schreef Markus Jelsma <[email protected]
>:

> Hello,
>
> Using Tika and a custom ContentHandler we're parsing messy HTML into
> readable text. We have a limit to the number of block type elements, and
> some line elements (li in this case), we're willing to parse. This causes a
> document [1] to skip processing and not extracting any useful text from it.
> This page has a thousands of articles neatly listed in li's in its header,
> so the limit of 2k is reached and everything else is skipped.
>
> Does anyone know of some clever tricks to deal with it? Semantically there
> is nothing wrong with the page having a huge article listing, but of course
> it is not a very smart move to deliver such HTML, even my browsers get
> bogged down by it.
>
> Thanks,
> Markus
>
> [1]
> https://lobjectif.net/la-pratique-deliberee-au-dela-du-mythe-de-la-maitrise/
>

Reply via email to