Just for reference, here's another nasty example. Different site, same disorder: https://www.bautzenerbote.de/250-jahre-dachrinnen-in-tschechien/
Op vr 30 aug 2024 om 11:40 schreef Markus Jelsma <[email protected] >: > Hello, > > Using Tika and a custom ContentHandler we're parsing messy HTML into > readable text. We have a limit to the number of block type elements, and > some line elements (li in this case), we're willing to parse. This causes a > document [1] to skip processing and not extracting any useful text from it. > This page has a thousands of articles neatly listed in li's in its header, > so the limit of 2k is reached and everything else is skipped. > > Does anyone know of some clever tricks to deal with it? Semantically there > is nothing wrong with the page having a huge article listing, but of course > it is not a very smart move to deliver such HTML, even my browsers get > bogged down by it. > > Thanks, > Markus > > [1] > https://lobjectif.net/la-pratique-deliberee-au-dela-du-mythe-de-la-maitrise/ >
