Hello, Using Tika and a custom ContentHandler we're parsing messy HTML into readable text. We have a limit to the number of block type elements, and some line elements (li in this case), we're willing to parse. This causes a document [1] to skip processing and not extracting any useful text from it. This page has a thousands of articles neatly listed in li's in its header, so the limit of 2k is reached and everything else is skipped.
Does anyone know of some clever tricks to deal with it? Semantically there is nothing wrong with the page having a huge article listing, but of course it is not a very smart move to deliver such HTML, even my browsers get bogged down by it. Thanks, Markus [1] https://lobjectif.net/la-pratique-deliberee-au-dela-du-mythe-de-la-maitrise/
