Selectively skipping parts of huge pages

Markus Jelsma Fri, 30 Aug 2024 02:41:56 -0700

Hello,

Using Tika and a custom ContentHandler we're parsing messy HTML into
readable text. We have a limit to the number of block type elements, and
some line elements (li in this case), we're willing to parse. This causes a
document [1] to skip processing and not extracting any useful text from it.
This page has a thousands of articles neatly listed in li's in its header,
so the limit of 2k is reached and everything else is skipped.


Does anyone know of some clever tricks to deal with it? Semantically there
is nothing wrong with the page having a huge article listing, but of course
it is not a very smart move to deliver such HTML, even my browsers get
bogged down by it.

Thanks,
Markus

[1]
https://lobjectif.net/la-pratique-deliberee-au-dela-du-mythe-de-la-maitrise/

Selectively skipping parts of huge pages

Reply via email to