Tim Allison created TIKA-4654:
---------------------------------
Summary: Experiment with docstrum for clustering TextPositions for
PDFs
Key: TIKA-4654
URL: https://issues.apache.org/jira/browse/TIKA-4654
Project: Tika
Issue Type: Task
Reporter: Tim Allison
We currently allow users to turn on {{detectAngles}} and/or switch on
{{sortByPosition}}. We should experiment with other methods for clustering text
positions... perhaps add heuristics based on the clusters to identify headers
and footers?
Docstrum is one (aged) approach. There are likely more modern versions.
While vlms are amazing, we should see if we can improve on our current options
for rebuilding the cow from the hamburger of PDFs.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)