Tim Allison created TIKA-4654:
---------------------------------

             Summary: Experiment with docstrum for clustering TextPositions for 
PDFs
                 Key: TIKA-4654
                 URL: https://issues.apache.org/jira/browse/TIKA-4654
             Project: Tika
          Issue Type: Task
            Reporter: Tim Allison


We currently allow users to turn on {{detectAngles}} and/or switch on 
{{sortByPosition}}. We should experiment with other methods for clustering text 
positions... perhaps add heuristics based on the clusters to identify headers 
and footers?

Docstrum is one (aged) approach. There are likely more modern versions.

While vlms are amazing, we should see if we can improve on our current options 
for rebuilding the cow from the hamburger of PDFs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to