Samved Chandrakant Divekar created PDFBOX-5796:
--------------------------------------------------
Summary: PDFBox cannot extract vector text from a PDF
Key: PDFBOX-5796
URL: https://issues.apache.org/jira/browse/PDFBOX-5796
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 2.0.28
Environment: MacOS Sonoma 14.4.1 OpenJDK 11 (Can reproduce in other
environments too)
Reporter: Samved Chandrakant Divekar
Attachments: Pre-flght_example.png, Sample_Working.png,
Sample_not_Working.png
PDFBox does not extract any text in the PDF which has all text encoded as
vector objects.
Unfortunately, I cannot attach the original document here(confidentiality) but.
have attached screenshot of pre-flight analysis of the a working file and a
non-working file using Adobe Acrobat pro. I can't copy paste the text directly,
however Adobe's "Recognize Text" function works on the document. I verified
that the whole page is not an image but definitley all text is encoded as
vector objects. I have attached an example of what pre-flight analysis for a
letter shows.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]