[
https://issues.apache.org/jira/browse/PDFBOX-5796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr closed PDFBOX-5796.
-----------------------------------
Resolution: Not A Bug
I'm closing this because it's outside of our scope, and not a bug. Please use
Apache Tika (which uses PDFBox and Tesseract), or use your own logic to use OCR.
https://cwiki.apache.org/confluence/display/TIKA/TikaOCR
> PDFBox cannot extract vector text from a PDF
> --------------------------------------------
>
> Key: PDFBOX-5796
> URL: https://issues.apache.org/jira/browse/PDFBOX-5796
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.28
> Environment: MacOS Sonoma 14.4.1 OpenJDK 11 (Can reproduce in other
> environments too)
> Reporter: Samved Chandrakant Divekar
> Priority: Major
> Attachments: Pre-flght_example.png, Sample_Working.png,
> Sample_not_Working.png
>
>
> PDFBox does not extract any text in the PDF which has all text encoded as
> vector objects.
> Unfortunately, I cannot attach the original document here(confidentiality)
> but. have attached screenshot of pre-flight analysis of the a working file
> and a non-working file using Adobe Acrobat pro. I can't copy paste the text
> directly, however Adobe's "Recognize Text" function works on the document. I
> verified that the whole page is not an image but definitley all text is
> encoded as vector objects. I have attached an example of what pre-flight
> analysis for a letter shows.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]