[
https://issues.apache.org/jira/browse/PDFBOX-5796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17833456#comment-17833456
]
Tilman Hausherr commented on PDFBOX-5796:
-----------------------------------------
Maybe Adobe is using OCR? PDFBox doesn't have a feature to do OCR. Apache Tika
supports the use of an external OCR engine (tesseract).
> PDFBox cannot extract vector text from a PDF
> --------------------------------------------
>
> Key: PDFBOX-5796
> URL: https://issues.apache.org/jira/browse/PDFBOX-5796
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.28
> Environment: MacOS Sonoma 14.4.1 OpenJDK 11 (Can reproduce in other
> environments too)
> Reporter: Samved Chandrakant Divekar
> Priority: Major
> Attachments: Pre-flght_example.png, Sample_Working.png,
> Sample_not_Working.png
>
>
> PDFBox does not extract any text in the PDF which has all text encoded as
> vector objects.
> Unfortunately, I cannot attach the original document here(confidentiality)
> but. have attached screenshot of pre-flight analysis of the a working file
> and a non-working file using Adobe Acrobat pro. I can't copy paste the text
> directly, however Adobe's "Recognize Text" function works on the document. I
> verified that the whole page is not an image but definitley all text is
> encoded as vector objects. I have attached an example of what pre-flight
> analysis for a letter shows.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]