[jira] [Commented] (PDFBOX-5796) PDFBox cannot extract vector text from a PDF

Tilman Hausherr (Jira) Wed, 03 Apr 2024 01:58:48 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-5796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17833456#comment-17833456
 ]


Tilman Hausherr commented on PDFBOX-5796:
-----------------------------------------

Maybe Adobe is using OCR? PDFBox doesn't have a feature to do OCR. Apache Tika 
supports the use of an external OCR engine (tesseract).

> PDFBox cannot extract vector text from a PDF
> --------------------------------------------
>
>                 Key: PDFBOX-5796
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5796
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.28
>         Environment: MacOS Sonoma 14.4.1 OpenJDK 11 (Can reproduce in other 
> environments too)
>            Reporter: Samved Chandrakant Divekar
>            Priority: Major
>         Attachments: Pre-flght_example.png, Sample_Working.png, 
> Sample_not_Working.png
>
>
> PDFBox does not extract any text in the PDF which has all text encoded as 
> vector objects. 
> Unfortunately, I cannot attach the original document here(confidentiality) 
> but. have attached screenshot of pre-flight analysis of the a working file 
> and a non-working file using Adobe Acrobat pro. I can't copy paste the text 
> directly, however Adobe's "Recognize Text" function works on the document. I 
> verified that the whole page is not an image but definitley all text is 
> encoded as vector objects. I have attached an example of what pre-flight 
> analysis for a letter shows.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-5796) PDFBox cannot extract vector text from a PDF

Reply via email to