[ 
https://issues.apache.org/jira/browse/PDFBOX-5796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed PDFBOX-5796.
-----------------------------------
    Resolution: Not A Bug

I'm closing this because it's outside of our scope, and not a bug. Please use 
Apache Tika (which uses PDFBox and Tesseract), or use your own logic to use OCR.
https://cwiki.apache.org/confluence/display/TIKA/TikaOCR

> PDFBox cannot extract vector text from a PDF
> --------------------------------------------
>
>                 Key: PDFBOX-5796
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5796
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.28
>         Environment: MacOS Sonoma 14.4.1 OpenJDK 11 (Can reproduce in other 
> environments too)
>            Reporter: Samved Chandrakant Divekar
>            Priority: Major
>         Attachments: Pre-flght_example.png, Sample_Working.png, 
> Sample_not_Working.png
>
>
> PDFBox does not extract any text in the PDF which has all text encoded as 
> vector objects. 
> Unfortunately, I cannot attach the original document here(confidentiality) 
> but. have attached screenshot of pre-flight analysis of the a working file 
> and a non-working file using Adobe Acrobat pro. I can't copy paste the text 
> directly, however Adobe's "Recognize Text" function works on the document. I 
> verified that the whole page is not an image but definitley all text is 
> encoded as vector objects. I have attached an example of what pre-flight 
> analysis for a letter shows.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to