[ https://issues.apache.org/jira/browse/PDFBOX-5796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tilman Hausherr closed PDFBOX-5796. ----------------------------------- Resolution: Not A Bug I'm closing this because it's outside of our scope, and not a bug. Please use Apache Tika (which uses PDFBox and Tesseract), or use your own logic to use OCR. https://cwiki.apache.org/confluence/display/TIKA/TikaOCR > PDFBox cannot extract vector text from a PDF > -------------------------------------------- > > Key: PDFBOX-5796 > URL: https://issues.apache.org/jira/browse/PDFBOX-5796 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 2.0.28 > Environment: MacOS Sonoma 14.4.1 OpenJDK 11 (Can reproduce in other > environments too) > Reporter: Samved Chandrakant Divekar > Priority: Major > Attachments: Pre-flght_example.png, Sample_Working.png, > Sample_not_Working.png > > > PDFBox does not extract any text in the PDF which has all text encoded as > vector objects. > Unfortunately, I cannot attach the original document here(confidentiality) > but. have attached screenshot of pre-flight analysis of the a working file > and a non-working file using Adobe Acrobat pro. I can't copy paste the text > directly, however Adobe's "Recognize Text" function works on the document. I > verified that the whole page is not an image but definitley all text is > encoded as vector objects. I have attached an example of what pre-flight > analysis for a letter shows. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org