[ 
https://issues.apache.org/jira/browse/PDFBOX-5796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17833653#comment-17833653
 ] 

Tilman Hausherr commented on PDFBOX-5796:
-----------------------------------------

There's no flag. When I look at such PDFs, it's a guess, when I see text, there 
is no extraction and the content stream has a lot of "c" operators.

> PDFBox cannot extract vector text from a PDF
> --------------------------------------------
>
>                 Key: PDFBOX-5796
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5796
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.28
>         Environment: MacOS Sonoma 14.4.1 OpenJDK 11 (Can reproduce in other 
> environments too)
>            Reporter: Samved Chandrakant Divekar
>            Priority: Major
>         Attachments: Pre-flght_example.png, Sample_Working.png, 
> Sample_not_Working.png
>
>
> PDFBox does not extract any text in the PDF which has all text encoded as 
> vector objects. 
> Unfortunately, I cannot attach the original document here(confidentiality) 
> but. have attached screenshot of pre-flight analysis of the a working file 
> and a non-working file using Adobe Acrobat pro. I can't copy paste the text 
> directly, however Adobe's "Recognize Text" function works on the document. I 
> verified that the whole page is not an image but definitley all text is 
> encoded as vector objects. I have attached an example of what pre-flight 
> analysis for a letter shows.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to