[jira] [Created] (PDFBOX-5796) PDFBox cannot extract vector text from a PDF

Samved Chandrakant Divekar (Jira) Tue, 02 Apr 2024 15:04:05 -0700

Samved Chandrakant Divekar created PDFBOX-5796:
--------------------------------------------------


             Summary: PDFBox cannot extract vector text from a PDF
                 Key: PDFBOX-5796
                 URL: https://issues.apache.org/jira/browse/PDFBOX-5796
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 2.0.28
         Environment: MacOS Sonoma 14.4.1 OpenJDK 11 (Can reproduce in other 
environments too)
            Reporter: Samved Chandrakant Divekar
         Attachments: Pre-flght_example.png, Sample_Working.png, 
Sample_not_Working.png

PDFBox does not extract any text in the PDF which has all text encoded as 
vector objects. 

Unfortunately, I cannot attach the original document here(confidentiality) but. 
have attached screenshot of pre-flight analysis of the a working file and a 
non-working file using Adobe Acrobat pro. I can't copy paste the text directly, 
however Adobe's "Recognize Text" function works on the document. I verified 
that the whole page is not an image but definitley all text is encoded as 
vector objects. I have attached an example of what pre-flight analysis for a 
letter shows.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (PDFBOX-5796) PDFBox cannot extract vector text from a PDF

Reply via email to