[
https://issues.apache.org/jira/browse/PDFBOX-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13527136#comment-13527136
]
Andreas Lehmkühler commented on PDFBOX-1438:
--------------------------------------------
Maybe there is a misunderstanding. There is no "real" text in this pdf. It is
stored as graphic (a pile of lines, curves and boxes). Do the "adobe test" [1]
to doublecheck that.
IMO there is only one possible workaround for such cases: convert the pdf to an
image and use an OCR-software to extract the text. But in your special case it
seems hard to do so, as the text is spread across the whole page using
different font sizes.
[1] http://pdfbox.apache.org/userguide/faq.html (first hint in the section
about text extraction)
> Problems with Image Extraction from PDF
> ---------------------------------------
>
> Key: PDFBOX-1438
> URL: https://issues.apache.org/jira/browse/PDFBOX-1438
> Project: PDFBox
> Issue Type: Bug
> Affects Versions: 1.7.1
> Environment: Windows XP
> Reporter: Christian Czech
> Attachments: Korrespondenz_000.jpg, Korrespondenz_001.jpg,
> Korrespondenz.PDF
>
>
> PDFBox don't extract images from pdf document correctly
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira