[jira] [Commented] (PDFBOX-1438) Problems with Image Extraction from PDF

JIRA Sat, 08 Dec 2012 05:23:28 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13527136#comment-13527136
 ]


Andreas Lehmkühler commented on PDFBOX-1438:
--------------------------------------------

Maybe there is a misunderstanding. There is no "real" text in this pdf. It is 
stored as graphic (a pile of lines, curves and boxes). Do the "adobe test" [1] 
to doublecheck that.

IMO there is only one possible workaround for such cases: convert the pdf to an 
image and use an OCR-software to extract the text. But in your special case it 
seems hard to do so, as the text is spread across the whole page using 
different font sizes.

[1] http://pdfbox.apache.org/userguide/faq.html (first hint in the section 
about text extraction)
                
> Problems with Image Extraction from PDF
> ---------------------------------------
>
>                 Key: PDFBOX-1438
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1438
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 1.7.1
>         Environment: Windows XP
>            Reporter: Christian Czech
>         Attachments: Korrespondenz_000.jpg, Korrespondenz_001.jpg, 
> Korrespondenz.PDF
>
>
> PDFBox don't extract images from pdf document correctly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-1438) Problems with Image Extraction from PDF

Reply via email to