[ 
https://issues.apache.org/jira/browse/PDFBOX-582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler resolved PDFBOX-582.
---------------------------------------

       Resolution: Fixed
    Fix Version/s: 1.2.0

I've attached the final result (PDFBOX582-pg_00051.png). The imageIO library, 
which isn't part of pdfbox, has to be used to render the embedded tiff.

> Ignoring text over images
> -------------------------
>
>                 Key: PDFBOX-582
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-582
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction, Utilities
>    Affects Versions: 0.8.0-incubator
>            Reporter: Villu Ruusmann
>             Fix For: 1.2.0
>
>         Attachments: PageDrawer.patch, PDFBOX582-pg_00051.png, pg_0005.pdf, 
> pg_0005.png
>
>
> Scientific publishers often publish older articles (year 2000 and earlier) in 
> scanned form. However, sometimes they seem to have conducted OCR, and added 
> the recovered text as an overlay in order to give the end user a "native PDF" 
> feeling in a sense that it is possible to copy and paste text.
> PDFBox differs from other PDF viewers (tested with Adobe Acrobat Reader 7.0, 
> Foxit Reader 3.1, iText 2.1) so that it tries to render both the image part 
> and the textual overlay part, which may produce confusing results.
> Actually, there are two separate cases:
> *) Page rendering (class org.apache.pdfbox.pdfviewer.PageDrawer): Render the 
> image part and ignore the text part.
> *) Text extraction (class org.apache.pdfbox.util.PDFTextStripper): Ignore the 
> image part and work upon the text part.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to