[ https://issues.apache.org/jira/browse/PDFBOX-582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andreas Lehmkühler resolved PDFBOX-582. --------------------------------------- Resolution: Fixed Fix Version/s: 1.2.0 I've attached the final result (PDFBOX582-pg_00051.png). The imageIO library, which isn't part of pdfbox, has to be used to render the embedded tiff. > Ignoring text over images > ------------------------- > > Key: PDFBOX-582 > URL: https://issues.apache.org/jira/browse/PDFBOX-582 > Project: PDFBox > Issue Type: Improvement > Components: Text extraction, Utilities > Affects Versions: 0.8.0-incubator > Reporter: Villu Ruusmann > Fix For: 1.2.0 > > Attachments: PageDrawer.patch, PDFBOX582-pg_00051.png, pg_0005.pdf, > pg_0005.png > > > Scientific publishers often publish older articles (year 2000 and earlier) in > scanned form. However, sometimes they seem to have conducted OCR, and added > the recovered text as an overlay in order to give the end user a "native PDF" > feeling in a sense that it is possible to copy and paste text. > PDFBox differs from other PDF viewers (tested with Adobe Acrobat Reader 7.0, > Foxit Reader 3.1, iText 2.1) so that it tries to render both the image part > and the textual overlay part, which may produce confusing results. > Actually, there are two separate cases: > *) Page rendering (class org.apache.pdfbox.pdfviewer.PageDrawer): Render the > image part and ignore the text part. > *) Text extraction (class org.apache.pdfbox.util.PDFTextStripper): Ignore the > image part and work upon the text part. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.