[ https://issues.apache.org/jira/browse/PDFBOX-582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853446#action_12853446 ]
Andreas Lehmkühler commented on PDFBOX-582: ------------------------------------------- The given example contains an inverted imagemap which has to be taken into amount when calculating the colormodel. With version 930910 I've added a patch with a similar approach than used for PDFBOX-672. > Ignoring text over images > ------------------------- > > Key: PDFBOX-582 > URL: https://issues.apache.org/jira/browse/PDFBOX-582 > Project: PDFBox > Issue Type: Improvement > Components: Text extraction, Utilities > Affects Versions: 0.8.0-incubator > Reporter: Villu Ruusmann > Attachments: PageDrawer.patch, PDFBOX582-pg_00051.png, pg_0005.pdf, > pg_0005.png > > > Scientific publishers often publish older articles (year 2000 and earlier) in > scanned form. However, sometimes they seem to have conducted OCR, and added > the recovered text as an overlay in order to give the end user a "native PDF" > feeling in a sense that it is possible to copy and paste text. > PDFBox differs from other PDF viewers (tested with Adobe Acrobat Reader 7.0, > Foxit Reader 3.1, iText 2.1) so that it tries to render both the image part > and the textual overlay part, which may produce confusing results. > Actually, there are two separate cases: > *) Page rendering (class org.apache.pdfbox.pdfviewer.PageDrawer): Render the > image part and ignore the text part. > *) Text extraction (class org.apache.pdfbox.util.PDFTextStripper): Ignore the > image part and work upon the text part. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.