[ https://issues.apache.org/jira/browse/PDFBOX-582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852028#action_12852028 ]
Ken Weinert commented on PDFBOX-582: ------------------------------------ This last comment fits with my experience. We frequently overlay transparent text on top of an image so that the user can select the text for copy/paste (I believe it's mode 3 text IIRC.) So it makes sense that if PDFBox doesn't support that mode that the text will be visible. > Ignoring text over images > ------------------------- > > Key: PDFBOX-582 > URL: https://issues.apache.org/jira/browse/PDFBOX-582 > Project: PDFBox > Issue Type: Improvement > Components: Text extraction, Utilities > Affects Versions: 0.8.0-incubator > Reporter: Villu Ruusmann > Attachments: pg_0005.pdf, pg_0005.png > > > Scientific publishers often publish older articles (year 2000 and earlier) in > scanned form. However, sometimes they seem to have conducted OCR, and added > the recovered text as an overlay in order to give the end user a "native PDF" > feeling in a sense that it is possible to copy and paste text. > PDFBox differs from other PDF viewers (tested with Adobe Acrobat Reader 7.0, > Foxit Reader 3.1, iText 2.1) so that it tries to render both the image part > and the textual overlay part, which may produce confusing results. > Actually, there are two separate cases: > *) Page rendering (class org.apache.pdfbox.pdfviewer.PageDrawer): Render the > image part and ignore the text part. > *) Text extraction (class org.apache.pdfbox.util.PDFTextStripper): Ignore the > image part and work upon the text part. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.