[ https://issues.apache.org/jira/browse/PDFBOX-582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851883#action_12851883 ]
Igor Podolskiy commented on PDFBOX-582: --------------------------------------- AFAIK there are no special provisions in PDFs and/or readers to handle those scanned documents, although I'm fairly familiar with the PDF format. There's text, and then there's an opaque image over it, that's all. It's image over text, not text over image, so there's nothinh to ignore. I occasionally create such PDFs myself, for example with the hocr2pdf tool. I can remember that I ran into this problem recently (PDFBox displaying both OCR text and images). I didn't have time to debug it to the end, but I think the problem was somehow related to my scanner producing 1-bit TIFFs and PDFBox' PageDrawer not displaying them correctly (what should be white appeared as transparent). The order was all right (image on top of text), but this transparency made it look reversed and confusing. I'll try to find time today or tomorrow to recollect the stuff and post it here, but I definitely know that 1-bit image were somehow key to this. > Ignoring text over images > ------------------------- > > Key: PDFBOX-582 > URL: https://issues.apache.org/jira/browse/PDFBOX-582 > Project: PDFBox > Issue Type: Improvement > Components: Text extraction, Utilities > Affects Versions: 0.8.0-incubator > Reporter: Villu Ruusmann > Attachments: pg_0005.pdf, pg_0005.png > > > Scientific publishers often publish older articles (year 2000 and earlier) in > scanned form. However, sometimes they seem to have conducted OCR, and added > the recovered text as an overlay in order to give the end user a "native PDF" > feeling in a sense that it is possible to copy and paste text. > PDFBox differs from other PDF viewers (tested with Adobe Acrobat Reader 7.0, > Foxit Reader 3.1, iText 2.1) so that it tries to render both the image part > and the textual overlay part, which may produce confusing results. > Actually, there are two separate cases: > *) Page rendering (class org.apache.pdfbox.pdfviewer.PageDrawer): Render the > image part and ignore the text part. > *) Text extraction (class org.apache.pdfbox.util.PDFTextStripper): Ignore the > image part and work upon the text part. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.