[ https://issues.apache.org/jira/browse/PDFBOX-582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Maruan Sahyoun updated PDFBOX-582: ---------------------------------- Attachment: PageDrawer.patch The patch adds a basic implementation for PDTextState.RENDERING_MODE_NEITHER_FILL_NOR_STROKE_TEXT in order to support applications where a text is invisibly included in a PDF as part of an OCR result. A more generic approach needs to be implemented in order to fully support the different text rendering modes > Ignoring text over images > ------------------------- > > Key: PDFBOX-582 > URL: https://issues.apache.org/jira/browse/PDFBOX-582 > Project: PDFBox > Issue Type: Improvement > Components: Text extraction, Utilities > Affects Versions: 0.8.0-incubator > Reporter: Villu Ruusmann > Attachments: PageDrawer.patch, pg_0005.pdf, pg_0005.png > > > Scientific publishers often publish older articles (year 2000 and earlier) in > scanned form. However, sometimes they seem to have conducted OCR, and added > the recovered text as an overlay in order to give the end user a "native PDF" > feeling in a sense that it is possible to copy and paste text. > PDFBox differs from other PDF viewers (tested with Adobe Acrobat Reader 7.0, > Foxit Reader 3.1, iText 2.1) so that it tries to render both the image part > and the textual overlay part, which may produce confusing results. > Actually, there are two separate cases: > *) Page rendering (class org.apache.pdfbox.pdfviewer.PageDrawer): Render the > image part and ignore the text part. > *) Text extraction (class org.apache.pdfbox.util.PDFTextStripper): Ignore the > image part and work upon the text part. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.