[jira] Commented: (PDFBOX-582) Ignoring text over images

Ken Weinert (JIRA) Wed, 31 Mar 2010 11:33:50 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852028#action_12852028
 ]


Ken Weinert commented on PDFBOX-582:
------------------------------------

This last comment fits with my experience. We frequently overlay transparent 
text on top of an image so that the user can select the text for copy/paste (I 
believe it's mode 3 text IIRC.)

So it makes sense that if PDFBox doesn't support that mode that the text will 
be visible.


> Ignoring text over images
> -------------------------
>
>                 Key: PDFBOX-582
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-582
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction, Utilities
>    Affects Versions: 0.8.0-incubator
>            Reporter: Villu Ruusmann
>         Attachments: pg_0005.pdf, pg_0005.png
>
>
> Scientific publishers often publish older articles (year 2000 and earlier) in 
> scanned form. However, sometimes they seem to have conducted OCR, and added 
> the recovered text as an overlay in order to give the end user a "native PDF" 
> feeling in a sense that it is possible to copy and paste text.
> PDFBox differs from other PDF viewers (tested with Adobe Acrobat Reader 7.0, 
> Foxit Reader 3.1, iText 2.1) so that it tries to render both the image part 
> and the textual overlay part, which may produce confusing results.
> Actually, there are two separate cases:
> *) Page rendering (class org.apache.pdfbox.pdfviewer.PageDrawer): Render the 
> image part and ignore the text part.
> *) Text extraction (class org.apache.pdfbox.util.PDFTextStripper): Ignore the 
> image part and work upon the text part.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-582) Ignoring text over images

Reply via email to