[jira] Updated: (PDFBOX-582) Ignoring text over images

Maruan Sahyoun (JIRA) Wed, 31 Mar 2010 13:23:51 -0700

     [ 
https://issues.apache.org/jira/browse/PDFBOX-582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Maruan Sahyoun updated PDFBOX-582:
----------------------------------

    Attachment: PageDrawer.patch

The patch adds a basic implementation for 
PDTextState.RENDERING_MODE_NEITHER_FILL_NOR_STROKE_TEXT in order to support 
applications where a text is invisibly included in a PDF as part of an OCR 
result.

A more generic approach needs to be implemented in order to fully support the 
different text rendering modes

> Ignoring text over images
> -------------------------
>
>                 Key: PDFBOX-582
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-582
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction, Utilities
>    Affects Versions: 0.8.0-incubator
>            Reporter: Villu Ruusmann
>         Attachments: PageDrawer.patch, pg_0005.pdf, pg_0005.png
>
>
> Scientific publishers often publish older articles (year 2000 and earlier) in 
> scanned form. However, sometimes they seem to have conducted OCR, and added 
> the recovered text as an overlay in order to give the end user a "native PDF" 
> feeling in a sense that it is possible to copy and paste text.
> PDFBox differs from other PDF viewers (tested with Adobe Acrobat Reader 7.0, 
> Foxit Reader 3.1, iText 2.1) so that it tries to render both the image part 
> and the textual overlay part, which may produce confusing results.
> Actually, there are two separate cases:
> *) Page rendering (class org.apache.pdfbox.pdfviewer.PageDrawer): Render the 
> image part and ignore the text part.
> *) Text extraction (class org.apache.pdfbox.util.PDFTextStripper): Ignore the 
> image part and work upon the text part.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-582) Ignoring text over images

Reply via email to