[ 
https://issues.apache.org/jira/browse/PDFBOX-3074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14983173#comment-14983173
 ] 

John Hewson commented on PDFBOX-3074:
-------------------------------------

Transparency groups are just forms, which may have an opacity. Marked content 
can encapsulate any sequence of PDF content, that might happen to include a 
transparency group - but marked content is purely for accessibility and has no 
impact on rendering. The two are orthogonal.

PDFBox extracts all text regardless of whether or not it is visible, so if you 
have a text in a transparency group with an opacity of zero, then that will be 
extracted. Try overriding the following method in PDFTextStripper (or in your 
case PDFMarkedContentExtractor):

{code}
public void showTransparencyGroup(PDTransparencyGroup form) throws IOException
{code}

You should be able to inspect the form's opacity, this is actually a mask, 
available via getGraphicsState().getSoftMask(). Then you can bypass the call 
super.showTransparencyGroup(form) only if the content is opaque.

> Mark transparency groups
> ------------------------
>
>                 Key: PDFBOX-3074
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3074
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Daniel Persson
>            Priority: Minor
>              Labels: github-import
>             Fix For: 2.0.0
>
>         Attachments: mark_transparency_groups.patch
>
>
> We try to read text from PDF files but some of the files include extra data 
> that is never shown. These segments are usually grouped in transparency 
> groups. So for us this function to flag a marked content as a transparency 
> group is quite useful.
> If there is a way to do this please tell me or if there is a better way to 
> remove text that isn't presented or drawn when the PDF is viewed then I'm all 
> ears.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to