[jira] [Commented] (PDFBOX-3912) Command line : ExtractText, Duplicated text

Tilman Hausherr (JIRA) Mon, 28 Aug 2017 23:29:44 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16144816#comment-16144816
 ]


Tilman Hausherr commented on PDFBOX-3912:
-----------------------------------------

Well, the problem is with the creator of that PDF. Maybe he/she got confused 
and made content invisible instead of deleting it. PDFBox just extracts what's 
there. It can't know that something is "invisible".

> Command line : ExtractText, Duplicated text
> -------------------------------------------
>
>                 Key: PDFBOX-3912
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3912
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.7
>            Reporter: Hasan Karaoğlu
>         Attachments: bugzilla867751.html
>
>
> When I convert some pages of a pdf file to html, it gives me duplicated 
> pages.For example, I convert seventh page of a pdf file. It is converted. But 
> it also contains sixth page's content. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-3912) Command line : ExtractText, Duplicated text

Reply via email to